Documente Academic
Documente Profesional
Documente Cultură
BACHELORS OF ENGINEERING
In
COMPUTER ENGINEERING
By Ujesh Shetty Pallav Parikh Vivek Shukla
Synopsis Report on
BACHELORS OF ENGINEERING
In
COMPUTER ENGINEERING
By Ujesh Shetty Pallav Parikh Vivek Shukla
CERTIFICATE
This is to certify that Ujesh Shetty, Pallav Parikh and Vivek Shukla are the bonafide students of Thakur College of Engineering and Technology, Mumbai. They have satisfactorily completed the requirements of the PROJECT-I as prescribed by University of Mumbai while working on Intelligent Heart Disease Prediction System .
Thakur College of Engineering and Technology Kandivali(E), Mumbai-400101. PLACE: Mumbai DATE:
CONTENTS
Chapter No.
Topic
List of figure List of tables Abbreviation and symbols Definitions Importance of the Project and its background Literature survey Motivation Scope of the project Organization of the Project report
Chapter 1
Introduction
1.1 1.2 1.3 1.4 1.5
Chapter 2
Proposed Work
2.1 2.2 2.3 Problem Definition Data Flow Diagram/ Flow chart of Design As per guides instruction Feasibility Study Project Planning Scheduling (Time line chart) As per guides instruction Technology/ Software etc Stage wise Model development ,Flow Chart etc. Implementation stages Installation stages
05
Chapter 3
Chapter 4
Chapter 5 Chapter 6
Progress ( Optional)
5.1 Deviations from design schedule 5.2 Remedial measures taken As per Project (Optional)
Chapter 1: Overview
Importance of the Project and its background Literature Survey Motivation Scope of the project Organization of the Project report
Chapter 1: Overview
1.1
Heart failure and stroke is the most frequent diagnostic category and is one of the leading causes of death.
Heart diseases are the most frequently first-listed diagnoses for hospital discharges.
Approximately 60 billion around the world are afflicted with some form of cardio-vascular disease which includes both heart disease and stroke.
The main objective of this research is to develop a prototype Intelligent Heart Disease Prediction System (IHDPS) using two data mining modelling techniques, namely, Decision Trees & Nave Bayes. IHDPS can discover and extract hidden knowledge, patterns and relationships associated with heart disease from a historical heart disease database. It can answer complex queries for diagnosing heart disease and thus assist healthcare practitioners to make intelligent clinical decisions which traditional decision support systems cannot. By providing effective treatments, it also helps to reduce treatment costs. To enhance visualization and ease of interpretation, it displays the results both in tabular and graphical forms. IHDPS can serve a training tool to train nurses and medical students to diagnose patients with heart disease. It can also provide decision support to assist doctors to make better clinical decisions or at least provide a second opinion. The current version of IHDPS is based on the 15 attributes listed in Figure 1.1. This list may need to be expanded to provide a more comprehensive diagnosis system.
List Of Attributes
Predictable attribute
Diagnosis value 0: < 50% diameter narrowing (no heart disease) value 1: > 50% diameter narrowing (has heart disease) Patient_id Patients identification number Sex value 1: Male value 0 : Female Chest Pain Type value 1: typical type 1 angina value 2: typical type angina value 3: non-angina pain value 4: asymptomatic Fasting Blood Sugar value 1: > 120 mg/dl value 0: < 120 mg/dl Restecg Resting electrographic results value 0: normal value 1: 1 having ST-T wave abnormality value 2: showing probable or definite left ventricular hypertrophy Exang Exercise induced angina value 1: yes value 0: no Slope The slope of the peak exercise ST segment value 1: un sloping value 2: flat value 3: down sloping CA number of major vessels coloured by floursopy (value 0 3) Thal value 3: normal value 6: fixed defect value 7: reversible defect Trest Blood Pressure (mm Hg on admission to the hospital) Serum Cholesterol (mg/dl) Thalach maximum heart rate achieved Oldpeak ST depression induced by exercise relative to rest Age in Year Figure 1.1. Description of attributes
Key attribute
2 3
Input attributes
9 10
11 12 13 14 15
1.2
Literature Survey
Healthcare industry today generates large amounts of complex data about patients, hospitals resources, disease diagnosis, electronic patient records, medical devices, etc. The large amounts of data are a key resource to be processed and analysed for knowledge extraction that enables support for cost-savings and decision making. The majority of conventional clinical decision support system (CDSS) for disease diagnosis are generally based on the symptoms of the patient or data from simple medical questionnaires. The CDSS that is maintained by the health- care provider over time, and includes all of the key administrative clinical data relevant to that persons care under a particular provider, including demographics, progress notes, problems, medications, vital signs, past medical history, immunizations, laboratory data, medical images and radiology reports. To our knowledge, a CDSS for cardiovascular disease diagnosis using an ensemble of multiple classifiers for comprehensive diagnosis and possible data mining does not currently exist. Specifically, the goal is to improve cardiovascular health and quality of life through the prevention, detection and treatment of risk factors; early identification and treatment of heart-attacks and strokes with prevention of recurrent cardiovascular events.
1.3
Motivation
A major challenge facing healthcare organizations like hospitals and various medical centers across the country is the provision of quality services at affordable costs. Quality service implies diagnosing patients correctly and administering treatments that are effective. Poor clinical decisions can lead to disastrous consequences which are therefore unacceptable. Hospitals must also minimize the cost of clinical tests. They can achieve these results by employing appropriate computer-based information and/or decision support systems. Most hospitals today employ some sort of hospital information systems to manage their healthcare or patient data. These systems typically generate huge amounts of data which take the form of numbers, text, charts and images. Unfortunately, these data are rarely used to support clinical decision making. There is a wealth of hidden information in these data that is largely untapped. This raises
an important question: How can we turn data into useful information that can enable healthcare practitioners to make intelligent clinical decisions? This is the main motivation for this research.
1.4
Physicians identify effective treatments and best practices. Patients receive better and more affordable healthcare services. Healthcare management services where it could be particularly useful in medicine when there is no dispositive evidence favouring a particular treatment option.
Based on patients profile, history, physical examination, diagnosis and utilizing previous treatment patterns, new treatment plans can be effectively suggested. Thus we intend to provide a prototype which will aid better clinical decision
making and act as a companion to the concerned doctors for a second opinion.
1.5
In the upcoming chapters, we highlight different aspects of the designed prototype. In Chapter 2, we provide the elicit definition of the problem along with the Class diagram and Block diagram of the entire system. In the problem definition, we discuss about the need for the project and its impact on the clinical decisions being made. Class diagram shows the various components and entities involved in the system. Block diagram describes the systems functionality in a detailed manner. Here we depict a low-level representation of the prototype system. We also design the static and dynamic analysis diagram, activity diagram, interaction diagram, collaboration diagram and the deployment diagram. In Chapter 3, we discuss the feasibility of our product with respect to various factors such as the technical, economical and operational domains. In the next part we describe the various stages in the planning of the project during the entire course duration. Here the scheduling of the entire project is represented using a Gantt chart. In Chapter 4, we discuss about the technologies and softwares used in the development of the prototype. Along with it we design the model deployment and the block diagram of the prototype. The stages in both the implementation and installation phases are diagramatically represented.
2.1
Problem definition
Many hospital information systems are designed to support patient billing,
inventory management and generation of simple statistics. Some hospitals use decision support systems, but they are largely limited. They can answer simple queries like What is the average age of patients who have heart disease?, How many surgeries had resulted in hospital stays longer than 10 days?, Identify the female patients who are single, above 30 years old, and who have been treated for cancer. However, they cannot answer complex queries like Identify the important preoperative predictors that increase the length of hospital stay, Given patient records on cancer, should treatment include chemotherapy alone, radiation alone, or both chemotherapy and radiation?, and Given patient records, predict the probability of patients getting a heart disease. Clinical decisions are often made based on doctors intuition and experience rather than on the knowledge-rich data hidden in the database. This practice leads to unwanted biases, errors and excessive medical costs which affects the quality of service provided to patients. Our project proposes that integration of clinical decision support with computer-based patient records could reduce medical errors, enhance patient safety, decrease unwanted practice variation, and improve patient outcome . This suggestion is promising as data modelling and analysis tools, e.g., data mining, have the potential to generate a knowledge-rich environment which can help to significantly improve the quality of clinical decisions.
2.2
A Data Flow Diagram (DFD) is a diagrammatic representation of the information flows within a system, showing:
How information enters and leaves the system What changes the information Where information is stored.
Level 0:
Level 1:
2.3
Actor s Goal
: Login
Participant s Actors : User, Database. Pre-condition Post-Condition Success Scenario 1) Opens Login page. 2) Enter user name and password 3) Enter the Attribute value 4) Get prediction and Time chart : The actor should have authorized access. : The user has a account and can use application. :
Steps:
Administrator enters into the home page. Fills in the user name and password. If the username/ password is correct, he enters into the service page and can manage and maintain it.
Login:
Steps:
The user/administrator is in home page. The user provides the username/password. If combination is correct, service page is displayed. If combination is incorrect, the user is allowed to re-enter the username/password.
Steps:
Administrator enters into the home page. Fills in the user name and password. If the username/ password is correct, he enters into the service page and can manage and maintain it.
makes it possible to run on any stand-alone machine. The system requirements for the project to work are not large but better configurations will deliver better performance. The Softwares required can be licensed at a reasonable cost. As the project can be implemented on a standalone machine there is an estimation of large scale access in large hospitals and healthcare sectors.
(hrs.) 1. Problem Definition. Formulation of the process statement. Brain storming session amongst 7-8 hrs
the group members. Attempt to find similar implemented solutions to problem. 2. Problem evaluation. Searching for multiple alternative solutions of main objective. Discussion and searching on internet. 15 hrs 24 Aug 2011 7 Sept 2011
3.
Describe the Input data required and output according to the software.
9 hrs
7 Sept 2011
5 Oct 2011
4.
12 hrs
14 Sept 2011
12 Oct 2011
5.
Develop-
21 hrs
12 Oct
26 Oct
logical execution of the system being developed software and hardware requirements.
general idea of the working of the process. Visualized a Standard solution which satisfies our goals and objectives.
2011
2011
6.
Using the system diagrams to write the actual code for the software.
Use Static & Dynamic diagrams to visualize working of system and write the code.
36 hrs
31 Oct 2011
8 Mar 2012
7.
Debugging
16 hrs
8 Mar 2012
28 Mar 2012
8.
Creation of tion
Completely
13-14hrs
28 Mar 2012
20 Apr 2012
date
need.
3.3 Scheduling
# P r o je c t
T ask
S ta rt E n d D u r 8 / 1 3 / 51 / 11 1 / 1 1 9 2 5
2011
2012
A u g S e p O c t N o v D e c J a n F e b M a rA p r M a y
1 P e r i o d 1 P r o j e c t 8s / 1 3 / 11 /1 2 / 11 20 1 1 .1 R e q u ir e m e n t g a t h e r in g a n d 8 / 1 3 / 91 /1 5 / 1 1 1 6 S p e c if ic a t io n 9 / 2 3 /1 10 1/ 2 1 2/ 1 1
1 . 2 F e a s i b i l i t y S t u d 8 y / 2 7 / 91 / 12 8 / 21 31 1 .3 W e b S u r v e y 1 . 4 M a r k e t A n a l y s i1s 0 / 1 5 1/ 11 /1 9 / 11 81 1 .5 D a t a D e s ig n 1 1 / 2 /1 11 1/ 2 5 1/ 81 1
Technology and Software Stage-wise Model Development and Flowchart Implementation Stages Installation Stages
Visual Basic Express Visual Basic 2005/2008 (but not Visual Basic 2010) Express Edition contains the Visual Basic 6.0 converter that makes it possible to upgrade Visual Basic 6.0 projects to the Visual Basic.NET. The Express Editions (2005 and 2008) mostly have the same following limitations :
No IDE support for databases other than SQL Server Express and Microsoft Access No support for Web Applications with ASP.NET (this can instead be done with Visual Web Developer Express, though the non-Express version of Visual Studio allows both web and windows applications from the same IDE) No support for developing for mobile devices (no templates or emulator) No Crystal Reports Fewer project templates (e.g. Windows services template, Excel Workbook template) Limited options for debugging and breakpoints No support for creating Windows Services (Can be gained through download of a project template) No support for Open MP Limited deployment options for finished programs VB Express lacks some advanced features of the standard versions. For example, there is no Outlining feature Hide selection to collapse/expand selected text.
Despite the fact that it is a stripped-down version of Visual Studio, some improvements were made upon Visual Basic 2008 from Visual Basic 2005. Visual Basic 2008 Express includes the following improvements over Visual Basic 2005 Express:
Includes the visual Windows Presentation Foundation designer codenamed "Cider" Debugs at runtime Better IntelliSense support o Fixes common spelling errors o Corrects most forms of invalid syntax o Provides suggestions to class names when specified classes are not found
Microsoft SQL Server 2008 It is a relational database server, developed by Microsoft: it is a software product whose primary function is to store and retrieve data as requested by other software applications, be it those on the same computer or those running on another computer across a network (including the Internet). There are at least a dozen different editions of Microsoft SQL Server aimed at different audiences and for different workloads (ranging from small applications that store and retrieve data on the same computer, to millions of users and computers that access huge amounts of data from the Internet at the same time).
SQL Server Enterprise SQL Server Enterprise is a freeware, light-weight, and redistributable edition of Microsoft SQL Server. It provides a no-cost database for developers writing basic Windows applications and web sites. SQL Server Express replaces MSDE 2000 and significantly expands on its feature set. SQL Server Management Studio Express which provides a graphical user interface for administering SQL Server Express can also be downloaded. SQL Server Enterprise Edition includes both the core database engine and add-on services, with a range of tools for creating and managing a SQL Server cluster. It can manage databases as large as 524 petabytes and address 2 terabytes of memory and supports 8 physical processors. The SQL Server Enterprise Edition has the following limitations :
Limited to one physical CPU. Lack of enterprise features support. 1 GB memory limit for the buffer pool. Databases have a 4 GB size limit. (10 GB beginning with SQL Server Express 2008 R2) No data mirroring and/or clustering. No profiler. No workload throttling. No GUI to import or export data from/to spreadsheets. No Server Agent background process.
SQL Server includes better compression features, which also helps in improving scalability. It enhanced the indexing algorithms and introduced the notion of filtered indexes. It also includes Resource Governor that allows reserving resources for certain users or workflows. It also includes capabilities for transparent encryption of data (TDE) as well as compression of backups. SQL Server 2008 supports the ADO.NET Entity Framework and the reporting tools, replication, and data definition will be built around the Entity Data Model.SQL Server Reporting Services will gain charting capabilities from the integration of the data visualization products from Dundas Data Visualization, Inc., which was acquired by Microsoft. On the management side, SQL Server 2008 includes the Declarative Management Framework which allows configuring policies and constraints, on the entire database or certain tables, declaratively. The version of SQL Server Management Studio included with SQL Server 2008 supports IntelliSense for SQL queries against a SQL Server 2008 Database Engine.SQL Server 2008 also makes the databases available via Windows PowerShell providers and management functionality available as Cmdlets, so that the server and all the running instances can be managed from Windows PowerShell. The main unit of data storage is a database, which is a collection of tables with typed columns. SQL Server supports different data types, including primary types such as Integer, Float, Decimal, Char (including character strings), Varchar (variable length character strings), binary (for unstructured blobs of data), Text (for textual data) among others. The rounding of floats to integers uses either Symmetric Arithmetic Rounding or Symmetric Round Down (Fix) depending on arguments: SELECT Round(2.5, 0) gives 3.
Logging and Transaction SQL Server ensures that any change to the data is ACID-compliant, i.e. it uses transactions to ensure that the database will always revert to a known consistent state on failure. Each transaction may consist of multiple SQL statements all of which will only make a permanent change to the database if the last statement in the transaction (a COMMIT statement) completes successfully. If the COMMIT successfully completes the transaction is safely on disk. SQL Server implements transactions using a writeahead log. Any changes made to any page will update the in-memory cache of the page, simultaneously all the operations performed will be written to a log, along with the transaction ID which the operation was a part of. Each log entry is identified by an increasing Log Sequence Number (LSN) which is used to ensure that all changes are written to the data files. Also during a log restore it is used to check that no logs are duplicated or skipped. SQL Server requires that the log is written onto the disc before the data page is written back. It must also ensure that all operations in a transaction are written to the log before any COMMIT operation is reported as completed. At a later point the server will checkpoint the database and ensure that all pages in the data files have the state of their contents synchronised to a point at or after the LSN that the checkpoint started. When completed the checkpoint marks that portion of the log file as complete and may free it. This enables SQL Server to ensure integrity of the data, even if the system fails. On failure the database log has to be replayed to ensure the data files are in a consistent state. All pages stored in the roll forward part of the log (not marked as completed) are rewritten to the database, when the end of the log is reached all open transactions are rolled back using the roll back portion of the log file. The database engine usually checkpoints quite frequently. However, in a heavily loaded database this can have a significant performance impact. It is possible to reduce the frequency of checkpoints or disable them completely but the rollforward during a recovery will take much longer Data retrieval The main mode of retrieving data from an SQL Server database is querying for it. The query is expressed using a variant of SQL called T-SQL, a dialect Microsoft SQL Server shares with Sybase SQL Server due to its legacy. The query declaratively specifies what is to be retrieved. It is processed by the query processor, which figures out the sequence of steps that will be necessary to retrieve the requested data. The sequence of actions necessary to execute a query is called a query plan. There might be multiple ways to process the same query. For example, for a query that contains a join statement and a select statement, executing join on both the tables and then executing select on the results would give the same result as selecting from each table and then executing the join, but result in different execution plans. In such case, SQL Server chooses the plan that is expected to yield the results in the shortest possible time. This is called query optimization and is performed by the query processor itself. SQL Server includes a cost-based query optimizer which tries to optimize on the cost, in terms of the resources it will take to execute the query. Given a query, then
the query optimizer looks at the database schema, the database statistics and the system load at that time. It then decides which sequence to access the tables referred in the query, which sequence to execute the operations and what access method to be used to access the tables. For example, if the table has an associated index, whether the index should be used or not - if the index is on a column which is not unique for most of the columns (low "selectivity"), it might not be worthwhile to use the index to access the data. Finally, it decides whether to execute the query concurrently or not. While a concurrent execution is more costly in terms of total processor time, because the execution is actually split to different processors might mean it will execute faster. Once a query plan is generated for a query, it is temporarily cached. For further invocations of the same query, the cached plan is used. Unused plans are discarded after some time. SQL Server also allows stored procedures to be defined. Stored procedures are parameterized T-SQL queries, that are stored in the server itself (and not issued by the client application as is the case with general queries). Stored procedures can accept values sent by the client as input parameters, and send back results as output parameters. They can call defined functions, and other stored procedures, including the same stored procedure (up to a set number of times). They can be selectively provided access to. Unlike other queries, stored procedures have an associated name, which is used at runtime to resolve into the actual queries. Also because the code need not be sent from the client every time (as it can be accessed by name), it reduces network traffic and somewhat improves performance. Execution plans for stored procedures are also cached as necessary. SQL CLR Microsoft SQL Server 2005 includes a component named SQL CLR ("Common Language Runtime") via which it integrates with .NET Framework. Unlike most other applications that use .NET Framework, SQL Server itself hosts the .NET Framework runtime, i.e. memory, threading and resource management requirements of .NET Framework are satisfied by SQLOS itself, rather than the underlying Windows operating system. SQLOS provides deadlock detection and resolution services for .NET code as well. With SQL CLR, stored procedures and triggers can be written in any managed .NET language, including C# and VB.NET. Managed code can also be used to define UDT's (user defined types), which can persist in the database. Managed code is compiled to .NET assemblies and after being verified for type safety, registered at the database. After that, they can be invoked like any other procedure. However, only a subset of the Base Class Library is available, when running code under SQL CLR. Most APIs relating to user interface functionality are not available. When writing code for SQL CLR, data stored in SQL Server databases can be accessed using the ADO.NET APIs like any other managed application that accesses SQL Server data. However, doing that creates a new database session, different from the one in which the code is executing. To avoid this, SQL Server provides some enhancements to the ADO.NET provider that allows the connection to be redirected to the same session which already hosts the running code. Such connections are called context connections and are set by setting context connection parameter to true in the connection string. SQL Server also provides several other enhancements to the ADO.NET API, including classes to work with tabular data or a single row of data as well as classes to work with internal metadata about the data stored in the database. It
also provides access to the XML features in SQL Server, including XQuery support. These enhancements are also available in T-SQL Procedures in consequence of the introduction of the new XML Datatype (query, value, nodes functions). Services SQL Server also includes an assortment of add-on services. While these are not essential for the operation of the database system, they provide value added services on top of the core database management system. These services either run as a part of some SQL Server component or out-of-process as Windows Service and presents their own API to control and interact with them. Analysis Services SQL Server Analysis Services adds OLAP and data mining capabilities for SQL Server databases. The OLAP engine supports MOLAP, ROLAP and HOLAP storage modes for data. Analysis Services supports the XML for Analysis standard as the underlying communication protocol. The cube data can be accessed using MDX and LINQ queries. Data mining specific functionality is exposed via the DMX query language. Analysis Services includes various algorithms - Decision trees, clustering algorithm, Naive Bayes algorithm, time series analysis, sequence clustering algorithm, linear and logistic regression analysis, and neural networks - for use in data mining. Reporting Services SQL Server Reporting Services is a report generation environment for data gathered from SQL Server databases. It is administered via a web interface. Reporting services features a web services interface to support the development of custom reporting applications. Reports are created as RDL files. Reports can be designed using recent versions of Microsoft Visual Studio (Visual Studio.NET 2003, 2005, and 2008) with Business Intelligence Development Studio, installed or with the included Report Builder. Once created, RDL files can be rendered in a variety of formats including Excel, PDF, CSV, XML, TIFF (and other image formats), and HTML Web Archive. Notification Services Originally introduced as a post-release add-on for SQL Server 2000, Notification Services was bundled as part of the Microsoft SQL Server platform for the first and only time with SQL Server 2005. SQL Server Notification Services is a mechanism for generating data-driven notifications, which are sent to Notification Services subscribers. A subscriber registers for a specific event or transaction (which is registered on the database server as a trigger) : when the event occurs, Notification Services can use one of three methods to send a message to the subscriber informing about the occurrence of the event. These methods include SMTP, SOAP, or by writing to a file in the filesystem. Notification Services was discontinued by Microsoft with the release of SQL Server 2008 in August 2008, and is no longer an officially supported component of the SQL Server database platform.
Integration Services SQL Server Integration Services is used to integrate data from different data sources. It is used for the ETL capabilities for SQL Server for data warehousing needs. Integration Services includes GUI tools to build data extraction workflows integration various functionality such as extracting data from various sources, querying data, transforming data including aggregating, duplication and merging data, and then loading the transformed data onto other sources, or sending e-mails detailing the status of the operation as defined by the user. Full Text Search Service SQL Server Full Text Search service is a specialized indexing and querying service for unstructured text stored in SQL Server databases. The full text search index can be created on any column with character based text data. It allows for words to be searched for in the text columns. While it can be performed with the SQL LIKE operator, using SQL Server Full Text Search service can be more efficient.
with the SQL Server. The Search process includes the indexer (that creates the full text indexes) and the full text query processor. The indexer scans through text columns in the database. It can also index through binary columns, and use iFilters to extract meaningful text from the binary blob (for example, when a Microsoft Word document is stored as an unstructured binary file in a database). The iFilters are hosted by the Filter Daemon process. Once the text is extracted, the Filter Daemon process breaks it up into a sequence of words and hands it over to the indexer. With the remaining words, an inverted index is created, associating each word with the columns they were found in. SQL Server itself includes a Gatherer component that monitors changes to tables and invokes the indexer in case of updates. When a full text query is received by the SQL Server query processor, it is handed over to the FTS query processor in the Search process. The FTS query processor breaks up the query into the constituent words, filters out the noise words, and uses an inbuilt thesaurus to find out the linguistic variants for each word. The words are then queried against the inverted index and a rank of their accurateness is computed. The results are returned to the client via the SQL Server process. SQL Server Management Studio SQL Server Management Studio is a GUI tool included with SQL Server 2005 and later for configuring, managing, and administering all components within Microsoft SQL Server. The tool includes both script editors and graphical tools that work with objects and features of the server.SQL Server Management Studio replaces Enterprise Manager as the primary management interface for Microsoft SQL Server since SQL Server 2005. A version of SQL Server Management Studio is also available for SQL Server Express Edition, for which it is known as SQL Server Management Studio Express (SSMSE). SQL Server Management Studio is a software application first launched with the Microsoft SQL Server 2005 that is used for configuring, managing, and administering all components within Microsoft SQL Server. The tool includes both script editors and graphical tools which work with objects and features of the server. A central feature of SQL Server Management Studio is the Object Explorer, which allows the user to browse, select, and act upon any of the objects within the server. It can be used to visually observe and analyze query plans and optimize the database performance, among others.SQL Server Management Studio can also be used to create a new database, alter any existing database schema by adding or modifying tables and indexes, or analyze performance. It includes the query windows which provide a GUI based interface to write and execute queries. Business Intelligence Development Studio Business Intelligence Development Studio (BIDS) is the IDE from Microsoft used for developing data analysis and Business Intelligence solutions utilizing the Microsoft SQL Server Analysis Services, Reporting Services and Integration Services. It is based on the Microsoft Visual Studio development environment but is customized with the SQL Server services-specific extensions and project types, including tools, controls and projects for reports (using Reporting Services), ETL dataflows, OLAP cubes and data mining structures (using Analysis Services).
4.2
DATABASE DESIGN 1. PATIENT DATA Field Name Patient_id Age Sex Chest Pain Type Fasting Blood Sugar Restecg Exang Data Type Int(10) Int(10) Int(10) Int(10) Int(10) Int(10) Int(10) Allow Null No No No No No No No
No No No No No No No
2. USER DATA
Allow Null No No
DATASET :
4.3
Installation Stages
1. Run the Intelligent Heart Disease Prediction Setup as shown below & click on Next button.
2. Select the Installation Folder, where you want to install the setup & click on Next
button.
3. Check for the Disk requirements and choose the appropriate drive to install the software. Click on OK.
5. After clicking on the Next button it will run the Setup project as follows:
6. Select the Close button to finish the setup & setup will be installed successfully.
4.4
Implementation Stages
1. Splash Screen
2. Login Form
3. Prediction Form
4. Result
The Naive Bayes algorithm is based on conditional probabilities. It uses Bayes Theorem, a formula that calculates a probability by counting the frequency of values and combinations of values in the historical data. Bayes' Theorem finds the probability of an event occurring given the probability of another event that has already occurred. If B represents the dependent event and A represents the prior event, Bayes' theorem can be stated as follows : Consider a supervised learning problem in which we wish to approximate an unknown target function f : X Y or equivalently P(YX). Naive Bayes makes the assumption that each predictor is conditionally independent of the others. For a given target value, the distribution of each predictor is independent of the other predictors. In practice, this assumption of independence, even when violated, does not degrade the model's predictive accuracy significantly, and makes the difference between a fast, computationally feasible algorithm and an intractable one. Sometimes the distribution of a given predictor is clearly not representative of the larger population. For example, there might be only a few customers under 21 in the training data, but in fact there are many customers in this age group in the wider custom.The Naive Bayes algorithm affords fast, highly scalable model building and scoring. It scales linearly with the number of predictors and rows. The build process for Naive Bayes is parallelized.Naive Bayes can be used for both binary and multiclass classification problems. The MSE of an estimator with respect to the estimated parameter is defined as
The MSE is equal to the sum of the variance and the squared bias of the estimator
The MSE thus assesses the quality of an estimator in terms of its variation and unbiasedness. Since MSE is an expectation, it is not a random variable. It may be a function of the unknown parameter , but it does not depend on any random quantities. However, when MSE is computed for a particular estimator of the true value of which is not known, it will be subject to estimation error. In a Bayesian sense, this means that there are cases in which it may be treated as a random variable.
Heart
I ag se che Fasti Heart_r Exa Rest Blood_pre pred Ser Slo C Th Ol D e x st ng ate ng ecg ssure ict um pe A al d 4 41 0 5 56 0 6 65 0 7 50 1 8 43 1 9 34 1 1 60 1 0 1 54 0 1 4 4 2 2 4 1 4 3 130 140 150 150 150 118 145 135 204 294 225 243 247 182 282 304 0 0 0 0 0 0 0 1 2 2 2 2 0 2 2 0 172 153 114 128 171 174 142 170 1 1 0 0 0 0 1 0 1 1 1 2 1 0 2 0 1 0 2 0 2 3 2 0 1 0 1 0 2 2 1 0 3 3 7 7 3 3 7 3 0 0 4 4 0 0 2 0
Probability (yes/M) =0.514 Probability (yes/F) =0.3414 Probability (yes/Chest pain=4) =0.818 Probability (yes/Fasting sugar>120) =0.234 Probability (yes/heart rate>200) =0.3414 Probability (yes / exang=1) =0.1014 Probability (yes/old peak=1) =0.3414 Probability (yes/M) =0.514 Probability (yes/age>60) =0.454 Probability (yes/restecg=2) =0.514 Suppose new tupple appear with attribute t =(64.0,0.0,2.0,140.0,294.0,0.0,2.0,153.0,0.0,1.3,2.0,0.0,3.0,0) P (t/yes)=0.454*0.3414*0.1014*0.5414*0.234*0.515 =0.4529 Since probability is less than 50%, and minimum Confidence is 50 %, it will be classified as No.
Attribute Discrimination
Attributes chest old Thal Slope Slope chest Values 4 0 3 2 1 3 Favors 0 79.101 44.115 31.582 29.208 29.000 Favors All other states 100.000
7 3 2 <1 >= 1 2 6 1
Attribute Profile
Attributes Size chest chest chest chest chest old old old old old old Serum Serum Serum Slope Slope Slope Slope Thal Thal Thal Thal States 4 3 2 1 Missing 0 1 2 3 4 Missing <1 >= 1 Missing 1 2 3 Missing 3 7 6 Missing Population (All) 176 85 46 30 15 0 88 34 22 21 11 0 95 81 0 84 80 12 0 93 71 12 0 1 65 0.815 0.092 0.062 0.031 0.000 0.200 0.246 0.231 0.231 0.092 0.000 0.385 0.615 0.000 0.277 0.662 0.062 0.000 0.292 0.585 0.123 0.000 0 111 0.288 0.360 0.234 0.117 0.000 0.676 0.162 0.063 0.054 0.045 0.000 0.631 0.369 0.000 0.595 0.333 0.072 0.000 0.667 0.297 0.036 0.000 Missing 0 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
PERFORMANCE ANALYSIS :
Partition Size 2 2 2 2 2 2 2 2 2 2
Test Classification Classification Classification Classification Classification Classification Classification Classification Classification Classification
Measure Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass Average Standard Deviation Fail Fail Fail Fail Fail Fail Fail Fail Fail Fail Average Standard Deviation Log Score Log Score Log Score Log Score Log Score Log Score Log Score Log Score Log Score Log Score Average Standard Deviation
Value 1 2 1 1 1 1 1 1 1 1 1.1 0.3 1 0.000e+000 1 1 1 1 1 1 1 1 0.9 0.3 -1.0986 -1.0986 -1.0986 -1.0986 -1.0986 -1.0986 -1.0986 -1.0986 -1.0986 -1.0986 -1.0986 2.107e-008
1 2 3 4 5 6 7 8 9 10
2 2 2 2 2 2 2 2 2 2
Classification Classification Classification Classification Classification Classification Classification Classification Classification Classification
1 2 3 4 5 6 7 8 9 10
2 2 2 2 2 2 2 2 2 2
Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood
1 2 3 4 5 6 7 8 9 10
2 2 2 2 2 2 2 2 2 2
Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood
1 2 3 4 5 6 7 8 9 10
2 2 2 2 2 2 2 2 2 2
Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood
Lift Lift Lift Lift Lift Lift Lift Lift Lift Lift Average Standard Deviation Root Mean Square Error Root Mean Square Error Root Mean Square Error Root Mean Square Error Root Mean Square Error Root Mean Square Error Root Mean Square Error Root Mean Square Error Root Mean Square Error Root Mean Square Error Average Standard Deviation
-0.4055 -1.0986 -0.4055 -0.4055 -0.4055 -0.4055 -0.4055 -0.4055 -0.4055 -0.4055 -0.4748 0.2079 0.6667 0.6667 0.6667 0.6667 0.6667 0.6667 0.6667 0.6667 0.6667 0.6667 0.6667 NaN
model as a best choice model or online selection model algorithm. Another use of decision trees is as a descriptive means for calculating conditional probabilities.
The main idea behind the ID3 algorithm are: Each non-leaf node of a decision tree corresponds to an input attribute, and each arc to a possible value of that attribute. A leaf node corresponds to the expected value of the output attribute when the input attributes are described by the path from the root node to that leaf node. In a good decision tree, each non-leaf node should correspond to the input attribute which is the most informative about the output attribute amongst all the input attributes not yet considered in the path from the root node to that node. This is because we would like to predict the output attribute using the smallest possible number of questions on average. Entropy Entropy is used to determine how informative a particular input attribute is about the output attribute for a subset of the training data. Entropy is a measure of uncertainty in communication systems introduced by Shannon (1948). It is fundamental in modern information theory.
For the same dataset as in the above example Let us first try to split on age attribute For age > 60 Entropy (4F, 5M) = - (4/9) log2(4/9) - (5/9)log2(5/9) = 0.9911
Gain (age>= 60) = 0.9911 (4/9 * 0.8113 + 5/9 * 0.9710) = 0.0911 Similarly,
We have Gain (Chest pain=4) =0.515, Gain(Blood pressure>150)=0.423. Gain (Fast sugar>150) =0.0895,Gain(Heart rate>230)=0.0034. Gain(exang=0)=0.234, Gain(old peak=0)=0.0023 Since Chest- pain has the highest Information gain, the decision tree will start with chest pain.
PERFORMANCE ANALYSIS :
Partition Size 2 2 2 2 2 2 2 2 2 2
Test Classification Classification Classification Classification Classification Classification Classification Classification Classification Classification
Measure Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass Average Standard Deviation Fail Fail Fail Fail Fail Fail Fail Fail Fail
1 2 3 4 5 6 7 8 9
2 2 2 2 2 2 2 2 2
10
Classification
1 2 3 4 5 6 7 8 9 10
2 2 2 2 2 2 2 2 2 2
Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood
1 2 3 4 5 6 7 8 9 10
2 2 2 2 2 2 2 2 2 2
Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood
1 2 3 4 5 6 7 8 9
2 2 2 2 2 2 2 2 2
Fail Average Standard Deviation Log Score Log Score Log Score Log Score Log Score Log Score Log Score Log Score Log Score Log Score Average Standard Deviation Lift Lift Lift Lift Lift Lift Lift Lift Lift Lift Average Standard Deviation Root Mean Square Error Root Mean Square Error Root Mean Square Error Root Mean Square Error Root Mean Square Error Root Mean Square Error Root Mean Square Error Root Mean Square Error Root Mean
1 0.9 0.3 -0.6949 -0.6931 -0.6949 -0.6949 -0.6949 -0.6949 -0.6949 -0.6949 -0.6949 -0.6949 -0.6947 0.0005 -0.0017 -0.6931 -0.0017 -0.0017 -0.0017 -0.0017 -0.0017 -0.0017 -0.0017 -0.0017 -0.0709 0.2074 0.4706 0.5 0.4706 0.4706 0.4706 0.4706 0.4706 0.4706 0.4706
10
Likelihood
5.3
Neural Network
An artificial neural network (ANN), usually called neural network (NN), is a mathematical model or computational model that is inspired by the structure and/or functional aspects of biological neural networks. A neural network consists of an interconnected group of artificial neurons, and it processes information using a connectionist approach to computation. In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the learning phase. Modern neural networks are non-linear statistical data modeling tools. They are usually used to model complex relationships between inputs and outputs or to find patterns in data. Artificial neural networks are algorithms that can be used to perform nonlinear statistical modeling and provide a new alternative to logistic regression, the most commonly used method for developing predictive models for dichotomous outcomes in medicine. Neural networks offer a number of advantages, including requiring less formal statistical training, ability to implicitly detect complex nonlinear relationships between dependent and independent variables, ability to detect all possible interactions between predictor variables, and the availability of multiple training algorithms. Disadvantages include its "black box" nature, greater computational burden, proneness to overfitting, and the empirical nature of model development. An overview of the features of neural networks and logistic regression is presented, and the advantages and disadvantages of using this modeling technique are discussed. Perhaps the greatest advantage of ANNs is their ability to be used as an arbitrary function approximation mechanism that 'learns' from observed data. However, using them is not so straightforward and a relatively good understanding of the underlying theory is essential.
Choice of model: This will depend on the data representation and the application. Overly complex models tend to lead to problems with learning. Learning algorithm: There are numerous trade-offs between learning algorithms. Almost any algorithm will work well with the correct hyperparameters for training on a particular fixed data set. However selecting and tuning an algorithm for training on unseen data requires a significant amount of experimentation. Robustness: If the model, cost function and learning algorithm are selected appropriately the resulting ANN can be extremely robust.
With the correct implementation, ANNs can be used naturally in online learning and large data set applications. Their simple implementation and the existence of mostly local dependencies exhibited in the structure allows for fast, parallel implementations in hardware. Theoretical and computational neuroscience is the field concerned with the theoretical analysis and computational modeling of biological neural systems. Since neural systems are intimately related to cognitive processes and behavior, the field is closely related to cognitive and behavioral modeling.The aim of the field is to create models of biological neural systems in order to understand how biological systems work. To gain this understanding, neuroscientists strive to make a link between observed biological processes (data), biologically plausible mechanisms for neural processing and learning (biological neural network models) and theory (statistical learning theory and information theory). Backpropagation is a common method of training artificial neural networks so as to minimize the objective function. Backpropagation is an iterative process that can often take a great deal of time to complete. When multicore computers are used multithreaded techniques can greatly decrease the amount of time that backpropagation takes to converge. If batching is being used, it is relatively simple to adapt the backpropagation algorithm to operate in a multithreaded manner. ATTRIBUTE PROFILE : Attribute Thal Thal Thal Slope Slope Slope sex sex age age age age age age old old old old Heart_rate Heart_rate Value 7 6 3 2 1 3 0 1 77 40 51 70 62 38 1 3 2 0 204 - 230 >= 286 Favors 0 Favors 1 25.49 16 15.98 23.97 20.96 12.71 4.05 1.7 100 76.46 65.24 43.99 41.48 39.31 11.16 10.71 7.43 4.86 12.04 8.57
Heart_rate Heart_rate Heart_rate Fasting Fasting Fasting Fasting Fasting Exang Exang chest chest chest chest CA CA CA CA Blood_pressure Blood_pressure Blood_pressure Blood_pressure Blood_pressure Blood_pressure Blood_pressure
254 - 286 230 - 254 < 204 132 - 144 >= 144 120 - 125 < 120 125 - 132 1 0 2 4 3 1 2 3 1 0 111 120 95 128 137 145 136
7.7 5.56 0.52 23.46 14.21 10.91 2.27 1.5 1.04 0.04 37.91 34.02 27.15 15.85 9.87 9.84 8.01 1.12 98.68 96.93 84.58 79.69 78.59 75.72 75.54
PERFORMANCE ANALYSIS :
Partition Size 2 2 2 2 2 2 2 2 2
Test Classification Classification Classification Classification Classification Classification Classification Classification Classification
Measure Pass Pass Pass Pass Pass Pass Pass Pass Pass
Value 1 2 1 1 1 1 1 1 1
10
Classification
1 2 3 4 5 6 7 8 9 10
2 2 2 2 2 2 2 2 2 2
Classification Classification Classification Classification Classification Classification Classification Classification Classification Classification
1 2 3 4 5 6 7 8 9 10
2 2 2 2 2 2 2 2 2 2
Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood
1 2 3 4 5 6 7 8 9 10
2 2 2 2 2 2 2 2 2 2
Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood Likelihood
1 2
2 2
Likelihood Likelihood
Pass Average Standard Deviation Fail Fail Fail Fail Fail Fail Fail Fail Fail Fail Average Standard Deviation Log Score Log Score Log Score Log Score Log Score Log Score Log Score Log Score Log Score Log Score Average Standard Deviation Lift Lift Lift Lift Lift Lift Lift Lift Lift Lift Average Standard Deviation Root Mean Square Error Root Mean Square Error
1 1.1 0.3 1 0.000e+000 1 1 1 1 1 1 1 1 0.9 0.3 -1.0986 -1.0986 -1.0986 -1.0986 -1.0986 -1.0986 -1.0986 -1.0986 -1.0986 -1.0986 -1.0986 2.107e-008 -0.4055 -1.0986 -0.4055 -0.4055 -0.4055 -0.4055 -0.4055 -0.4055 -0.4055 -0.4055 -0.4748 0.2079 0.6667 0.6667
3 4 5 6 7 8 9 10
2 2 2 2 2 2 2 2
Root Mean Square Error Root Mean Square Error Root Mean Square Error Root Mean Square Error Root Mean Square Error Root Mean Square Error Root Mean Square Error Root Mean Square Error Average Standard Deviation
0.6667 0.6667 0.6667 0.6667 0.6667 0.6667 0.6667 0.6667 0.6667 NaN