Sunteți pe pagina 1din 31

CASE STUDY ON KIDNEY DISEASE

ANALYSIS

BY
TEAM : SPARKS

N. VINILA
S. BHAVANI
S. MOUNIKA
CH. MADHUMITHA
I. SRI POORVAJA
INTRODUCTION

Chronic kidney disease (CKD) has become a global health


Issue and is an area of concern. It is a condition where Kidneys
become damaged and cannot filter toxic wastes in the body.
Chronic kidney disease, also called chronic kidney failure,
describes the gradual loss of kidney function.
CONCEPTS
• Importing Packages
• Data Collection
• Data Cleaning
• Exploratory Data Analysis
• Algorithms
• Data Visualization
IMPORTING PACKAGES

• Numpy
• Pandas
• Matplotlib
Syntax :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
DATA COLLECTION

• The current data set provides us with a data about kidney


disease analysis. Data set is 21kidney_disease.csv
• It contains 401 rows and 26 columns
Describing the data

Correlation
Data Cleaning
• Data cleansing or data cleaning is the process of detecting and
correcting (or removing) corrupt or inaccurate data from a
data set.
• We have generally two ways of imputing missing values: the
Pandas Data Frame fillna method or the SciKit Imputer.
finding null values
• Lambda function
Unique() :

Replace :
Fillna :
• Firstly we need to use the value_counts()

• Apply fillna
• Applying fillna to all the columns with NaN values .
• Check whether NaN values are present or not?
Apply Lambda
split dataset into attributes using iloc
 Encoding labels.
 Iloc()
 train_test_split

TRAINING TESTING
80 20

70 30

60 40

50 50
Algorithms
• Logistic Regression
• Support Vector Classification
• Decision Tree Classifier
• Random Forest
• K Nearest Neighbors
• Naive bayes
Confusion Matrix
• A confusion matrix is a table that is often used to describe the
performance of a classification model (or "classifier") on a set
of test data for which the true values are known.

 Recall
 Precision
 F-Measure
Logistic Regression
A logistic regression model predicts a dependent data variable
by analyzing the relationship between one or more existing
independent variables.
Support Vector Classification

• A Support Vector Machine (SVM) is a supervised machine


learning algorithm that can be employed for
both classification and regression purposes. SVMs are more
commonly used in classification problems.
• Confusion Matrix :
91 9
24 36
Decision Tree

• Decision Trees are a type of Supervised Machine Learning


(that is you explain what the input is and what the
corresponding output is in the training data) where the data is
continuously split according to a certain parameter.
• Confusion Matrix :
95 5
8 52
Random Forest
• A random forest is a data construct applied to machine
learning that develops large numbers of random decision
trees analyzing sets of variables.
• Confusion Matrix :
95 5
8 52
K Nearest Neighbor
• K nearest neighbors is a simple algorithm that stores all
available cases and classifies new cases based on a similarity
measure (e.g., distance functions).
• Confusion Matrix :
76 24
16 44
Naive Bayes
• Naive Bayes is a simple but surprisingly powerful algorithm for
predictive modeling.
• Confusion Matrix :
85 7
15 53
Algorithm 80:20 70:30 60:40 50:50

Logistic Regression 92 89 86 84

Support vector 82 79 79 79
classification

Decision Tree 90 90 91 89

Random Forest 90 90 91 89

KNN 77 77 75 73

Naïve Bayes 92 89 86 84
We have consider these input variable & output variable.
Both the input var has got accuracy at different ratio’s for diff algorithm

Input Output Algorithm Ratio Accuracy


variable variable

Classification Logistic 80:20 92%


Age, blood Regression,
pressure, specific Naive Bayes
gravity, albumin,
sugar
Classification Decision tree, 60:40 91%
haemoglobin Random
forest
Histogram
Box plot
Dist Plot
Heat Map
THANKYOU

S-ar putea să vă placă și