Blog

A Step by Step Guide for Predictive Modeling Using R: Part 1

Introduction

R has been the language of choice for predictive analysis due to its innumerable packages and strong developer community.  In this series, we will demonstrate how to use R in various stages of predictive analysis and discuss the packages available in R for generating a predictive model for one of the datasets available in the UC Irvine machine learning dataset.

Stages of Predictive Modeling

Predictive Modeling is the process of building a model to predict future outcomes using statistics techniques. In order to generate the model, historical data of prior occurrences needs to be analyzed, classified and validated. Listed below are the stages of predictive modeling.

1. Data Gathering and Cleansing  

Read the data from various sources and perform data cleansing operations, such as identification of noisy data and removal of outliers to make the prediction more accurate. Apply R packages to handle missing data and impure values. 

2. Data Analysis/Transformation

Before building a model, data needs to be transformed for efficient processing by normalizing the data without missing the significance of data. Normalization can be done by scaling the values to a particular range. In addition, irrelevant attributes can be removed by performing a correlation analysis, which will play a least significant role in determining the outcomes. 

3. Building a Predictive Model 

Generate a decision tree or apply linear/logistic regression techniques to build a predictive model. This involves choosing a classification algorithm, identifying test data and generating classification rules. Identify the confidence of the classification model by applying it against test data.

4. Inferences 

Perform a cluster analysis to segregate data groups. Use these meaningful subsets of populations to make inferences.

Main Module of the Project

In this series, we will demonstrate how to generate the predictive model for chronic kidney disease, with an illustration on how to step through various stages of the data mining process and applying available R packages.

The code snippet below is the main module of our project which invokes various submodules of each stages in predictive modeling. This is a preview of what’s to come.

LoadLibraries<-function() {
  
  source("Preprocess.R")
  source("Outlier.R")
  source("RegressionAnalysis.R")
  source("LogisticRegression.R")
  source("DecisionTree.R")
  source("Clustering.R")
  source("PredictCKD.R")
  library('VIM')
}
main<-function() {
  

  CKD_Dataset  <- readCKDData("Input",
                              "chronic_kidney_disease_formatted.csv",";")
  
  CKD_Dataset<-Compute_MissingValues(CKD_Dataset)
  
  CKD_Dataset<-HandleNoisy_Outlier_Data(CKD_Dataset)
  
  CKD_Dataset<-PerformRegressionAnalysis(CKD_Dataset)
  
  GenerateLogisticRegresion(CKD_Dataset,"LogisticRegressionOuput.txt")
  
  DecisionTree<-GenerateDecisionTree(CKD_Dataset)
  
  PredictCKD(DecisionTree)

Prerequisite 

We will use RStudio as the IDE for developing this model. Any other IDE tools can also used that support an R environment.

Install RStudio 

First Program 

Picture1

Code Snippet:

setwd([Folder location of dataset])
filename <- paste(c(getwd(),"Chronic_kidney_disease_formatted.csv"),collapse="")
CKD_Dataset <- read.csv(filename, sep = ";" , header = T )
head(CKD_Dataset)

The snippet above demonstrates how to read a CSV dataset, then print the first few records of the dataset.

setwd function sets the current working directory for R environment. Once it is set, the value of the current working directory can be retrieved using the getwd function.

The paste function concatenates the list of strings with the collapse literal passed as an argument.

Now we have seen a glimpse of R by reading the chronic kidney disease dataset. In the subsequent post, we will see how to start the data preprocessing using our loaded dataset.