Module 1: What is Predictive Analytics & R Basics?
• Identify the problem and assess whether it should be addressed with predictive modeling. Understand differences and similarities between traditional analysis techniques. • Learn predictive modeling tools - layout and basic commands of R • Learn predictive modeling tools - Practice writing basic R scripts and complete additional suggested practice, if necessary
Module 2: Effective Problem Definition and Project Management
• Translate a vague question into one that can be analyzed with data, statistics and machine leaning to solve a business problem. • Use case design and evaluation/prioritization based on available data and technology, significance of business impact and/or implementation considerations • Implement and select appropriate technology in order to efficiently utilize statistical and machine learning techniques taking into account problem objectives and implementation constraints • List and understand the importance of key principles in creating and managing a predictive modeling team.
Module 3: Data Design, Transformation & Visualization
• Identify common data types, structured, unstructured and semi-structured • Learn variable types and applicable terminology • Identify and evaluate the quality (including common data problems) of appropriate data sources for a problem • Identify the types of regulatory, professional standard, and ethical issues surrounding predictive modeling and data collection/use and where they apply to situations • Introduce lapse, mortality and health datasets use for exercises • Implement effective data design: time frame, sampling, granularity • Use common data blending techniques, e.g. fuzzy matching • Learn how, why and when to transform the data, using scaling, normalization, standardization, binarization, encoding and imputation. • Apply each technique using an example model • Create and interpret histograms, bar charts and frequency plots • Visualize data using one-way, two-way, box-plot, to identify potential errors, outliers and trends in the data
Module 4: Data Exploration
• Identify data issues by exploring one variable to understand the distribution is as expected and detect any outliers • Determine the significant relationships between two variables using scatter plots, calculating correlations and investigating conditional means. • Determine relationships between many variables and select material ones using principle component analysis • Determine relationships between many variables and select material ones using independent component analysis • Determine relationships between many variables and select material ones using singular value decomposition • Take appropriate action when results of data exploration deviate from what is expected and apply judgment to resolve those differences
Module 5: Feature Generation & Selection
• Define the term "feature" and understand the difference to "variable" • Use subject matter expertise and prior knowledge about the data to create features that lead to more effective models. • List the principles, advantages and disadvantages and limitations of using filter based selection techniques for tuning a data set to be used in modelling. • Select appropriate features for a model using Pearson, Kendall and Spearman correlation as selection criteria (Pearson, Kendall and Spearman correlation). • Select appropriate features for a model using Mutual information as selection criteria (Mutual information). • Select appropriate features for a model using Chi squared as selection criteria (Chi squared). • List the principles, advantages and disadvantages and limitations of using permutation based selection techniques for tuning a data set to be used in modelling.
PA Certificate Program: Learning Objectives 2
• Apply concepts such as accuracy, precision and recall to select features to be used in classification modelling problems (Classification - accuracy, precision, recall). • Apply concepts such as MSE, RSE and coefficient of determination to select features to be used in regression modelling problems (Regression - MSE, RSE, coefficient of determination). • List the principles, advantages and disadvantages and limitations of using algorithm based selection techniques for tuning a data set to be used in modelling. • Use Ridge, Lasso, Elastic Net and tree based methods to select appropriate features to be used for modelling (Ridge, Lasso, Elastic Net, Trees (detailed lessons in section 5). • Text mining: Apply various text mining methods in order to generate appropriate features for use when modelling text data.
Module 6: Model Development & Validation
• Understand how different business problems affect the decisions made about model development and validation. • Understand the difference between supervised, unsupervised and reinforcement learning and identify examples of problems each would be applied to. • Understand the difference between classification and regression problems and explain the features of models that make them suitable/unsuitable for each type/ • Understand and explain the concepts of bias, variance and model complexity and the bias variance tradeoff and the implications this holds for building robust models • Understand the importance of using train, test & holdout data samples during modeling and be able to apply this method appropriately to fit and validate a model. List advantages, disadvantages and common pitfalls when using this method. • Understand and apply the method of using cross validation during modeling. List advantages, disadvantages and common pitfalls when using this method. • Supervised learning - For each of the following techniques, understand when it is appropriate (including advantages, disadvantages, and limitations), describe data needed, apply the method to data, and interpret and describe the results. o Decision Trees o Generalized Linear models (identity, poisson, gamma, tweedie, binomial and shrinkage methods) o Ensemble methods (bagging, boosting and blending – specifically Gradient boost machines)
PA Certificate Program: Learning Objectives 3
• Unsupervised learning - For each of the following techniques, understand when it is appropriate (including advantages, disadvantages, and limitations), describe data needed, apply the method to data, and interpret and describe the results. o K-means clustering o Hierarchical clustering • Advanced topics - For each of the following techniques, understand when it is appropriate (including advantages, disadvantages, and limitations), describe data needed o Instance based learning o Support Vector Machines o Bayesian Learning o Additive models o Topic modeling o Neural networks o Gaussian mixture models o Genetic equation search o Grid search