SUMMER 2021 - Predictive analytics (UG)

Course dEscription

Predictive analytics is the art and science of extracting useful information from historical data and present data for the purpose of predicting future trends. In this course, students will have an introduction to the phases of the analytics lifecycle and gain a basic understanding of a variety of tools and machine learning algorithms to analyze data and discover forward insights. Several techniques will be introduced including: data pre-processing techniques, data reduction algorithms, data clustering algorithms, data classification algorithms, association rules data mining algorithms, recommender systems, and more.

Applications from financial markets, bio-informatics, social networks analytics, and text mining will be covered. Highlights from industrial use cases will be covered to demonstrate how Predictive Analytics relates to improving business performance and impacting better decisions. This is an introductory course that will provide students with basic skills of the new generation of data scientists that will allow them to structure, analyze and derive useful insights from data that could help make better decisions.

Books AND rESOURCES

Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman. Mining of massive datasets. Cambridge University Press, 2014.

Bari, Anasse, Mohamed Chaouchi, and Tommy Jung. Predictive analytics for dummies. John Wiley & Sons, 2016.

Siegel, Eric. Predictive analytics: The power to predict who will click, buy, lie, or die. John Wiley & Sons, 2016.

Practical Predictive Analytics and Decisioning Systems for Medicine: Informatics ... Book by Gary Miner, Joseph M. Hilbe,

Linda Miner, Mitchell Goldstein, Nephi Walton, Pat Bolding, Robert Nisbet, and Thomas Hill

Ryza, Sandy, et al. Advanced Analytics with Spark: Patterns for Learning from Data at Scale. " O'Reilly Media, Inc.", 2015.

Han, Jiawei, Micheline Kamber, and Jian Pei. Data mining: concepts and techniques: Elsevier, 2011.

Professor Bari

Anasse Bari, Ph.D.

Email: abari@nyu.edu

Web: http://cs.nyu.edu/~abari/

Twitter: https://twitter.com/BariAnasse

Office: 425 WWH

Phone: 8-3227

Office Hours: To be sent to you by email

Grading

Final grades for the course will be determined using the following weights:

Assignments & Projects - 40%

Announced Short Quizzes - 20%

Exams – 40%

Tools and Analytics Frameworks

Weka

Java Machine Learning API

IBM Watson Analytics

Hadoop and Hadoop Ecosystem

Apache Spark

RapidMiner

Java ML Libraries

Python ML Libraries

GRADING SCALE

The following scale will be followed when assigning the final grade:

A 95-100

A- 90-95

B+ 87-90

B 84-87

B- 80-84

C+ 76-80

C 72-76

D 65-72

F <65

Class attendance and participation may be added to your overall final grade.

Topics and Agenda

Chapter One: Introduction to Predictive Analytics and Related Disciplines

Defining Predictive Analytics

Defining Data Science and Big Data

Introducing Skills needed for Predictive Analytics and Data Science

Highlighting Use-cases around Predictive Analytics

Twitter Predicts the Stock Market

Predicting Breast Cancer Survivability Using Data Mining Techniques

Can Twitter Predict Earthquakes in Japan?

NYC Bikers Data as Knowledge Base for Real Estate Recommender System

The Value of Data Analytics in Mergers and Acquisitions

Mr. Invest: A Deep Learning based Predictor for Stocks based on News Articles

From (1) Success to (2) Failure on Building Predictive Models: Google Search Queries Predict Disease Outbreaks

Target Retail Store Determines and Predicts which of its Customers are or will be Pregnant using Customer Data

Defining Statistics, Machine Learning, Data Mining, and Business Intelligence

Introducing (briefly) Hadoop and MapReduce. (there will be a separate chapter (chapter nine) on both topics)

Chapter Two: Data Analytics Project Development Phases

Introducing the Lifecycle of Data Analytics Project (Reading)

Explaining Predictive Analytics problems and their relationship to Data Clustering, Data Classification, Link Analysis and Recommender Systems

Introducing Supervised and Unsupervised Learning

Chapter Three: Data Preparation Algorithms

Additional subchapter: Tutorial on Model Evaluation, Model Evaluation and Predictive Model Accuracy Measures

Introducing Problems with Raw Data

Introducing the Process of Data Cleaning

Introducing three widely used tools for PA: RapidMiner, Knime and R.

Understanding Data Reduction Algorithms

Feature Reduction using Missing Values Ratio (MVR)

Feature Reduction using Feature Variance Threshold (FVT)

Feature Reduction using Feature Correlation Threshold (FCT)

Tutorial on Introduction to Linear Algebra

Principal Component Analysis (PCA)

Singular Value Decomposition (SVD)

Defining Data Fusion

Introducing Data Compression (Lossy Compression)

Highlighting the Importance of Hash Joins

Practice One: DataPreparationPractice-1 Handling Missing Values with RapidMiner

Practice Two: DataPreparationPractice-2 Reducing Data Dimensionality with Knime

Practice Three: DataPreparationPractice-3 PCA with RapidMiner

Practice Four: DataPreparationPractice-4 SVD with RapidMiner

Chapter Four: Data Similarity Measures

Assigned Readings for Chapter Four (those will be discussed in Class):

1. Boriah, Shyam, Varun Chandola, and Vipin Kumar. "Similarity measures for categorical data: A comparative evaluation." red 30.2 (2008): 3.

2. Shi, Jian-Yu, et al. "Predicting drug target interaction for new drugs using enhanced similarity measures and super-target clustering." Methods 83 (2015): 98-104.

3. Ye, Jun. "Improved cosine similarity measures of simplified neutrosophic sets for medical diagnoses." Artificial intelligence in medicine 63.3 (2015): 171-179.

Understanding the Notion of Similarity among Data Records

Learning the Mathematical Properties for a Similarity Measure and a Similarity Distance

Learning Major Distances and Similarity Measures

Euclidean Distance

Manhattan Distance

Minkowski Distance

Mahalanobis Distance

Cosine Similarity

Jacquard Similarity

Simple Matching Similarity

Pearson Correlation

Hands-on: Introduction to Weka, Data Mining Software

Practice Six: DataPreparationPractice-6 Feature Filtering and Discretization using Weka

Chapter Five: Introduction to Text Mining

Learning the Fundamentals of Text Mining and Text Categorization

Acquiring Understanding of Data Cleaning in Textual Data (e.g. stemming, stop words, parsing)

Learning Document Vector Representation, Term Frequency Measures and Document Nearest Neighbors

Assigned Reading: Chapter Three from Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman. Mining of massive datasets. Cambridge University Press, 2014.

Chapter Six: Feature Selection Algorithms

Understanding the Process of Selecting and Extracting Features

Learning how to Measure the Predictive Power of Features in your Dataset

Learning Greedy Algorithms for Selecting near-optimal set of Features

Learning Entropy based Ranking Measures for Ranking and Relecting Features

Gaining hands-on Practice on Selecting Features from the Hepatitis Dataset

Assigned Readings:

Tang, J., Alelyani, S., & Liu, H. (2014). Feature selection for classification: A review. Data Classification: Algorithms and Applications, 37.

Guyon, Isabelle and Andre Elisseeff. "An introduction to variable and feature selection." Journal of machine learning research 3.Mar (2003): 1157-1182.

Hands-on

Practice Five: Practice Five: DataPreparationPractice-5 PCA Data Cleaning and Feature Selection with R Project

R Source Code

Chapter Seven: Data Clustering Algorithms

Defining Data Clustering

Understanding Data Clustering and its relationship with Predictive/Data Analytics

Highlighting Data Clustering Algorithms Requirements

Introducing Data Clustering Algorithms

Partitioning Algorithms

K-means

K-modes -- Reading on K-modes

Assigned Reading: Huang, Joshua Zhexue. "Clustering Categorical Data with k-Modes." (2009): 246-250.

Hierarchal Algorithms

Density-based Algorithms

DBSCAN

Grid-based Algorithms

Biologically Inspired Algorithms

Birds Flocking Algorithms for Data Clustering

Flock by Leader Machine Learning Algorithm (by Anasse Bari. et. al)

Large Scale Clustering Algorithms

BFR Clustering Algorithm (Bradley-Fayyad-Reina)

CURE Algorithm

Assigned Reading: Chapter 7 from Mining Massive Dataset (Ullman et, al)

Practice Six: DataPreparationPractice-6 Feature Filtering, Data Clustering and Discretization using Weka

Chapter Eight: Data Classification Algorithms

Learning the essence behind Supervised Learning

Understanding the relationship between data classification and data analytics

Learning how to make predictions using data classification algorithms

Understanding and applying data classification algorithms:

K-nearest neighbors (KNN) algorithm

Assigned Reading: Keller, James M., Michael R. Gray, and James A. Givens. A fuzzy k-nearest neighbor algorithm.IEEE transactions on systems, man, and cybernetics 4 (1985): 580-585.

Decision Trees & ID3 and C4.5 algorithms

Assigned Reading: Quinlan, J. Ross. Induction of decision trees. Machine learning 1.1 (1986): 81-106.

(In class practice. Handout distributed in class)

Support Vector Machines (SVM) and the Lagrangian of Original Problem

Assigned reaching: Drucker, Harris, Donghui Wu, and Vladimir N. Vapnik. Support vector machines for spam categorization. IEEE Transactions on Neural networks 10.5 (1999): 1048-1054.

Markov Models (Handout discussed and distributed in class)

Linear Regression

Logistic Regression

Naive Bayes

Neural Networks

Introduction to Deep Learning

Assigned Reading: LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature 521.7553 (2015): 436-444.

Chapter Nine: Software Frameworks for Large Scale Predictive Analytics Applications

Required Readings:

Chapter Two from Ullman book (Mining Massive Datasets): Map-Reduce and the New Software Stack

Dean, Jeffrey, and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM 51.1 (2008): 107-113.

Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. The Google file system. ACM SIGOPS operating systems review. Vol. 37. No. 5. ACM, 2003.

Zhao, Weizhong, Huifang Ma, and Qing He. Parallel k-means clustering based on mapreduce. IEEE International Conference on Cloud Computing. Springer Berlin Heidelberg, 2009.

Additional Reading: Enterprise CIO Guide: How to Use Hadoop with You SAP Software Landscape (SAP Solutions)

Explaining the important of fault-tolerance and data availability in data analytics applications

Introducing the MapReduce Programming Paradigm

Introducing Hadoop File System (HDFS)

Highlighting where Hadoop Ecosystem can be a good solution to a business problem

Elaborating on when Hadoop can be used as part of a full stack predictive analytics solutions

Hadoop as a data storage

Hadoop as a simple database (knowledge base for predictive analytics)

Hadoop as a processing engine

Hadoop for data analytics (Mahout and other parts of the ecosystem)

Mahout library for predictive analytics

Highlighting usecases at Yahoo!, Facebook, and international banks

Predictive Maintenance

Additional Reading: Fulp, Errin W., Glenn A. Fink, and Jereme N. Haack. "Predicting Computer System Failures Using Support Vector Machines." WASL 8 (2008): 5-5.

Recommendation Systems

Enterprise search engine with contextual search and semantic search

Chapter Ten: Mining Hidden Associations in Large Datasets

Understanding a popular and well-researched method for discovering interesting relationships between variables in large databases

Mining Associations Rules is the process of identifying strong rules discovered in databases using different measures of interest.

Examples of Associations that could be discovered:

An increase in sales and a reduction in costs.

Which products were often purchased together? Beer and diapers?!

What are the subsequent purchases after buying a PC?

Which types of genes are sensitive to this new drug?

Learning machine learning algorithms to identify strong rules discovered in databases using different measures of interest

Frequent Itemset Generation (Mining)

Apriori Algorithm

FP-Growth Algorithm

Practice Seven: Mining Association Rules in R (Distributed in class)