Predictive
analytics is the art and science of extracting useful information from
historical data and present data for the purpose of predicting future trends.
In this course, students will have an introduction to the phases of the
analytics lifecycle and gain a basic understanding of a variety of tools and machine
learning algorithms to analyze data and discover forward insights. Several
techniques will be introduced including: data pre-processing techniques, data
reduction algorithms, data clustering algorithms, data classification
algorithms, association rules data mining algorithms, recommender systems, and
more.
Applications
from financial markets, bio-informatics, social networks analytics, and text
mining will be covered. Highlights from industrial use cases will be covered to
demonstrate how Predictive Analytics relates to improving business performance
and impacting better decisions. This is an introductory course that will
provide students with basic skills of the new generation of data scientists
that will allow them to structure, analyze and derive useful insights from data
that could help make better decisions.
Recommended
Linda
Miner, Mitchell Goldstein, Nephi Walton, Pat Bolding, Robert Nisbet, and Thomas
Hill
Han,
Jiawei, Micheline Kamber, and Jian Pei. Data mining:
concepts and techniques: Elsevier, 2011.
Anasse Bari, Ph.D.
Email: abari@nyu.edu
Web: http://cs.nyu.edu/~abari/
Twitter: https://twitter.com/BariAnasse
Office: 425 WWH
Phone: 8-3227
Office Hours: To be sent to you by email
Final grades for the course will be determined using the following
weights:
Assignments & Projects - 40%
Announced Short Quizzes - 20%
Weka
R
Java Machine Learning API
IBM Watson Analytics
Hadoop and Hadoop Ecosystem
Apache Spark
RapidMiner
Java ML Libraries
Python ML Libraries
The
following scale will be followed when assigning the final grade:
A
95-100
A-
90-95
B+
87-90
B
84-87
B-
80-84
C+
76-80
C
72-76
D
65-72
F
<65
Class
attendance and participation may be added to your overall final grade.
Chapter
One: Introduction to Predictive Analytics and Related Disciplines
Defining Predictive Analytics
Defining Data Science and Big
Data
Introducing Skills needed for
Predictive Analytics and Data Science
Highlighting Use-cases around
Predictive Analytics
Twitter Predicts the
Stock Market
Predicting
Breast Cancer Survivability Using Data Mining Techniques
Can Twitter Predict Earthquakes in Japan?
NYC Bikers Data as Knowledge
Base for Real Estate Recommender System
The
Value of Data Analytics in Mergers and Acquisitions
Mr. Invest: A Deep Learning
based Predictor for Stocks based on News Articles
From (1) Success to (2) Failure on
Building Predictive Models: Google Search Queries Predict Disease Outbreaks
Defining Statistics, Machine
Learning, Data Mining, and Business Intelligence
Introducing (briefly) Hadoop and
MapReduce. (there will be a separate chapter (chapter nine) on both topics)
Chapter
Two: Data Analytics Project Development Phases
Introducing the Lifecycle of Data Analytics Project (Reading)
Explaining Predictive Analytics
problems and their relationship to Data Clustering, Data Classification, Link
Analysis and Recommender Systems
Introducing Supervised and
Unsupervised Learning
Chapter Three: Data Preparation
Algorithms
Additional subchapter: Tutorial on Model
Evaluation, Model Evaluation and Predictive Model Accuracy Measures
Introducing Problems with Raw
Data
Introducing the Process of Data
Cleaning
Introducing three
widely used tools for PA: RapidMiner, Knime and R.
Understanding Data Reduction
Algorithms
Feature Reduction using Missing
Values Ratio (MVR)
Feature Reduction using Feature
Variance Threshold (FVT)
Feature Reduction using Feature
Correlation Threshold (FCT)
Tutorial on Introduction to
Linear Algebra
Principal Component Analysis
(PCA)
Singular Value Decomposition
(SVD)
Defining Data Fusion
Introducing Data Compression
(Lossy Compression)
Highlighting the Importance of
Hash Joins
Practice One: DataPreparationPractice-1 Handling Missing Values with RapidMiner
Practice
Two: DataPreparationPractice-2 Reducing Data Dimensionality with Knime
Practice
Three: DataPreparationPractice-3 PCA with RapidMiner
Practice
Four: DataPreparationPractice-4 SVD with RapidMiner
Chapter Four: Data Similarity
Measures
Assigned Readings for Chapter Four (those will be discussed in
Class):
Understanding the Notion of
Similarity among Data Records
Learning the Mathematical
Properties for a Similarity Measure and a Similarity Distance
Learning Major Distances and Similarity
Measures
Euclidean Distance
Manhattan Distance
Minkowski
Distance
Mahalanobis
Distance
Cosine Similarity
Jacquard Similarity
Simple Matching Similarity
Pearson Correlation
Hands-on: Introduction to Weka, Data Mining Software
Practice Six: DataPreparationPractice-6 Feature Filtering and
Discretization using Weka
Chapter Five: Introduction to Text
Mining
Learning the Fundamentals of
Text Mining and Text Categorization
Acquiring Understanding of Data
Cleaning in Textual Data (e.g. stemming, stop words, parsing)
Learning Document Vector
Representation, Term Frequency Measures and Document Nearest Neighbors
Assigned
Reading: Chapter Three from Leskovec, Jure, Anand Rajaraman, and Jeffrey David
Ullman. Mining of massive datasets. Cambridge University Press, 2014.
Chapter Six: Feature Selection
Algorithms
Understanding the Process of
Selecting and Extracting Features
Learning how to Measure the
Predictive Power of Features in your Dataset
Learning Greedy Algorithms for
Selecting near-optimal set of Features
Learning Entropy based Ranking
Measures for Ranking and Relecting Features
Gaining hands-on Practice on
Selecting Features from the Hepatitis Dataset
Assigned Readings:
Hands-on
Chapter Seven: Data Clustering Algorithms
Defining Data Clustering
Understanding Data Clustering
and its relationship with Predictive/Data Analytics
Highlighting Data Clustering
Algorithms Requirements
Introducing Data Clustering
Algorithms
Partitioning Algorithms
K-means
K-modes -- Reading on K-modes
Assigned Reading: Huang,
Joshua Zhexue. "Clustering Categorical Data with
k-Modes." (2009): 246-250.
Hierarchal Algorithms
Density-based Algorithms
DBSCAN
Grid-based Algorithms
Biologically Inspired Algorithms
Birds Flocking Algorithms for
Data Clustering
Flock by Leader Machine Learning
Algorithm (by Anasse Bari. et. al)
Large Scale Clustering
Algorithms
BFR Clustering Algorithm
(Bradley-Fayyad-Reina)
CURE Algorithm
Assigned Reading: Chapter 7 from Mining Massive Dataset (Ullman et,
al)
Practice Six: DataPreparationPractice-6
Feature Filtering, Data Clustering and Discretization using Weka
Chapter Eight: Data
Classification Algorithms
Learning the essence behind Supervised Learning
Understanding the relationship
between data classification and data analytics
Learning how to make predictions
using data classification algorithms
Understanding and applying data
classification algorithms:
K-nearest
neighbors (KNN) algorithm
Assigned Reading: Keller, James M., Michael R. Gray,
and James A. Givens. A fuzzy k-nearest neighbor algorithm.IEEE transactions on systems, man, and
cybernetics 4 (1985): 580-585.
Decision
Trees & ID3 and C4.5 algorithms
Assigned Reading: Quinlan, J. Ross. Induction of decision
trees. Machine learning 1.1 (1986): 81-106.
(In class practice. Handout distributed in class)
Support
Vector Machines (SVM) and the Lagrangian of Original
Problem
Assigned reaching: Drucker, Harris, Donghui
Wu, and Vladimir N. Vapnik. Support vector machines
for spam categorization. IEEE Transactions on Neural networks 10.5 (1999):
1048-1054.
Markov Models (Handout discussed and distributed in class)
Linear Regression
Logistic Regression
Naive Bayes
Neural Networks
Introduction to Deep Learning
Assigned Reading: LeCun, Yann, Yoshua Bengio, and Geoffrey
Hinton. Deep learning. Nature 521.7553 (2015): 436-444.
Chapter Nine: Software Frameworks for Large
Scale Predictive Analytics Applications
Required
Readings:
Chapter Two from Ullman book (Mining Massive Datasets): Map-Reduce
and the New Software Stack
Explaining the important of fault-tolerance and data availability in data analytics applications
Introducing the MapReduce
Programming Paradigm
Introducing Hadoop File System
(HDFS)
Highlighting where Hadoop
Ecosystem can be a good solution to a business problem
Elaborating on when Hadoop can
be used as part of a full stack predictive analytics solutions
Hadoop as a data storage
Hadoop as a simple database
(knowledge base for predictive analytics)
Hadoop as a processing engine
Hadoop for data analytics
(Mahout and other parts of the ecosystem)
Mahout library for predictive
analytics
Highlighting usecases
at Yahoo!, Facebook, and international banks
Predictive Maintenance
Additional
Reading: Fulp, Errin W., Glenn A. Fink, and Jereme
N. Haack. "Predicting Computer System Failures Using Support Vector
Machines." WASL 8 (2008): 5-5.
Recommendation Systems
Enterprise search engine with
contextual search and semantic search
Chapter Ten: Mining Hidden
Associations in Large Datasets
Understanding a popular and
well-researched method for discovering interesting relationships between
variables in large databases
Mining Associations Rules is the
process of identifying strong rules discovered in databases using different
measures of interest.
Examples of Associations that
could be discovered:
An
increase in sales and a reduction in costs.
Which
products were often purchased together? Beer and diapers?!
What
are the subsequent purchases after buying a PC?
Which
types of genes are sensitive to this new drug?
Learning machine learning algorithms
to identify strong rules discovered in databases using different measures of
interest
Frequent Itemset Generation
(Mining)
Apriori
Algorithm
FP-Growth Algorithm
Practice
Seven: Mining Association Rules in R (Distributed in class)