Data Mining Techniques Class Notes and Outline
Course term: Spring 2012 – Northeastern University
Website: course official website.
Course Professor:Professor Mirek Riedewald
Textbook: Data Mining: Concepts and Techniques, Third Edition (The Morgan Kaufmann Series in Data Management Systems)
Some of the outline item will have a separate detailed post.
Course outline
Introduction
Data preprocessing
Classification/Prediction
- Decision trees
- Overfitting
- Statistical learning theory
- Nearest neighbor
- Bayes’ theorem
- Naive Bayes
- Joint Distribution
- Bayesian networks
===========================================
– Naive Bayes
===========================================
– For discrete attributes
P( X=x | Ci) * P(Ci)
– P( Ci | X=x) = ——————–
P(X=x)
– for record x, find class Ci that has the maximum
posterior probability P( Ci | X=x )
– conditional independence
– Example
– For continuous attributes without discretization
– a bit different formula
– Zero probability problem — do smoothing (e.g. Laplace)
===========================================
– Bayes Theorem
===========================================
– using probabilities
P( X=x | H) * P(H)
– P( H | X=x) = ——————
P(X=x)
– Posterior = likelihood * prior / evidence
– Find the maximum a posteriori (MAP
===========================================
– Neaerest neighbor
===========================================
– Lazy learning — learn the model when given a test record
– Eager learning — construct the model prior to test (decision tree)
– 1-NN — 1 nearest neighbor
– k-NN — any number up to N
if k too small
— sensitive to noise points
— high variance, low bias
— overfitting
if k too large — neighborhood may include points from other classes
— low variance, high bias
— underfitting
– cost:
– brute force: O(|training instances|)
– pre-compute Voronoi: O(log|training instances|)
===========================================
– Statistical learning / decision theory
===========================================
– Random variables
– Squared error
– Absolute error
– Bias-Variance tradeoff
– Bias decreases and variance increases as tree becomes larger
==================================
– Classification/Prediction
==================================
– Introduction
idea:
f(X1,..,Xn) -> Y
– Classification – for discrete attribute
– Prediction – for continuous attribute
==================================
– Decision tree
==================================
– Decision tree
– The model is a tree
– Greedy algorithm to construct the tree
– Conditions for stopping partitioning:
– Pure node (all records belong to same class)
– No remaining attributes for further partitioning
(majority voting will be done)
– No cases left
– Decision boundary – split on a single attribute
– Oblique decision trees – split on multiple attribute
– Binary split
– Multi-way split
– Attribute selection – how to determine the best split
– Information Gain
– Gain Ration
– Gini Index
– Underfitting and overfitting
– Training set error
– Test set error
– Minimum Description Length (MDL)
– Tree cost analysis
– Finding optimal decision tree is NP-complete
– O( |attributes| * |training instances| * log(|training instances|))
==================================
– Data preprocessing
==================================
– Basic definitions
– Data records
– Attributes
– Discrete
– Continuous
– Descriptive data summarization
– Statistic review
– Mean (xbar)
– Median
– Mode
– Variance
– Standard deviation
– Histogram
– Scatter plot
– Data cleaning
– How to handle missing data
– Correlations
– Covariance
– Pearson’s product-moment (numerical data)
– Chi-square (categorical/discrete data)
– Data transformation
– Normalization