Browse By

Data Mining Techniques Class Notes and Outline

data mining class outline image
Course term: Spring 2012 – Northeastern University
Website: course official website.
Course Professor:Professor Mirek Riedewald
Textbook: Data Mining: Concepts and Techniques, Third Edition (The Morgan Kaufmann Series in Data Management Systems)

Some of the outline item will have a separate detailed post.

Course outline

Introduction

Data preprocessing

Classification/Prediction

  • Decision trees
  • Overfitting
  • Statistical learning theory
  • Nearest neighbor
  • Bayes’ theorem
  • Naive Bayes
  • Joint Distribution
  • Bayesian networks

===========================================
– Naive Bayes
===========================================

– For discrete attributes

P( X=x | Ci) * P(Ci)
– P( Ci | X=x) = ——————–
P(X=x)

– for record x, find class Ci that has the maximum
posterior probability P( Ci | X=x )

– conditional independence

– Example

– For continuous attributes without discretization
– a bit different formula

– Zero probability problem — do smoothing (e.g. Laplace)

===========================================
– Bayes Theorem
===========================================

– using probabilities

P( X=x | H) * P(H)
– P( H | X=x) = ——————
P(X=x)

– Posterior = likelihood * prior / evidence

– Find the maximum a posteriori (MAP

===========================================
– Neaerest neighbor
===========================================

– Lazy learning — learn the model when given a test record
– Eager learning — construct the model prior to test (decision tree)

– 1-NN — 1 nearest neighbor
– k-NN — any number up to N
if k too small
— sensitive to noise points
— high variance, low bias
— overfitting

if k too large — neighborhood may include points from other classes
— low variance, high bias
— underfitting

– cost:
– brute force: O(|training instances|)
– pre-compute Voronoi: O(log|training instances|)

===========================================
– Statistical learning / decision theory
===========================================

– Random variables

– Squared error
– Absolute error

– Bias-Variance tradeoff
– Bias decreases and variance increases as tree becomes larger

==================================
– Classification/Prediction
==================================

– Introduction

idea:
f(X1,..,Xn) -> Y

– Classification – for discrete attribute
– Prediction – for continuous attribute

==================================
– Decision tree
==================================

– Decision tree
– The model is a tree
– Greedy algorithm to construct the tree

– Conditions for stopping partitioning:
– Pure node (all records belong to same class)
– No remaining attributes for further partitioning
(majority voting will be done)
– No cases left

– Decision boundary – split on a single attribute
– Oblique decision trees – split on multiple attribute

– Binary split
– Multi-way split

– Attribute selection – how to determine the best split
– Information Gain
– Gain Ration
– Gini Index

– Underfitting and overfitting
– Training set error
– Test set error

– Minimum Description Length (MDL)

– Tree cost analysis
– Finding optimal decision tree is NP-complete
– O( |attributes| * |training instances| * log(|training instances|))

==================================
– Data preprocessing
==================================

– Basic definitions
– Data records
– Attributes
– Discrete
– Continuous

– Descriptive data summarization
– Statistic review
– Mean (xbar)
– Median
– Mode
– Variance
– Standard deviation
– Histogram
– Scatter plot

– Data cleaning
– How to handle missing data

– Correlations
– Covariance
– Pearson’s product-moment (numerical data)
– Chi-square (categorical/discrete data)

– Data transformation
– Normalization

Leave a Reply

Your email address will not be published. Required fields are marked *