Notes on Data Mining
Some selfnotes from the book Data Mining: Concepts, Models, Methods, and Algorithms (by Mehmed Kantardzic).
Data mining includes Predictive and Descriptive.
 Predictive goal: to produce a model, expressed as an executable code, which can be used to perform classification, estimation, prediction, or other similar tasks.
 Descriptive goal: uncovering patterns and relationship in large data sets.
Data mining roots from Statistics (Math) and Machine Learning (Computer Science).
System identification:
a: structure identification
y = f(u, t); y: model’s output, u: input vector; t: parameter vector
b: parameter identification
Data mining process:

 state the problem and formulate the hypothesis

 collect the data

 preprocessing the data: outlier detection (and removal); scaling, encoding, and selecting features

 estimate the model: implementation based on several models

 interpret the model and draw conclusion
Data quality check:
 accurate
 stored according to data type
 integrity
 be consistant
 not be redundant
 timely
 well understood
 data set should be complete
Properties of highdimensional data
 the size of a data set yielding the same density of data points in an ndimensional space increases exponentially with dimensions
 a larger radius is needed to enclose a fraction of the data points in a highdimensional space: e(p) = p^(1/d); p is the prespecified fraction of sample
 amost every point is closer to an edge than to another sample oint in a highdimensional space: D(d,n) = 1/2(1/n)^(1/d); d: dimension; n: points
 almost every point is an outlier
Raw data are rarely good, many transformations may be needed.
Transformation of raw data:
 Normalisation:
 decimal scaling: typical to [1, 1]
 minmax normalisation
 standard deviation normalisation: V’(i) = (V(i)  mean(v))/sd(V)
 data smoothing
 differences and ratios
Missing values:
 replace with a single global constant
 replace with its feature mean
 replace with its feature mean for the given class
Sum raw data before data mining
Times series: sometimes better to show t(n+1)t(n) or t(n+1)/t(n), especially use for logicbased methods, such as decision trees or rules
Moving averages (MA): MA(i,m) =
Outlier detection:
 statistics (e.g., mean, SD) (for single dimension)
 distance based (multipledimension): e.g., Euclidian Distance d = [(x1x2)^2 + (y1y2)^2]^(1/2)
 deviationbased technique: define the basic characteristics of a sample set, and all samples that deviate from these characterics are outliers
Data reduction (delete nonessential data)
three main dimensions of preprocessed data
 columns (features)
 rows (cases or samples)
 values
basic data reduction
 delete a column
 delete a row
 reduce the number of values in a column (smooth a feature)
Parameters;
 computing time
 predictive / descriptive accuracy
 representation of the datamining model
No single data reduction method can be suited for all applications
Feature selection / reduction
two main methods:
 featureranking algorithms
 minimum subset algorithms
two ways
 bottomup approach: start with empty set and fill in most relevant features (heuristic criteria)
 topdown approach: start from full set, remove one by one those are shown irrelevant
promising subsets are usually obtained heuristically
in general, if one feature describes different classes of entities, samples of different classes can be examined. Means of features are normalised by its variance and then compared:
SE(AB) = sqrt(Var(A)/n1 + Var(B)/n2)
For a test:
mean(A)  mean(B)/SE(AB) > threshold
Merging features: principle component
Feature examined, merged, and transformed (linear)
F’ = Σ(j=1 to m)w(j)·f(j) ; w(j): a single set of weights; upto m transformation are generated each vector of m weights is called a principle component
Entropy measure for ranking features based on a similarity measure S that is in inverse proportion to the distance D btw. two ndimensional samples
S for two samples: S_ij = e^(α·D_ij) ; α is a parameter expressed as: α = (ln0.5)/D ; D is the average distances among samples in the dataset
For nominal variables (nonnumeric) use Hamming distance: S_ij = (ΣX_ik = X_jk)/n; xik = xjk is 1 if xik = xjk, and 0 otherwise.
PCA
 a set of samples X = {x1,x2,x3…xm} be transformed into another set: Y = {y1,y2,y3…ym} but most information content is stored in the few dimensions.
 transformation based on: assumption high info correspond to high variance: Y = A·X
 the first PC: an axis in the direction of max. variance
 eigenvalues based on correlatoinmatrix of features
 ratio of selected features for datamining: usually R > 0.9 is good
Values reduction
featurediscretization techniques: discretize the values of continuous features into a small number of intervals
 sort all values
 assign approximately equal number of sorted values to each bin
 move a boder bin, calc total errors > compose and select best bin
other methods for feature discretization: ChiMerge (automated)
Types of inference: induction, deduction, and transduction
 linear: polynomial regression nonlinear: multilayer + artificial neural networks
inducsivelearning dependencies: a learning machine select a function from the set of functions it supports, which best approximates the system’s response
training set (Xi, Yi)
Quality of approximation is measured by the loss function: L(y, f(x, w)); y: output by the system, X: set of inputs, w: set of parameters, f(X, w): output by learning machine
L compares the difference btw. outputs produced by the system (Yi) and that by the learning machine f(Xi, w)
Loss function: nonnegative. Large: poor approximation; small to zero: good
Statistical learning theory (VapnikChervonenkis (VC) theory): mainly for classification but also regression and summarisation
Foundations for successful datamining processes is datapreprocessing and datareduction methods
Scaling and normalisation
 encoding
 outlier detection and removal
 feature selection and comosition
 data cleaning and scrubbing
 data smoothing
 missing data elimination
 case reduction by sampling
datamining and knowledgediscovery techniques
 stats methods: Bayesian interference, logistic regression, ANOVA, longlinear models
 cluster analysis
 decision trees and rules (CLS, ID3, C4.5 methods)
 association rules
 artifical neural networks (multiple perceptrons)
 genetic algorithms
 fuzzy inference systems
 Ndimensional visualisation methods
Resampling methods (splitting datasets into training and test samples)
 resubstitution methods (training=test=all)
 holdout method (1/2  2/3 for training)
 leave one out method (n1 for training)
 rotation method (nfold cross validation)
 bootstrap method: generate fake data set with same size as given data
Stats methods
 mean, median, mode (data occur most frequently), SD
 mean  mode = 3*(mean  median)
Bayesian Interference
 Bayes theorem; P(H/X) = [P(X/H)·P(H)]/P(X)
Generalized linear models
 relationship btw the trend of one variable and the values taken by several other variables
 a common type of GLM is logistic regression
 linear logistic model: logit(P)
Linear Discriminant Analysis (LDA)
 the dependent variable is categorical (nominal or ordinal)
 independent variables are metric
 Z = w1x1 + w2x2 + … + wkxk; w is the weight; Z is the quantity discriminant score
Cluster analysis (see also here)
Schemata for a formal description of discovered clusters
 represent a cluster of points in an ndim space by their centroid or by a set of distant points
 graphically using nodes in a clustering
 using logical expression on sample tree attributes
Distance calculation:
 Euclidean distance (Disadvantages: not scale invariant, not for negative correlations)
 Maximum, Manhattan, Canberra, binary, Minowski, …
 Correlationbased distance: 1r. Pearson correlation coefficient (PCC; Disadvantage: outlier sensitive) or Spearman correlation coefficient (SCC)
Algorithms:
 Hierarchical clustering: Agglomerative approach (hclust() and agnes()) and Divisive approach (diana())
 Kmeans clustering (kmeans())
 Fuzzy CMeans Clustering (e.g. fanny())
 Principal Component Analysis (PCA; prcomp())
 Multidimensional Scaling (MDS; cmdscale())
 Biclustering
 Similarity Measures for Clusters
global criterion: Euclidean;
local criterion: minimal mutual neighbor distance (MND)
Kmeans: very sensitive to outliers and noise data points
Incremental clustering for very large data sets
Decision trees requirements
 attributevalue description
 predefined classes
 discrete classes
 sufficient data
 “Logical classification” methods
Artificial neural networks properties and capabilities
 nonlinearity
 learning from examples
 adaptivisity
 evidential response
 fault tolerence
 uniformity fo analysis and design
Learning tasks
 pattern association
 pattern recognisation
 function approximation
 control
 filtering
genetic algorithms (based on natural evolution)
 populationbased optimization process
 stochastioptimization techniques
 belong to probabilistic algorithms
 usually use a binarycoding schema
Ensemble learning
 to combine results from various predictive models generated using training samples.
 If N classifiers make independent errors and they have the error probability e < 0.5, then it can be shown that the error of an ensemble E is monotonically decreasing the function of N.
 consists of two sequential phases: (a) the training phase, and (b) the testing phase.
Graph mining
 a collection of techniques for mining the relational aspects of data represented as a graph
 Traditional datamining and man agement algorithms such as clustering, classification, frequentpattern mining, and indexing have now been extended to the graph scenario
Temporal data mining
 data mining of large sequential data sets, including Temporal Sequences and Time Series.