Notes on Data Mining

Data mining includes Predictive and Descriptive.

Predictive goal: to produce a model, expressed as an executable code, which can be used to perform classification, estimation, prediction, or other similar tasks.
Descriptive goal: uncovering patterns and relationship in large data sets.

Data mining roots from Statistics (Math) and Machine Learning (Computer Science).

System identification:

a: structure identification

y = f(u, t); y: model’s output, u: input vector; t: parameter vector

b: parameter identification

Data mining process:

1. state the problem and formulate the hypothesis
1. collect the data
1. preprocessing the data: outlier detection (and removal); scaling, encoding, and selecting features
1. estimate the model: implementation based on several models
1. interpret the model and draw conclusion

Data quality check:

accurate
stored according to data type
integrity
be consistant
not be redundant
timely
well understood
data set should be complete

Properties of high-dimensional data

the size of a data set yielding the same density of data points in an n-dimensional space increases exponentially with dimensions
a larger radius is needed to enclose a fraction of the data points in a high-dimensional space: e(p) = p^(1/d); p is the prespecified fraction of sample
amost every point is closer to an edge than to another sample oint in a high-dimensional space: D(d,n) = 1/2(1/n)^(1/d); d: dimension; n: points
almost every point is an outlier

Raw data are rarely good, many transformations may be needed.

Transformation of raw data:

Normalisation:

decimal scaling: typical to [-1, 1]
min-max normalisation
standard deviation normalisation: V'(i) = (V(i) - mean(v))/sd(V)

data smoothing
differences and ratios

Missing values:

replace with a single global constant
replace with its feature mean
replace with its feature mean for the given class

Sum raw data before data mining

Times series: sometimes better to show t(n+1)-t(n) or t(n+1)/t(n), especially use for logic-based methods, such as decision trees or rules

Moving averages (MA): MA(i,m) =

Outlier detection:

statistics (e.g., mean, SD) (for single dimension)
distance based (multiple-dimension): e.g., Euclidian Distance d = [(x1-x2)^2 + (y1-y2)^2]^(1/2)
deviation-based technique: define the basic characteristics of a sample set, and all samples that deviate from these characterics are outliers

Data reduction (delete nonessential data)

three main dimensions of preprocessed data

columns (features)
rows (cases or samples)
values

basic data reduction

delete a column
delete a row
reduce the number of values in a column (smooth a feature)

Parameters;

computing time
predictive / descriptive accuracy
representation of the data-mining model

No single data reduction method can be suited for all applications

Feature selection / reduction

two main methods:

feature-ranking algorithms
minimum subset algorithms

two ways

bottom-up approach: start with empty set and fill in most relevant features (heuristic criteria)
top-down approach: start from full set, remove one by one those are shown irrelevant

promising subsets are usually obtained heuristically

in general, if one feature describes different classes of entities, samples of different classes can be examined. Means of features are normalised by its variance and then compared:

SE(A-B) = sqrt(Var(A)/n1 + Var(B)/n2)

For a test:

|mean(A) - mean(B)|/SE(A-B) > threshold

Merging features: principle component

Feature examined, merged, and transformed (linear)

F' = Σ(j=1 to m)w(j)·f(j) ; w(j): a single set of weights; upto m transformation are generated each vector of m weights is called a principle component

Entropy measure for ranking features based on a similarity measure S that is in inverse proportion to the distance D btw. two n-dimensional samples

S for two samples: S_ij = e^(-α·D_ij) ; α is a parameter expressed as: α = -(ln0.5)/D ; D is the average distances among samples in the dataset

For nominal variables (non-numeric) use Hamming distance: S_ij = (Σ|X_ik = X_jk|)/n; |xik = xjk| is 1 if xik = xjk, and 0 otherwise.

PCA

a set of samples X = {x1,x2,x3…xm} be transformed into another set: Y = {y1,y2,y3…ym} but most information content is stored in the few dimensions.
transformation based on: assumption high info correspond to high variance: Y = A·X
the first PC: an axis in the direction of max. variance
eigenvalues based on correlatoin-matrix of features
ratio of selected features for data-mining: usually R > 0.9 is good

Values reduction

feature-discretization techniques: discretize the values of continuous features into a small number of intervals

sort all values
assign approximately equal number of sorted values to each bin
move a boder bin, calc total errors -> compose and select best bin

other methods for feature discretization: ChiMerge (automated)

Types of inference: induction, deduction, and transduction

linear: polynomial regression -non-linear: multilayer + artificial neural networks

inducsive-learning dependencies: a learning machine select a function from the set of functions it supports, which best approximates the system’s response

training set (Xi, Yi)

Quality of approximation is measured by the loss function: L(y, f(x, w)); y: output by the system, X: set of inputs, w: set of parameters, f(X, w): output by learning machine

L compares the difference btw. outputs produced by the system (Yi) and that by the learning machine f(Xi, w)

Loss function: non-negative. Large: poor approximation; small to zero: good

Statistical learning theory (Vapnik-Chervonenkis (VC) theory): mainly for classification but also regression and summarisation

Foundations for successful data-mining processes is data-preprocessing and data-reduction methods

Scaling and normalisation

encoding
outlier detection and removal
feature selection and comosition
data cleaning and scrubbing
data smoothing
missing data elimination
case reduction by sampling

data-mining and knowledge-discovery techniques

stats methods: Bayesian interference, logistic regression, ANOVA, long-linear models
cluster analysis
decision trees and rules (CLS, ID3, C4.5 methods)
association rules
artifical neural networks (multiple perceptrons)
genetic algorithms
fuzzy inference systems
N-dimensional visualisation methods

Resampling methods (splitting datasets into training and test samples)

resubstitution methods (training=test=all)
holdout method (1/2 - 2/3 for training)
leave one out method (n-1 for training)
rotation method (n-fold cross validation)
bootstrap method: generate fake data set with same size as given data

Stats methods

mean, median, mode (data occur most frequently), SD
mean - mode = 3*(mean - median)

Bayesian Interference

Bayes theorem; P(H/X) = [P(X/H)·P(H)]/P(X)

Generalized linear models

relationship btw the trend of one variable and the values taken by several other variables
a common type of GLM is logistic regression
linear logistic model: logit(P)

Linear Discriminant Analysis (LDA)

the dependent variable is categorical (nominal or ordinal)
independent variables are metric
Z = w1x1 + w2x2 + … + wkxk; w is the weight; Z is the quantity discriminant score

Cluster analysis (see also here)

Schemata for a formal description of discovered clusters

represent a cluster of points in an n-dim space by their centroid or by a set of distant points
graphically using nodes in a clustering
using logical expression on sample tree attributes

Distance calculation:

Euclidean distance (Disadvantages: not scale invariant, not for negative correlations)
Maximum, Manhattan, Canberra, binary, Minowski, …
Correlation-based distance: 1-r. Pearson correlation coefficient (PCC; Disadvantage: outlier sensitive) or Spearman correlation coefficient (SCC)

Algorithms:

Hierarchical clustering: Agglomerative approach (hclust() and agnes()) and Divisive approach (diana())
K-means clustering (kmeans())
Fuzzy C-Means Clustering (e.g. fanny())
Principal Component Analysis (PCA; prcomp())
Multidimensional Scaling (MDS; cmdscale())
Biclustering
Similarity Measures for Clusters

global criterion: Euclidean;

local criterion: minimal mutual neighbor distance (MND)

K-means: very sensitive to outliers and noise data points

Incremental clustering for very large data sets

Decision trees requirements

attribute-value description
predefined classes
discrete classes
sufficient data
“Logical classification” methods

Artificial neural networks properties and capabilities

nonlinearity
learning from examples
adaptivisity
evidential response
fault tolerence
uniformity fo analysis and design

Learning tasks

pattern association
pattern recognisation
function approximation
control
filtering

genetic algorithms (based on natural evolution)

population-based optimization process
stochasti-optimization techniques
belong to probabilistic algorithms
usually use a binary-coding schema

Ensemble learning

to combine results from various predictive models generated using training samples.
If N classifiers make independent errors and they have the error probability e < 0.5, then it can be shown that the error of an ensemble E is monotonically decreasing the function of N.
consists of two sequential phases: (a) the training phase, and (b) the testing phase.

Graph mining

a collection of techniques for mining the relational aspects of data represented as a graph
Traditional data-mining and man- agement algorithms such as clustering, classification, frequent-pattern mining, and indexing have now been extended to the graph scenario

Temporal data mining

data mining of large sequential data sets, including Temporal Sequences and Time Series.