Data mining includes Predictive and Descriptive.
- Predictive goal: to produce a model, expressed as an executable code, which can be used to perform classification, estimation, prediction, or other similar tasks.
- Descriptive goal: uncovering patterns and relationship in large data sets.
Data mining roots from Statistics (Math) and Machine Learning (Computer Science).
System identification:
a: structure identification
y = f(u, t); y: model’s output, u: input vector; t: parameter vector
b: parameter identification
Data mining process:
-
- state the problem and formulate the hypothesis
-
- collect the data
-
- preprocessing the data: outlier detection (and removal); scaling, encoding, and selecting features
-
- estimate the model: implementation based on several models
-
- interpret the model and draw conclusion
Data quality check:
- accurate
- stored according to data type
- integrity
- be consistant
- not be redundant
- timely
- well understood
- data set should be complete
Properties of high-dimensional data
- the size of a data set yielding the same density of data points in an n-dimensional space increases exponentially with dimensions
- a larger radius is needed to enclose a fraction of the data points in a high-dimensional space: e(p) = p^(1/d); p is the prespecified fraction of sample
- amost every point is closer to an edge than to another sample oint in a high-dimensional space: D(d,n) = 1/2(1/n)^(1/d); d: dimension; n: points
- almost every point is an outlier
Raw data are rarely good, many transformations may be needed.
Transformation of raw data:
- Normalisation:
- decimal scaling: typical to [-1, 1]
- min-max normalisation
- standard deviation normalisation: V'(i) = (V(i) - mean(v))/sd(V)
- data smoothing
- differences and ratios
Missing values:
- replace with a single global constant
- replace with its feature mean
- replace with its feature mean for the given class
Sum raw data before data mining
Times series: sometimes better to show t(n+1)-t(n) or t(n+1)/t(n), especially use for logic-based methods, such as decision trees or rules
Moving averages (MA): MA(i,m) =
Outlier detection:
- statistics (e.g., mean, SD) (for single dimension)
- distance based (multiple-dimension): e.g., Euclidian Distance d = [(x1-x2)^2 + (y1-y2)^2]^(1/2)
- deviation-based technique: define the basic characteristics of a sample set, and all samples that deviate from these characterics are outliers
Data reduction (delete nonessential data)
three main dimensions of preprocessed data
- columns (features)
- rows (cases or samples)
- values
basic data reduction
- delete a column
- delete a row
- reduce the number of values in a column (smooth a feature)
Parameters;
- computing time
- predictive / descriptive accuracy
- representation of the data-mining model
No single data reduction method can be suited for all applications
Feature selection / reduction
two main methods:
- feature-ranking algorithms
- minimum subset algorithms
two ways
- bottom-up approach: start with empty set and fill in most relevant features (heuristic criteria)
- top-down approach: start from full set, remove one by one those are shown irrelevant
promising subsets are usually obtained heuristically
in general, if one feature describes different classes of entities, samples of different classes can be examined. Means of features are normalised by its variance and then compared:
SE(A-B) = sqrt(Var(A)/n1 + Var(B)/n2)
For a test:
|mean(A) - mean(B)|/SE(A-B) > threshold
Merging features: principle component
Feature examined, merged, and transformed (linear)
F' = Σ(j=1 to m)w(j)·f(j) ; w(j): a single set of weights; upto m transformation are generated each vector of m weights is called a principle component
Entropy measure for ranking features based on a similarity measure S that is in inverse proportion to the distance D btw. two n-dimensional samples
S for two samples: S_ij = e^(-α·D_ij) ; α is a parameter expressed as: α = -(ln0.5)/D ; D is the average distances among samples in the dataset
For nominal variables (non-numeric) use Hamming distance: S_ij = (Σ|X_ik = X_jk|)/n; |xik = xjk| is 1 if xik = xjk, and 0 otherwise.
PCA
- a set of samples X = {x1,x2,x3…xm} be transformed into another set: Y = {y1,y2,y3…ym} but most information content is stored in the few dimensions.
- transformation based on: assumption high info correspond to high variance: Y = A·X
- the first PC: an axis in the direction of max. variance
- eigenvalues based on correlatoin-matrix of features
- ratio of selected features for data-mining: usually R > 0.9 is good
Values reduction
feature-discretization techniques: discretize the values of continuous features into a small number of intervals
- sort all values
- assign approximately equal number of sorted values to each bin
- move a boder bin, calc total errors -> compose and select best bin
other methods for feature discretization: ChiMerge (automated)
Types of inference: induction, deduction, and transduction
- linear: polynomial regression -non-linear: multilayer + artificial neural networks
inducsive-learning dependencies: a learning machine select a function from the set of functions it supports, which best approximates the system’s response
training set (Xi, Yi)
Quality of approximation is measured by the loss function: L(y, f(x, w)); y: output by the system, X: set of inputs, w: set of parameters, f(X, w): output by learning machine
L compares the difference btw. outputs produced by the system (Yi) and that by the learning machine f(Xi, w)
Loss function: non-negative. Large: poor approximation; small to zero: good
Statistical learning theory (Vapnik-Chervonenkis (VC) theory): mainly for classification but also regression and summarisation
Foundations for successful data-mining processes is data-preprocessing and data-reduction methods
Scaling and normalisation
- encoding
- outlier detection and removal
- feature selection and comosition
- data cleaning and scrubbing
- data smoothing
- missing data elimination
- case reduction by sampling
data-mining and knowledge-discovery techniques
- stats methods: Bayesian interference, logistic regression, ANOVA, long-linear models
- cluster analysis
- decision trees and rules (CLS, ID3, C4.5 methods)
- association rules
- artifical neural networks (multiple perceptrons)
- genetic algorithms
- fuzzy inference systems
- N-dimensional visualisation methods
Resampling methods (splitting datasets into training and test samples)
- resubstitution methods (training=test=all)
- holdout method (1/2 - 2/3 for training)
- leave one out method (n-1 for training)
- rotation method (n-fold cross validation)
- bootstrap method: generate fake data set with same size as given data
Stats methods
- mean, median, mode (data occur most frequently), SD
- mean - mode = 3*(mean - median)
Bayesian Interference
- Bayes theorem; P(H/X) = [P(X/H)·P(H)]/P(X)
Generalized linear models
- relationship btw the trend of one variable and the values taken by several other variables
- a common type of GLM is logistic regression
- linear logistic model: logit(P)
Linear Discriminant Analysis (LDA)
- the dependent variable is categorical (nominal or ordinal)
- independent variables are metric
- Z = w1x1 + w2x2 + … + wkxk; w is the weight; Z is the quantity discriminant score
Cluster analysis (see also here)
Schemata for a formal description of discovered clusters
- represent a cluster of points in an n-dim space by their centroid or by a set of distant points
- graphically using nodes in a clustering
- using logical expression on sample tree attributes
Distance calculation:
- Euclidean distance (Disadvantages: not scale invariant, not for negative correlations)
- Maximum, Manhattan, Canberra, binary, Minowski, …
- Correlation-based distance: 1-r. Pearson correlation coefficient (PCC; Disadvantage: outlier sensitive) or Spearman correlation coefficient (SCC)
Algorithms:
- Hierarchical clustering: Agglomerative approach (hclust() and agnes()) and Divisive approach (diana())
- K-means clustering (kmeans())
- Fuzzy C-Means Clustering (e.g. fanny())
- Principal Component Analysis (PCA; prcomp())
- Multidimensional Scaling (MDS; cmdscale())
- Biclustering
- Similarity Measures for Clusters
global criterion: Euclidean;
local criterion: minimal mutual neighbor distance (MND)
K-means: very sensitive to outliers and noise data points
Incremental clustering for very large data sets
Decision trees requirements
- attribute-value description
- predefined classes
- discrete classes
- sufficient data
- “Logical classification” methods
Artificial neural networks properties and capabilities
- nonlinearity
- learning from examples
- adaptivisity
- evidential response
- fault tolerence
- uniformity fo analysis and design
Learning tasks
- pattern association
- pattern recognisation
- function approximation
- control
- filtering
genetic algorithms (based on natural evolution)
- population-based optimization process
- stochasti-optimization techniques
- belong to probabilistic algorithms
- usually use a binary-coding schema
Ensemble learning
- to combine results from various predictive models generated using training samples.
- If N classifiers make independent errors and they have the error probability e < 0.5, then it can be shown that the error of an ensemble E is monotonically decreasing the function of N.
- consists of two sequential phases: (a) the training phase, and (b) the testing phase.
Graph mining
- a collection of techniques for mining the relational aspects of data represented as a graph
- Traditional data-mining and man- agement algorithms such as clustering, classification, frequent-pattern mining, and indexing have now been extended to the graph scenario
Temporal data mining
- data mining of large sequential data sets, including Temporal Sequences and Time Series.