## Notes on Data Mining

### Some self-notes from the book Data Mining: Concepts, Models, Methods, and Algorithms (by Mehmed Kantardzic).

#### Data mining includes Predictive and Descriptive.

• Predictive goal: to produce a model, expressed as an executable code, which can be used to perform classification, estimation, prediction, or other similar tasks.
• Descriptive goal: uncovering patterns and relationship in large data sets.

Data mining roots from Statistics (Math) and Machine Learning (Computer Science).

#### System identification:

a: structure identification

y = f(u, t); y: model’s output, u: input vector; t: parameter vector

b: parameter identification

#### Data mining process:

1. state the problem and formulate the hypothesis
1. collect the data
1. preprocessing the data: outlier detection (and removal); scaling, encoding, and selecting features
1. estimate the model: implementation based on several models
1. interpret the model and draw conclusion

#### Data quality check:

• accurate
• stored according to data type
• integrity
• be consistant
• not be redundant
• timely
• well understood
• data set should be complete

#### Properties of high-dimensional data

• the size of a data set yielding the same density of data points in an n-dimensional space increases exponentially with dimensions
• a larger radius is needed to enclose a fraction of the data points in a high-dimensional space: e(p) = p^(1/d); p is the prespecified fraction of sample
• amost every point is closer to an edge than to another sample oint in a high-dimensional space: D(d,n) = 1/2(1/n)^(1/d); d: dimension; n: points
• almost every point is an outlier

Raw data are rarely good, many transformations may be needed.

#### Transformation of raw data:

1. Normalisation:
• decimal scaling: typical to [-1, 1]
• min-max normalisation
• standard deviation normalisation: V'(i) = (V(i) - mean(v))/sd(V)
1. data smoothing
2. differences and ratios

Missing values:

• replace with a single global constant
• replace with its feature mean
• replace with its feature mean for the given class

Sum raw data before data mining

Times series: sometimes better to show t(n+1)-t(n) or t(n+1)/t(n), especially use for logic-based methods, such as decision trees or rules

Moving averages (MA): MA(i,m) =

#### Outlier detection:

• statistics (e.g., mean, SD) (for single dimension)
• distance based (multiple-dimension): e.g., Euclidian Distance d = [(x1-x2)^2 + (y1-y2)^2]^(1/2)
• deviation-based technique: define the basic characteristics of a sample set, and all samples that deviate from these characterics are outliers

#### Data reduction (delete nonessential data)

three main dimensions of preprocessed data

• columns (features)
• rows (cases or samples)
• values

basic data reduction

• delete a column
• delete a row
• reduce the number of values in a column (smooth a feature)

Parameters;

• computing time
• predictive / descriptive accuracy
• representation of the data-mining model

No single data reduction method can be suited for all applications

#### Feature selection / reduction

two main methods:

• feature-ranking algorithms
• minimum subset algorithms

two ways

• bottom-up approach: start with empty set and fill in most relevant features (heuristic criteria)
• top-down approach: start from full set, remove one by one those are shown irrelevant

promising subsets are usually obtained heuristically

in general, if one feature describes different classes of entities, samples of different classes can be examined. Means of features are normalised by its variance and then compared:

SE(A-B) = sqrt(Var(A)/n1 + Var(B)/n2)

For a test:

|mean(A) - mean(B)|/SE(A-B) > threshold

Merging features: principle component

Feature examined, merged, and transformed (linear)

F' = Σ(j=1 to m)w(j)·f(j) ; w(j): a single set of weights; upto m transformation are generated each vector of m weights is called a principle component

Entropy measure for ranking features based on a similarity measure S that is in inverse proportion to the distance D btw. two n-dimensional samples

S for two samples: S_ij = e^(-α·D_ij) ; α is a parameter expressed as: α = -(ln0.5)/D ; D is the average distances among samples in the dataset

For nominal variables (non-numeric) use Hamming distance: S_ij = (Σ|X_ik = X_jk|)/n; |xik = xjk| is 1 if xik = xjk, and 0 otherwise.

#### PCA

• a set of samples X = {x1,x2,x3…xm} be transformed into another set: Y = {y1,y2,y3…ym} but most information content is stored in the few dimensions.
• transformation based on: assumption high info correspond to high variance: Y = A·X
• the first PC: an axis in the direction of max. variance
• eigenvalues based on correlatoin-matrix of features
• ratio of selected features for data-mining: usually R > 0.9 is good

#### Values reduction

feature-discretization techniques: discretize the values of continuous features into a small number of intervals

• sort all values
• assign approximately equal number of sorted values to each bin
• move a boder bin, calc total errors -> compose and select best bin

other methods for feature discretization: ChiMerge (automated)

### Types of inference: induction, deduction, and transduction

• linear: polynomial regression -non-linear: multilayer + artificial neural networks

inducsive-learning dependencies: a learning machine select a function from the set of functions it supports, which best approximates the system’s response

training set (Xi, Yi)

Quality of approximation is measured by the loss function: L(y, f(x, w)); y: output by the system, X: set of inputs, w: set of parameters, f(X, w): output by learning machine

L compares the difference btw. outputs produced by the system (Yi) and that by the learning machine f(Xi, w)

Loss function: non-negative. Large: poor approximation; small to zero: good

Statistical learning theory (Vapnik-Chervonenkis (VC) theory): mainly for classification but also regression and summarisation

Foundations for successful data-mining processes is data-preprocessing and data-reduction methods

#### Scaling and normalisation

• encoding
• outlier detection and removal
• feature selection and comosition
• data cleaning and scrubbing
• data smoothing
• missing data elimination
• case reduction by sampling

#### data-mining and knowledge-discovery techniques

• stats methods: Bayesian interference, logistic regression, ANOVA, long-linear models
• cluster analysis
• decision trees and rules (CLS, ID3, C4.5 methods)
• association rules
• artifical neural networks (multiple perceptrons)
• genetic algorithms
• fuzzy inference systems
• N-dimensional visualisation methods

#### Resampling methods (splitting datasets into training and test samples)

• resubstitution methods (training=test=all)
• holdout method (1/2 - 2/3 for training)
• leave one out method (n-1 for training)
• rotation method (n-fold cross validation)
• bootstrap method: generate fake data set with same size as given data

#### Stats methods

• mean, median, mode (data occur most frequently), SD
• mean - mode = 3*(mean - median)

#### Bayesian Interference

• Bayes theorem; P(H/X) = [P(X/H)·P(H)]/P(X)

#### Generalized linear models

• relationship btw the trend of one variable and the values taken by several other variables
• a common type of GLM is logistic regression
• linear logistic model: logit(P)

#### Linear Discriminant Analysis (LDA)

• the dependent variable is categorical (nominal or ordinal)
• independent variables are metric
• Z = w1x1 + w2x2 + … + wkxk; w is the weight; Z is the quantity discriminant score

Schemata for a formal description of discovered clusters

• represent a cluster of points in an n-dim space by their centroid or by a set of distant points
• graphically using nodes in a clustering
• using logical expression on sample tree attributes

Distance calculation:

• Euclidean distance (Disadvantages: not scale invariant, not for negative correlations)
• Maximum, Manhattan, Canberra, binary, Minowski, …
• Correlation-based distance: 1-r. Pearson correlation coefficient (PCC; Disadvantage: outlier sensitive) or Spearman correlation coefficient (SCC)

Algorithms:

• Hierarchical clustering: Agglomerative approach (hclust() and agnes()) and Divisive approach (diana())
• K-means clustering (kmeans())
• Fuzzy C-Means Clustering (e.g. fanny())
• Principal Component Analysis (PCA; prcomp())
• Multidimensional Scaling (MDS; cmdscale())
• Biclustering
• Similarity Measures for Clusters

global criterion: Euclidean;

local criterion: minimal mutual neighbor distance (MND)

K-means: very sensitive to outliers and noise data points

Incremental clustering for very large data sets

#### Decision trees requirements

• attribute-value description
• predefined classes
• discrete classes
• sufficient data
• “Logical classification” methods

#### Artificial neural networks properties and capabilities

• nonlinearity
• learning from examples
• evidential response
• fault tolerence
• uniformity fo analysis and design

• pattern association
• pattern recognisation
• function approximation
• control
• filtering

#### genetic algorithms (based on natural evolution)

• population-based optimization process
• stochasti-optimization techniques
• belong to probabilistic algorithms
• usually use a binary-coding schema

#### Ensemble learning

• to combine results from various predictive models generated using training samples.
• If N classifiers make independent errors and they have the error probability e < 0.5, then it can be shown that the error of an ensemble E is monotonically decreasing the function of N.
• consists of two sequential phases: (a) the training phase, and (b) the testing phase.

#### Graph mining

• a collection of techniques for mining the relational aspects of data represented as a graph
• Traditional data-mining and man- agement algorithms such as clustering, classification, frequent-pattern mining, and indexing have now been extended to the graph scenario

#### Temporal data mining

• data mining of large sequential data sets, including Temporal Sequences and Time Series.