Self-notes on ML for Humans

Recommend to read the full book. It's free and very intuitive.

The full book can be read here.

Some notes for myself:

Artificial Narrow Intelligence (ANI) Artificial Gereral Intelligence (AGI) singularity Supervised learning features can be numerical or categorical


linear regression, logistic regression, SVMs


linear regression

Y = f(X) + e e(epsilon): irreducible error draw the line with ordinary least squares is a parametric method, define best model parameters cost function (loss function). find a parameter to minimize loss X can be scalar (single number), vector (array), matrix (2D array) and tensor (张量 >=3D array) look at difference btw each real data point y and our model’s prediction y^ –> = Cost calculus for simple problem; gradient descent for complex loss function

gradient descent: partial derivative –> how much total loss is increased or decreased by increasing parameters by a small amount

bias-variance tradeoff to avoid overfitting: use more training data & use regularization (add in a ;enalty in the loss function) view gradient descent


Logistic regression

output probability btw 0 and 1 by applying Sigmoid function S(x) = 1/ (1+e^(-x)) log-odds ratio (logit) threshold minimize loss with logistic regression the cost function is a measure of how often you predicted 1 when the truth is 0


hyperplane maximize the margin (dist to the nearest point on either side of the line)



-look at k closest data points and thake the average (for continuous values) or the mode (for categorical)

  • good for complex situation where relationship is too complex to be expressed with a simple linear model
  • nearest: using Euclidean distance or Manhatten distance
  • Euclidean distance: Pythagorean 毕达哥拉斯哲学 hypotenuse(直角三角形的)斜边 a^2 + b^2 = c^2 orthogonal

cross-validation higher k prevents overfitting usage: classification (fraud detection) or regression (predicting hausing prices)

decision trees

split data by maximizing information gain entropy: amount of disorders in a set minimize entropy classification or regression (leaf node) good for mixed data (numerical or categorical) computationally expensive to train hyperparameter tuning

Random Forests

excellent starting point for the modeling process, since they tend to have strong performance with a high tolerance for less-cleaned data Unsupervised learning often used to preprocess the data



centroid handwriting recognition with k-means

Hierarchical clustering

start with n clusters, one for each data poing merge two clusters that are closest to each other recompute the dist btw the clusters repeat until get one cluster of N data points pick a number of clusters and draw a horizontal line in the dendrogram. Dimensionality reduction

PCA (linear algebra)

spaces and bases basis vector reduce complexity (dimensionality in this case) while maintaining structure (variance) These basis vectors are called principal components, and the subset you select constitute a new space that is smaller in dimensionality than the original space but maintains as much of the complexity of the data as possible.

Singular value decomposition (SVD)

data as a big A = m x n matrix SVD decompose A into 3 smaller matrices: U=m x r, diagonal matrix Σ=r x r, and V=r x n where r is a small number values in the r*r diagonal matrix Σ are called singular values: can be used to compress the original matrix


t-SNE helps make the cluster more accurate because it converts data into a 2-dimension space where dots are in a circular shape (which pleases to k-means and it’s one of its weak points when creating segments.

Neural Networks & Deep Learning

image recognition: input X is, say, a greyscale image represented by a w-by-h matrix of pixel brightnesses. The output Y is a vector of class probabilities the layers in the middle are just doing a bunch of matrix multiplication by summing activations x weights with non-linear transformations (activation functions) after every hidden layer to enable the network to learn a non-linear function.

Image recognition: A single dense layer

Matrix multiplication: Using the first column of weights in the weights matrix W, we compute the weighted sum of all the pixels of the first image. This sum corresponds to the first neuron.

Activation function

chain neural network layers. The first layer computes weighted sums of of pixels. Subsequent layers compute weighted sums of the outputs of the previous layers. You would typically use the “relu” activation function for all layers but the last. The last layer, in a classifier, would use “softmax” activation.

Four factors for AI

  • Compute
  • Data
  • Algorithms
  • Infrastructure

Convolutional neural networks (CNNs)

designed specifically for images as input; effective for computer vision tasks recurrent neural networks (RNNs) well suited for language problems

Deep reinforcement learning

heart of OpenAI / AlphaGo applying all techniques to the problem of teaching an agent to maximize reward be applied in any context that can be gamified – eg. self-driving, trading stock

Reinforcement learning

exploration/exploitation tradeoff: epsilon-greedy strategy keep in mind that the reward is not always immediate

Markov decision process (Mouse-maze example)

a finite set of states a set of actions available in each state transitions btw states (steps) rewards associated with each transition a discount factor gamma btw 0 and 1. This quantifies the difference in importance between immediate rewards and future rewards Memorylessness: “the future is independent of the past given the present”

We’re trying to maximize the sum of rewards in the long term:


learning the action-value function: determines the value of being in a certain state and taking a certain action at that state Q is supposed to show you the full sum of rewards from choosing action Q and all the optimal actions afterward. learning rate alpha, reward, estimated future reward, deep Q-networks (DQN) - an approach that approximates Q-functions using deep neural networks Asynchronous Advantage Actor-Critic (A3C)

Policy learning

more straightforward alternative in which we learn a policy function: pi map from each state to the best corresponding action at that state


  • predicting molecular bioactivity for drug discovery
  • Google Neural Network Playground
  • Tensorflow, karas tutorials
  • building a two-layer neural network from scratch
  • CS231n
  • Simple Reinforcement Learning with Tensorflow
  • An introduction to statistical learning
  • Neural networks 101
  • gradient descent scratch
comments powered by Disqus
CC-BY-NC 4.0
Built with Hugo Theme Stack