Skip to content

Latest commit

 

History

History
101 lines (76 loc) · 6.06 KB

machineLearning.md

File metadata and controls

101 lines (76 loc) · 6.06 KB

Notes for Coursera: Machine Learning

3 Types of machine learning

  1. Supervised learning - learns using labeled training data. Training data has input and output.
  2. Unsupervised learning - learns using unlabled training data. Training data is input but has no output. Algorithm finds structure. Clustering is an example of unsupervised learning. Examples: social network analysis, market segmentation, astronomical data analysis.
  3. Reinforcement learning - learns to take action by maximizing a cumulative reward. deep mind learns breakout

2 Types of machine learning problems

  1. regression - predicting a continuous value attribute (Example: house prices)
  2. classification - predicting a discrete value. (Example: pass or fail)

Types of classification

Binary classification - classifying elements into one of two groups. Examples: benign/malignant tumor, fraudulent or legitimate financial transaction, spam or non-spam email.

Multiclass classification/multinomial classification - classify instances into more than 2 classes. Example: MNIST evaluates handwritten single numeric characters and classifies into 10 binary classes 0 - 9.

Week 1: Linear Regression with one variable

m = number of training examples
x = input features
y = output variable / target variable

(x, y) = refers to one training example
x(i) y(i) refers to specific training example at index i. It doesn’t refer to an exponent.

h = hypothesis. function that maps x to y
Predict that y is a linear function of x.
y = h(x) = ϴ0 + ϴ1x

Cost Function

J is the cost function
ϴi are parameters. the coefficients of the function. Similar to weights in a neural net.
find values of ϴ0 and ϴ1 that minimize the cost (error) function.
h(x) - y = difference in function versus actual… we want to minimize this.
aka squared error cost function.
J(ϴ0, ϴ1) = 1/(2m) Σ from i = 1 to m (h(x(i)) - y(i))2
find ϴ0 and ϴ1 that minimizes the error.

Square the error because

  1. Squared gets rid of negative numbers that would cancel each other out. Although you could use magnitude.
  2. For many applications small errors are not important, but big errors are very important. example: self driving car. ½ foot steering error no big deal, 5 foot error fatal problem. So it’s not 10x more important… it is 100x more important.
  3. The convex nature of quadratic equation avoids local minimums.

Gradient Descent

:= (assignment operator, not equality)
α = learning rate. the learning rate controls the size of the adjustments made during the training process.

Repeat until convergence.
θ0 := θ0 - α (1/m) Σ i = 1 to m (hθ(x(i)) - y(i))
θ1 := θ1 - α (1/m) Σ i = 1 to m (hθ(x(i)) - y(i)) • x(i)

Week 2 MultiVariate Linear Regression

n = the number of features
Feature scaling - get features in the range -1 to 1.
xi = xi - average / range (max - min)

Week 3 Logistic regression

Logistic regression is a confusing term because it is a classification algorithm, not a regression algorithm.
"Logistic regression" is named after the logistic function (aka sigmoid function)
The sigmoid function is the hypothesis for logistic regression: h(x)= 1 / (1 + e-θ T x)
The cost function for logistic regression is:
for y = 1, -log(hθ(x))
for y = 0, -log(1 - hθ(x))

Multiclass classification

Multiclass classification is solved using "One versus All." There are K output classes. Solve for K binary logistic regression classifiers and choose the one with the highest probability.

Regularization

Underfitting (high bias) - output doesn't fit the training data well.
Overfitting (high variance) - output fits training data well, but doesn't work well on test data. Failes to generalize.

Regularization factor (λ) - variable to control overfitting. If model is underfitting, you need lower λ. If the model is overfitting, you need higher lambda.

The intuition of λ is that you add the regularization term to the cost function, and as you minimize the cost function, you minimize the magnitude of theta (the weights). If theta is smaller, especially for higher order polynomials, the hypothesis is simpler.

The regularization term is added to the cost function.
The regularization term is for linear regression is: λ times the sum from j=1 to n of θj2
The regularization term is for logistic regression is: λ/2m times the sum from j=1 to n of θj2
Notice they sum from j=1, not j=0. i.e. it doesn't consider the bias term.

Week 5 Neural Networks

L is the number of layers.
K number of output units.

How to train a neural network:

  1. randomize initial weights.
  2. Implement forward propagation. hϴ(xi)
  3. Implement cost function J(ϴ)
  4. Implement backprop to compute partial derivative of cost function with respect to theta. for 1 through m (each training example)
  5. Do gradient checking
  6. Use gradient descent.

Use gradient checking to verify that back prop is working correctly with no bugs. Don’s use gradient checking in production.

Week 8 Unsupervised learning and dimensionality reduction

K-means clustering. Example using 2 clusters. Randomly initialize 2 cluster centroids. K-means is an iterative algorithm. Step 1 is to assign each data point to the nearest cluster centroid. Step 2: move the cluster centroid to the location of the mean of the assigned data points.

K = number of clusters to find.