make some changes to regression - text changes and remove a section

pssachdeva · pssachdeva · commit c7ee8deca378 · 2022-08-31T11:19:04.000-07:00
diff --git a/lessons/01_regression.ipynb b/lessons/01_regression.ipynb
@@ -19,7 +19,7 @@
     "\n",
     "We're going to use the [Auto MPG dataset](https://archive.ics.uci.edu/ml/datasets/Auto+MPG) from UCI's machine learning repository. The Auto MPG dataset contains information on city-cycle fuel consumption in miles per gallon for various types of cars. Our goal is to predict the miles per gallon of different car make and models using 7 predictors. \n",
     "\n",
-    "The `auto-mpg` dataset is stored in a `.csv` file that can be accessed from the UCI repository. We've obtained a copy and made a few modifications, which we've stored in the `data` folder. We'll use `pandas` to load in the dataset by specifying the correct path. We'll start by performing some exploratory data analysis, and then building an OLS model.\n",
+    "The `auto-mpg` dataset is stored in a `.csv` file that can be accessed from the UCI repository. We've obtained a copy and made a few modifications, which we've stored in the `data` folder. We'll use `pandas` to load in the dataset by specifying the correct path. We'll start by performing some exploratory data analysis, and then build an OLS model.\n",
     "\n",
     "First, let's import (or install) some packages we'll need."
    ]
@@ -144,7 +144,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Some variables are pretty strongly correlated with miles per game, so there may be some predictive signal here."
+    "Some variables are pretty strongly correlated with miles per gallon, so there may be some predictive signal here."
    ]
   },
   {
@@ -172,7 +172,7 @@
    "source": [
     "## Creating Train and Test Splits\n",
     "\n",
-    "Next, we'll want to split our dataset into `train` and `test` data. When creating the model, we need to make sure it only sees the training data. Then, we can examine how well it **generalizes** to data it hasn't seen before. The train and test split is a foundational concept in machine learning. Be sure you're confident you understand why we do this before moving forward!\n",
+    "Next, we'll want to split our dataset into training and test data. When creating the model, we need to make sure it only sees the training data. Then, we can examine how well it **generalizes** to data it hasn't seen before. The train and test split is a foundational concept in machine learning. Be sure you're confident you understand why we do this before moving forward!\n",
     "\n",
     "A dataset is often broken up into a feature set, or **design matrix** (typically with the variable name `X`) as well as the target or response variable `y`. Both have $D$ samples, but the design matrix will have a second dimension indicating the number of features we're using for prediction.\n",
     "\n",
@@ -198,7 +198,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now, we perform the train/test split. The package `scikit-learn` provides a function we can easily use to perform this split. Let's import it:"
+    "Now, we perform the train/test split. The package `scikit-learn` is the most commonly used package for machine learning in Python. It provides a function we can easily use to perform this split. Let's import it:"
    ]
   },
   {
@@ -260,7 +260,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## What is Ordinary Least Squares?\n",
+    "### What is Ordinary Least Squares?\n",
     "\n",
     "At a high level, linear regression is nothing more than finding the best straight line, or line of best fit through a set of data points that most accurately captures the pattern that exists within those data points.\n",
     "\n",
@@ -291,7 +291,7 @@
     "\n",
     "The goal of linear regression, then, is to find a combination of these $\\beta_i$ values such that we pass through or as close to as many data points as possible. In other words, we are trying to find the values of $\\beta$ that reduce or minimize the aggregate distance between our linear model and the data points. \n",
     "\n",
-    "We can formalize this into an optimization problem and pursue a strategy that is known in machine learning as minimizing the **cost function** or **objective function**. In the case of linear regression, the cost function we are trying to minimize is the **mean squared error (MSE)** function:\n",
+    "We can formalize this into an optimization problem and pursue a strategy that is known in machine learning as minimizing the **cost function** or **objective function** or **loss**. In the case of linear regression, the cost function we are trying to minimize is the **mean squared error (MSE)** function:\n",
     "\n",
     "$$\\text{MSE} = \\frac{1}{N}\\sum_{i=1}^{N}(y_i - \\hat{y}_i)^2$$\n",
     "\n",
@@ -317,7 +317,7 @@
     "## OLS in Practice\n",
     "\n",
     "The package `scikit-learn` makes it very easy to train a linear regression model. In general, `scikit-learn` models follow the same structure:\n",
-    "* Import the model you want to train (here, `LinearRegression`)\n",
+    "* Import the model you want to train (here, `LinearRegression`).\n",
     "* Create an object for that model with chosen settings. This is *not* training the model. For example, in linear regression, you may choose a linear regression object that does or does not fit an intercept term (see the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) for more details).\n",
     "* Train the model using the `fit()` function, passing in the training data.\n",
     "* Evaluate the model on new data using the `predict()` and `score()` functions.\n",
@@ -372,7 +372,7 @@
     "\n",
     "When evaluating models, it's helpful to look at how it performs on both the training and test data, separately. This gives us a sense of the generalization gap, or how much we overfit to our data. If that gap is large, that means we need to make adjustments to the model in order to make sure it learns patterns that generalize well. \n",
     "\n",
-    "For regression models, the `.score()` method returns the amount of variance in the output variable that can be explained by the model predictions. This is known as $R^2$, or R-squared. It varies from 0 to 1, with 1 being better predictive performance. There are many other performance metrics that can be used when predicting continuous variables.\n",
+    "For regression models, the `score()` method returns the amount of variance in the output variable that can be explained by the model predictions. This is known as $R^2$, or R-squared. It has a maximum of 1, with 1 being better predictive performance. There are many other performance metrics that can be used when predicting continuous variables.\n",
     "\n",
     "Let's look at the $R^2$ for the training data:"
    ]
@@ -498,143 +498,6 @@
     "# YOUR CODE HERE\n"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "\n",
-    "For linear regression models, one form of regularization is known as **Ridge (L2) regression**. Instead of using the least squares loss (which is the loss function used to calculate our MSE cost function): \n",
-    "$$ L(\\beta) = \\sum_i^n (y_i - \\hat y_i)^2 $$ \n",
-    "\n",
-    "In ridge regression we additionally penalize the coefficients by adding a regularization term: \n",
-    "\n",
-    "$$ L(\\beta) = \\sum_i^n (y_i - \\hat y_i)^2  + \\alpha \\sum_j^p \\beta^2 $$ \n",
-    "\n",
-    "This regularization term aims to minimize the size of any one coefficient (or weight), penalizing any reliance on a given subset of features which commonly leads to overfitting.\n",
-    "\n",
-    "Ridge regression takes a **hyperparameter**, called alpha, $\\alpha$ (sometimes lambda, $\\lambda$). This hyperparameter indicates how much regularization should be done. In other words, how much to care about the coefficient penalty term vs how much to care about the sum of squared errors term. The higher the value of alpha the more regularization, and the smaller the resulting coefficients will be. See [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge) for more. \n",
-    "\n",
-    "If we use an `alpha` value of `0` then we get the same solution as the OLS regression done above. Let's prove that."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from sklearn.linear_model import Ridge\n",
-    "ridge_reg = Ridge(alpha=0,  # regularization\n",
-    "                  solver='auto',\n",
-    "                  random_state = rand_seed) \n",
-    "ridge_reg.fit(X_train, y_train)\n",
-    "\n",
-    "# Predictions\n",
-    "ridge_train_pred = ridge_reg.predict(X_train)\n",
-    "ridge_test_pred = ridge_reg.predict(X_test)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "print('Train RMSE: %.04f' % (mse(y_train, ridge_train_pred, squared=False)))\n",
-    "print('Test RMSE: %.04f' % (mse(y_test, ridge_test_pred, squared=False)))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Generally we don't know what the best value hypterparameter values should be, and so we need to leverage some type of trial and error method to determine the best values. We won't cover it today (it's covered in detail on Day 2), but scikit-learn provides a `RidgeCV` model that does just that. It fits a ridge regression model by first using cross-validation to find a good value of alpha. See [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html#sklearn.linear_model.RidgeCV) for more."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Just for our sanity, let's see if we can improve on our baseline linear regression model using a ridge model by setting our alpha value to 0.1."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "ridge_reg = Ridge(alpha=0.1,  # regularization\n",
-    "                  solver='auto',\n",
-    "                  random_state = rand_seed) \n",
-    "ridge_reg.fit(X_train, y_train)\n",
-    "\n",
-    "# Predictions\n",
-    "ridge_train_pred = ridge_reg.predict(X_train)\n",
-    "ridge_test_pred = ridge_reg.predict(X_test)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "print('Train RMSE: %.04f' % (mse(y_train, ridge_train_pred, squared=False)))\n",
-    "print('Test RMSE: %.04f' % (mse(y_test, ridge_test_pred, squared=False)))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Looks like despite doing slightly worse on the training set, it did a bit better than using regular OLS on the test set!"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from sklearn.linear_model import Lasso\n",
-    "lasso_reg = Lasso(alpha=0.01,  # regularization\n",
-    "                  random_state = rand_seed) \n",
-    "lasso_reg.fit(X_train, y_train)\n",
-    "\n",
-    "# Predictions\n",
-    "lasso_train_pred = lasso_reg.predict(X_train)\n",
-    "lasso_test_pred = lasso_reg.predict(X_test)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "print('Train RMSE: %.04f' % (mse(y_train, lasso_train_pred, squared=False)))\n",
-    "print('Test RMSE: %.04f' % (mse(y_test, lasso_test_pred, squared=False)))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "In this case, we can see that even with a small alpha, we have too much regularization which leads to worse performance on both train and test datasets. In this case, we would call our model **underfit**.\n",
-    "\n",
-    "Taking a look at our feature coeffiecients, we can see that many of them are 0:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "lasso_reg.coef_"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},