Bonus: Regression

Bonus lessons are optional.

A regression line

Let's continue with the goatathalon data from the last lesson.

Say you have a new goat, Billy. His running time is 90. Can we predict what his swimming time will be?

Prediction is a big thing in business analytics. Predicting sales for a new product, sales in a new region, labor costs over time... many applications.

That's where regression comes in. It's a statistical technique that gives you a formula for estimating one variable from others. We'll use linear regression, the simplest and most common form.

The technique will give us a formula like:

swimming = m * running + c

This is the formula for a straight line. m and c are constants, like:

swimming = 2 * running + 1

(These values aren't correct. That's what we'll be working out.)

Once we have the formula, we can plug a value for running in, and get a predicted value for swimming. Billy's running time is 90, so his predicted swimming time (using this fake formula) is:

swimming = 2 * 90 + 1

Here's what we'll make:

Regression

Rather than plot a line, we'll take each value of running from the data set, and show the predicted value for swimming. So, we'll:

  • Find values for m and c, so we have the prediction equation: swimming = m * running + c
  • Loop over the records. Make a new list with the predicted swimming values for each running value.
  • Plot.

Regression is a method for estimating m and c in y = m*x + c. It finds the best m and c it can, creating a line minimizing the distance of the data points from the list.

Pyplot doesn't know how to compute m and c, but the Numpy package does. It's included with Sypder. You need to import it, like:

  • import numpy

This code will do the work. It comes after the lists for running and swimming have been made.

  1. m, c = numpy.polyfit(running, swimming, 1)
  2. predicted_swimming = []
  3. for run_time in running:
  4.     prediction = run_time * m + c
  5.     predicted_swimming.append(prediction)
  6.  
  7. plt.scatter(running, swimming)
  8. plt.scatter(running, predicted_swimming)
  9. plt.title('Goatathlon: Running and swimming times')
  10. plt.xlabel('Running')
  11. plt.ylabel('Swimming')
  12. plt.show()

Line 1 computes m and c. They make the equation: predicted swimming = m * running + c. (The 1 as the last param makes it a line, rather than a quadratic, or something else.)

Line 2 makes a new list for the predicted values. Line 3 loops over all the running list. Line 4 makes the prediction for one value of running. Line 5 adds the value to the prediction list.

Line 8 plots the predicted points. The x-axis value is running, the y is predicted swimming, based on the running value.

We get:

Regression

We can use the VE to find out the values of m and c:

VE

So the formula for predicting swimming from running is:

predicted swimming = 0.95 * running + 4.7

Billy's running time is 90. His predicted swimming time is 0.95 * 90 + 4.7 = 90.2.

Summary

  • The Numpy package's polyfit method computes regression coefficients.
  • We can use them to plot predicted values.
Attachments