Relationships | Python with Pets

Can we talk about our relationship?

The stats we've looked at so far, like mean and standard deviation, are about one variable. The average of Before only tells you about Before, not After.

Often you're interested in the relationship between two variables. The correlation between two variables tells you how much variables move in the same direction when they change. If one goes up, does the other tend to go up, or down? Or is there no relationship? For example, if the weather gets colder, do sales of hot chocolate go up?

Correlation coefficients are always between 1 and -1. Correlation coefficients closer to 1 mean the variables tend to go up and down together. For example, say we went to your local grade school (called primary school in some places), and measured the heights and ages of a buncha kids. Each kid has a height value, and an age value. They're matched pairs. Like:

Kid	Height (cm)	Age
Letti	121	7
Ellis	128	9
Sophia	136	10
Lance	155	12
Bogdan	167	13

Age and height go up together. The correlation coefficient of 0.98, close to 1.

Let's also get deets on the students' GPA and days absent. Here's what we find:

Kid	Absences	GPA
Bogdan	1	3.8
Ellis	5	3.5
Sophia	10	3.0
Letti	13	3.1
Lance	15	2.8

The correlation coefficient is -0.97. Close to -1. As one variable goes up, the other goes down.

`statistics.correlation`

You'll need Spyder 6.0 or above for this. There is a correlation function for older Spyders on the bottom of this page. You can use if you want, but it's best to upgrade.

Call the correlation function like this: correlation(list1, list2). It returns a float between -1 and 1, being the correlation coefficient between the data in list1 and list2.

Here's a program using it:

import statistics
height = [121,128,136,155,167]
age = [7,9,10,12,13]
r = statistics.correlation(height, age)
print('r (height, age) = ' + str(r))

The correlation coefficient is usually called r. That's why I used that variable name in the last two lines of the program.

Before and After

OK, how can we write a program to work out the correlation between Before and After in the goats data set?

Adela

It's the pipeline idea again. The correlation function needs two lists to do its job: correlation(list1, list2). We need to get the Before data into one list, and After into another, then pass them to the function. The field extract data machine will do that.

Georgina

Makes sense. We already have most of the code we need. Read the CSV file into a list-of-dictionaries, clean it up, then extract the Before and After data. Hmm. I'll adjust Ray's last program. I like the way it's organized, and how the output is formatted.

# Read goat scores from CSV file.
raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
# Filter out bad records.
cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
# Compute differences.
compute_change(cleaned_goat_scores)
# Extract lists of field values.
befores = extract_field_values(cleaned_goat_scores, 'Before')
afters = extract_field_values(cleaned_goat_scores, 'After')
differences = extract_field_values(cleaned_goat_scores, 'Difference')
# Analysis.
# Descriptive statistics.
befores_mean = statistics.mean(befores)
befores_std_dev = statistics.stdev(befores)
...
# Correlation.
r_before_after = statistics.correlation(befores, afters)
# Output.
print('Befores')
print('=======')
print('Mean: '+ str(round(befores_mean, 2)))
print('Standard deviation: ' + str(round(befores_std_dev,2)))
...
print()
print('Correlation')
print('===========')
print('r (Before, After): '+ str(round(r_before_after, 2)))

Great!

Georgina added just a few lines. Because the program is well-structured, she could make the changes without any trouble.

That's what you should aim for. A flexible program made of small chunks. Doing a little bit of this to the data, then a little bit of that.

Summary

A correlation coefficient tells you the relationships between two fields.
Extract fields from a data set before you use it.

Exercises

Exercise

Stinky and atttractive

Cthulhu has gathered stickiness and attractiveness ratings for 50 goats. You can download it. Here's some of the data.

"Goat","Gender","Stinkiness","Attraction"
"Aisha","F",40,45
"Andreas","M",48,52
"August","M",42,43
"Bertha","F",35,31
"Bessie","F",19,24
"Boyd","M","Yuck",38
"Bridgette","F",31,37

The four fields are:

Goat name. Cannot be missing.
Gender: F or M, though there might be extra spaces, and you should allow for upper and lowercase.
Stinkiness: An integer rating from 0 to 50.
Attraction: An integer rating from 0 to 50.

Do the analysis for valid records only.

Show the correlations between stinkiness and attraction, but analyze females and males separately. Round to two decimals.

Output:

Correlations
============
Stinkiness and Attraction
Females: 0.89
Males: 0.95

A correlation function is attached to this page if you're using an old Spyder, but it's best to upgrade to at least Spyder 6.

Upload a zip of your project folder. The usual coding standards apply.

If you were logged in as a student, you could submit a solution to this exercise.

Attachments

correlation.py_.txt