Stats

The `statistics` module

The module has 18 statistics functions in the documentation. There's:

mean for computing averages
median to get the middle value of a data set
stdev for standard deviation
Others...

The easiest way to use statistics functions is to send them lists. For example...

import statistics
my_data = [3,7,8]
mean = statistics.mean(my_data)
print('Mean: ' + str(mean))

Line 4 calls a function to compute a mean (an average, same thing). Send it a list as the param, and get back one number.

Before and after

Let's see how we can prep data for the mean method.

Here's a program we wrote.

# Read goat scores from CSV file.
raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
# Filter out bad records.
cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
# Compute difference scores.
compute_change(clean_goat_scores)
# Analysis.
total_difference = 0
for goat in cleaned_goat_scores:
total_difference += goat['Difference']
print('Total difference: ' + str(total_difference))

You can see the pipeline from the code.

Pipeline

Ray

I'm seeing the big picture now. Like, each function is a machine. Start with a pile of raw data, shovel it into read_csv_data_set, and get data that's closer to what we want. Partially processed. Shovel that into clean_goat_scores, and get data that's even closer. Shovel it into another machine, and we get data that's ready to analyze.

Adela

You know, for all the program is doing, the actual analysis part is just a few lines.

# Analysis.
total_difference = 0
for goat in cleaned_goat_scores:
total_difference += goat['Difference']

Ethan

Yeah, I saw that, too. And every function is something I can understand. Make different machines (functions) from familiar patterns, and call them in sequence.

Georgina

Yeah, I like this way. A big task becomes a buncha small tasks.

Aye! That's how you should design your code.

Something you learn from experience is now to break apart complex tasks, what the pieces should be. You've already got some nice patterns, for reading CSV, making new data sets while cleaning existing ones, adding new fields to data sets... you can combine them to do many different things.

Being `mean`

Let's get back to the statistics module. We saw the mean function.

import statistics
my_data = [3,7,8]
mean = statistics.mean(my_data)
print('Mean: '+ str(mean))

Give it a list, and it will give you the mean.

So, if we want to compute the average of, say, the Before data, we need to make a list with just that data. This is how the data is arranged, and what we want to give to mean.

What we have What we want for mean

What we have	What we want for `mean`
	[ 17, 14, 12, 18, 16, 15, 16, 10, 10, 19, 20, 12, 15 ]

Data set

[
17,
14,
12,
18,
16,
15,
16,
10,
10,
19,
20,
12,
15
]

Reflect

In your own words, explain what the code preparing the list will do.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ethan

Looks like it will run through the list-of-dictionaries, take the Before value from each record, and add it to a new list.

Correct. We want to make a list we can use like this...

befores_mean = statistics.mean(befores)

... where befores is a list of all the Before values, like in the right-hand column of the table above:

[
17,
14,
12,
18,
16,
15,
16,
10,
10,
19,
20,
12,
15
]

Reflect

Complete this code. goat_records is the cleaned data set.

def extract_before_values(goat_records):
# Make a new list.
befores = []
What goes here?
return befores

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Adela

Here's what I got:

def extract_before_values(goat_records):
'''
Extract Before values in to a list.
Parameters
----------
goat_records : List-of-dictionaries.
Data set.
Returns
-------
befores : list
List of before values.
'''
# Make a new list.
befores = []
# Loop over the record set
for goat_record in goat_records:
# Get Before value for the current record.
before = goat_record['Before']
# Add it to the list.
befores.append(before)
return befores

Nice!

A new pattern:

Pattern

Data machine: Field extractor

Write a function that extracts a list with the values of one field from a data set.

When you start a new project, you can use the pattern catalog to remind yourself of useful chunks of code.

Calling `statistics`

Here's the main program.

# Read goat scores from CSV file.
raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
# Filter out bad records.
cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
# Extract befores
befores = extract_before_values(cleaned_goat_scores)
# Analysis.
befores_mean = statistics.mean(befores)
# Output.
print('Mean of befores:' + str(befores_mean))

Georgina

Gotta say, the actual analysis code barely exists. It's just a call to mean. What about something more complex, like standard deviation?

(Standard deviation measures how spread out data is. You don't need to know anything about it, just that it exists.)

No problem. Add one line:

befores_std_dev = statistics.stdev(befores)

Ray

Wait, that's it?! I remember from my stats course, QMM 2400, the standard deviation formula is... challenging.

Here's the computational formula used in most programs (the definitional formula is different):

Computational formula for standard deviation

You could code that yourself, but most people don't, since there are modules like statistics. Easy is good.

Like I said earlier, the downside of these modules is they require data in a certain format, like mean needing a list of numbers. However, it's easier to write the code to set up the right format, than to write the code to do the analysis yourself.

All the fields!

Here's extract_before_values again, the function that returns a list of Before values for analysis:

def extract_before_values(goat_records):
'''
Extract Before values in to a list.
Parameters
----------
goat_records : List-of-dictionaries.
Data set.
Returns
-------
befores : list
List of before values.
'''
# Make a new list.
befores = []
# Loop over the record set
for goat_record in goat_records:
# Get Before value for the current record.
before = goat_record['Before']
# Add it to the list.
befores.append(before)
return befores

Multiple choice

What type of thing is goat_record?

A list.

A dictionary.

An array.

The best vinyl album ever!

Not graded. So why do it?

Reflect

In your own words, explain what the highlighted line (before = goat_record['Before']) does.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Adela

goat_record is a dictionary. Before is one of its keys. So the line looks up Before in the dictionary, and puts the value in the variable before.

Good!

Multiple choice

In the line...

before = goat_record['Before']

... what type of thing is 'Before'?

A numeric variable.

A numeric constant.

A string variable.

A string constant.

Can't tell.

Not graded. So why do it?

Multiple choice

Would this work?

field_key = 'Before'
before = goat_record[field_key]

Yes.

No.

Can't tell from the code given.

Not graded. So why do it?

Reflect

How we can change extract_before_values so it extracts values from any field you tell it to?

def extract_before_values(goat_records):
# Make a new list.
befores = []
# Loop over the record set
for goat_record in goat_records:
# Get Before value for the current record.
before = goat_record['Before']
# Add it to the list.
befores.append(before)
return befores

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Georgina

Ooo, I see what you mean! You pass in the key as a parameter:

def extract_field_values(goat_records, field_key):
# Make a new list.
values = []
# Loop over the record set
for goat_record in goat_records:
# Get value for the current record.
value = goat_record[field_key]
# Add it to the list.
values.append(value)
return values
# Main program ----------------------------------------
# Read goat scores from CSV file.
raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
# Filter out bad records.
cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
# Extract befores
befores = extract_field_values(cleaned_goat_scores, 'Before')

Ray

That's so cool! You pass in any field name, and get a list with the values of that field!

Right!

Reflect

Change this code so it extracts lists for Before, After, and Difference.

# Read goat scores from CSV file.
raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
# Filter out bad records.
cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
# Extract befores
befores = extract_field_values(cleaned_goat_scores, 'Before')

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ethan

Maybe...

# Filter out bad records.
cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
# Compute differences.
compute_change(cleaned_goat_scores)
# Extract lists of field values.
befores = extract_field_values(cleaned_goat_scores, 'Before')
afters = extract_field_values(cleaned_goat_scores, 'After')
differences = extract_field_values(cleaned_goat_scores, 'Difference')

Great! I noticed you also brought compute_change back into the pipeline, do the Differences field exists.

We've made a reusable function, extract_field_values. Write it once, call it as many times as you like. A useful addition to the pipeline.

Let's add that to our data machine list.

Field extractor

All the statistics!

Reflect

Add code to Ethan's to compute and print the mean and standard deviation of Before, After, and Difference.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ray

Before, we had this:

befores_mean = statistics.mean(befores)
befores_std_dev = statistics.stdev(befores)

So, we could copy that? Oh, and I'll change the look of the output, to make it cleaner.

befores_mean = statistics.mean(befores)
befores_std_dev = statistics.stdev(befores)
afters_mean = statistics.mean(afters)
afters_std_dev = statistics.stdev(afters)
differences_mean = statistics.mean(differences)
differences_std_dev = statistics.stdev(differences)
# Output.
print('Befores')
print('=======')
print('Mean: ' + str(befores_mean))
print('Standard deviation: '+ str(befores_std_dev))
print()
print('Afters')
print('=======')
print('Mean: ' + str(afters_mean))
print('Standard deviation: ' + str(afters_std_dev))
print()
print('Differences')
print('=======')
print('Mean: ' + str(differences_mean))
print('Standard deviation: ' + str(differences_std_dev))

Adela

Nice work, Ray!

Indeed! I like the output format, too. Easy to find what you want.

Befores
=======
Mean: 15.405405405405405
Standard deviation: 3.4517262948290695
Afters
=======
Mean: 17.513513513513512
Standard deviation: 4.266286582169697
Differences
=======
Mean: 2.108108108108108
Standard deviation: 2.144551025063078

Ray

Wait, an idea. It would be better to round off the output numbers to two decimal places.

Reflect

Change Ray's code to round numbers to two decimal places.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ray

We learned about the round function a while back.

print('Mean: ' + str(round(befores_mean, 2)))
print('Standard deviation: ' + str(round(befores_std_dev,2)))

Use round in the output code, where needed.

Aye, that does look better.

Highs and lows

Georgina

Hey, I was looking at the list of the statistics module's functions. I didn't see one that would find the highest and lowest values. Like, the highest Before score.

Good point. Some listy functions are part of Python's core. For example:

# Extract lists of field values.
befores = extract_field_values(cleaned_goat_scores, 'Before')
max_before = max(befores)
min_before = min(befores)

max takes a list as a parameter, and gives back the maximum value in the list. No statistics module needed, since max is built-in to Python.

Highs and lows and names

What if you want to know which goat has the highest After value? max_after = max(afters) will give you the highest values, but won't tell you the name of the goat with that value.

One way to do it is write a function that loops over the records as usual. Each time through the loop, it asks: Is the current value greater than the largest I have so far?

for each record
get current-after from the current record
get current-goat from the current record
if current-after is more than the-largest-after-so-far
the-largest-after-so-far is current-after
goat-with-the-largest-after is current-goat

For each record, if a goat's After is larger than any we've seen, remember the new After, and the name of the goat with the new After.

Reflect

Complete this code:

def find_largest_after(clean_goat_scores):
largest_after_value = -1
largest_after_name = ''
for record in clean_goat_scores:
goat_name = record['Goat']
goat_after_value = record['After']
Something goes here
return largest_after_name, largest_after_value

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ethan

I think I got it!

def find_largest_after(clean_goat_scores):
largest_after_value = -1
largest_after_name = ''
for record in clean_goat_scores:
goat_name = record['Goat']
goat_after_value = record['After']
if goat_after_value > largest_after_value:
# Remember the new large value.
largest_after_value = goat_after_value
# Remember the name for that record.
largest_after_name = goat_name
return largest_after_name, largest_after_value

Ray

I see most of it. Line 6 gets After for the current record. If that's bigger than the largest so far, remember the new big value and the goat that has it.

What's line 2 about, though?

OK, here's some data and code.

"Dewey",10,12
"Elvira",10,12
"Flossie",19,23
"Foster",20,24
"Georgia",12,14

largest_after_value = 1000
for record in clean_goat_scores:
goat_after_value = record['After']
if goat_after_value > largest_after_value:
# Remember the new large value.
largest_after_value = goat_after_value
print(largest_after_value)

As you can see, the largest After value in the data is 24.

Reflect

What would the code output?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Adela

1000.

Right!

Here's the stuff again.

"Dewey",10,12
"Elvira",10,12
"Flossie",19,23
"Foster",20,24
"Georgia",12,14

largest_after_value = 1000
for record in clean_goat_scores:
goat_after_value = record['After']
if goat_after_value > largest_after_value:
# Remember the new large value.
largest_after_value = goat_after_value
print(largest_after_value)

The code runs through the records, grabbing each After value. That's 12, 12, 23, 24, and 14.

Each time through the loop, line 4 compares the After value from the record (12, 12, 23, etc.) with the largest value so far, in largest_after_value, which is initialized to 1000.

Let's look at the first record. After is 12. 12 is not more than 1000, so largest_after_value is not changed.

The second record. After is 12. 12 is not more than 1000, so largest_after_value is not changed.

The third record. After is 23. 23 is not more than 1000, so largest_after_value is not changed.

Continue for all records. largest_after_value will never change, since none of the After values are more than the initial value of largest_after_value: 1000

Ray

Oh, I got it! largest_after_value should start off as a very low number. Then the first time the if runs...

if goat_after_value > largest_after_value:

goat_after_value if guaranteed to be greater than largest_after_value, so the After value from the first record will go into largest_after_value.

Right!

There's one small improvement we can make. It won't affect this data set, but remember that we might reuse the code for another program, like temperatures on Mars. We copy-and-paste the code, but doing this won't work:

largest_after_value = -1

Reflect

Why would that -1 value fail for the Mars temperature data?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ray

Because all the data might be less than -1. None of them will replace -1 as the largest value.

Right!

The easiest thing to do is to type in very large/small values to initialize the smallest-value and largest-value variables.

largest_after_value = -999999999
for record in clean_goat_scores:
goat_after_value = record['After']

Or even better:

largest_after_value = int('-inf')

...or...

largest_after_value = float('-inf')

... for floats.

A new pattern

Use existing computation functions when you can, like statistics.mean. If you can't, like when you identify records with lowest/highest values in a data set, write your own loopy function.

Pattern

Data machine: Computation

Use existing computation functions where you can, like statistics.mean. If you can't, like when you identify records with lowest/highest values in a data set, write your own loopy function.

Summary

People have made libraries of functions they use often, like the Statistics module.
Every module expects data in a certain format. For example, statistics.mean wants a simple list.
We added a new data machine, field extractor, to our collection.
Functions like max and min are part of Python's core. No module needed.
To find, e.g., the name of the goat with the largest value in a field, you need a loop with an if. Initialize the largest-so-far variable with the smallest possible value.

Exercise

Shoes

As you know, goats love shoes. They don't wear shoes in pairs, but in quads.

Cthulhu has data on his goatty friends' fave shoe brands, and the number of quads they have. You can download a CSV file. Help Cthulhu analyze the data.

Here's part of the data set:

"Goat","Fave brand","Quads owned"
"Adria ","Remock",1
"Albertina ","Remock",7
"Amer","Skreecherz",7
"Anneliese ","Abibaaas",1
"Ashanti ","Skreecherz",7

The fields:

Goat. Name must be present.
Fave brand. Valid values: Abibaaas, Remock, or Skreecherz. Allow for extra spaces, and upper- or lowercase.
Number of quads owned, of all brands. Integer from 0 to 10.

Output:

Abibaaas
=======
Mean: 4.27
Standard deviation: 2.34
Max: 7
Remock
======
Mean: 5.12
Standard deviation: 2.8
Max: 8
Skreecherz
===========
Mean: 4.23
Standard deviation: 2.77
Max: 8

Other requirements:

Use the statistics module.
Use one subset function, called three times.
Round all values to two decimal places.

Upload a zip of your project folder. The usual coding standards apply.

If you were logged in as a student, you could submit a solution to this exercise.

The statistics module

Before and after

Being mean

Calling statistics

All the fields!

All the statistics!

Highs and lows

Highs and lows and names

A new pattern

Summary

Exercise

The `statistics` module

Being `mean`

Calling `statistics`