Not graded. So why do it?
Not graded. So why do it?
People have been writing analysis code for a long time. To make their life easier (easy is good), they made libraries of functions they use often. The libraries are free, and you can use them to make your life easier, too.
The most common libraries are things like NumPy and Pandas, but they're more complex than we have time to learn. However, there's an older, simpler library called Statistics, that works similarly. Let's use that.
The downside to using any package is that it expects data in a certain format. The list-of-dictionaries format is a great way to represent typical business data, but the statistics
module doesn't know how to deal with that directly.
So, you need a data machine for the pipeline to extract data from a list-of-dictionaries to whatever format the statistics
module likes.
The statistics
module
The module has 18 statistics functions in the documentation. There's:
mean
for computing averagesmedian
to get the middle value of a data setstdev
for standard deviation- Others...
The easiest way to use statistics
functions is to send them lists. For example...
- import statistics
- my_data = [3,7,8]
- mean = statistics.mean(my_data)
- print('Mean: ' + str(mean))
Line 4 calls a function to compute a mean (an average, same thing). Send it a list as the param, and get back one number.
Before and after
Let's see how we can prep data for the mean
method.
Here's a program we wrote.
- # Read goat scores from CSV file.
- raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
- # Filter out bad records.
- cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
- # Compute difference scores.
- compute_change(clean_goat_scores)
- # Analysis.
- total_difference = 0
- for goat in cleaned_goat_scores:
- total_difference += goat['Difference']
- print('Total difference: ' + str(total_difference))
You can see the pipeline from the code.
Pipeline

Ray
I'm seeing the big picture now. Like, each function is a machine. Start with a pile of raw data, shovel it into read_csv_data_set
, and get data that's closer to what we want. Partially processed. Shovel that into clean_goat_scores
, and get data that's even closer. Shovel it into another machine, and we get data that's ready to analyze.

Adela
You know, for all the program is doing, the actual analysis part is just a few lines.
- # Analysis.
- total_difference = 0
- for goat in cleaned_goat_scores:
- total_difference += goat['Difference']

Ethan
Yeah, I saw that, too. And every function is something I can understand. Make different machines (functions) from familiar patterns, and call them in sequence.

Georgina
Yeah, I like this way. A big task becomes a buncha small tasks.
Aye! That's how you should design your code.
Something you learn from experience is now to break apart complex tasks, what the pieces should be. You've already got some nice patterns, for reading CSV, making new data sets while cleaning existing ones, adding new fields to data sets... you can combine them to do many different things.
Being mean
Let's get back to the statistics
module. We saw the mean
function.
- import statistics
- my_data = [3,7,8]
- mean = statistics.mean(my_data)
- print('Mean: '+ str(mean))
Give it a list, and it will give you the mean.
So, if we want to compute the average of, say, the Before data, we need to make a list with just that data. This is how the data is arranged, and what we want to give to mean
.
What we have | What we want for mean |
---|---|
|
|
In your own words, explain what the code preparing the list will do.

Ethan
Looks like it will run through the list-of-dictionaries, take the Before value from each record, and add it to a new list.
Correct. We want to make a list we can use like this...
- befores_mean = statistics.mean(befores)
... where befores
is a list of all the Before values, like in the right-hand column of the table above:
- [
- 17,
- 14,
- 12,
- 18,
- 16,
- 15,
- 16,
- 10,
- 10,
- 19,
- 20,
- 12,
- 15
- ]
Complete this code. goat_records
is the cleaned data set.
- def extract_before_values(goat_records):
- # Make a new list.
- befores = []
- What goes here?
- return befores

Adela
Here's what I got:
- def extract_before_values(goat_records):
- '''
- Extract Before values in to a list.
- Parameters
- ----------
- goat_records : List-of-dictionaries.
- Data set.
- Returns
- -------
- befores : list
- List of before values.
- '''
- # Make a new list.
- befores = []
- # Loop over the record set
- for goat_record in goat_records:
- # Get Before value for the current record.
- before = goat_record['Before']
- # Add it to the list.
- befores.append(before)
- return befores
Nice!
A new pattern:
Write a function that extracts a list with the values of one field from a data set.
When you start a new project, you can use the pattern catalog to remind yourself of useful chunks of code.
Calling statistics
Here's the main program.
- # Read goat scores from CSV file.
- raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
- # Filter out bad records.
- cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
- # Extract befores
- befores = extract_before_values(cleaned_goat_scores)
- # Analysis.
- befores_mean = statistics.mean(befores)
- # Output.
- print('Mean of befores:' + str(befores_mean))

Georgina
Gotta say, the actual analysis code barely exists. It's just a call to mean
. What about something more complex, like standard deviation?
(Standard deviation measures how spread out data is. You don't need to know anything about it, just that it exists.)
No problem. Add one line:
- befores_std_dev = statistics.stdev(befores)

Ray
Wait, that's it?! I remember from my stats course, QMM 2400, the standard deviation formula is... challenging.
Here's the computational formula used in most programs (the definitional formula is different):
You could code that yourself, but most people don't, since there are modules like statistics
. Easy is good.
Like I said earlier, the downside of these modules is they require data in a certain format, like mean
needing a list of numbers. However, it's easier to write the code to set up the right format, than to write the code to do the analysis yourself.
All the fields!
Here's extract_before_values
again, the function that returns a list of Before values for analysis:
- def extract_before_values(goat_records):
- '''
- Extract Before values in to a list.
- Parameters
- ----------
- goat_records : List-of-dictionaries.
- Data set.
- Returns
- -------
- befores : list
- List of before values.
- '''
- # Make a new list.
- befores = []
- # Loop over the record set
- for goat_record in goat_records:
- # Get Before value for the current record.
- before = goat_record['Before']
- # Add it to the list.
- befores.append(before)
- return befores
Not graded. So why do it?
In your own words, explain what the highlighted line (before = goat_record['Before']
) does.

Adela
goat_record
is a dictionary. Before is one of its keys. So the line looks up Before in the dictionary, and puts the value in the variable before
.
Good!
Not graded. So why do it?
Not graded. So why do it?
How we can change extract_before_values
so it extracts values from any field you tell it to?
- def extract_before_values(goat_records):
- # Make a new list.
- befores = []
- # Loop over the record set
- for goat_record in goat_records:
- # Get Before value for the current record.
- before = goat_record['Before']
- # Add it to the list.
- befores.append(before)
- return befores

Georgina
Ooo, I see what you mean! You pass in the key as a parameter:
- def extract_field_values(goat_records, field_key):
- # Make a new list.
- values = []
- # Loop over the record set
- for goat_record in goat_records:
- # Get value for the current record.
- value = goat_record[field_key]
- # Add it to the list.
- values.append(value)
- return values
- # Main program ----------------------------------------
- # Read goat scores from CSV file.
- raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
- # Filter out bad records.
- cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
- # Extract befores
- befores = extract_field_values(cleaned_goat_scores, 'Before')

Ray
That's so cool! You pass in any field name, and get a list with the values of that field!
Right!
Change this code so it extracts lists for Before, After, and Difference.
- # Read goat scores from CSV file.
- raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
- # Filter out bad records.
- cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
- # Extract befores
- befores = extract_field_values(cleaned_goat_scores, 'Before')

Ethan
Maybe...
- # Filter out bad records.
- cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
- # Compute differences.
- compute_change(cleaned_goat_scores)
- # Extract lists of field values.
- befores = extract_field_values(cleaned_goat_scores, 'Before')
- afters = extract_field_values(cleaned_goat_scores, 'After')
- differences = extract_field_values(cleaned_goat_scores, 'Difference')
Great! I noticed you also brought compute_change
back into the pipeline, do the Differences field exists.
We've made a reusable function, extract_field_values
. Write it once, call it as many times as you like. A useful addition to the pipeline.
Let's add that to our data machine list.
Field extractor
All the statistics!
Add code to Ethan's to compute and print the mean and standard deviation of Before, After, and Difference.

Ray
Before, we had this:
- befores_mean = statistics.mean(befores)
- befores_std_dev = statistics.stdev(befores)
So, we could copy that? Oh, and I'll change the look of the output, to make it cleaner.
- befores_mean = statistics.mean(befores)
- befores_std_dev = statistics.stdev(befores)
- afters_mean = statistics.mean(afters)
- afters_std_dev = statistics.stdev(afters)
- differences_mean = statistics.mean(differences)
- differences_std_dev = statistics.stdev(differences)
- # Output.
- print('Befores')
- print('=======')
- print('Mean: ' + str(befores_mean))
- print('Standard deviation: '+ str(befores_std_dev))
- print()
- print('Afters')
- print('=======')
- print('Mean: ' + str(afters_mean))
- print('Standard deviation: ' + str(afters_std_dev))
- print()
- print('Differences')
- print('=======')
- print('Mean: ' + str(differences_mean))
- print('Standard deviation: ' + str(differences_std_dev))

Adela
Nice work, Ray!
Indeed! I like the output format, too. Easy to find what you want.
- Befores
- =======
- Mean: 15.405405405405405
- Standard deviation: 3.4517262948290695
- Afters
- =======
- Mean: 17.513513513513512
- Standard deviation: 4.266286582169697
- Differences
- =======
- Mean: 2.108108108108108
- Standard deviation: 2.144551025063078

Ray
Wait, an idea. It would be better to round off the output numbers to two decimal places.
Change Ray's code to round numbers to two decimal places.

Ray
We learned about the round
function a while back.
- print('Mean: ' + str(round(befores_mean, 2)))
- print('Standard deviation: ' + str(round(befores_std_dev,2)))
Use round
in the output code, where needed.
Aye, that does look better.
Highs and lows

Georgina
Hey, I was looking at the list of the statistics
module's functions. I didn't see one that would find the highest and lowest values. Like, the highest Before score.
Good point. Some listy functions are part of Python's core. For example:
- # Extract lists of field values.
- befores = extract_field_values(cleaned_goat_scores, 'Before')
- max_before = max(befores)
- min_before = min(befores)
max
takes a list as a parameter, and gives back the maximum value in the list. No statistics
module needed, since max
is built-in to Python.
Highs and lows and names
What if you want to know which goat has the highest After value? max_after = max(afters)
will give you the highest values, but won't tell you the name of the goat with that value.
One way to do it is write a function that loops over the records as usual. Each time through the loop, it asks: Is the current value greater than the largest I have so far?
- for each record
- get current-after from the current record
- get current-goat from the current record
- if current-after is more than the-largest-after-so-far
- the-largest-after-so-far is current-after
- goat-with-the-largest-after is current-goat
For each record, if a goat's After is larger than any we've seen, remember the new After, and the name of the goat with the new After.
Complete this code:
- def find_largest_after(clean_goat_scores):
- largest_after_value = -1
- largest_after_name = ''
- for record in clean_goat_scores:
- goat_name = record['Goat']
- goat_after_value = record['After']
- Something goes here
- return largest_after_name, largest_after_value

Ethan
I think I got it!
- def find_largest_after(clean_goat_scores):
- largest_after_value = -1
- largest_after_name = ''
- for record in clean_goat_scores:
- goat_name = record['Goat']
- goat_after_value = record['After']
- if goat_after_value > largest_after_value:
- # Remember the new large value.
- largest_after_value = goat_after_value
- # Remember the name for that record.
- largest_after_name = goat_name
- return largest_after_name, largest_after_value

Ray
I see most of it. Line 6 gets After for the current record. If that's bigger than the largest so far, remember the new big value and the goat that has it.
What's line 2 about, though?
OK, here's some data and code.
- "Dewey",10,12
- "Elvira",10,12
- "Flossie",19,23
- "Foster",20,24
- "Georgia",12,14
- largest_after_value = 1000
- for record in clean_goat_scores:
- goat_after_value = record['After']
- if goat_after_value > largest_after_value:
- # Remember the new large value.
- largest_after_value = goat_after_value
- print(largest_after_value)
As you can see, the largest After value in the data is 24.
What would the code output?

Adela
1000.
Right!
Here's the stuff again.
- "Dewey",10,12
- "Elvira",10,12
- "Flossie",19,23
- "Foster",20,24
- "Georgia",12,14
- largest_after_value = 1000
- for record in clean_goat_scores:
- goat_after_value = record['After']
- if goat_after_value > largest_after_value:
- # Remember the new large value.
- largest_after_value = goat_after_value
- print(largest_after_value)
The code runs through the records, grabbing each After value. That's 12, 12, 23, 24, and 14.
Each time through the loop, line 4 compares the After value from the record (12, 12, 23, etc.) with the largest value so far, in largest_after_value
, which is initialized to 1000.
Let's look at the first record. After is 12. 12 is not more than 1000, so largest_after_value
is not changed.
The second record. After is 12. 12 is not more than 1000, so largest_after_value
is not changed.
The third record. After is 23. 23 is not more than 1000, so largest_after_value
is not changed.
Continue for all records. largest_after_value
will never change, since none of the After values are more than the initial value of largest_after_value
: 1000

Ray
Oh, I got it! largest_after_value
should start off as a very low number. Then the first time the if
runs...
if goat_after_value > largest_after_value:
goat_after_value
if guaranteed to be greater than largest_after_value
, so the After value from the first record will go into largest_after_value
.
Right!
There's one small improvement we can make. It won't affect this data set, but remember that we might reuse the code for another program, like temperatures on Mars. We copy-and-paste the code, but doing this won't work:
- largest_after_value = -1
Why would that -1 value fail for the Mars temperature data?

Ray
Because all the data might be less than -1. None of them will replace -1 as the largest value.
Right!
The easiest thing to do is to type in very large/small values to initialize the smallest-value and largest-value variables.
- largest_after_value = -999999999
- for record in clean_goat_scores:
- goat_after_value = record['After']
A new pattern
Use existing computation functions when you can, like statistics.mean
. If you can't, like when you identify records with lowest/highest values in a data set, write your own loopy function.
Use existing computation functions where you can, like statistics.mean
. If you can't, like when you identify records with lowest/highest values in a data set, write your own loopy function.
Summary
- People have made libraries of functions they use often, like the Statistics module.
- Every module expects data in a certain format. For example,
statistics.mean
wants a simple list. - We added a new data machine, field extractor, to our collection.
- Functions like
max
andmin
are part of Python's core. No module needed. - To find, e.g., the name of the goat with the largest value in a field, you need a loop with an
if
. Initialize the largest-so-far variable with the smallest possible value.
Exercise
Shoes
As you know, goats love shoes. They don't wear shoes in pairs, but in quads.
Cthulhu has data on his goatty friends' fave shoe brands, and the number of quads they have. You can download a CSV file. Help Cthulhu analyze the data.
Here's part of the data set:
- "Goat","Fave brand","Quads owned"
- "Adria ","Remock",1
- "Albertina ","Remock",7
- "Amer","Skreecherz",7
- "Anneliese ","Abibaaas",1
- "Ashanti ","Skreecherz",7
The fields:
- Goat. Name must be present.
- Fave brand. Valid values: Abibaaas, Remock, or Skreecherz. Allow for extra spaces, and upper- or lowercase.
- Number of quads owned, of all brands. Integer from 0 to 10.
Output:
- Abibaaas
- =======
- Mean: 4.27
- Standard deviation: 2.34
- Max: 7
- Remock
- ======
- Mean: 5.12
- Standard deviation: 2.8
- Max: 8
- Skreecherz
- ===========
- Mean: 4.23
- Standard deviation: 2.77
- Max: 8
Other requirements:
- Use the
statistics
module. - Use one subset function, called three times.
- Round all values to two decimal places.
Upload a zip of your project folder. The usual coding standards apply.