Stats

Multiple choice

What is a list?

Saving
A

A collection of values. Each one is accessed through an index that's an integer.

B

A collection of values.
Each one is accessed through a key that's usually a string.

C

A set of unique values (no two in the list are the same).
Each one is accessed through an index that's an integer.

D

A collection of dictionaries.
Each one is a record.

Not graded. So why do it?

Multiple choice

team is a list of dictionaries of players. Each dictionary has keys id, name, and position. What's the best way to print the player's names?

Saving
A
  • index = 0
  • while index < len(team):
  •     print(team[index]['name'])
  •     index += 1
B
  • for player in team:
  •     print(player[name])
C
  • foreach player in team:
  •     player = next(team)
  •     name = player[name]
  •     print(name)
D
  • for player in team:
  •     print(player['name'])
E
  • for 'player' from team:
  •     print('player[name]')

Not graded. So why do it?

People have been writing analysis code for a long time. To make their life easier (easy is good), they made libraries of functions they use often. The libraries are free, and you can use them to make your life easier, too.

The most common libraries are things like NumPy and Pandas, but they're more complex than we have time to learn. However, there's an older, simpler library called Statistics, that works similarly. Let's use that.

The downside to using any package is that it expects data in a certain format. The list-of-dictionaries format is a great way to represent typical business data, but the statistics module doesn't know how to deal with that directly.

So, you need a data machine for the pipeline to extract data from a list-of-dictionaries to whatever format the statistics module likes.

The statistics module

The module has 18 statistics functions in the documentation. There's:

  • mean for computing averages
  • median to get the middle value of a data set
  • stdev for standard deviation
  • Others...

The easiest way to use statistics functions is to send them lists. For example...

  1. import statistics
  2.  
  3. my_data = [3,7,8]
  4. mean = statistics.mean(my_data)
  5. print('Mean: ' + str(mean))

Line 4 calls a function to compute a mean (an average, same thing). Send it a list as the param, and get back one number.

Before and after

Let's see how we can prep data for the mean method.

Here's a program we wrote.

  • # Read goat scores from CSV file.
  • raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
  • # Filter out bad records.
  • cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
  • # Compute difference scores.
  • compute_change(clean_goat_scores)
  • # Analysis.
  • total_difference = 0
  • for goat in cleaned_goat_scores:
  •     total_difference += goat['Difference']
  • print('Total difference: ' + str(total_difference))

You can see the pipeline from the code.

Pipeline

Pipeline​

Ray
Ray

I'm seeing the big picture now. Like, each function is a machine. Start with a pile of raw data, shovel it into read_csv_data_set, and get data that's closer to what we want. Partially processed. Shovel that into clean_goat_scores, and get data that's even closer. Shovel it into another machine, and we get data that's ready to analyze.

Adela
Adela

You know, for all the program is doing, the actual analysis part is just a few lines.

  • # Analysis.
  • total_difference = 0
  • for goat in cleaned_goat_scores:
  •     total_difference += goat['Difference']
Ethan
Ethan

Yeah, I saw that, too. And every function is something I can understand. Make different machines (functions) from familiar patterns, and call them in sequence.

Georgina
Georgina

Yeah, I like this way. A big task becomes a buncha small tasks.

Aye! That's how you should design your code.

Something you learn from experience is now to break apart complex tasks, what the pieces should be. You've already got some nice patterns, for reading CSV, making new data sets while cleaning existing ones, adding new fields to data sets... you can combine them to do many different things.

Being mean

Let's get back to the statistics module. We saw the mean function.

  1. import statistics
  2.  
  3. my_data = [3,7,8]
  4. mean = statistics.mean(my_data)
  5. print('Mean: '+ str(mean))

Give it a list, and it will give you the mean.

So, if we want to compute the average of, say, the Before data, we need to make a list with just that data. This is how the data is arranged, and what we want to give to mean.

What we have What we want for mean

Data set

  • [
  •     17,
  •     14,
  •     12,
  •     18,
  •     16,
  •     15,
  •     16,
  •     10,
  •     10,
  •     19,
  •     20,
  •     12,
  •     15
  • ]
Reflect

In your own words, explain what the code preparing the list will do.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Ethan
Ethan

Looks like it will run through the list-of-dictionaries, take the Before value from each record, and add it to a new list.

Correct. We want to make a list we can use like this...

  • befores_mean = statistics.mean(befores)

... where befores is a list of all the Before values, like in the right-hand column of the table above:

  • [
  •     17,
  •     14,
  •     12,
  •     18,
  •     16,
  •     15,
  •     16,
  •     10,
  •     10,
  •     19,
  •     20,
  •     12,
  •     15
  • ]
Reflect

Complete this code. goat_records is the cleaned data set.

  • def extract_before_values(goat_records):
  •     # Make a new list.
  •     befores = []
  •     What goes here?
  •     return befores
If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Adela
Adela

Here's what I got:

  • def extract_before_values(goat_records):
  •     '''
  •     Extract Before values in to a list.
  •  
  •     Parameters
  •     ----------
  •     goat_records : List-of-dictionaries.
  •         Data set.
  •  
  •     Returns
  •     -------
  •     befores : list
  •         List of before values.
  •  
  •     '''
  •     # Make a new list.
  •     befores = []
  •     # Loop over the record set
  •     for goat_record in goat_records:
  •         # Get Before value for the current record.
  •         before = goat_record['Before']
  •         # Add it to the list.
  •         befores.append(before)
  •     return befores

Nice!

A new pattern:

Pattern

Data machine: Field extractor

Write a function that extracts a list with the values of one field from a data set.

When you start a new project, you can use the pattern catalog to remind yourself of useful chunks of code.

Calling statistics

Here's the main program.

  • # Read goat scores from CSV file.
  • raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
  • # Filter out bad records.
  • cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
  • # Extract befores
  • befores = extract_before_values(cleaned_goat_scores)
  • # Analysis.
  • befores_mean = statistics.mean(befores)
  • # Output.
  • print('Mean of befores:' + str(befores_mean))
Georgina
Georgina

Gotta say, the actual analysis code barely exists. It's just a call to mean. What about something more complex, like standard deviation?

(Standard deviation measures how spread out data is. You don't need to know anything about it, just that it exists.)

No problem. Add one line:

  • befores_std_dev = statistics.stdev(befores)
Ray
Ray

Wait, that's it?! I remember from my stats course, QMM 2400, the standard deviation formula is... challenging.

Here's the computational formula used in most programs (the definitional formula is different):

Computational formula for standard deviation

You could code that yourself, but most people don't, since there are modules like statistics. Easy is good.

Like I said earlier, the downside of these modules is they require data in a certain format, like mean needing a list of numbers. However, it's easier to write the code to set up the right format, than to write the code to do the analysis yourself.

All the fields!

Here's extract_before_values again, the function that returns a list of Before values for analysis:

  1. def extract_before_values(goat_records):
  2.     '''
  3.     Extract Before values in to a list.
  4.  
  5.     Parameters
  6.     ----------
  7.     goat_records : List-of-dictionaries.
  8.         Data set.
  9.  
  10.     Returns
  11.     -------
  12.     befores : list
  13.         List of before values.
  14.  
  15.     '''
  16.     # Make a new list.
  17.     befores = []
  18.     # Loop over the record set
  19.     for goat_record in goat_records:
  20.         # Get Before value for the current record.
  21.         before = goat_record['Before']
  22.         # Add it to the list.
  23.         befores.append(before)
  24.     return befores
Multiple choice

What type of thing is goat_record?

Saving
A

A list.

B

A dictionary.

C

An array.

D

The best vinyl album ever!

Not graded. So why do it?

Reflect

In your own words, explain what the highlighted line (before = goat_record['Before']) does.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Adela
Adela

goat_record is a dictionary. Before is one of its keys. So the line looks up Before in the dictionary, and puts the value in the variable before.

Good!

Multiple choice

In the line...

before = goat_record['Before']

... what type of thing is 'Before'?

Saving
A

A numeric variable.

B

A numeric constant.

C

A string variable.

D

A string constant.

E

Can't tell.

Not graded. So why do it?

Multiple choice

Would this work?

  • field_key = 'Before'
  • before = goat_record[field_key]
Saving
A

Yes.

B

No.

C

Can't tell from the code given.

Not graded. So why do it?

Reflect

How we can change extract_before_values so it extracts values from any field you tell it to?

  • def extract_before_values(goat_records):
  •     # Make a new list.
  •     befores = []
  •     # Loop over the record set
  •     for goat_record in goat_records:
  •         # Get Before value for the current record.
  •         before = goat_record['Before']
  •         # Add it to the list.
  •         befores.append(before)
  •     return befores
If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Georgina
Georgina

Ooo, I see what you mean! You pass in the key as a parameter:

  • def extract_field_values(goat_records, field_key):
  •     # Make a new list.
  •     values = []
  •     # Loop over the record set
  •     for goat_record in goat_records:
  •         # Get value for the current record.
  •         value = goat_record[field_key]
  •         # Add it to the list.
  •         values.append(value)
  •     return values
  •  
  • # Main program ----------------------------------------
  •  
  • # Read goat scores from CSV file.
  • raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
  • # Filter out bad records.
  • cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
  • # Extract befores
  • befores = extract_field_values(cleaned_goat_scores, 'Before')
Ray
Ray

That's so cool! You pass in any field name, and get a list with the values of that field!

Right!

Reflect

Change this code so it extracts lists for Before, After, and Difference.

  • # Read goat scores from CSV file.
  • raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
  • # Filter out bad records.
  • cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
  • # Extract befores
  • befores = extract_field_values(cleaned_goat_scores, 'Before')
If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Ethan
Ethan

Maybe...

  • # Filter out bad records.
  • cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
  • # Compute differences.
  • compute_change(cleaned_goat_scores)
  • # Extract lists of field values.
  • befores = extract_field_values(cleaned_goat_scores, 'Before')
  • afters = extract_field_values(cleaned_goat_scores, 'After')
  • differences = extract_field_values(cleaned_goat_scores, 'Difference')

Great! I noticed you also brought compute_change back into the pipeline, do the Differences field exists.

We've made a reusable function, extract_field_values. Write it once, call it as many times as you like. A useful addition to the pipeline.

Let's add that to our data machine list.

Field extractor

Field extractor​

All the statistics!

Reflect

Add code to Ethan's to compute and print the mean and standard deviation of Before, After, and Difference.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Ray
Ray

Before, we had this:

  • befores_mean = statistics.mean(befores)
  • befores_std_dev = statistics.stdev(befores)

So, we could copy that? Oh, and I'll change the look of the output, to make it cleaner.

  • befores_mean = statistics.mean(befores)
  • befores_std_dev = statistics.stdev(befores)
  • afters_mean = statistics.mean(afters)
  • afters_std_dev = statistics.stdev(afters)
  • differences_mean = statistics.mean(differences)
  • differences_std_dev = statistics.stdev(differences)
  • # Output.
  • print('Befores')
  • print('=======')
  • print('Mean: ' + str(befores_mean))
  • print('Standard deviation: '+ str(befores_std_dev))
  • print()
  • print('Afters')
  • print('=======')
  • print('Mean: ' + str(afters_mean))
  • print('Standard deviation: ' + str(afters_std_dev))
  • print()
  • print('Differences')
  • print('=======')
  • print('Mean: ' + str(differences_mean))
  • print('Standard deviation: ' + str(differences_std_dev))
Adela
Adela

Nice work, Ray!

Indeed! I like the output format, too. Easy to find what you want.

  • Befores
  • =======
  • Mean: 15.405405405405405
  • Standard deviation: 3.4517262948290695
  •  
  • Afters
  • =======
  • Mean: 17.513513513513512
  • Standard deviation: 4.266286582169697
  •  
  • Differences
  • =======
  • Mean: 2.108108108108108
  • Standard deviation: 2.144551025063078
Ray
Ray

Wait, an idea. It would be better to round off the output numbers to two decimal places.

Reflect

Change Ray's code to round numbers to two decimal places.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Ray
Ray

We learned about the round function a while back.

  • print('Mean: ' + str(round(befores_mean, 2)))
  • print('Standard deviation: ' + str(round(befores_std_dev,2)))

Use round in the output code, where needed.

Aye, that does look better.

Highs and lows

Georgina
Georgina

Hey, I was looking at the list of the statistics module's functions. I didn't see one that would find the highest and lowest values. Like, the highest Before score.

Good point. Some listy functions are part of Python's core. For example:

  • # Extract lists of field values.
  • befores = extract_field_values(cleaned_goat_scores, 'Before')
  • max_before = max(befores)
  • min_before = min(befores)

max takes a list as a parameter, and gives back the maximum value in the list. No statistics module needed, since max is built-in to Python.

Highs and lows and names

What if you want to know which goat has the highest After value? max_after = max(afters) will give you the highest values, but won't tell you the name of the goat with that value.

One way to do it is write a function that loops over the records as usual. Each time through the loop, it asks: Is the current value greater than the largest I have so far?

  • for each record
  •     get current-after from the current record
  •     get current-goat from the current record
  •     if current-after is more than the-largest-after-so-far
  •         the-largest-after-so-far is current-after
  •         goat-with-the-largest-after is current-goat

For each record, if a goat's After is larger than any we've seen, remember the new After, and the name of the goat with the new After.

Reflect

Complete this code:

  • def find_largest_after(clean_goat_scores):
  •     largest_after_value = -1
  •     largest_after_name = ''
  •     for record in clean_goat_scores:
  •         goat_name = record['Goat']
  •         goat_after_value = record['After']
  •         Something goes here
  •     return largest_after_name, largest_after_value
If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Ethan
Ethan

I think I got it!

  1. def find_largest_after(clean_goat_scores):
  2.     largest_after_value = -1
  3.     largest_after_name = ''
  4.     for record in clean_goat_scores:
  5.         goat_name = record['Goat']
  6.         goat_after_value = record['After']
  7.         if goat_after_value > largest_after_value:
  8.             # Remember the new large value.
  9.             largest_after_value = goat_after_value
  10.             # Remember the name for that record.
  11.             largest_after_name = goat_name
  12.     return largest_after_name, largest_after_value
Ray
Ray

I see most of it. Line 6 gets After for the current record. If that's bigger than the largest so far, remember the new big value and the goat that has it.

What's line 2 about, though?

OK, here's some data and code.

  • "Dewey",10,12
  • "Elvira",10,12
  • "Flossie",19,23
  • "Foster",20,24
  • "Georgia",12,14
  1. largest_after_value = 1000
  2. for record in clean_goat_scores:
  3.     goat_after_value = record['After']
  4.     if goat_after_value > largest_after_value:
  5.         # Remember the new large value.
  6.         largest_after_value = goat_after_value
  7.  
  8. print(largest_after_value)

As you can see, the largest After value in the data is 24.

Reflect

What would the code output?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Adela
Adela

1000.

Right!

Here's the stuff again.

  • "Dewey",10,12
  • "Elvira",10,12
  • "Flossie",19,23
  • "Foster",20,24
  • "Georgia",12,14
  1. largest_after_value = 1000
  2. for record in clean_goat_scores:
  3.     goat_after_value = record['After']
  4.     if goat_after_value > largest_after_value:
  5.         # Remember the new large value.
  6.         largest_after_value = goat_after_value
  7.  
  8. print(largest_after_value)

The code runs through the records, grabbing each After value. That's 12, 12, 23, 24, and 14.

Each time through the loop, line 4 compares the After value from the record (12, 12, 23, etc.) with the largest value so far, in largest_after_value, which is initialized to 1000.

Let's look at the first record. After is 12. 12 is not more than 1000, so largest_after_value is not changed.

The second record. After is 12. 12 is not more than 1000, so largest_after_value is not changed.

The third record. After is 23. 23 is not more than 1000, so largest_after_value is not changed.

Continue for all records. largest_after_value will never change, since none of the After values are more than the initial value of largest_after_value: 1000

Ray
Ray

Oh, I got it! largest_after_value should start off as a very low number. Then the first time the if runs...

if goat_after_value > largest_after_value:

goat_after_value if guaranteed to be greater than largest_after_value, so the After value from the first record will go into largest_after_value.

Right!

There's one small improvement we can make. It won't affect this data set, but remember that we might reuse the code for another program, like temperatures on Mars. We copy-and-paste the code, but doing this won't work:

  • largest_after_value = -1
Reflect

Why would that -1 value fail for the Mars temperature data?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Ray
Ray

Because all the data might be less than -1. None of them will replace -1 as the largest value.

Right!

The easiest thing to do is to type in very large/small values to initialize the smallest-value and largest-value variables.

  • largest_after_value = -999999999
  • for record in clean_goat_scores:
  •     goat_after_value = record['After']

A new pattern

Use existing computation functions when you can, like statistics.mean. If you can't, like when you identify records with lowest/highest values in a data set, write your own loopy function.

Pattern

Data machine: Computation

Use existing computation functions where you can, like statistics.mean. If you can't, like when you identify records with lowest/highest values in a data set, write your own loopy function.

Summary

  • People have made libraries of functions they use often, like the Statistics module.
  • Every module expects data in a certain format. For example, statistics.mean wants a simple list.
  • We added a new data machine, field extractor, to our collection.
  • Functions like max and min are part of Python's core. No module needed.
  • To find, e.g., the name of the goat with the largest value in a field, you need a loop with an if. Initialize the largest-so-far variable with the smallest possible value.

Exercise

Exercise

Shoes

As you know, goats love shoes. They don't wear shoes in pairs, but in quads.

Cthulhu has data on his goatty friends' fave shoe brands, and the number of quads they have. You can download a CSV file. Help Cthulhu analyze the data.

Here's part of the data set:

  • "Goat","Fave brand","Quads owned"
  • "Adria ","Remock",1
  • "Albertina ","Remock",7
  • "Amer","Skreecherz",7
  • "Anneliese ","Abibaaas",1
  • "Ashanti ","Skreecherz",7

The fields:

  • Goat. Name must be present.
  • Fave brand. Valid values: Abibaaas, Remock, or Skreecherz. Allow for extra spaces, and upper- or lowercase.
  • Number of quads owned, of all brands. Integer from 0 to 10.

Output:

  • Abibaaas
  • =======
  • Mean: 4.27
  • Standard deviation: 2.34
  • Max: 7
  •  
  • Remock
  • ======
  • Mean: 5.12
  • Standard deviation: 2.8
  • Max: 8
  •  
  • Skreecherz
  • ===========
  • Mean: 4.23
  • Standard deviation: 2.77
  • Max: 8

Other requirements:

  • Use the statistics module.
  • Use one subset function, called three times.
  • Round all values to two decimal places.

Upload a zip of your project folder. The usual coding standards apply.