Record subsets | Python with Pets

New for the pipeline

Let's add the component to our collection.

New component

Subset data machine

Goatty friends

Here's goatty data again.

"Goat","Before","After"
"Aisha",17,16
"Andreas",14,15
"August",10,13
"Bertha",17,23
"Bessie",20,25
"Boyd",13,12
"Bridgette",16,22
"Carrie",15,19

Let's say we only want to get average before and after scores for goats where After is more than Before. That would be lines 3, 4, 5, 6, 8, and 9. We'd ignore lines 2 and 7.

I'd want to add a new subset process:

Subset process

Add a subset process

Ray

Oh, I get it. You plug together the data machines you need. So, if you're writing a program to analyze an entire data set, without splitting out any subsets, you wouldn't include the last step.

Exactly!

I'd want to do something like this:

# Read goat scores from CSV file.
raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
# Filter out bad records.
cleaned_goat_scores, goats_bad_records = clean_goat_scores(raw_goat_scores)
# Get records where after is more than before.
score_gain_records = filter_score_gain_records(cleaned_goat_scores)
# Analysis.
total_before = 0
total_after = 0
for goat in score_gain_records:
total_before += goat['Before']
total_after += goat['After']
print('Total before:', total_before)
print('Total after:', total_after)

Top-down again

I started by deciding what the new function should do. A good way to do that is to write code that calls the function.

Splitty split

Adela

This is gonna be like the other functions, right? Make a new list. Then loop over the data set, have an if to pick the records you want, then add them to the new list.

Right! That's a common pattern.

Pattern

Loop over a list of dictionaries

You have a data set. Each record is a dictionary. All the records are in a list. Use a for loop to run through each record in the list.

Reflect

Finish this function.

def filter_score_gain_records(goat_records):
'''
Filter records where after is more than before.
Parameters
----------
goat_records : list of dictionaries
Valid goatty records.
Returns
-------
score_gain_records : list of dictionaries
Records where after > before.
'''
# Create a new list.
# Loop over records.
# Is after > before?
# Add to new list.
# Return the new list.

You can have more than one Python line per comment.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ray

Here's what my coffee and I got.

def filter_score_gain_records(goat_records):
'''
Filter records where after is more than before.
Parameters
----------
goat_records : list of dictionaries
Valid goatty records.
Returns
-------
score_gain_records : list of dictionaries
Records where after > before.
'''
# Create a new list.
score_gain_records = []
# Loop over records.
for record in goat_records:
# Is after > before?
before = record['Before']
after = record['After']
if after > before:
# Add to new list.
score_gain_records.append(record)
# Return the new list.
return score_gain_records

Nice work!

Ethan

Yes, good job, dude!

More than one group

We just made a subset of the data where After is more than Before. What if you want more than one subset?

Here's part of another data set you can download:

"Neighborhood","Goat","Houses visited","Candy collected"
"Guernsey Glen","Azucena",5,307
"Savanna Shores","Dawne",7,242
"Pygora Park","Rozanne",6,262
"Pygora Park","Elenor",7,288
"Angora Acres","Wava",9,236
"Pygora Park","Fredric",8,312
"Pygora Park","Clarinda",8,312
"Pygora Park","Rosamond",9,230

It shows how much candy goats collected in various neighborhoods last Halloween. Candy is in grams.

There are four neighborhoods:

Angora Acres
Guernsey Glen
Pygora Park
Savanna Shores

Say you want to find the average loot in each neighborhood. You want output like this:

Average Halloween loot by neighborhood
======= ========= ==== == ============
Angora Acres: 211.4
Guernsey Glen: 320.2
Pygora Park: 254.2
Savanna Shores: 226.8

How would you do it?

Ethan

Can we start with one of those diagrams? Showing the different data machines?

Sure.

Halloween machines

Halloween machines (good band name?)

Ethan

The first part we know how to do. We'd want to be able to write this:

# Read the raw data.
halloween_data_set = read_csv_data_set('halloween-candy.csv')
# Keep only valid records.
clean_halloween_data_set = clean_goat_records(halloween_data_set)

Ethan

read_csv_data_set we just copy-and-paste.

For clean_goat_records, we can change what we did last time.

Here's code from before:

def clean_goat_scores(raw_goat_scores):
# Create a new list for the clean records.
clean_scores = []
# Loop over raw records.
for raw_record in raw_goat_scores:
# Is the record OK?
if is_record_ok(raw_record):
# Yes, make a new record with the right data types.
clean_record = {
'Goat': raw_record['Goat'],
'Before': float(raw_record['Before']),
'After': float(raw_record['After'])
}
# Add the new record to the clean list.
clean_scores.append(clean_record)
# Send the cleaned list back.
return clean_scores

Reflect

Write clean_goat_records for the halloween data, to be called like this:

clean_halloween_data_set = clean_goat_records(halloween_data_set)

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Adela

Here's what I got.

Old	New
def clean_goat_scores(raw_goat_scores): # Create a new list for the clean records. clean_scores = [] # Loop over raw records. for raw_record in raw_goat_scores: # Is the record OK? if is_record_ok(raw_record): # Yes, make a new record with the right data types. clean_record = { 'Goat': raw_record['Goat'], 'Before': float(raw_record['Before']), 'After': float(raw_record['After']) } # Add the new record to the clean list. clean_scores.append(clean_record) # Send the cleaned list back. return clean_scores	def clean_goat_records(raw_goat_data_set): # Create a new list for the clean records. clean_records = [] # Loop over raw records. for goat_record in raw_goat_data_set: # Is the record OK? if is_record_ok(goat_record): # Make new record with right data types. new_record = { 'Neighborhood': goat_record['Neighborhood'], 'Goat': goat_record['Goat'], 'Houses visited': int(goat_record['Houses visited']), 'Candy collected': int(goat_record['Candy collected']) } # Add the new record to the clean list. clean_records.append(new_record) # Send the lists back. return clean_records

Old

New

def clean_goat_scores(raw_goat_scores):
# Create a new list for the clean records.
clean_scores = []
# Loop over raw records.
for raw_record in raw_goat_scores:
# Is the record OK?
if is_record_ok(raw_record):
# Yes, make a new record with the right data types.
clean_record = {
'Goat': raw_record['Goat'],
'Before': float(raw_record['Before']),
'After': float(raw_record['After'])
}
# Add the new record to the clean list.
clean_scores.append(clean_record)
# Send the cleaned list back.
return clean_scores

def clean_goat_records(raw_goat_data_set):
# Create a new list for the clean records.
clean_records = []
# Loop over raw records.
for goat_record in raw_goat_data_set:
# Is the record OK?
if is_record_ok(goat_record):
# Make new record with right data types.
new_record = {
'Neighborhood': goat_record['Neighborhood'],
'Goat': goat_record['Goat'],
'Houses visited': int(goat_record['Houses visited']),
'Candy collected': int(goat_record['Candy collected'])
}
# Add the new record to the clean list.
clean_records.append(new_record)
# Send the lists back.
return clean_records

Adela

Almost the same. The old one has three fields, and converts numbers to floats (lines 9 to 13). The new one has four fields, and makes the numbers ints (lines 9 to 14).

Good job!

Both functions call is_record_ok (line 7 in both functions): if is_record_ok(goat_record):

The function takes a dictionary (with one record), and returns True if the data in the record is valid.

Halloween records have different validity rules from the other data. Here they are for each field:

Neighborhood: One of Angora Acres, Guernsey Glen, Pygora Park, or Savanna Shores. As usual, case doesn't matter, and there can be extra learning and trailing spaces.
Goat: name can't be MT.
Houses: int from 0 to 30.
Candy: int from 0 to 500.

Here's what we had for the before/after thing.

def is_record_ok(record):
# Check name.
goat_name = record['Goat']
if goat_name == '' or goat_name == None:
return False
# Check Before value.
before = record['Before']
if not is_score_ok(before):
return False
# Check After value.
after = record['After']
if not is_score_ok(after):
return False
return True

Adela, you wanna try this one?

Adela

Sure. Basically the code is this:

Check a field. Return False if it's bad.
Check another field. Return False if it's bad.
Check another field. Return False if it's bad.
(Repeat for as many fields as we have.)
return True.

Adela

Here's what I came up with.

def is_record_ok(record):
# Check neighborhood.
neighborhood = record['Neighborhood'].strip().lower()
neighborhood_ok = \
neighborhood == 'angora acres' \
or neighborhood == 'guernsey glen' \
or neighborhood == 'pygora park' \
or neighborhood == 'savanna shores'
if not neighborhood_ok:
return False
# Check name.
goat_name = record['Goat'].strip().lower()
if goat_name == '' or goat_name == None:
return False
# Check houses visited value.
houses = record['Houses visited']
if not is_count_ok(houses, 30):
return False
# Check candy value.
candy = record['Candy collected']
if not is_count_ok(candy, 500):
return False
return True

Wow, that's great!

You can see the structure of the function from the comments.

def is_record_ok(record):
# Check neighborhood.
# Check name.
# Check houses visited value.
# Check candy value.

Here's the first one:

# Check neighborhood.
neighborhood = record['Neighborhood'].strip().lower()
neighborhood_ok = \
neighborhood == 'angora acres' \
or neighborhood == 'guernsey glen' \
or neighborhood == 'pygora park' \
or neighborhood == 'savanna shores'
if not neighborhood_ok:
return False

neighborhood_ok is a boolean. It'll be True if neighborhood is any one of the valid values. If neighborhood_ok is, say, bikini bottom, all the ==s will be False, so neighborhood_ok will be False.

The rule for the next field, name, is:

Goat: name can't be MT.

Here's how Adela did it:

# Check name.
goat_name = record['Goat'].strip().lower()
if goat_name == '' or goat_name == None:
return False

That's the first two fields.

Here are the rules for the last two fields:

Houses: int from 0 to 30.
Candy: int from 0 to 500.

The rules are almost the same.

Reflect

When two rule sets are almost the same, you should consider ________.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Georgina

Make one function to handle both.

Right!

The function is going to have some params. They should be the things that are different between the different calls. Here are the rules again:

Houses: int from 0 to 30.
Candy: int from 0 to 500.

The only differences are the value being checked (houses or candy), and the maximum value (30 or 500). Those become the params passed to the new function. (You can pass other things, too, if you like.)

Here are the calls Adela wrote:

# Check houses visited value.
houses = record['Houses visited']
if not is_count_ok(houses, 30):
return False
# Check candy value.
candy = record['Candy collected']
if not is_count_ok(candy, 500):
return False

The params match what's different between the way the functions are called.

Note

Writing out how a function is called will help you work out what the function should do.

Ethan, wanna try is_count_ok?

Ethan

No problem. The first line will be:

def is_count_ok(count_in, max):

Ethan

The structure will be:

Check a rule. If it's violated, return False.
Check another rule. If it's violated, return False.
...
return True

Reflect

Complete the function, starting with:

def is_count_ok(count_in, max):

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ethan

The rules are like: int from 0 to something.

We did checking for int and checking range before. Here's my code:

def is_count_ok(count_in, max):
# Is it an int?
try:
count = int(count_in)
except Exception:
# Not an int.
return False
# Check range.
if count < 0 or count > max:
return False
# All OK.
return True

Hey, good work!

Where are we?

It's easy to get lost when you're writing a moderately complex program like this.

Ray

True dat!

It helps to take a step back, and look at the structure of your work.

We're writing a program that takes data like this...

"Neighborhood","Goat","Houses visited","Candy collected"
"Guernsey Glen","Azucena",5,307
"Savanna Shores","Dawne",7,242
"Pygora Park","Rozanne",6,262
"Pygora Park","Elenor",7,288

... and outputs average candy collected for each neighborhood.

We started with:

# Read the raw data.
halloween_data_set = read_csv_data_set('halloween-candy.csv')
# Keep only valid records.
clean_halloween_data_set = clean_goat_records(halloween_data_set)

Next, we wrote clean_goat_records. It's similar to what we've done before.

Splitting

We want...

Average Halloween loot by neighborhood
======= ========= ==== == ============
Angora Acres: 211.4
Guernsey Glen: 320.2
Pygora Park: 254.2
Savanna Shores: 226.8

There are several ways to do this. Let's use one of the simpler approaches. Here's what we'll have in the main program:

# Break out each neighborhood's data.
angora_data_set = get_neighborhood_records('Angora Acres', clean_halloween_data_set)
guernsey_data_set = get_neighborhood_records('Guernsey Glen', clean_halloween_data_set)
pygora_data_set = get_neighborhood_records('Pygora Park', clean_halloween_data_set)
savanna_data_set = get_neighborhood_records('Savanna Shores', clean_halloween_data_set)

Reflect

Explain in your own words what get_neighborhood_records will do.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ethan

It goes through a data set, and returns records for one neighbornood.

Indeed.

Reflect

Finish this code.

def get_neighborhood_records(neighborhood_name_to_find, data_set):
# Normalize name to find.
neighborhood_name_to_find = neighborhood_name_to_find.strip().lower()
neighborhood_records = []
Stuff here
return neighborhood_records

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Adela

It's gotta loop over the records. For each one, check whether it's in the neighborhood we're looking for. If it is, add it to neighborhood_records.

def get_neighborhood_records(neighborhood_name_to_find, data_set):
# Normalize name to find.
neighborhood_name_to_find = neighborhood_name_to_find.strip().lower()
neighborhood_records = []
# Loop over records.
for record in data_set:
# Get normalized name for current record.
neighborhood_name_in_record = record['Neighborhood']
neighborhood_name_in_record = neighborhood_name_in_record.strip().lower()
# Is it the one we want?
if neighborhood_name_in_record == neighborhood_name_to_find:
# Aye.
neighborhood_records.append(record)
return neighborhood_records

Good work!

Averages

Here's the main program again:

# Break out each neighborhood's data.
angora_data_set = get_neighborhood_records('Angora Acres', clean_halloween_data_set)
guernsey_data_set = get_neighborhood_records('Guernsey Glen', clean_halloween_data_set)
pygora_data_set = get_neighborhood_records('Pygora Park', clean_halloween_data_set)
savanna_data_set = get_neighborhood_records('Savanna Shores', clean_halloween_data_set)
# Compute stats.
angora_average = compute_average_candy(angora_data_set)
guernsey_average = compute_average_candy(guernsey_data_set)
pygora_average = compute_average_candy(pygora_data_set)
savanna_average = compute_average_candy(savanna_data_set)

We can add a new type of machine:

Averaging machine

This one doesn't return another data set. It returns a single value.

Remember, there's a data machines list list.

Reflect

What's compute_average do?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ethan

It works out the average candy for one neighborhood.

Close, but not really. Though that seems right. Thing is, compute_average doesn't test for a neighborhood. The neighborhood doesn't get passed in to the function.

This call...

angora_average = compute_average_candy(angora_data_set)

... does compute the average of a neighborhood, but not because compute_average_candy knows about the neighborhood. It doesn't.

Ethan

Huh... Oh! I think I see it.

compute_average_candy works out the candy average for a data set, whatever the data set is. angora_data_set only contains data for that one neighborhood. That's all the data compute_average_candy gets for the call compute_average_candy(angora_data_set).

We made a different data set for each neighborhood. But compute_average_candy doesn't know that. It just takes a data set, whatever it is, and works out the average.

Correct.

Reflect

Please finish this code:

def compute_average_candy(data_set):
Stuff here
return average

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Georgina

Here's what I came up with:

def compute_average_candy(data_set):
total = 0
# Loop over data set.
for record in data_set:
# Add to total.
total += record['Candy collected']
# Divide total by number of elements.
average = total / len(data_set)
return average

Good!

But...

Dataless

Adela

I'm not sure if this is a problem, but what would happen if the data set is MT?

Ooo, good question! That's called an edge case.

Reflect

What would this code return if the data set was MT?

def compute_average_candy(data_set):
total = 0
# Loop over data set.
for record in data_set:
# Add to total.
total += record['Candy collected']
# Divide total by number of elements.
average = total / len(data_set)
return average

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Georgina

Hmm, let's see. Zero, maybe?

Let me walk it through in my head.

total is zero in line 2.

The loop (lines 4 to 6) wouldn't do anything. Python would skip it, since there are no records in the data set.

Line 8 would divide total by the number of elements in the list, zero in this case.

Oops...

Reflect

Run 0/0 in the console. What happens?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ray

I got an error: ZeroDivisionError: division by zero

Aye. Just as in normal math, in Python dividing by 0 is not allowed.

Adela

If data_set was MT, that means a neighborhood had no data. Nobody collected candy there. I suppose that could happen.

But if it does, what should we tell the user?

Ray

We could show an error message. Like, in the main program, maybe.

# Break out each neighborhood's data.
angora_data_set = get_neighborhood_records('Angora Acres', clean_halloween_data_set)
...
# Compute stats.
if len(angora_data_set) > 0:
angora_average = compute_average_candy(angora_data_set)
...
# Output.
if len(angora_data_set) == 0:
print('No data for Angora Acres.')
else:
angora_average = compute_average_candy(angora_data_set)
print('Angora Acres: ', angora_average)

Aye, you could do that.

Georgina

Or return zero from compute_average_candy:

def compute_average_candy(data_set):
# Check there is data.
if len(data_set) == 0:
return 0
total = 0
# Loop over data set.
...

Yes, that would work, too.

Adela

Kieran, if you were an analyst, what would you do?

I'd ask the people who were going to use the program what they wanted.

Principle

Ask users

If you're not sure what a program should do in special cases, ask the people who are going to use it.

This is for the make-it-zero approach:

def compute_average_candy(data_set):
# Check there is data.
if len(data_set) == 0:
return 0
total = 0
# Loop over data set.
for record in data_set:
# Add to total.
total += record['Candy collected']
# Divide total by number of elements.
average = total / len(data_set)
return average

A new pattern

Pattern

Data machine: Records subset

Write a function taking a data set as a param, and returning another data set with a subset of the original records, based on criteria you choose.

When you start a new project, you can use the pattern catalog to remind yourself of useful chunks of code.

Summary

It's common in business analytics to look at subsets of data.
We added a function to the data pipeline to loop over records, making a new list of records that meet criteria, like being about a particular neighborhood.
When two rule sets are almost the same, you should consider making one function to handle both.
Use ifs to handle edge cases, like trying to find the average of no data.
If you're not sure what a program should do in edge cases, ask the people who are going to use it.

Exercise

Compare cuteness

Cthulhu has data on goat cuteness. You can download it. Here's some of the data:

"Name","Cuteness"
"Lino",10
"Johnnie",5
"August",9
"Leticia",5
"Andreas",10
"Rhonda",10

Write a program to compute the average cuteness of goats whose names begin with A-L, and goats whose names begin with M-Z. Do this for valid records only. Validity rules:

Goat name must be present.
Cuteness ratings are ints from 0 to 10.

Output:

First half average cuteness: 7.0
Second half average cuteness: 7.1

Write a subset function that works for creating both subsets.

Hints:

thing[0] is the first character of the string thing.
a < b < c < d... < l < m ... y < z.
def get_records_by_name(data_set, low_letter, high_letter):

Upload a zip of your project folder. The usual coding standards apply.

If you were logged in as a student, you could submit a solution to this exercise.

Attachments

halloween-candy.csv