It's common in business analytics to look at subsets of data. For example, say we have a CSV file with data on sales for three regions (say north, south, and underground). The boss wants us to analyze data on sales in the southern region only. We make a subset of the original data.
You can do this with a subset function. It's a function that takes a data set (and other params), and returns another data set with only some of the records you sent it.
New for the pipeline
Let's add the component to our collection.
Subset data machine
There's a data machines list.
Goatty friends
Here's goatty data again.
- "Goat","Before","After"
- "Aisha",17,16
- "Andreas",14,15
- "August",10,13
- "Bertha",17,23
- "Bessie",20,25
- "Boyd",13,12
- "Bridgette",16,22
- "Carrie",15,19
Let's say we only want to get average before and after scores for goats where After
is more than Before
. That would be lines 3, 4, 5, 6, 8, and 9. We'd ignore lines 2 and 7.
I'd want to add a new subset process:
Add a subset process

Ray
Oh, I get it. You plug together the data machines you need. So, if you're writing a program to analyze an entire data set, without splitting out any subsets, you wouldn't include the last step.
Exactly!
I'd want to do something like this:
- # Read goat scores from CSV file.
- raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
- # Filter out bad records.
- cleaned_goat_scores, goats_bad_records = clean_goat_scores(raw_goat_scores)
- # Get records where after is more than before.
- score_gain_records = filter_score_gain_records(cleaned_goat_scores)
- # Analysis.
- total_before = 0
- total_after = 0
- for goat in score_gain_records:
- total_before += goat['Before']
- total_after += goat['After']
- print('Total before:', total_before)
- print('Total after:', total_after)
Top-down again
I started by deciding what the new function should do. A good way to do that is to write code that calls the function.
Splitty split

Adela
This is gonna be like the other functions, right? Make a new list. Then loop over the data set, have an if
to pick the records you want, then add them to the new list.
Right! That's a common pattern.
Loop over a list of dictionaries
You have a data set. Each record is a dictionary. All the records are in a list. Use a for
loop to run through each record in the list.
Finish this function.
- def filter_score_gain_records(goat_records):
- '''
- Filter records where after is more than before.
- Parameters
- ----------
- goat_records : list of dictionaries
- Valid goatty records.
- Returns
- -------
- score_gain_records : list of dictionaries
- Records where after > before.
- '''
- # Create a new list.
- # Loop over records.
- # Is after > before?
- # Add to new list.
- # Return the new list.
You can have more than one Python line per comment.

Ray
Here's what my coffee and I got.
- def filter_score_gain_records(goat_records):
- '''
- Filter records where after is more than before.
- Parameters
- ----------
- goat_records : list of dictionaries
- Valid goatty records.
- Returns
- -------
- score_gain_records : list of dictionaries
- Records where after > before.
- '''
- # Create a new list.
- score_gain_records = []
- # Loop over records.
- for record in goat_records:
- # Is after > before?
- before = record['Before']
- after = record['After']
- if after > before:
- # Add to new list.
- score_gain_records.append(record)
- # Return the new list.
- return score_gain_records
Nice work!

Ethan
Yes, good job, dude!
More than one group
We just made a subset of the data where After
is more than Before
. What if you want more than one subset?
Here's part of another data set you can download:
- "Neighborhood","Goat","Houses visited","Candy collected"
- "Guernsey Glen","Azucena",5,307
- "Savanna Shores","Dawne",7,242
- "Pygora Park","Rozanne",6,262
- "Pygora Park","Elenor",7,288
- "Angora Acres","Wava",9,236
- "Pygora Park","Fredric",8,312
- "Pygora Park","Clarinda",8,312
- "Pygora Park","Rosamond",9,230
It shows how much candy goats collected in various neighborhoods last Halloween. Candy is in grams.
There are four neighborhoods:
- Angora Acres
- Guernsey Glen
- Pygora Park
- Savanna Shores
Say you want to find the average loot in each neighborhood. You want output like this:
- Average Halloween loot by neighborhood
- ======= ========= ==== == ============
- Angora Acres: 211.4
- Guernsey Glen: 320.2
- Pygora Park: 254.2
- Savanna Shores: 226.8
How would you do it?

Ethan
Can we start with one of those diagrams? Showing the different data machines?
Sure.
Halloween machines (good band name?)

Ethan
The first part we know how to do. We'd want to be able to write this:
- # Read the raw data.
- halloween_data_set = read_csv_data_set('halloween-candy.csv')
- # Keep only valid records.
- clean_halloween_data_set = clean_goat_records(halloween_data_set)

Ethan
read_csv_data_set
we just copy-and-paste.
For clean_goat_records
, we can change what we did last time.
Here's code from before:
- def clean_goat_scores(raw_goat_scores):
- # Create a new list for the clean records.
- clean_scores = []
- # Loop over raw records.
- for raw_record in raw_goat_scores:
- # Is the record OK?
- if is_record_ok(raw_record):
- # Yes, make a new record with the right data types.
- clean_record = {
- 'Goat': raw_record['Goat'],
- 'Before': float(raw_record['Before']),
- 'After': float(raw_record['After'])
- }
- # Add the new record to the clean list.
- clean_scores.append(clean_record)
- # Send the cleaned list back.
- return clean_scores
Write clean_goat_records
for the halloween data, to be called like this:
clean_halloween_data_set = clean_goat_records(halloween_data_set)

Adela
Here's what I got.
Old | New |
---|---|
|
|

Adela
Almost the same. The old one has three fields, and converts numbers to floats (lines 9 to 13). The new one has four fields, and makes the numbers ints (lines 9 to 14).
Good job!
Both functions call is_record_ok
(line 7 in both functions): if is_record_ok(goat_record):
The function takes a dictionary (with one record), and returns True
if the data in the record is valid.
Halloween records have different validity rules from the other data. Here they are for each field:
- Neighborhood: One of Angora Acres, Guernsey Glen, Pygora Park, or Savanna Shores. As usual, case doesn't matter, and there can be extra learning and trailing spaces.
- Goat: name can't be MT.
- Houses: int from 0 to 30.
- Candy: int from 0 to 500.
Here's what we had for the before/after thing.
- def is_record_ok(record):
- # Check name.
- goat_name = record['Goat']
- if goat_name == '' or goat_name == None:
- return False
- # Check Before value.
- before = record['Before']
- if not is_score_ok(before):
- return False
- # Check After value.
- after = record['After']
- if not is_score_ok(after):
- return False
- return True
Adela, you wanna try this one?

Adela
Sure. Basically the code is this:
- Check a field. Return False if it's bad.
- Check another field. Return False if it's bad.
- Check another field. Return False if it's bad.
- (Repeat for as many fields as we have.)
- return True.

Adela
Here's what I came up with.
- def is_record_ok(record):
- # Check neighborhood.
- neighborhood = record['Neighborhood'].strip().lower()
- neighborhood_ok = \
- neighborhood == 'angora acres' \
- or neighborhood == 'guernsey glen' \
- or neighborhood == 'pygora park' \
- or neighborhood == 'savanna shores'
- if not neighborhood_ok:
- return False
- # Check name.
- goat_name = record['Goat'].strip().lower()
- if goat_name == '' or goat_name == None:
- return False
- # Check houses visited value.
- houses = record['Houses visited']
- if not is_count_ok(houses, 30):
- return False
- # Check candy value.
- candy = record['Candy collected']
- if not is_count_ok(candy, 500):
- return False
- return True
Wow, that's great!
You can see the structure of the function from the comments.
- def is_record_ok(record):
- # Check neighborhood.
- # Check name.
- # Check houses visited value.
- # Check candy value.
Here's the first one:
- # Check neighborhood.
- neighborhood = record['Neighborhood'].strip().lower()
- neighborhood_ok = \
- neighborhood == 'angora acres' \
- or neighborhood == 'guernsey glen' \
- or neighborhood == 'pygora park' \
- or neighborhood == 'savanna shores'
- if not neighborhood_ok:
- return False
neighborhood_ok
is a boolean. It'll be True
if neighborhood
is any one of the valid values. If neighborhood_ok
is, say, bikini bottom
, all the ==
s will be False
, so neighborhood_ok
will be False
.
The rule for the next field, name, is:
- Goat: name can't be MT.
Here's how Adela did it:
- # Check name.
- goat_name = record['Goat'].strip().lower()
- if goat_name == '' or goat_name == None:
- return False
That's the first two fields.
Here are the rules for the last two fields:
- Houses: int from 0 to 30.
- Candy: int from 0 to 500.
The rules are almost the same.
When two rule sets are almost the same, you should consider ________.

Georgina
Make one function to handle both.
Right!
The function is going to have some params. They should be the things that are different between the different calls. Here are the rules again:
- Houses: int from 0 to 30.
- Candy: int from 0 to 500.
The only differences are the value being checked (houses or candy), and the maximum value (30 or 500). Those become the params passed to the new function. (You can pass other things, too, if you like.)
Here are the calls Adela wrote:
- # Check houses visited value.
- houses = record['Houses visited']
- if not is_count_ok(houses, 30):
- return False
- # Check candy value.
- candy = record['Candy collected']
- if not is_count_ok(candy, 500):
- return False
The params match what's different between the way the functions are called.
Note
Writing out how a function is called will help you work out what the function should do.
Ethan, wanna try is_count_ok
?

Ethan
No problem. The first line will be:
- def is_count_ok(count_in, max):

Ethan
The structure will be:
- Check a rule. If it's violated, return False.
- Check another rule. If it's violated, return False.
- ...
- return True
Complete the function, starting with:
def is_count_ok(count_in, max):

Ethan
The rules are like: int from 0 to something.
We did checking for int and checking range before. Here's my code:
- def is_count_ok(count_in, max):
- # Is it an int?
- try:
- count = int(count_in)
- except Exception:
- # Not an int.
- return False
- # Check range.
- if count < 0 or count > max:
- return False
- # All OK.
- return True
Hey, good work!
Where are we?
It's easy to get lost when you're writing a moderately complex program like this.

Ray
True dat!
It helps to take a step back, and look at the structure of your work.
We're writing a program that takes data like this...
- "Neighborhood","Goat","Houses visited","Candy collected"
- "Guernsey Glen","Azucena",5,307
- "Savanna Shores","Dawne",7,242
- "Pygora Park","Rozanne",6,262
- "Pygora Park","Elenor",7,288
... and outputs average candy collected for each neighborhood.
We started with:
- # Read the raw data.
- halloween_data_set = read_csv_data_set('halloween-candy.csv')
- # Keep only valid records.
- clean_halloween_data_set = clean_goat_records(halloween_data_set)
Next, we wrote clean_goat_records
. It's similar to what we've done before.
Splitting
We want...
- Average Halloween loot by neighborhood
- ======= ========= ==== == ============
- Angora Acres: 211.4
- Guernsey Glen: 320.2
- Pygora Park: 254.2
- Savanna Shores: 226.8
There are several ways to do this. Let's use one of the simpler approaches. Here's what we'll have in the main program:
- # Break out each neighborhood's data.
- angora_data_set = get_neighborhood_records('Angora Acres', clean_halloween_data_set)
- guernsey_data_set = get_neighborhood_records('Guernsey Glen', clean_halloween_data_set)
- pygora_data_set = get_neighborhood_records('Pygora Park', clean_halloween_data_set)
- savanna_data_set = get_neighborhood_records('Savanna Shores', clean_halloween_data_set)
Explain in your own words what get_neighborhood_records
will do.

Ethan
It goes through a data set, and returns records for one neighbornood.
Indeed.
Finish this code.
- def get_neighborhood_records(neighborhood_name_to_find, data_set):
- # Normalize name to find.
- neighborhood_name_to_find = neighborhood_name_to_find.strip().lower()
- neighborhood_records = []
- Stuff here
- return neighborhood_records

Adela
It's gotta loop over the records. For each one, check whether it's in the neighborhood we're looking for. If it is, add it to neighborhood_records
.
- def get_neighborhood_records(neighborhood_name_to_find, data_set):
- # Normalize name to find.
- neighborhood_name_to_find = neighborhood_name_to_find.strip().lower()
- neighborhood_records = []
- # Loop over records.
- for record in data_set:
- # Get normalized name for current record.
- neighborhood_name_in_record = record['Neighborhood']
- neighborhood_name_in_record = neighborhood_name_in_record.strip().lower()
- # Is it the one we want?
- if neighborhood_name_in_record == neighborhood_name_to_find:
- # Aye.
- neighborhood_records.append(record)
- return neighborhood_records
Good work!
Averages
Here's the main program again:
- # Break out each neighborhood's data.
- angora_data_set = get_neighborhood_records('Angora Acres', clean_halloween_data_set)
- guernsey_data_set = get_neighborhood_records('Guernsey Glen', clean_halloween_data_set)
- pygora_data_set = get_neighborhood_records('Pygora Park', clean_halloween_data_set)
- savanna_data_set = get_neighborhood_records('Savanna Shores', clean_halloween_data_set)
- # Compute stats.
- angora_average = compute_average_candy(angora_data_set)
- guernsey_average = compute_average_candy(guernsey_data_set)
- pygora_average = compute_average_candy(pygora_data_set)
- savanna_average = compute_average_candy(savanna_data_set)
We can add a new type of machine:
Averaging machine
This one doesn't return another data set. It returns a single value.
Remember, there's a data machines list list.
What's compute_average
do?

Ethan
It works out the average candy for one neighborhood.
Close, but not really. Though that seems right. Thing is, compute_average
doesn't test for a neighborhood. The neighborhood doesn't get passed in to the function.
This call...
- angora_average = compute_average_candy(angora_data_set)
... does compute the average of a neighborhood, but not because compute_average_candy
knows about the neighborhood. It doesn't.

Ethan
Huh... Oh! I think I see it.
compute_average_candy
works out the candy average for a data set, whatever the data set is. angora_data_set
only contains data for that one neighborhood. That's all the data compute_average_candy
gets for the call compute_average_candy(angora_data_set)
.
We made a different data set for each neighborhood. But compute_average_candy
doesn't know that. It just takes a data set, whatever it is, and works out the average.
Correct.
Please finish this code:
- def compute_average_candy(data_set):
- Stuff here
- return average

Georgina
Here's what I came up with:
- def compute_average_candy(data_set):
- total = 0
- # Loop over data set.
- for record in data_set:
- # Add to total.
- total += record['Candy collected']
- # Divide total by number of elements.
- average = total / len(data_set)
- return average
Good!
But...
Dataless

Adela
I'm not sure if this is a problem, but what would happen if the data set is MT?
Ooo, good question! That's called an edge case.
What would this code return if the data set was MT?
- def compute_average_candy(data_set):
- total = 0
- # Loop over data set.
- for record in data_set:
- # Add to total.
- total += record['Candy collected']
- # Divide total by number of elements.
- average = total / len(data_set)
- return average

Georgina
Hmm, let's see. Zero, maybe?
Let me walk it through in my head.
total
is zero in line 2.
The loop (lines 4 to 6) wouldn't do anything. Python would skip it, since there are no records in the data set.
Line 8 would divide total
by the number of elements in the list, zero in this case.
Oops...
Run 0/0
in the console. What happens?

Ray
I got an error: ZeroDivisionError: division by zero
Aye. Just as in normal math, in Python dividing by 0 is not allowed.

Adela
If data_set
was MT, that means a neighborhood had no data. Nobody collected candy there. I suppose that could happen.
But if it does, what should we tell the user?

Ray
We could show an error message. Like, in the main program, maybe.
- # Break out each neighborhood's data.
- angora_data_set = get_neighborhood_records('Angora Acres', clean_halloween_data_set)
- ...
- # Compute stats.
- if len(angora_data_set) > 0:
- angora_average = compute_average_candy(angora_data_set)
- ...
- # Output.
- if len(angora_data_set) == 0:
- print('No data for Angora Acres.')
- else:
- angora_average = compute_average_candy(angora_data_set)
- print('Angora Acres: ', angora_average)
Aye, you could do that.

Georgina
Or return zero from compute_average_candy
:
- def compute_average_candy(data_set):
- # Check there is data.
- if len(data_set) == 0:
- return 0
- total = 0
- # Loop over data set.
- ...
Yes, that would work, too.

Adela
Kieran, if you were an analyst, what would you do?
I'd ask the people who were going to use the program what they wanted.
If you're not sure what a program should do in special cases, ask the people who are going to use it.
This is for the make-it-zero approach:
- def compute_average_candy(data_set):
- # Check there is data.
- if len(data_set) == 0:
- return 0
- total = 0
- # Loop over data set.
- for record in data_set:
- # Add to total.
- total += record['Candy collected']
- # Divide total by number of elements.
- average = total / len(data_set)
- return average
A new pattern
Write a function taking a data set as a param, and returning another data set with a subset of the original records, based on criteria you choose.
When you start a new project, you can use the pattern catalog to remind yourself of useful chunks of code.
Summary
- It's common in business analytics to look at subsets of data.
- We added a function to the data pipeline to loop over records, making a new list of records that meet criteria, like being about a particular neighborhood.
- When two rule sets are almost the same, you should consider making one function to handle both.
- Use
if
s to handle edge cases, like trying to find the average of no data. - If you're not sure what a program should do in edge cases, ask the people who are going to use it.
Exercise
Compare cuteness
Cthulhu has data on goat cuteness. You can download it. Here's some of the data:
- "Name","Cuteness"
- "Lino",10
- "Johnnie",5
- "August",9
- "Leticia",5
- "Andreas",10
- "Rhonda",10
Write a program to compute the average cuteness of goats whose names begin with A-L, and goats whose names begin with M-Z. Do this for valid records only. Validity rules:
- Goat name must be present.
- Cuteness ratings are ints from 0 to 10.
Output:
- First half average cuteness: 7.0
- Second half average cuteness: 7.1
Write a subset function that works for creating both subsets.
Hints:
thing[0]
is the first character of the stringthing
.- a < b < c < d... < l < m ... y < z.
def get_records_by_name(data_set, low_letter, high_letter):
Upload a zip of your project folder. The usual coding standards apply.