Record subsets

It's common in business analytics to look at subsets of data. For example, say we have a CSV file with data on sales for three regions (say north, south, and underground). The boss wants us to analyze data on sales in the southern region only. We make a subset of the original data.

You can do this with a subset function. It's a function that takes a data set (and other params), and returns another data set with only some of the records you sent it.

New for the pipeline

Let's add the component to our collection.

New component

Subset data machine ​

There's a data machines list.

Goatty friends

Here's goatty data again.

  1. "Goat","Before","After"
  2. "Aisha",17,16
  3. "Andreas",14,15
  4. "August",10,13
  5. "Bertha",17,23
  6. "Bessie",20,25
  7. "Boyd",13,12
  8. "Bridgette",16,22
  9. "Carrie",15,19

Let's say we only want to get average before and after scores for goats where After is more than Before. That would be lines 3, 4, 5, 6, 8, and 9. We'd ignore lines 2 and 7.

I'd want to add a new subset process:

Subset process

​Add a subset process

Ray
Ray

Oh, I get it. You plug together the data machines you need. So, if you're writing a program to analyze an entire data set, without splitting out any subsets, you wouldn't include the last step.

Exactly!

I'd want to do something like this:

  • # Read goat scores from CSV file.
  • raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
  • # Filter out bad records.
  • cleaned_goat_scores, goats_bad_records = clean_goat_scores(raw_goat_scores)
  • # Get records where after is more than before.
  • score_gain_records = filter_score_gain_records(cleaned_goat_scores)
  • # Analysis.
  • total_before = 0
  • total_after = 0
  • for goat in score_gain_records:
  •     total_before += goat['Before']
  •     total_after += goat['After']
  • print('Total before:', total_before)
  • print('Total after:', total_after)

Top-down again

I started by deciding what the new function should do. A good way to do that is to write code that calls the function.

Splitty split

Adela
Adela

This is gonna be like the other functions, right? Make a new list. Then loop over the data set, have an if to pick the records you want, then add them to the new list.

Right! That's a common pattern.

Pattern

Loop over a list of dictionaries

You have a data set. Each record is a dictionary. All the records are in a list. Use a for loop to run through each record in the list.

Reflect

Finish this function.

  • def filter_score_gain_records(goat_records):
  •     '''
  •     Filter records where after is more than before.
  •  
  •     Parameters
  •     ----------
  •     goat_records : list of dictionaries
  •         Valid goatty records.
  •  
  •     Returns
  •     -------
  •     score_gain_records : list of dictionaries
  •         Records where after > before.
  •  
  •     '''
  •     # Create a new list.
  •     # Loop over records.
  •         # Is after > before?
  •             # Add to new list.
  •     # Return the new list.

You can have more than one Python line per comment.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Ray
Ray

Here's what my coffee and I got.

  • def filter_score_gain_records(goat_records):
  •     '''
  •     Filter records where after is more than before.
  •  
  •     Parameters
  •     ----------
  •     goat_records : list of dictionaries
  •         Valid goatty records.
  •  
  •     Returns
  •     -------
  •     score_gain_records : list of dictionaries
  •         Records where after > before.
  •  
  •     '''
  •     # Create a new list.
  •     score_gain_records = []
  •     # Loop over records.
  •     for record in goat_records:
  •         # Is after > before?
  •         before = record['Before']
  •         after = record['After']
  •         if after > before:
  •             # Add to new list.
  •             score_gain_records.append(record)
  •     # Return the new list.
  •     return score_gain_records

Nice work!

Ethan
Ethan

Yes, good job, dude!

More than one group

We just made a subset of the data where After is more than Before. What if you want more than one subset?

Here's part of another data set you can download:

  • "Neighborhood","Goat","Houses visited","Candy collected"
  • "Guernsey Glen","Azucena",5,307
  • "Savanna Shores","Dawne",7,242
  • "Pygora Park","Rozanne",6,262
  • "Pygora Park","Elenor",7,288
  • "Angora Acres","Wava",9,236
  • "Pygora Park","Fredric",8,312
  • "Pygora Park","Clarinda",8,312
  • "Pygora Park","Rosamond",9,230

It shows how much candy goats collected in various neighborhoods last Halloween. Candy is in grams.

There are four neighborhoods:

  • Angora Acres
  • Guernsey Glen
  • Pygora Park
  • Savanna Shores

Say you want to find the average loot in each neighborhood. You want output like this:

  • Average Halloween loot by neighborhood
  • ======= ========= ==== == ============
  • Angora Acres: 211.4
  • Guernsey Glen: 320.2
  • Pygora Park: 254.2
  • Savanna Shores: 226.8

How would you do it?

Ethan
Ethan

Can we start with one of those diagrams? Showing the different data machines?

Sure.

​Halloween machines

​Halloween machines (good band name?)

Ethan
Ethan

The first part we know how to do. We'd want to be able to write this:

  • # Read the raw data.
  • halloween_data_set = read_csv_data_set('halloween-candy.csv')
  • # Keep only valid records.
  • clean_halloween_data_set = clean_goat_records(halloween_data_set)
Ethan
Ethan

read_csv_data_set we just copy-and-paste.

For clean_goat_records, we can change what we did last time.

Here's code from before:

  • def clean_goat_scores(raw_goat_scores):
  •     # Create a new list for the clean records.
  •     clean_scores = []
  •     # Loop over raw records.
  •     for raw_record in raw_goat_scores:
  •         # Is the record OK?
  •         if is_record_ok(raw_record):
  •             # Yes, make a new record with the right data types.
  •             clean_record = {
  •                 'Goat': raw_record['Goat'],
  •                 'Before': float(raw_record['Before']),
  •                 'After': float(raw_record['After'])
  •             }
  •             # Add the new record to the clean list.
  •             clean_scores.append(clean_record)
  •     # Send the cleaned list back.
  •     return clean_scores
Reflect

Write clean_goat_records for the halloween data, to be called like this:

clean_halloween_data_set = clean_goat_records(halloween_data_set)

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Adela
Adela

Here's what I got.

Old New
  1. def clean_goat_scores(raw_goat_scores):
  2.     # Create a new list for the clean records.
  3.     clean_scores = []
  4.     # Loop over raw records.
  5.     for raw_record in raw_goat_scores:
  6.         # Is the record OK?
  7.         if is_record_ok(raw_record):
  8.             # Yes, make a new record with the right data types.
  9.             clean_record = {
  10.                 'Goat': raw_record['Goat'],
  11.                 'Before': float(raw_record['Before']),
  12.                 'After': float(raw_record['After'])
  13.             }
  14.             # Add the new record to the clean list.
  15.             clean_scores.append(clean_record)
  16.     # Send the cleaned list back.
  17.     return clean_scores

  1. def clean_goat_records(raw_goat_data_set):
  2.     # Create a new list for the clean records.
  3.     clean_records = []
  4.     # Loop over raw records.
  5.     for goat_record in raw_goat_data_set:
  6.         # Is the record OK?
  7.         if is_record_ok(goat_record):
  8.             # Make new record with right data types.
  9.             new_record = {
  10.                 'Neighborhood': goat_record['Neighborhood'],
  11.                 'Goat': goat_record['Goat'],
  12.                 'Houses visited': int(goat_record['Houses visited']),
  13.                 'Candy collected': int(goat_record['Candy collected'])
  14.             }
  15.             # Add the new record to the clean list.
  16.             clean_records.append(new_record)
  17.     # Send the lists back.
  18.     return clean_records
Adela
Adela

Almost the same. The old one has three fields, and converts numbers to floats (lines 9 to 13). The new one has four fields, and makes the numbers ints (lines 9 to 14).

Good job!

Both functions call is_record_ok (line 7 in both functions): if is_record_ok(goat_record):

The function takes a dictionary (with one record), and returns True if the data in the record is valid.

Halloween records have different validity rules from the other data. Here they are for each field:

  • Neighborhood: One of Angora Acres, Guernsey Glen, Pygora Park, or Savanna Shores. As usual, case doesn't matter, and there can be extra learning and trailing spaces.
  • Goat: name can't be MT.
  • Houses: int from 0 to 30.
  • Candy: int from 0 to 500.

Here's what we had for the before/after thing.

  • def is_record_ok(record):
  •     # Check name.
  •     goat_name = record['Goat']
  •     if goat_name == '' or goat_name == None:
  •         return False
  •     # Check Before value.
  •     before = record['Before']
  •     if not is_score_ok(before):
  •         return False
  •     # Check After value.
  •     after = record['After']
  •     if not is_score_ok(after):
  •         return False
  •     return True

Adela, you wanna try this one?

Adela
Adela

Sure. Basically the code is this:

  • Check a field. Return False if it's bad.
  • Check another field. Return False if it's bad.
  • Check another field. Return False if it's bad.
  • (Repeat for as many fields as we have.)
  • return True.
Adela
Adela

Here's what I came up with.

  1. def is_record_ok(record):
  2.     # Check neighborhood.
  3.     neighborhood = record['Neighborhood'].strip().lower()
  4.     neighborhood_ok = \
  5.         neighborhood == 'angora acres' \
  6.         or neighborhood == 'guernsey glen' \
  7.         or neighborhood == 'pygora park' \
  8.         or neighborhood == 'savanna shores'
  9.     if not neighborhood_ok:
  10.         return False
  11.     # Check name.
  12.     goat_name = record['Goat'].strip().lower()
  13.     if goat_name == '' or goat_name == None:
  14.         return False
  15.     # Check houses visited value.
  16.     houses = record['Houses visited']
  17.     if not is_count_ok(houses, 30):
  18.         return False
  19.     # Check candy value.
  20.     candy = record['Candy collected']
  21.     if not is_count_ok(candy, 500):
  22.         return False
  23.     return True

Wow, that's great!

You can see the structure of the function from the comments.

  • def is_record_ok(record):
  •     # Check neighborhood.
  •     # Check name.
  •     # Check houses visited value.
  •     # Check candy value.

Here's the first one:

  •     # Check neighborhood.
  •     neighborhood = record['Neighborhood'].strip().lower()
  •     neighborhood_ok = \
  •         neighborhood == 'angora acres' \
  •         or neighborhood == 'guernsey glen' \
  •         or neighborhood == 'pygora park' \
  •         or neighborhood == 'savanna shores'
  •     if not neighborhood_ok:
  •         return False

neighborhood_ok is a boolean. It'll be True if neighborhood is any one of the valid values. If neighborhood_ok is, say, bikini bottom, all the ==s will be False, so neighborhood_ok will be False.

The rule for the next field, name, is:

  • Goat: name can't be MT.

Here's how Adela did it:

  •     # Check name.
  •     goat_name = record['Goat'].strip().lower()
  •     if goat_name == '' or goat_name == None:
  •         return False

That's the first two fields.

Here are the rules for the last two fields:

  • Houses: int from 0 to 30.
  • Candy: int from 0 to 500.

The rules are almost the same.

Reflect

When two rule sets are almost the same, you should consider ________.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Georgina
Georgina

Make one function to handle both.

Right!

The function is going to have some params. They should be the things that are different between the different calls. Here are the rules again:

  • Houses: int from 0 to 30.
  • Candy: int from 0 to 500.

The only differences are the value being checked (houses or candy), and the maximum value (30 or 500). Those become the params passed to the new function. (You can pass other things, too, if you like.)

Here are the calls Adela wrote:

  1.     # Check houses visited value.
  2.     houses = record['Houses visited']
  3.     if not is_count_ok(houses, 30):
  4.         return False
  5.     # Check candy value.
  6.     candy = record['Candy collected']
  7.     if not is_count_ok(candy, 500):
  8.         return False

The params match what's different between the way the functions are called.

Note

Writing out how a function is called will help you work out what the function should do.

Ethan, wanna try is_count_ok?

Ethan
Ethan

No problem. The first line will be:

  • def is_count_ok(count_in, max):
Ethan
Ethan

The structure will be:

  • Check a rule. If it's violated, return False.
  • Check another rule. If it's violated, return False.
  • ...
  • return True
Reflect

Complete the function, starting with:

def is_count_ok(count_in, max):

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Ethan
Ethan

The rules are like: int from 0 to something.

We did checking for int and checking range before. Here's my code:

  • def is_count_ok(count_in, max):
  •     # Is it an int?
  •     try:
  •         count = int(count_in)
  •     except Exception:
  •         # Not an int.
  •         return False
  •     # Check range.
  •     if count < 0 or count > max:
  •         return False
  •     # All OK.
  •     return True

Hey, good work!

Where are we?

It's easy to get lost when you're writing a moderately complex program like this.

Ray
Ray

True dat!

It helps to take a step back, and look at the structure of your work.

We're writing a program that takes data like this...

  • "Neighborhood","Goat","Houses visited","Candy collected"
  • "Guernsey Glen","Azucena",5,307
  • "Savanna Shores","Dawne",7,242
  • "Pygora Park","Rozanne",6,262
  • "Pygora Park","Elenor",7,288

... and outputs average candy collected for each neighborhood.

We started with:

  • # Read the raw data.
  • halloween_data_set = read_csv_data_set('halloween-candy.csv')
  • # Keep only valid records.
  • clean_halloween_data_set = clean_goat_records(halloween_data_set)

Next, we wrote clean_goat_records. It's similar to what we've done before.

Splitting

We want...

  • Average Halloween loot by neighborhood
  • ======= ========= ==== == ============
  • Angora Acres: 211.4
  • Guernsey Glen: 320.2
  • Pygora Park: 254.2
  • Savanna Shores: 226.8

There are several ways to do this. Let's use one of the simpler approaches. Here's what we'll have in the main program:

  • # Break out each neighborhood's data.
  • angora_data_set = get_neighborhood_records('Angora Acres', clean_halloween_data_set)
  • guernsey_data_set = get_neighborhood_records('Guernsey Glen', clean_halloween_data_set)
  • pygora_data_set = get_neighborhood_records('Pygora Park', clean_halloween_data_set)
  • savanna_data_set = get_neighborhood_records('Savanna Shores', clean_halloween_data_set)
Reflect

Explain in your own words what get_neighborhood_records will do.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Ethan
Ethan

It goes through a data set, and returns records for one neighbornood.

Indeed.

Reflect

Finish this code.

  • def get_neighborhood_records(neighborhood_name_to_find, data_set):
  •     # Normalize name to find.
  •     neighborhood_name_to_find = neighborhood_name_to_find.strip().lower()
  •     neighborhood_records = []
  •     Stuff here
  •     return neighborhood_records
If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Adela
Adela

It's gotta loop over the records. For each one, check whether it's in the neighborhood we're looking for. If it is, add it to neighborhood_records.

  • def get_neighborhood_records(neighborhood_name_to_find, data_set):
  •     # Normalize name to find.
  •     neighborhood_name_to_find = neighborhood_name_to_find.strip().lower()
  •     neighborhood_records = []
  •     # Loop over records.
  •     for record in data_set:
  •         # Get normalized name for current record.
  •         neighborhood_name_in_record = record['Neighborhood']
  •         neighborhood_name_in_record = neighborhood_name_in_record.strip().lower()
  •         # Is it the one we want?
  •         if neighborhood_name_in_record == neighborhood_name_to_find:
  •             # Aye.
  •             neighborhood_records.append(record)
  •     return neighborhood_records

Good work!

Averages

Here's the main program again:

  • # Break out each neighborhood's data.
  • angora_data_set = get_neighborhood_records('Angora Acres', clean_halloween_data_set)
  • guernsey_data_set = get_neighborhood_records('Guernsey Glen', clean_halloween_data_set)
  • pygora_data_set = get_neighborhood_records('Pygora Park', clean_halloween_data_set)
  • savanna_data_set = get_neighborhood_records('Savanna Shores', clean_halloween_data_set)
  • # Compute stats.
  • angora_average = compute_average_candy(angora_data_set)
  • guernsey_average = compute_average_candy(guernsey_data_set)
  • pygora_average = compute_average_candy(pygora_data_set)
  • savanna_average = compute_average_candy(savanna_data_set)

We can add a new type of machine:

Averaging machine

Averaging machine​

This one doesn't return another data set. It returns a single value.

Remember, there's a data machines list list.

Reflect

What's compute_average do?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Ethan
Ethan

It works out the average candy for one neighborhood.

Close, but not really. Though that seems right. Thing is, compute_average doesn't test for a neighborhood. The neighborhood doesn't get passed in to the function.

This call...

  • angora_average = compute_average_candy(angora_data_set)

... does compute the average of a neighborhood, but not because compute_average_candy knows about the neighborhood. It doesn't.

Ethan
Ethan

Huh... Oh! I think I see it.

compute_average_candy works out the candy average for a data set, whatever the data set is. angora_data_set only contains data for that one neighborhood. That's all the data compute_average_candy gets for the call compute_average_candy(angora_data_set).

We made a different data set for each neighborhood. But compute_average_candy doesn't know that. It just takes a data set, whatever it is, and works out the average.

Correct.

Reflect

Please finish this code:

  • def compute_average_candy(data_set):
  •     Stuff here
  •     return average
If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Georgina
Georgina

Here's what I came up with:

  • def compute_average_candy(data_set):
  •     total = 0
  •     # Loop over data set.
  •     for record in data_set:
  •         # Add to total.
  •         total += record['Candy collected']
  •     # Divide total by number of elements.
  •     average = total / len(data_set)
  •     return average

Good!

But...

Dataless

Adela
Adela

I'm not sure if this is a problem, but what would happen if the data set is MT?

Ooo, good question! That's called an edge case.

Reflect

What would this code return if the data set was MT?

  1. def compute_average_candy(data_set):
  2.     total = 0
  3.     # Loop over data set.
  4.     for record in data_set:
  5.         # Add to total.
  6.         total += record['Candy collected']
  7.     # Divide total by number of elements.
  8.     average = total / len(data_set)
  9.     return average
If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Georgina
Georgina

Hmm, let's see. Zero, maybe?

Let me walk it through in my head.

total is zero in line 2.

The loop (lines 4 to 6) wouldn't do anything. Python would skip it, since there are no records in the data set.

Line 8 would divide total by the number of elements in the list, zero in this case.

Oops...

Reflect

Run 0/0 in the console. What happens?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Ray
Ray

I got an error: ZeroDivisionError: division by zero

Aye. Just as in normal math, in Python dividing by 0 is not allowed.

Adela
Adela

If data_set was MT, that means a neighborhood had no data. Nobody collected candy there. I suppose that could happen.

But if it does, what should we tell the user?

Ray
Ray

We could show an error message. Like, in the main program, maybe.

  • # Break out each neighborhood's data.
  • angora_data_set = get_neighborhood_records('Angora Acres', clean_halloween_data_set)
  • ...
  • # Compute stats.
  • if len(angora_data_set) > 0:
  •     angora_average = compute_average_candy(angora_data_set)
  • ...
  • # Output.
  • if len(angora_data_set) == 0:
  •     print('No data for Angora Acres.')
  • else:
  •     angora_average = compute_average_candy(angora_data_set)
  •     print('Angora Acres: ', angora_average)

Aye, you could do that.

Georgina
Georgina

Or return zero from compute_average_candy:

  • def compute_average_candy(data_set):
  •     # Check there is data.
  •     if len(data_set) == 0:
  •         return 0
  •     total = 0
  •     # Loop over data set.
  •     ...

Yes, that would work, too.

Adela
Adela

Kieran, if you were an analyst, what would you do?

I'd ask the people who were going to use the program what they wanted.

Principle

Ask users

If you're not sure what a program should do in special cases, ask the people who are going to use it.

This is for the make-it-zero approach:

  1. def compute_average_candy(data_set):
  2.     # Check there is data.
  3.     if len(data_set) == 0:
  4.         return 0
  5.     total = 0
  6.     # Loop over data set.
  7.     for record in data_set:
  8.         # Add to total.
  9.         total += record['Candy collected']
  10.     # Divide total by number of elements.
  11.     average = total / len(data_set)
  12.     return average

A new pattern

Pattern

Data machine: Records subset

Write a function taking a data set as a param, and returning another data set with a subset of the original records, based on criteria you choose.

When you start a new project, you can use the pattern catalog to remind yourself of useful chunks of code.

Summary

  • It's common in business analytics to look at subsets of data.
  • We added a function to the data pipeline to loop over records, making a new list of records that meet criteria, like being about a particular neighborhood.
  • When two rule sets are almost the same, you should consider making one function to handle both.
  • Use ifs to handle edge cases, like trying to find the average of no data.
  • If you're not sure what a program should do in edge cases, ask the people who are going to use it.

Exercise

Exercise

Compare cuteness

Cthulhu has data on goat cuteness. You can download it. Here's some of the data:

  • "Name","Cuteness"
  • "Lino",10
  • "Johnnie",5
  • "August",9
  • "Leticia",5
  • "Andreas",10
  • "Rhonda",10

Write a program to compute the average cuteness of goats whose names begin with A-L, and goats whose names begin with M-Z. Do this for valid records only. Validity rules:

  • Goat name must be present.
  • Cuteness ratings are ints from 0 to 10.

Output:

  • First half average cuteness: 7.0
  • Second half average cuteness: 7.1

Write a subset function that works for creating both subsets.

Hints:

  • thing[0] is the first character of the string thing.
  • a < b < c < d... < l < m ... y < z.
  • def get_records_by_name(data_set, low_letter, high_letter):

Upload a zip of your project folder. The usual coding standards apply.

Attachments