Computed fields | Python with Pets

Earlier, on Python with Pets

Here's a goats' scores data set we've been using, with errors removed.

"Goat","Before","After"
"Aisha",17,16
"Andreas",14,15
"August",10,13
"Bertha",17,23
"Bessie",20,25
"Boyd",13,12
"Bridgette",16,22
"Carrie",15,19

Here's the main program from earlier. Please explain it to Burt. Explaining something to someone (or a doggo, a plant, a plushie...) helps you learn it.

Burt
Burt

# Read goat scores from CSV file.
raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
# Filter out bad records.
cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
# Analysis.
total_before = 0
total_after = 0
for goat in cleaned_goat_scores:
total_before += goat['Before']
total_after += goat['After']
print('Total before: ' + str(total_before))
print('Total after: ' + str(total_after))

Before and After are in the data set. You can see them in the CSV, and referenced in lines 4 and 5 of the code.

A new thing

Often with business data you want to analyze fields that aren't in the data set directly. For example, in this data set, what if you wanted to know about the differences between the before and after scores? That's not in the data set, although the pieces needed to work it out are.

A computed field is a value added to each record based on a computation, usually from other fields, but sometimes using external data as well.

One approach is to add a new field to each record.

Here's the data set in Spyder's VE.

Original data set

The data set is a list of dictionaries, with three fields:

Goat
Before
After

We'll write a function to add a new field to each record:

New field

Now we have four fields in each record:

Goat
Before
After
Difference

Let's add a new machine to our collection.

Computed field machine

A new machine for the list.

How?

How to do it?

Ray

Let's work out the signature first. Maybe name the function compute_change, send it the cleaned goat scores, and get back a new data set with an extra column.

new_data_set = compute_change(clean_goat_scores)

Note

Working out the signature is a good place to start.

Good, that'll work nicely, Ray.

I want the function to do something a little different this time. We've written functions like:

def clean_goat_scores(raw_goat_scores):
# Create a new list for the clean records.
clean_scores = []
# Loop over raw records.
for raw_record in raw_goat_scores:
# Is the record OK?
if is_record_ok(raw_record):
# Yes, make a new record with the right data types.
clean_record = {
'Goat': raw_record['Goat'],
'Before': float(raw_record['Before']),
'After': float(raw_record['After'])
}
# Add the new record to the clean list.
clean_scores.append(clean_record)
# Send the cleaned list back.
return clean_scores

This code makes a new list (line 3), new dictionaries (lines 9 to 13), and adds them to the new list (line 15). Then it sends back the new list (line 17).

The code that calls the function puts the return list in a new variable:

cleaned_goat_scores = clean_goat_scores(raw_goat_scores)

Ethan

raw_goat_scores is still there, right?

Aye. We just don't use it anymore.

Rather than creating a new list, I'd like to change the one that's passed in. The signature would be:

compute_change(clean_goat_scores)

Adela

Oh! So clean_goat_scores goes in. compute_change alters it directly. No new list comes out.

In fact, it looks like nothing comes out. So there's no return statement?

Correct.

Here's how we'd call the new function.

# Read goat scores from CSV file.
raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
# Filter out bad records.
cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
# Compute difference scores.
compute_change(cleaned_goat_scores)
# Analysis.
total_difference = 0
for goat in clean_goat_scores:
total_difference += goat['Difference']
print('Total difference: ' + str(total_difference))

Line 6 sends cleaned_goat_scores to compute_change, which adds a new field. Notice, line 10 accesses a field called Difference, but that wasn't in the original data set. compute_change adds it.

Loopy loop

OK, how do we do it? Here's what we have so far.

def compute_change(goat_scores):

Adela

We're going to loop over the data set, right? So we can add the new field to each record?

Aye.

def compute_change(goat_scores):
# Loop over list.
# Compute difference.
# Add difference field to the dictionary.

Reflect

Write a line of code to loop over the list.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Georgina

That would be:

def compute_change(goat_scores):
# Loop over list.
for goat_record in goat_scores:
# Compute difference.
# Add difference field to the dictionary.

Reflect

Write code to compute the difference between Before and After.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ethan

Well, Before and After are fields in goat_record. So maybe:

def compute_change(goat_scores):
# Loop over list.
for goat_record in goat_scores:
# Compute difference.
change = goat_record['After'] - goat_record['Before']
# Add difference field to the dictionary.

Good!

Reflect

Write code that will make the new field Difference.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Georgina

No prob!

def compute_change(goat_scores):
# Loop over list.
for goat_record in goat_scores:
# Compute difference.
change = goat_record['After'] - goat_record['Before']
# Add difference field to the dictionary.
goat_record['Difference'] = change

Good work! I'll just add some comments.

def compute_change(goat_scores):
'''
Add score differences to the data set.
The data is changed in-place.
Parameters
----------
goat_scores : list
A list of dictionaries with the goats' score.
'''
# Loop over list.
for goat_record in goat_scores:
# Compute difference.
change = goat_record['After'] - goat_record['Before']
# Add difference field to the dictionary.
goat_record['Difference'] = change

Georgina

There's no return... Oh, wait, you talked about that with Adela. The function changes goat_scores in place. No need to return anything.

Right! Say we had this code:

# Filter out bad records.
cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
# Compute difference scores.
compute_change(cleaned_goat_scores)
# Do something with scores.
do_something(cleaned_goat_scores)

By the time do_something gets cleaned_goat_scores, it has already has the new field added.

Let's put this in with the rest of the program.

# Show total differences in before and after scores.
# Written by the Scoobies, July 19, Year of the Dragon.
import csv
def read_csv_data_set(file_name):
'''
Read a data set from a CSV file.
Parameters
----------
file_name : string
Name of the CSV file in the current folder.
Returns
-------
data_set : List of dictionaries.
Data set.
'''
# Create a list to be the return value.
data_set = []
with open('./' + file_name) as file:
file_csv = csv.DictReader(file)
# Put each row into the return list.
for row in file_csv:
data_set.append(row)
return data_set
def is_record_ok(record):
'''
Check whether a record of goat test scores is OK.
Parameters
----------
record : dictionary
Data co check.
Returns
-------
bool
True if the data is valid.
'''
# Check name.
goat_name = record['Goat']
if goat_name == '' or goat_name == None:
return False
# Check Before value.
before = record['Before']
if not is_score_ok(before):
return False
# Check After value.
after = record['After']
if not is_score_ok(after):
return False
return True
def is_score_ok(score_in):
'''
Test whether a string is a valid score.
Parameters
----------
score_in : string
Value to check.
Returns
-------
bool
True if the score is valid.
'''
# Is it a number?
try:
score_number = float(score_in)
except Exception:
return False
# Check range.
if score_number < 0 or score_number > 100:
return False
# All OK.
return True
def clean_goat_scores(raw_goat_scores):
'''
Remove records with bad data (case-wise deletion)
Parameters
----------
raw_goat_scores : List of dictionaries
Goat scores data set.
Returns
-------
clean_scores : List of dictionaries
Cleaned goat scores data set.
'''
clean_scores = []
for raw_record in raw_goat_scores:
if is_record_ok(raw_record):
clean_record = {
'Goat': raw_record['Goat'],
'Before': float(raw_record['Before']),
'After': float(raw_record['After'])
}
clean_scores.append(clean_record)
return clean_scores
def compute_change(goat_scores):
'''
Add score differences to the data set.
The data is changed in-place.
Parameters
----------
goat_scores : list
A list of dictionaries with the goats' score.
Returns
-------
Nothing.
'''
# Loop over list.
for goat_record in goat_scores:
# Compute difference.
change = goat_record['After'] - goat_record['Before']
# Add difference field to the dictionary.
goat_record['Difference'] = change
# Main program ----------------------------------------
# Read goat scores from CSV file.
raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
# Filter out bad records.
cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
# Compute difference scores.
compute_change(clean_goat_scores)
# Analysis.
total_difference = 0
for goat in cleaned_goat_scores:
total_difference += goat['Difference']
print('Total difference: ' + str(total_difference))

Ray

That's a lotta code!

It is, 146 lines. For a beginning course, that's a lotta code.

Ray, did you have trouble understanding the program?

Ray

No, not really. We'd seen all the pieces before.

Right. The key is not to panic when you see a big program, or get a task you haven't done before. Use patterns. Break it down into chunks. Focus on one thing at a time.

You know a lot of useful stuff.

Ray

You're right. I'm feeling more confident.

Changey changey

Let's see how you would change the program, something that happens a lot in business analytics. Like, a lot a lot.

You want to change the maximum allowed score from 100 to 120. So 110 is no longer a bad value to be filtered out.

Where in the program would you make the change? Ray?

Ray

I'd change the function is_score_ok.

How did you know to choose that?

Ray

It's right there in the main program's comments.

# Filter out bad records.
cleaned_goat_scores = clean_goat_scores(raw_goat_scores)

clean_goat_scores calls is_record_ok:

if is_record_ok(raw_record):

In is_record_ok, you've got:

# Check Before value.
before = record['Before']
if not is_score_ok(before):

Looks is_score_ok is what I want, if I want to allow scores up to 120. is_score_ok says:

# Check range.
if score_number < 0 or score_number > 100

There's the thing I want to change.

Right!

The functions and comments help you drill down through the program, to find the code you need to change.

OK, we're done with data preparation.

Adela

We've had a buncha lessons on data cleaning. Do RL (real life) programmers spend a lotta time on that?

Aye, they do. You might hear old timers say GIGO, for garbage in, garbage out. If the data going in is flaky, do you trust the results coming out?

Summary

Often with business data, you want to analyze fields that aren't in the data set directly.
A computed field is a value added to each record based on a computation, usually from other fields
You can modify a data set that's passed in to a function as a parameter.

Exercise

Goatball point spread

Goatball is a popular sport on Vuohen saari. There are only two teams, the Typerä and the Vakava, but there's nothing else going on, so it's popular. Cthulhu wants to know about goatball point spreads.

Point spread is the difference between scores in a game. So if:

The New York Nannies scored 10 points and the Buffalo Billies scored 7, the spread would be 3 for the game.
The Brisbane Bed Bugs scored 11 points and the Yosemite Yolks scores 16, the spread would be 5.
The average of the spreads for the games would be (3+5)/2 = 4.

Download the data set. Here's part of it:

"Typera","Vakava"
6,7
8,4
3,4
9,"low"
7,5
3,6

Each row is the points scored in one game. Both values should be integers from 0 to 15. Only analyze records with valid data.

Here's the output:

Goatball
========
Typera wins: 28
Vakava wins: 28
Draws: 14
Average spread: 1.97

Extra requirements:

Write a function called add_spread_field that takes a data set, and adds a field called Spread to each record. It returns nothing.
Write a function called count_results that takes a data set, and returns three values: typera_wins, vakava_wins, and draws.
Write a function called compute_average_spread that takes a data set, and returns the average of Spread.

Upload a zip of your project folder. The usual coding standards apply.

If you were logged in as a student, you could submit a solution to this exercise.