Earlier, on Python with Pets
Here's a goats' scores data set we've been using, with errors removed.
- "Goat","Before","After"
- "Aisha",17,16
- "Andreas",14,15
- "August",10,13
- "Bertha",17,23
- "Bessie",20,25
- "Boyd",13,12
- "Bridgette",16,22
- "Carrie",15,19
Here's the main program from earlier. Please explain it to Burt. Explaining something to someone (or a doggo, a plant, a plushie...) helps you learn it.
Burt
- # Read goat scores from CSV file.
- raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
- # Filter out bad records.
- cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
- # Analysis.
- total_before = 0
- total_after = 0
- for goat in cleaned_goat_scores:
- total_before += goat['Before']
- total_after += goat['After']
- print('Total before: ' + str(total_before))
- print('Total after: ' + str(total_after))
Before and After are in the data set. You can see them in the CSV, and referenced in lines 4 and 5 of the code.
A new thing
Often with business data you want to analyze fields that aren't in the data set directly. For example, in this data set, what if you wanted to know about the differences between the before and after scores? That's not in the data set, although the pieces needed to work it out are.
A computed field is a value added to each record based on a computation, usually from other fields, but sometimes using external data as well.
One approach is to add a new field to each record.
Here's the data set in Spyder's VE.
The data set is a list of dictionaries, with three fields:
- Goat
- Before
- After
We'll write a function to add a new field to each record:
Now we have four fields in each record:
- Goat
- Before
- After
- Difference
Let's add a new machine to our collection.
Computed field machine
A new machine for the list.
How?
How to do it?

Ray
Let's work out the signature first. Maybe name the function compute_change
, send it the cleaned goat scores, and get back a new data set with an extra column.
new_data_set = compute_change(clean_goat_scores)
Note
Working out the signature is a good place to start.
Good, that'll work nicely, Ray.
I want the function to do something a little different this time. We've written functions like:
- def clean_goat_scores(raw_goat_scores):
- # Create a new list for the clean records.
- clean_scores = []
- # Loop over raw records.
- for raw_record in raw_goat_scores:
- # Is the record OK?
- if is_record_ok(raw_record):
- # Yes, make a new record with the right data types.
- clean_record = {
- 'Goat': raw_record['Goat'],
- 'Before': float(raw_record['Before']),
- 'After': float(raw_record['After'])
- }
- # Add the new record to the clean list.
- clean_scores.append(clean_record)
- # Send the cleaned list back.
- return clean_scores
This code makes a new list (line 3), new dictionaries (lines 9 to 13), and adds them to the new list (line 15). Then it sends back the new list (line 17).
The code that calls the function puts the return list in a new variable:
- cleaned_goat_scores = clean_goat_scores(raw_goat_scores)

Ethan
raw_goat_scores
is still there, right?
Aye. We just don't use it anymore.
Rather than creating a new list, I'd like to change the one that's passed in. The signature would be:
compute_change(clean_goat_scores)

Adela
Oh! So clean_goat_scores
goes in. compute_change
alters it directly. No new list comes out.
In fact, it looks like nothing comes out. So there's no return
statement?
Correct.
Here's how we'd call the new function.
- # Read goat scores from CSV file.
- raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
- # Filter out bad records.
- cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
- # Compute difference scores.
- compute_change(cleaned_goat_scores)
- # Analysis.
- total_difference = 0
- for goat in clean_goat_scores:
- total_difference += goat['Difference']
- print('Total difference: ' + str(total_difference))
Line 6 sends cleaned_goat_scores
to compute_change
, which adds a new field. Notice, line 10 accesses a field called Difference
, but that wasn't in the original data set. compute_change
adds it.
Loopy loop
OK, how do we do it? Here's what we have so far.
- def compute_change(goat_scores):

Adela
We're going to loop over the data set, right? So we can add the new field to each record?
Aye.
- def compute_change(goat_scores):
- # Loop over list.
- # Compute difference.
- # Add difference field to the dictionary.
Write a line of code to loop over the list.

Georgina
That would be:
- def compute_change(goat_scores):
- # Loop over list.
- for goat_record in goat_scores:
- # Compute difference.
- # Add difference field to the dictionary.
Write code to compute the difference between Before and After.

Ethan
Well, Before and After are fields in goat_record
. So maybe:
- def compute_change(goat_scores):
- # Loop over list.
- for goat_record in goat_scores:
- # Compute difference.
- change = goat_record['After'] - goat_record['Before']
- # Add difference field to the dictionary.
Good!
Write code that will make the new field Difference.

Georgina
No prob!
- def compute_change(goat_scores):
- # Loop over list.
- for goat_record in goat_scores:
- # Compute difference.
- change = goat_record['After'] - goat_record['Before']
- # Add difference field to the dictionary.
- goat_record['Difference'] = change
Good work! I'll just add some comments.
- def compute_change(goat_scores):
- '''
- Add score differences to the data set.
- The data is changed in-place.
- Parameters
- ----------
- goat_scores : list
- A list of dictionaries with the goats' score.
- '''
- # Loop over list.
- for goat_record in goat_scores:
- # Compute difference.
- change = goat_record['After'] - goat_record['Before']
- # Add difference field to the dictionary.
- goat_record['Difference'] = change

Georgina
There's no return
... Oh, wait, you talked about that with Adela. The function changes goat_scores
in place. No need to return anything.
Right! Say we had this code:
- # Filter out bad records.
- cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
- # Compute difference scores.
- compute_change(cleaned_goat_scores)
- # Do something with scores.
- do_something(cleaned_goat_scores)
By the time do_something
gets cleaned_goat_scores
, it has already has the new field added.
Let's put this in with the rest of the program.
- # Show total differences in before and after scores.
- # Written by the Scoobies, July 19, Year of the Dragon.
- import csv
- def read_csv_data_set(file_name):
- '''
- Read a data set from a CSV file.
- Parameters
- ----------
- file_name : string
- Name of the CSV file in the current folder.
- Returns
- -------
- data_set : List of dictionaries.
- Data set.
- '''
- # Create a list to be the return value.
- data_set = []
- with open('./' + file_name) as file:
- file_csv = csv.DictReader(file)
- # Put each row into the return list.
- for row in file_csv:
- data_set.append(row)
- return data_set
- def is_record_ok(record):
- '''
- Check whether a record of goat test scores is OK.
- Parameters
- ----------
- record : dictionary
- Data co check.
- Returns
- -------
- bool
- True if the data is valid.
- '''
- # Check name.
- goat_name = record['Goat']
- if goat_name == '' or goat_name == None:
- return False
- # Check Before value.
- before = record['Before']
- if not is_score_ok(before):
- return False
- # Check After value.
- after = record['After']
- if not is_score_ok(after):
- return False
- return True
- def is_score_ok(score_in):
- '''
- Test whether a string is a valid score.
- Parameters
- ----------
- score_in : string
- Value to check.
- Returns
- -------
- bool
- True if the score is valid.
- '''
- # Is it a number?
- try:
- score_number = float(score_in)
- except Exception:
- return False
- # Check range.
- if score_number < 0 or score_number > 100:
- return False
- # All OK.
- return True
- def clean_goat_scores(raw_goat_scores):
- '''
- Remove records with bad data (case-wise deletion)
- Parameters
- ----------
- raw_goat_scores : List of dictionaries
- Goat scores data set.
- Returns
- -------
- clean_scores : List of dictionaries
- Cleaned goat scores data set.
- '''
- clean_scores = []
- for raw_record in raw_goat_scores:
- if is_record_ok(raw_record):
- clean_record = {
- 'Goat': raw_record['Goat'],
- 'Before': float(raw_record['Before']),
- 'After': float(raw_record['After'])
- }
- clean_scores.append(clean_record)
- return clean_scores
- def compute_change(goat_scores):
- '''
- Add score differences to the data set.
- The data is changed in-place.
- Parameters
- ----------
- goat_scores : list
- A list of dictionaries with the goats' score.
- Returns
- -------
- Nothing.
- '''
- # Loop over list.
- for goat_record in goat_scores:
- # Compute difference.
- change = goat_record['After'] - goat_record['Before']
- # Add difference field to the dictionary.
- goat_record['Difference'] = change
- # Main program ----------------------------------------
- # Read goat scores from CSV file.
- raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
- # Filter out bad records.
- cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
- # Compute difference scores.
- compute_change(clean_goat_scores)
- # Analysis.
- total_difference = 0
- for goat in cleaned_goat_scores:
- total_difference += goat['Difference']
- print('Total difference: ' + str(total_difference))

Ray
That's a lotta code!
It is, 146 lines. For a beginning course, that's a lotta code.
Ray, did you have trouble understanding the program?

Ray
No, not really. We'd seen all the pieces before.
Right. The key is not to panic when you see a big program, or get a task you haven't done before. Use patterns. Break it down into chunks. Focus on one thing at a time.
You know a lot of useful stuff.

Ray
You're right. I'm feeling more confident.
Changey changey
Let's see how you would change the program, something that happens a lot in business analytics. Like, a lot a lot.
You want to change the maximum allowed score from 100 to 120. So 110 is no longer a bad value to be filtered out.
Where in the program would you make the change? Ray?

Ray
I'd change the function is_score_ok
.
How did you know to choose that?

Ray
It's right there in the main program's comments.
- # Filter out bad records.
- cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
clean_goat_scores
calls is_record_ok
:
- if is_record_ok(raw_record):
In is_record_ok
, you've got:
- # Check Before value.
- before = record['Before']
- if not is_score_ok(before):
Looks is_score_ok
is what I want, if I want to allow scores up to 120. is_score_ok
says:
- # Check range.
- if score_number < 0 or score_number > 100
There's the thing I want to change.
Right!
The functions and comments help you drill down through the program, to find the code you need to change.
OK, we're done with data preparation.

Adela
We've had a buncha lessons on data cleaning. Do RL (real life) programmers spend a lotta time on that?
Aye, they do. You might hear old timers say GIGO, for garbage in, garbage out. If the data going in is flaky, do you trust the results coming out?
Summary
- Often with business data, you want to analyze fields that aren't in the data set directly.
- A computed field is a value added to each record based on a computation, usually from other fields
- You can modify a data set that's passed in to a function as a parameter.
Exercise
Goatball point spread
Goatball is a popular sport on Vuohen saari. There are only two teams, the Typerä and the Vakava, but there's nothing else going on, so it's popular. Cthulhu wants to know about goatball point spreads.
Point spread is the difference between scores in a game. So if:
- The New York Nannies scored 10 points and the Buffalo Billies scored 7, the spread would be 3 for the game.
- The Brisbane Bed Bugs scored 11 points and the Yosemite Yolks scores 16, the spread would be 5.
- The average of the spreads for the games would be (3+5)/2 = 4.
Download the data set. Here's part of it:
- "Typera","Vakava"
- 6,7
- 8,4
- 3,4
- 9,"low"
- 7,5
- 3,6
Each row is the points scored in one game. Both values should be integers from 0 to 15. Only analyze records with valid data.
Here's the output:
- Goatball
- ========
- Typera wins: 28
- Vakava wins: 28
- Draws: 14
- Average spread: 1.97
Extra requirements:
- Write a function called
add_spread_field
that takes a data set, and adds a field calledSpread
to each record. It returns nothing. - Write a function called
count_results
that takes a data set, and returns three values:typera_wins
,vakava_wins
, anddraws
. - Write a function called
compute_average_spread
that takes a data set, and returns the average ofSpread
.
Upload a zip of your project folder. The usual coding standards apply.