Not graded. So why do it?
Not graded. So why do it?
Not graded. So why do it?
Not graded. So why do it?
Not graded. So why do it?
Not graded. So why do it?
Not graded. So why do it?
Not graded. So why do it?
Not graded. So why do it?
Not graded. So why do it?
The story so far
read_csv_data_set
is a function that will read a CSV file into a list of dictionaries. There is one dictionary for each record, and a list of records. Each field is a string.
- goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
We made a validation function. Give it a record (in a dictionary), and it will return true if the data is valid, or false if not.
Explain the code to this goat, to remind yourself how it works. You can explain it in your head, if you don't want to do it out loud.
Persephone
- def is_record_ok(record):
- # Check name.
- goat_name = record['Goat']
- if goat_name == '' or goat_name == None:
- return False
- # Check Before value.
- before = record['Before']
- if not is_score_ok(before):
- return False
- # Check After value.
- after = record['After']
- if not is_score_ok(after):
- return False
- return True
The record validation function called is_score_ok
validates a single numeric field (supposed to be numeric, anyway). Explain it to Persephone.
- def is_score_ok(score_in):
- # Is it a number?
- try:
- score_number = float(score_in)
- except Exception:
- return False
- # Check range.
- if score_number < 0 or score_number > 100:
- return False
- # All OK.
- return True
Finally, we had a complete program. Persephone awaits.
- goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
- total_before = 0
- total_after = 0
- for goat in goat_scores:
- if is_record_ok(goat):
- total_before += float(goat['Before'])
- total_after += float(goat['After'])
- print('Total before: ' + str(total_before))
- print('Total after: ' + str(total_after))
Another way
We can change the code a little, to make it easier to write analysis code. This...
- for goat in goat_scores:
- if is_record_ok(goat):
- total_before += float(goat['Before'])
... checks each record inside a computation loop. That works, but it combines validation and analysis.
Another way to do things is to separate validation and analysis entirely. Remove bad records from the data set before starting any analysis.
Clean data before analysis
clean_goat_scores
is a new function that takes in a data set that might have errors, and returns a data set that does not have errors.
The analysis code will be easier to write, since it doesn't have to worry about validation.
Cleaner

Ray
This sounds hard!
What do you do when something's hard?

Ray
Umm... break it into pieces?
Aye.
Eating elephants
There is only one way to eat an elephant: a bite at a time.
Attributed to Desmond Tutu
Like this:
- # Read goat scores from CSV file.
- raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
- # Filter out bad records.
- cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
- # Analysis.
- total_before = 0
- total_after = 0
- for goat in cleaned_goat_scores:
- total_before += goat['Before']
- total_after += goat['After']
- print('Total before: ' + str(total_before))
- print('Total after: ' + str(total_after))
Line 2 reads the data set from the CSV file into the variable raw_goat_scores
. It might contain errors.
Line 4 calls a new function that goes through raw_goat_scores
, and makes a new data set with only valid records. The new data set is called cleaned_goat_scores
.
Now that we have a data set without errors, the analysis doesn't need error-checking code. So instead of...
- for goat in goat_scores:
- if is_record_ok(goat): Error check
- total_before += float(goat['Before'])
- total_after += float(goat['After'])
... we can have...
- for goat in clean_goat_scores:
- total_before += goat['Before']
- total_after += goat['After']
Simpler, since there's no validation needed here. It's already been done.

Adela
Looks like the float
calls are gone, too. Are they done in clean_goat_scores
?
Right! Cleaning means doing anything needed to get the data ready for analysis, including validation and type conversion.
Not graded. So why do it?
Let's start with the signature. You can get it from the code we saw earlier.
- # Read goat scores from CSV file.
- raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
- # Filter out bad records.
- cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
- # Analysis.
- total_before = 0
- total_after = 0
- for goat in cleaned_goat_scores:
- total_before += goat['Before']
- total_after += goat['After']
- print('Total before: ' + str(total_before))
- print('Total after: ' + str(total_after))

Ray
The function is called clean_goat_scores
. It takes one parameter: a list of dictionaries, made by read_csv_data_set
. It returns... a list of dictionaries?
Aye. But what's different about the return list?

Ray
It only has valid records. Oh, and the data types. Like, if there's a field called temperature
, that'll be a string in the data set read_csv_data_set
returns, but it needs to be numeric for analysis.
So it would be something like:
- def clean_goat_scores(raw_goat_scores):
- Something
- return cleaned_goat_scores
Indeed! Good job!
Let me give you a head start, with some comments.
- def clean_goat_scores(raw_goat_scores):
- # Create a new list for the clean records.
- # Loop over raw records.
- # Is the record OK?
- # Yes, make a new record with the right data types.
- # Add the new record to the clean list.
- # Send the cleaned list back.
- return cleaned_goat_scores
Finish the function, as much as you can.

Ray
OK... I think I remember how to make a new list.
- def clean_goat_scores(raw_goat_scores):
- # Create a new list for the clean records.
- cleaned_goat_scores = []
- # Loop over raw records.
- # Is the record OK?
- # Yes, make a new record with the right data types.
- # Add the new record to the clean list.
- # Send the cleaned list back.
- return cleaned_goat_scores
The loopy loop is a for
. Maybe...
- def clean_goat_scores(raw_goat_scores):
- # Create a new list for the clean records.
- cleaned_goat_scores = []
- # Loop over raw records.
- for raw_record in raw_goat_scores:
- # Is the record OK?
- # Yes, make a new record with the right data types.
- # Add the new record to the clean list.
- # Send the cleaned list back.
- return cleaned_goat_scores
The next step is checking raw_record
... I'm not sure...

Adela
We already wrote that code. This is from an earlier lesson.
- total_after = 0
- for goat in goat_scores:
- if is_record_ok(goat):
- total_before += float(goat['Before'])
Give is_record_ok
a record, and it tells you whether it's OK.

Ray
Oh, you're right! Can I use that function in this code?
- def clean_goat_scores(raw_goat_scores):
- # Create a new list for the clean records.
- cleaned_goat_scores = []
- # Loop over raw records.
- for raw_record in raw_goat_scores:
- # Is the record OK?
- if is_record_ok(raw_record):
- # Yes, make a new record with the right data types.
- # Add the new record to the clean list.
- # Send the cleaned list back.
- return cleaned_goat_scores
Now, to make a new record... How do I do that?
What's a record look like? Is it a list, a string...
What type of thing is a record?

Ray
It's a dictionary. We talked about them before... Aha! I found this code in an earlier lesson.
- best_pokemon = {
- 'name': 'Snorlax',
- 'generation': 1,
- 'pokedex number': 143
- }
So, to make a new goat scores record...
- def clean_goat_scores(raw_goat_scores):
- # Create a new list for the clean records.
- cleaned_goat_scores = []
- # Loop over raw records.
- for raw_record in raw_goat_scores:
- # Is the record OK?
- if is_record_ok(raw_record):
- # Yes, make a new record with the right data types.
- cleaned_goat_record = {
- 'Goat': raw_record['Goat'],
- 'Before': float(raw_record['Before']),
- 'After': float(raw_record['After'])
- }
- # Add the new record to the clean list.
- # Send the cleaned list back.
- return cleaned_goat_scores
I remembered to add the float
s, to get the data types right.
Nice work!

Ray
Last bit.
- def clean_goat_scores(raw_goat_scores):
- # Create a new list for the clean records.
- cleaned_goat_scores = []
- # Loop over raw records.
- for raw_record in raw_goat_scores:
- # Is the record OK?
- if is_record_ok(raw_record):
- # Yes, make a new record with the right data types.
- clean_record = {
- 'Goat': raw_record['Goat'],
- 'Before': float(raw_record['Before']),
- 'After': float(raw_record['After'])
- }
- # Add the new record to the clean list.
- cleaned_goat_scores.append(clean_record)
- # Send the cleaned list back.
- return cleaned_goat_scores
Yay!

Georgina
Great work, Ray!

Ethan
Yeah! Good job!
Ray, anything that occurred to you as you wrote this?

Ray
A coupla things. The new function is like other stuff we did. A for
loop, with an if
inside.
This is a new program, but we reused read_csv_data_set
and is_record_ok
. Oh, and is_score_ok
, too. is_record_ok
calls it. Already written, as functions we could reuse.
I'm feeling good about this programming thing.
Doing more while checking data
You can add to clean_goat_scores
to do all sorts of things. It's a Python function, and can do anything Python can do.
For example, you can print
out which records have errors. Maybe show the Goat field of invalid records, like this:
- Bad record:
- Bad record:
- Bad record: Bertha
- Bad record: Bessie
- Bad record: Boyd
- Bad record: Bridgette
- Bad record: Carrie
- Bad record: Darell
- Bad record: Deborah
- Bad record: Gerald
- Bad record: Johnnie
- Bad record: Long
- Bad record: Vincent
- Total before: 570.0
- Total after: 648.0
Two records are missing goat names. That's what the first two lines show you.
Here's the code Ray wrote:
- def clean_goat_scores(raw_goat_scores):
- # Create a new list for the clean records.
- cleaned_goat_scores = []
- # Loop over raw records.
- for raw_record in raw_goat_scores:
- # Is the record OK?
- if is_record_ok(raw_record):
- # Yes, make a new record with the right data types.
- clean_record = {
- 'Goat': raw_record['Goat'],
- 'Before': float(raw_record['Before']),
- 'After': float(raw_record['After'])
- }
- # Add the new record to the clean list.
- cleaned_goat_scores.append(clean_record)
- # Send the cleaned list back.
- return cleaned_goat_scores
Add code to show the Goat field of invalid records.
Hint: The code inside the if
...
if is_record_ok(raw_record):
... runs when is_record_ok
is true. The print
for the bad record show run when is_record_ok
is false.
Something like:
- for raw_record in raw_goat_scores:
- # Is the record OK?
- if is_record_ok(raw_record):
- ...
- cleaned_goat_scores.append(clean_record)
- else:
- print('Bad record: ' + str(raw_record['Goat']))

Adela
I put the print
in is_record_ok
.
That works, too.
I love to count, ah, ah
How about changing clean_goat_scores
so it outputs the count of bad records, too? Like:
- ...
- Bad record: Long
- Bad record: Vincent
- Number of bad records: 13
- Total before: 570.0
- Total after: 648.0

Ray
A long time ago, we added a counter to a loop:
- best_pet = ''
- counter = 0
- while best_pet != 'dog':
- best_pet = input('What is the best pet? ')
- counter += 1
- print('Best pet:' + best_pet)
- print('Number of times asked:' + str(counter))
We could do the same thing here. Have a new variable, add one to it in the loop, and then output it after the loop.
Here's the code so far:
- def clean_goat_scores(raw_goat_scores):
- # Create a new list for the clean records.
- cleaned_goat_scores = []
- # Loop over raw records.
- for raw_record in raw_goat_scores:
- # Is the record OK?
- if is_record_ok(raw_record):
- # Yes, make a new record with the right data types.
- clean_record = {
- 'Goat': raw_record['Goat'],
- 'Before': float(raw_record['Before']),
- 'After': float(raw_record['After'])
- }
- # Add the new record to the clean list.
- cleaned_goat_scores.append(clean_record)
- else:
- print('Bad record: ' + str(raw_record['Goat']))
- # Send the cleaned list back.
- return cleaned_goat_scores
Add code to count the number of bad records.

Ray
How about this?
- def clean_goat_scores(raw_goat_scores):
- # Create a new list for the clean records.
- cleaned_goat_scores = []
- # Initialize bad record count.
- bad_record_count = 0
- # Loop over raw records.
- for raw_record in raw_goat_scores:
- # Is the record OK?
- if is_record_ok(raw_record):
- # Yes, make a new record with the right data types.
- clean_record = {
- 'Goat': raw_record['Goat'],
- 'Before': float(raw_record['Before']),
- 'After': float(raw_record['After'])
- }
- # Add the new record to the clean list.
- cleaned_goat_scores.append(clean_record)
- else:
- print('Bad record:' + str(raw_record['Goat']))
- bad_record_count += 1
- print('Number of bad records:'+ str(bad_record_count))
- # Send the cleaned list back.
- return cleaned_goat_scores

Adela
Kieran, is that the way you'd do it?
Hmm. Probably not. I like data cleaning functions to not do any output directly, in case I want to do something else with the results, like write them to a file, make a webpage, whatevs.
Remember how we can return more than one value from a function? Like return x, y
.

Ethan
I can see how you'd return the bad record count...
return clean_scores, bad_record_count
... but how would you send back all the goat names?
return clean_scores, bad_record_count, goat1, goat2...
That doesn't seem right.
You're correct, that isn't the best way to do it. cleaned_goat_scores
is a single variable, but it contains many values within itself.

Ethan
Oh, yeah! Return a variable containing all the names from the bad records.
Use a list, right?
Aye!
Here's the code again, from a few steps back, without counting or showing bad records.
- def clean_goat_scores(raw_goat_scores):
- # Create a new list for the clean records.
- cleaned_goat_scores = []
- # Loop over raw records.
- for raw_record in raw_goat_scores:
- # Is the record OK?
- if is_record_ok(raw_record):
- # Yes, make a new record with the right data types.
- clean_record = {
- 'Goat': raw_record['Goat'],
- 'Before': float(raw_record['Before']),
- 'After': float(raw_record['After'])
- }
- # Add the new record to the clean list.
- cleaned_goat_scores.append(clean_record)
- # Send the cleaned list back.
- return cleaned_goat_scores
How would you change the code to return a list of Goat
fields for records with errors?

Georgina
I think I got it.
- def clean_goat_scores(raw_goat_scores):
- # Create a new list for the clean records.
- cleaned_goat_scores = []
- # Create a list for goats with bad records.
- goats_bad_records = []
- # Loop over raw records.
- for raw_record in raw_goat_scores:
- # Is the record OK?
- if is_record_ok(raw_record):
- # Yes, make a new record with the right data types.
- clean_record = {
- 'Goat': raw_record['Goat'],
- 'Before': float(raw_record['Before']),
- 'After': float(raw_record['After'])
- }
- # Add the new record to the clean list.
- cleaned_goat_scores.append(clean_record)
- else:
- # Add the goat name to the list of bad records.
- goats_bad_records.append(raw_record['Goat'])
- # Send the lists back.
- return cleaned_goat_scores, goats_bad_records

Adela
We've lost the count of bad records, though.
We can get that in the main program, though.
- # Read goat scores from CSV file.
- raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
- # Filter out bad records.
- cleaned_goat_scores, goats_bad_records = clean_goat_scores(raw_goat_scores)
- print('Number of bad records:' + str(????))
Finish the code.

Ray
I got:
print('Number of bad records: ' + str(len(goats_bad_records)))
Great!
A new pattern
Write a function that takes a data set as a param. Some of the records in the data set might have errors. The function returns a data set with no errors.
Summary
- It's a good idea to separate validation and analysis entirely. Remove bad records from the data set before starting any analysis.
- Decomp(osition) is good.