Even cleaner | Python with Pets

Tags

Data cleaning, Data machines, Data sets, Validation, Program design

Multiple choice

What is a list?

A collection of values. Each one is accessed through an index that's an integer.

A collection of values.
Each one is accessed through a key that's usually a string.

A set of unique values (no two in the list are the same).
Each one is accessed through an index that's an integer.

A collection of dictionaries.
Each one is a record.

Not graded. So why do it?

Multiple choice

What's a dictionary?

A collection of values.
Each one is accessed through an index that's an integer.

A collection of values.
Each one is accessed through a key that's usually a string.

A set of unique values (no two in the list are the same).
Each one is accessed through an index that's an integer.

A collection of lists.
Each one is a record.

Not graded. So why do it?

Multiple choice

Lists are delimited by ____, dictionaries by _____.

() and {}

{} and ()

[] and ()

[] and {}

{} and []

Not graded. So why do it?

Multiple choice

team is a list of dictionaries of players. Each dictionary has keys id, name, and position. What's the best way to print the player's names?

index = 0
while index < len(team):
print(team[index]['name'])
index += 1

for player in team:
print(player[name])

foreach player in team:
player = next(team)
name = player[name]
print(name)

for player in team:
print(player['name'])

for 'player' from team:
print('player[name]')

Not graded. So why do it?

Multiple choice

Finish this.

# Function.
def lbs_to_kg(lbs):
kg = lbs / 2.2
# What goes here?
# Main program.
pounds = float(input('Pounds? '))
kilos = lbs_to_kg(pounds)
print('Kilos: '+ str(kilos))

return kg

return lbs

lbs_to_kg = kg

return kilos

Not graded. So why do it?

Multiple choice

def football_player_total_weight_kg(body_weight, unit):
unit = What goes here?
if unit == 'lb' or unit == 'lbs':
body_weight_kg = What goes here?
else:
body_weight_kg = body_weight
parents_expectations_weight = body_weight_kg * 0.2
total_weight = body_weight_kg + parents_expectations_weight
return total_weight
def lbs_to_kg(lbs):
kg = lbs / 2.2
return kg
def normalize(text):
text = text.lower().strip()
return text
body_weight = float(input('Body weight (lbs or kg)? '))
weight_unit = input('Weight unit (lbs or kg)? ')
total_weight = What goes here?
print('Total weight: '+ str(total_weight))

normalize(unit)
...
lbs_to_kg(body_weight)
...
football_player_total_weight(body_weight_lbs, weight_unit)

normalize(unit)
...
lbs_to_kg(body_weight)
...
football_player_total_weight(body_weight_lbs, unit)

normalize(weight_unit)
...
lbs_to_kg(body_weight)
...
football_player_total_weight(body_weight_lbs, weight_unit)

normalize()
...
lbs_to_kg(weight)
...
football_player_total_weight(body_weight_lbs)

Not graded. So why do it?

Multiple choice

Complete:

What goes here?:
state = What goes here?
if state == 'mi':
tax_rate = 0.06
elif state == 'in':
tax_rate = 0.07
else:
tax_rate = 0.0625
sales_tax = price * tax_rate
What goes here?
def normalize(text):
text = text.lower().strip()
return text
price = float(input('Price? '))
state_abbrev = input('State (IN, IL, MI)? ')
sales_tax = compute_sales_tax(price, state_abbrev)
total = price + sales_tax
print('Price:'+ str(price))
print('Sales tax (' + state_abbrev + '): '+ str(sales_tax))
print('Total: '+ str(total))

def compute_sales_tax(state, price):
...
normalize(state)
...
return sales_tax

def compute_sales_tax(price, state):
...
normalize(state)
...
return price + sales_tax

def compute_sales_tax(price, state):
...
normalize(state_abbrev)
...
return sales_tax

def compute_sales_tax(price, state):
...
normalize(state)
...
return sales_tax

Not graded. So why do it?

Multiple choice

Here's a call to read_csv_data_set:

animal_cuteness = read_csv_data_set('animal-cuteness-ratings.csv')

What's the best description of animal_cuteness?

A file with one record per animal.
Field values are separated by commas.

A dictionary of lists.
Each list in the dictionary is one record.

A list of dictionaries.
Each dictionary in the list is one record.

A list of lists.
Each list in the outer list is one record.

A dictionary of dictionaries.
Each dictionary in the outer dictionary is one record.

Not graded. So why do it?

Multiple choice

Here's a call to read_csv_data_set:

animal_cuteness = read_csv_data_set('animal-cuteness-ratings.csv')

Which of the following is true of animal_cuteness?

All fields in animal_cuteness's dictionaries
are strings.

All fields in animal_cuteness's dictionaries
match the underlying data types in the CSV file.
So if a column in the file contains only integers,
the corresponding field values in
animal_cuteness will be integers.

All fields in animal_cuteness's dictionaries
match the data types suggested by the column names
in the CSV file. So if a column is called Rating,
its values will be numeric in animal_cuteness.
For a column called Name, its values will be strings.

Not graded. So why do it?

Multiple choice

Complete this program.

clog_sizes = [8, 7, 11, 8.5, 10, 7.5, 6]
total = 0
count = 0
for something in something else:
total += size
count += 1
average = total/count
print('Average: '+ str(average))

clog_sizes and size

size and clog_sizes

sizes and clog_size

clog_size and sizes

Not graded. So why do it?

The story so far

read_csv_data_set is a function that will read a CSV file into a list of dictionaries. There is one dictionary for each record, and a list of records. Each field is a string.

goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')

We made a validation function. Give it a record (in a dictionary), and it will return true if the data is valid, or false if not.

Explain the code to this goat, to remind yourself how it works. You can explain it in your head, if you don't want to do it out loud.

A goat
Persephone

def is_record_ok(record):
# Check name.
goat_name = record['Goat']
if goat_name == '' or goat_name == None:
return False
# Check Before value.
before = record['Before']
if not is_score_ok(before):
return False
# Check After value.
after = record['After']
if not is_score_ok(after):
return False
return True

The record validation function called is_score_ok validates a single numeric field (supposed to be numeric, anyway). Explain it to Persephone.

def is_score_ok(score_in):
# Is it a number?
try:
score_number = float(score_in)
except Exception:
return False
# Check range.
if score_number < 0 or score_number > 100:
return False
# All OK.
return True

Finally, we had a complete program. Persephone awaits.

goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
total_before = 0
total_after = 0
for goat in goat_scores:
if is_record_ok(goat):
total_before += float(goat['Before'])
total_after += float(goat['After'])
print('Total before: ' + str(total_before))
print('Total after: ' + str(total_after))

Another way

We can change the code a little, to make it easier to write analysis code. This...

for goat in goat_scores:
if is_record_ok(goat):
total_before += float(goat['Before'])

... checks each record inside a computation loop. That works, but it combines validation and analysis.

Another way to do things is to separate validation and analysis entirely. Remove bad records from the data set before starting any analysis.

Clean data as a separate step

Clean data before analysis

clean_goat_scores is a new function that takes in a data set that might have errors, and returns a data set that does not have errors.

The analysis code will be easier to write, since it doesn't have to worry about validation.

Cleaner

Ray

This sounds hard!

What do you do when something's hard?

Ray

Umm... break it into pieces?

Aye.

Eating elephants

There is only one way to eat an elephant: a bite at a time.

Attributed to Desmond Tutu

Like this:

# Read goat scores from CSV file.
raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
# Filter out bad records.
cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
# Analysis.
total_before = 0
total_after = 0
for goat in cleaned_goat_scores:
total_before += goat['Before']
total_after += goat['After']
print('Total before: ' + str(total_before))
print('Total after: ' + str(total_after))

Line 2 reads the data set from the CSV file into the variable raw_goat_scores. It might contain errors.

Line 4 calls a new function that goes through raw_goat_scores, and makes a new data set with only valid records. The new data set is called cleaned_goat_scores.

Now that we have a data set without errors, the analysis doesn't need error-checking code. So instead of...

for goat in goat_scores:
if is_record_ok(goat): Error check
total_before += float(goat['Before'])
total_after += float(goat['After'])

... we can have...

for goat in clean_goat_scores:
total_before += goat['Before']
total_after += goat['After']

Simpler, since there's no validation needed here. It's already been done.

Adela

Looks like the float calls are gone, too. Are they done in clean_goat_scores?

Right! Cleaning means doing anything needed to get the data ready for analysis, including validation and type conversion.

Multiple choice

What is a function's signature?

The function's name, parameters, and return types.

A mapping from a function's parameters to its return values.

The function's docstring.

Deets on who wrote the function, and when.

Not graded. So why do it?

Let's start with the signature. You can get it from the code we saw earlier.

# Read goat scores from CSV file.
raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
# Filter out bad records.
cleaned_goat_scores = clean_goat_scores(raw_goat_scores)
# Analysis.
total_before = 0
total_after = 0
for goat in cleaned_goat_scores:
total_before += goat['Before']
total_after += goat['After']
print('Total before: ' + str(total_before))
print('Total after: ' + str(total_after))

Ray

The function is called clean_goat_scores. It takes one parameter: a list of dictionaries, made by read_csv_data_set. It returns... a list of dictionaries?

Aye. But what's different about the return list?

Ray

It only has valid records. Oh, and the data types. Like, if there's a field called temperature, that'll be a string in the data set read_csv_data_set returns, but it needs to be numeric for analysis.

So it would be something like:

def clean_goat_scores(raw_goat_scores):
Something
return cleaned_goat_scores

Indeed! Good job!

Let me give you a head start, with some comments.

def clean_goat_scores(raw_goat_scores):
# Create a new list for the clean records.
# Loop over raw records.
# Is the record OK?
# Yes, make a new record with the right data types.
# Add the new record to the clean list.
# Send the cleaned list back.
return cleaned_goat_scores

Reflect

Finish the function, as much as you can.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ray

OK... I think I remember how to make a new list.

def clean_goat_scores(raw_goat_scores):
# Create a new list for the clean records.
cleaned_goat_scores = []
# Loop over raw records.
# Is the record OK?
# Yes, make a new record with the right data types.
# Add the new record to the clean list.
# Send the cleaned list back.
return cleaned_goat_scores

The loopy loop is a for. Maybe...

def clean_goat_scores(raw_goat_scores):
# Create a new list for the clean records.
cleaned_goat_scores = []
# Loop over raw records.
for raw_record in raw_goat_scores:
# Is the record OK?
# Yes, make a new record with the right data types.
# Add the new record to the clean list.
# Send the cleaned list back.
return cleaned_goat_scores

The next step is checking raw_record... I'm not sure...

Adela

We already wrote that code. This is from an earlier lesson.

total_after = 0
for goat in goat_scores:
if is_record_ok(goat):
total_before += float(goat['Before'])

Give is_record_ok a record, and it tells you whether it's OK.

Ray

Oh, you're right! Can I use that function in this code?

def clean_goat_scores(raw_goat_scores):
# Create a new list for the clean records.
cleaned_goat_scores = []
# Loop over raw records.
for raw_record in raw_goat_scores:
# Is the record OK?
if is_record_ok(raw_record):
# Yes, make a new record with the right data types.
# Add the new record to the clean list.
# Send the cleaned list back.
return cleaned_goat_scores

Now, to make a new record... How do I do that?

What's a record look like? Is it a list, a string...

Reflect

What type of thing is a record?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ray

It's a dictionary. We talked about them before... Aha! I found this code in an earlier lesson.

best_pokemon = {
'name': 'Snorlax',
'generation': 1,
'pokedex number': 143
}

So, to make a new goat scores record...

def clean_goat_scores(raw_goat_scores):
# Create a new list for the clean records.
cleaned_goat_scores = []
# Loop over raw records.
for raw_record in raw_goat_scores:
# Is the record OK?
if is_record_ok(raw_record):
# Yes, make a new record with the right data types.
cleaned_goat_record = {
'Goat': raw_record['Goat'],
'Before': float(raw_record['Before']),
'After': float(raw_record['After'])
}
# Add the new record to the clean list.
# Send the cleaned list back.
return cleaned_goat_scores

I remembered to add the floats, to get the data types right.

Nice work!

Ray

Last bit.

def clean_goat_scores(raw_goat_scores):
# Create a new list for the clean records.
cleaned_goat_scores = []
# Loop over raw records.
for raw_record in raw_goat_scores:
# Is the record OK?
if is_record_ok(raw_record):
# Yes, make a new record with the right data types.
clean_record = {
'Goat': raw_record['Goat'],
'Before': float(raw_record['Before']),
'After': float(raw_record['After'])
}
# Add the new record to the clean list.
cleaned_goat_scores.append(clean_record)
# Send the cleaned list back.
return cleaned_goat_scores

Yay!

Georgina

Great work, Ray!

Ethan

Yeah! Good job!

Ray, anything that occurred to you as you wrote this?

Ray

A coupla things. The new function is like other stuff we did. A for loop, with an if inside.

This is a new program, but we reused read_csv_data_set and is_record_ok. Oh, and is_score_ok, too. is_record_ok calls it. Already written, as functions we could reuse.

I'm feeling good about this programming thing.

Doing more while checking data

You can add to clean_goat_scores to do all sorts of things. It's a Python function, and can do anything Python can do.

For example, you can print out which records have errors. Maybe show the Goat field of invalid records, like this:

Bad record:
Bad record:
Bad record: Bertha
Bad record: Bessie
Bad record: Boyd
Bad record: Bridgette
Bad record: Carrie
Bad record: Darell
Bad record: Deborah
Bad record: Gerald
Bad record: Johnnie
Bad record: Long
Bad record: Vincent
Total before: 570.0
Total after: 648.0

Two records are missing goat names. That's what the first two lines show you.

Here's the code Ray wrote:

def clean_goat_scores(raw_goat_scores):
# Create a new list for the clean records.
cleaned_goat_scores = []
# Loop over raw records.
for raw_record in raw_goat_scores:
# Is the record OK?
if is_record_ok(raw_record):
# Yes, make a new record with the right data types.
clean_record = {
'Goat': raw_record['Goat'],
'Before': float(raw_record['Before']),
'After': float(raw_record['After'])
}
# Add the new record to the clean list.
cleaned_goat_scores.append(clean_record)
# Send the cleaned list back.
return cleaned_goat_scores

Reflect

Add code to show the Goat field of invalid records.

Hint: The code inside the if...

if is_record_ok(raw_record):

... runs when is_record_ok is true. The print for the bad record show run when is_record_ok is false.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Something like:

for raw_record in raw_goat_scores:
# Is the record OK?
if is_record_ok(raw_record):
...
cleaned_goat_scores.append(clean_record)
else:
print('Bad record: ' + str(raw_record['Goat']))

Adela

I put the print in is_record_ok.

That works, too.

I love to count, ah, ah

How about changing clean_goat_scores so it outputs the count of bad records, too? Like:

...
Bad record: Long
Bad record: Vincent
Number of bad records: 13
Total before: 570.0
Total after: 648.0

Ray

A long time ago, we added a counter to a loop:

best_pet = ''
counter = 0
while best_pet != 'dog':
best_pet = input('What is the best pet? ')
counter += 1
print('Best pet:' + best_pet)
print('Number of times asked:' + str(counter))

We could do the same thing here. Have a new variable, add one to it in the loop, and then output it after the loop.

Here's the code so far:

def clean_goat_scores(raw_goat_scores):
# Create a new list for the clean records.
cleaned_goat_scores = []
# Loop over raw records.
for raw_record in raw_goat_scores:
# Is the record OK?
if is_record_ok(raw_record):
# Yes, make a new record with the right data types.
clean_record = {
'Goat': raw_record['Goat'],
'Before': float(raw_record['Before']),
'After': float(raw_record['After'])
}
# Add the new record to the clean list.
cleaned_goat_scores.append(clean_record)
else:
print('Bad record: ' + str(raw_record['Goat']))
# Send the cleaned list back.
return cleaned_goat_scores

Reflect

Add code to count the number of bad records.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ray

How about this?

def clean_goat_scores(raw_goat_scores):
# Create a new list for the clean records.
cleaned_goat_scores = []
# Initialize bad record count.
bad_record_count = 0
# Loop over raw records.
for raw_record in raw_goat_scores:
# Is the record OK?
if is_record_ok(raw_record):
# Yes, make a new record with the right data types.
clean_record = {
'Goat': raw_record['Goat'],
'Before': float(raw_record['Before']),
'After': float(raw_record['After'])
}
# Add the new record to the clean list.
cleaned_goat_scores.append(clean_record)
else:
print('Bad record:' + str(raw_record['Goat']))
bad_record_count += 1
print('Number of bad records:'+ str(bad_record_count))
# Send the cleaned list back.
return cleaned_goat_scores

Adela

Kieran, is that the way you'd do it?

Hmm. Probably not. I like data cleaning functions to not do any output directly, in case I want to do something else with the results, like write them to a file, make a webpage, whatevs.

Remember how we can return more than one value from a function? Like return x, y.

Ethan

I can see how you'd return the bad record count...

return clean_scores, bad_record_count

... but how would you send back all the goat names?

return clean_scores, bad_record_count, goat1, goat2...

That doesn't seem right.

You're correct, that isn't the best way to do it. cleaned_goat_scores is a single variable, but it contains many values within itself.

Ethan

Oh, yeah! Return a variable containing all the names from the bad records.

Use a list, right?

Aye!

Here's the code again, from a few steps back, without counting or showing bad records.

def clean_goat_scores(raw_goat_scores):
# Create a new list for the clean records.
cleaned_goat_scores = []
# Loop over raw records.
for raw_record in raw_goat_scores:
# Is the record OK?
if is_record_ok(raw_record):
# Yes, make a new record with the right data types.
clean_record = {
'Goat': raw_record['Goat'],
'Before': float(raw_record['Before']),
'After': float(raw_record['After'])
}
# Add the new record to the clean list.
cleaned_goat_scores.append(clean_record)
# Send the cleaned list back.
return cleaned_goat_scores

Reflect

How would you change the code to return a list of Goat fields for records with errors?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Georgina

I think I got it.

def clean_goat_scores(raw_goat_scores):
# Create a new list for the clean records.
cleaned_goat_scores = []
# Create a list for goats with bad records.
goats_bad_records = []
# Loop over raw records.
for raw_record in raw_goat_scores:
# Is the record OK?
if is_record_ok(raw_record):
# Yes, make a new record with the right data types.
clean_record = {
'Goat': raw_record['Goat'],
'Before': float(raw_record['Before']),
'After': float(raw_record['After'])
}
# Add the new record to the clean list.
cleaned_goat_scores.append(clean_record)
else:
# Add the goat name to the list of bad records.
goats_bad_records.append(raw_record['Goat'])
# Send the lists back.
return cleaned_goat_scores, goats_bad_records

Adela

We've lost the count of bad records, though.

We can get that in the main program, though.

# Read goat scores from CSV file.
raw_goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
# Filter out bad records.
cleaned_goat_scores, goats_bad_records = clean_goat_scores(raw_goat_scores)
print('Number of bad records:' + str(????))

Reflect

Finish the code.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ray

I got:

print('Number of bad records: ' + str(len(goats_bad_records)))

Great!

A new pattern

Pattern

Data machine: Cleaner

Write a function that takes a data set as a param. Some of the records in the data set might have errors. The function returns a data set with no errors.

Summary

It's a good idea to separate validation and analysis entirely. Remove bad records from the data set before starting any analysis.
Decomp(osition) is good.