Cleaning

Sloth thieves

Sloths are cute, and harmless. Well, that's how it seems. In fact, sloths are outstanding pickpockets. Nobody suspects them, but they can flash into a pocket or purse, and extract some cash.

Seriously! Sloths are a hazard!

Here's a CSV file (that's a link) with data on sloths at one slothary. Here's part of the file:

"Name","Amount stolen"
"Azucena",261
"Dawne",210
"Rozanne",223
"Elenor",136

Let's write a program to total the amount stolen by sloths. I'll open a new .py file, and put the function read_csv_data_set into it, the thing we talked about earlier. This program will show how many sloths there are in the file.

sloth_theft = read_csv_data_set('sloth-theft.csv')
print('Number of sloths: ' + str(len(sloth_theft)))

The first line reads the data, and the second shows the number of items in the list.

Reflect

Change the program so it shows the names of all the sloths. Hint: it takes two lines.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ethan

How about this:

for sloth in sloth_theft:
print(sloth['Name'])

Good! for loops over sloth_theft:

The first time through the loop, sloth is set to sloth_theft[0], and the print runs.
The second time through the loop, sloth is set to sloth_theft[1], and the print runs.
The third time through the loop, sloth is set to sloth_theft[2], and the print runs.
...

Reflect

Change the program to show the amounts.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ray

Just change from print(sloth['Name']) to print(sloth['Amount stolen']).

Oh, I changed it to amount stolen at first, but it didn't work.

Reflect

Why didn't it work?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ray

Amount, not amount.

Correct!

Totaling

Now let's write a program to total up the amounts. Here's some code:

total = 0
for sloth in sloth_theft:
total += sloth['Amount stolen']
print('Total: ' + str(total))

Reflect

Explain how this code works, in your own words.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Adela

The first line makes the variable total, and sets it to 0. The for loops over the data. The looping variable sloth is the first sloth the first time through the loop, the second sloth the second time, etc.

Line 3 (total += sloth['Amount stolen']) takes whatever it in the Amount stolen field, and adds it to total. total keeps going up, as each amount is added.

Once it's done, line 4 shows total.

Reflect

Try the program. What happens?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Adela

Well, 💩. I got an error on the line total += sloth['Amount stolen']:

TypeError: unsupported operand type(s) for +=: 'int' and 'str'

Reflect

What's happening?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Try this in the console:

one_sloth = sloth_theft[0]
type(one_sloth['Amount stolen'])

(type shows the data type of the param.)

Adela

What?! It says amount stolen is a string, but it's a number. Let me check... oh...

(From the console:)

one_sloth = sloth_theft[0]
type(one_sloth['Amount stolen'])
Out[17]: str
one_sloth['Amount stolen']
Out[18]: '261'

Adela

The quotes: '261'. All the data is read as strings, isn't it? Just like input().

Aye, that's it!

Reflect

How would you fix the code?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Georgina

Not too bad: total += float(sloth['Amount stolen'])

The float() changes the value to a number before adding it to total.

Bee tea dubs, Kieran, should that be float or int?

The data in the file is int, but it's a currency amount, and they can have decimal parts. I'd make it a float, based on the fact that it's a currency.

Remember, you might use the program again, with a different data set, maybe from a different slothary. They could use fractions, like 260.85.

Note

When choosing between int or float for a numeric field, use whatever makes most sense in the business context (deduction), rather than looking at a few records in a file (induction).

Another notey note

Convert field values from read_csv_data_set, which are always strings, to whatever type you need before doing computations.

Missing and bad data

Often data will have mistakes. Let's see how to handle missing and bad data.

Here's another data set. Cthulhu teaches a database class to goats. It had an idea for improving a lesson. It gave some goats the old lesson, then a test (the "before" test), then the new lesson, then another test (the "after" test). What happened to the scores?

OK, a poor experimental design. Tell Cthulhu, if you like. Though it might eat you. With picante sauce.

Here's part of the data set:

"Goat","Before","After"
"Aisha",17,16
"Andreas",14,15
"August",10,13
"Bertha",17,23
"Bessie",20,25
"Boyd",13,12
"Bridgette",16,22
"Carrie",15,19

Aisha got 17 on the first test, 16 on the second.

Let's mess up the data a bit. You can download the file (that's a link).

"Goat","Before","After"
"Aisha",17,16
"",14,15
,10,13
"Bertha",-22,23
"Bessie",20,
"Boyd",13,Some
"Bridgette",16,"a few"
"Carrie",15 19

Line 3 is missing the goat name, but still has the "s.
Line 4 is missing the name field altogether, quotes and all.
Line 5 has a negative test score.
Line 6 is missing a test score.
Lines 7 and 8 have text for a score, with and without "s.
Line 9 has a missing comma.

Let's see what happens when you read the bad data.

Reflect

Write a program to read the data using read_csv_data_set. (You can download the file (that's a link) the data file). The program should print the number of records in the file, and list the goat names. Paste your code here.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ethan

I'll start by making a new project, then download the data (that's a link) into its folder. Now I'll make a new program, and paste read_csv_data_set into it.

Now read the data set:

goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')

To show the number of records... Oh, OK. goat_scores is a list of records, so its length will be the number of records. I'll use the len function.

goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
print('Number of scores: ' + str(len(goat_scores)))

OK, now, print out the names of the goats. Each record has a field Goat, with the name, so I'll loop over the records, and print that field.

goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
print('Number of scores: ' + str(len(goat_scores)))
for goat in goat_scores:
print(goat['Goat'])

Nice!

Let's see what read_csv_data_set makes of the bad data.

Reflect

What does read_csv_data_set put in fields with missing or bad data?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Georgina

Let's see. I'll put a breakpoint just after the read_csv_data_set does its thing.

Breakpoint

Start debug... Check goat_scores in VE. Oh, if I double-click on it, I get an expanded view.

goat-scores

Here are the errors from before...

Line 3 (of the CSV file, not the VE display) is missing the goat name, but still has the "s.
Line 4 is missing the name field altogether, quotes and all.
Line 5 has a negative test score.
Line 6 is missing a test score.
Lines 7 and 8 have text for a score, with and without "s.
Line 9 has a missing comma.

Hmm... looks like read_csv_data_set put in MT strings for missing field values.

Except... it did something different for the missing comma in "Carrie",15 19.

None

What's that about?

Here's that record again:

"Carrie",15 19

There's one comma. Python's CSV library isn't smart enough to break up "15 19" into two fields. "Carrie" goes into the first field, "15 19" into the second. The third isn't there, so you get the special value None of the special type NoneType.

Let's use the console to explore further. (Recall the console remembers variable values from the last program run.)

goat_scores[7] is Carrie's record. Try goat_scores[7] in the console to confirm this.

We want to get the value for After, so try this: goat_scores[7]['After']. Notice you can use [] in the line twice. The first pair accesses list element 7, which is a dictionary, and the second pair gets the element After from that dictionary.

Ray

Console told me there's nothing there. Makes sense, I suppose.

So, if we look at an MT string in the console, like goat_scores[1]['Goat'], we get nothing. If we look at None in the console, like goat_scores[7]['After'], we get nothing.

Are None and an MT string the same?

Try this:

goat_scores[7]['After'] == None

That's true, the console tells us.

Here's the data again:

goat-scores

Reflect

Predict what these will show in the console.

goat_scores[4]['After'] == ''
goat_scores[4]['After'] == None
goat_scores[7]['After'] == ''
goat_scores[7]['After'] == None

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Adela

I think True, False, False, and True.

That's it! An MT string and None are not the same.

Another thing to try:

goat_scores[7]['After'] == goat_scores[1]['Goat']

Computations

Reflect

Complete this program. Type your guess as to the code, but don't try to run it.

goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
Stuff goes here
print('Total before: ' + str(total_before))
print('Total after: ' + str(total_after))

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Adela

OK, here's what I've got.

total_before = 0
total_after = 0
for goat in goat_scores:
total_before += float(goat['Before'])
total_after += float(goat['After'])
print('Total before: ' + str(total_before))
print('Total after: ' + str(total_after))

I remembered to use Before and not before, and to convert the fields to numbers. They're always read as strings.

Now run it.

Adela

Uh oh. I got ValueError: could not convert string to float: ''.

I'm guessing that's an MT string?

It'd be nice to know which goat record is causing the problem.

Reflect

Add a line to the program, so Adela will know which record causes the issue.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Adela

I'll change it to:

total_before = 0
total_after = 0
for goat in goat_scores:
print(goat['Goat'])
total_before += float(goat['Before'])
total_after += float(goat['After'])
print('Total before: ' + str(total_before))
print('Total after: ' + str(total_after))

Run it... It's Bessie. The record is:

"Bessie",20,

Hmm. After is missing.

I'll add something else to test whether After is MT for that record.

total_before = 0
total_after = 0
for goat in goat_scores:
print(goat['Goat'])
if goat['After'] == '':
print('After is MT')
total_before += float(goat['Before'])
total_after += float(goat['After'])
print('Total before: ' + str(total_before))
print('Total after: ' + str(total_after))

The output is:

Bertha
Bessie
After is MT

Then the error message. So float(goat['After']) breaks when it gets an MT string.

What to do?

Here's some of the data again.

"Goat","Before","After"
"Aisha",17,16
"",14,15
,10,13
"Bertha",-22,23
"Bessie",20,
"Boyd",13,Some
"Bridgette",16,"a few"
"Carrie",15 19

There are seven broken lines in the entire data set, and 43 that are OK. If we want to compute totals for Before and After...

Reflect

What do we do with the broken records?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Georgina

We could not include the missing data.

Yes. A question, though. For the record...

"Bertha",-22,23

Before is invalid, but After is OK. Do we omit the entire record, or just the -22? If we omit just the -22, we'd still add 23 to the After total.

Georgina

I'm not sure. What do we think?

Ethan

If the Before value is a mistake, can we trust the other fields for Bertha?

Georgina

I see what you mean. Kieran, what do you think?

With a record like...

"Bertha",-22,23

The most common things would be:

Drop the entire record from the analysis, called case-wise deletion.
Drop -22, but keep the rest of the fields.
Replace -22 with the average of all the other values for that field.

What you do depends on how much data you have, what questions you're asking, and other things.

In this course, we'll do case-wise deletion only. That is, if a record has any errors in any field, we ignore it. It's the most cautious approach.

A validation function

Let's imagine code we'd like to have in the main program if we want to compute totals. This didn't work:

goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
total_before = 0
total_after = 0
for goat in goat_scores:
total_before += float(goat['Before'])
total_after += float(goat['After'])
print('Total before: ' + str(total_before))
print('Total after: ' + str(total_after))

Imagine we had a function returning true if a record was OK, or false if the record has errors. If we had such a function, we could fix the program with one line:

goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
total_before = 0
total_after = 0
for goat in goat_scores:
if is_record_ok(goat):
total_before += float(goat['Before'])
total_after += float(goat['After'])
print('Total before: ' + str(total_before))
print('Total after: ' + str(total_after))

Adela

That'd be nice. The program's easy to understand. Send it a record, and get back true or false.

Are we gonna write is_record_ok?

Aye.

Deciding what functions do what can make your life easier. Let's list what could go wrong.

Name could be MT or None.
Before could be nonnumeric (MT, a string, or None), too small, or too large.
After could be nonnumeric (MT, a string, or None), too small, or too large.

Reflect

The last two are very similar. What does this suggest to you?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Georgina

Maybe write one function to check both?

Aye!

Ray

Wait. We want to write a function is_record_ok to make writing the program easier.

And to make is_record_ok easier to write, we'll write another function to check the Before and After values.

Yes, that's it. That's common, and very effective.

Principle

Decomposition

Programs are too complex to think about all at one time. You need to break them into pieces, and work on one piece at a time. Of course, the pieces must work together.

The function is the most important tool for decomposition.

Decomposition is the most important thing in making programming easier, and programmers more productive.

Multiple choice

What is a function's signature?

The function's name, parameters, and return types.

A mapping from a function's parameters to its return values.

The function's docstring.

Deets on who wrote the function, and when.

Not graded. So why do it?

Checking `Before` and `After`

Let's do this one first, since you've done some of this already.

Reflect

Write the signature for a function that will check a Before or After value.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ethan

How about: is_score_ok(value), returning True or False.

Good, that will work. You send a score into a validation function. You call it like this:

is_score_ok(record['Before'])

Ray

You could also send the key, right? Like is_score_field_ok(key). key is a string, either Before or After.

Aye, you could. You'd need to send in the record, so the function could look up the key: is_score_field_ok(record, key).

You'd call it like this:

@is_score_field_ok(goat_record, 'Before')@

That would work, too. I prefer Ethan's way a little, since the function is more generic, and more reusable in other programs.

However, both are correct. It's common for different solutions to be equally correct.

The function should detect when:

The value can't be made numeric. E.g., it's a word.
The value is too small or too large.

Reflect

Try writing the function.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ray

Let me try.

We had this earlier:

try:
meal_cost = float(user_input)
except ValueError:
print('Sorry, you must enter a number.')
sys.exit()

That will check for a nonnumeric thing. The range check was:

if meal_cost <= 0 or meal_cost > 1000000:
print('Sorry, please enter a number more than zero, and less than 1,000,000.')
sys.exit()

I'll start with the signature and some comments.

def is_score_ok(score):
# Is it a number?
# Check the range.

I'll put in the earlier code and change it.

def is_score_ok(score):
# Is it a number?
Start old code
try:
meal_cost = float(user_input)
except ValueError:
print('Sorry, you must enter a number.')
sys.exit()
End old code
# Check the range.

Let's see... line 5, change the param and the variable name.

def is_score_ok(score_in):
# Is it a number?
Start old code
try:
score_number = float(score_in)
except ValueError:
print('Sorry, you must enter a number.')
sys.exit()
End old code
# Check the range.

OK, now... what should happen if there's an error? Hmm, we want to return False, not show a message. How about this?

def is_score_ok(score_in):
# Is it a number?
Start old code
try:
score_number = float(score_in)
except ValueError:
return False
End old code
# Check the range.

Can I use return in the middle of a function?

Reflect

Can you use return in the middle of a function?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Georgina

Yes, You can.

Note

The function returns True or False, and that's it. It doesn't return a number. So if the param isn't numeric, we have enough information to return the right value.

Reflect

Write the code to check the range. Valid values can be from 0 to 100.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ray

Here's my function.

def is_score_ok(score_in):
# Is it a number?
try:
score_number = float(score_in)
except Exception:
return False
# Check range.
if score_number < 0 or score_number > 100:
return False
# All OK.
return True

I'm not sure about the last line. I was thinking, if there are errors, return false. But if the compy gets to the end of the function, well, score_in is OK. Is that right?

Yep. Looks good.

Test `is_score_ok`

Reflect

Write a program to test is_score_ok.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ray

I started to write some code, but realized I was repeating. So this is what I did:

def check_it(value):
print('Is ' + str(value) + ' OK? ' + str(is_score_ok(value)))
check_it(3)
check_it(-3)
check_it(55)
check_it(-55)
check_it(23.4)
check_it('Goattos')
check_it(None)

Georgina

Nice going! I didn't think of that.

I also noticed you need to convert a boolean (is_score_ok(value) returns a boolean) to a string before you append it to another string.

Did is_score_ok work?

Ray

Yes, perfectly.

Checking an entire record

Let's see where we are. We're writing code to test a goat score record, like this: "Aisha",17,16. The function's signature is is_record_ok(goat_record), returning true or false.

Here are the errors is_record_ok should look for:

Name could be MT or None.
Before score could be nonnumeric (MT, a string, or None), too small, or too large.
After score could be nonnumeric (MT, a string, or None), too small, or too large.

The last two are almost the same, so we wrote a function what would work for both. We wrote is_score_ok(value).

Reflect

Write is_record_ok, passing in a record (records are stored in dictionaries, remember). Check the name, then the Before score, then the After score. Check the scores using is_score_ok(value). Your function should return True if the data is OK, or False if it isn't.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Adela

I think I have it.

def is_record_ok(record):
# Check name.
goat_name = record['Goat']
if goat_name == '' or goat_name == None:
return False
# Check Before value.
before = record['Before']
if not is_score_ok(before):
return False
# Check After value.
after = record['After']
if not is_score_ok(after):
return False
return True

I love it! Each step is simple. Comments make it easy to follow. Nice work!

Assembling everything

We wrote a program to total the Before and After scores, but it didn't work, because of errors in the data. We decided we wanted something like this:

goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
total_before = 0
total_after = 0
for goat in goat_scores:
if is_record_ok(goat):
total_before += float(goat['Before'])
total_after += float(goat['After'])
print('Total before: ' + str(total_before))
print('Total after: ' + str(total_after))

Most of the complexity is in validation, but we hid that away in is_record_ok. When we're writing is_record_ok, we didn't need to think abut the rest of the program. We just need to know it takes a record, and returns a boolean.

Here's is_record_ok:

def is_record_ok(record):
'''
Check whether a record is valid.
Parameters
----------
record : dictionary
The record to check.
Returns
-------
bool
True of the record is OK, False if not.
'''
# Check name.
goat_name = record['Goat']
if goat_name == '' or goat_name == None:
return False
# Check Before value.
before = record['Before']
if not is_score_ok(before):
return False
# Check After value.
after = record['After']
if not is_score_ok(after):
return False
return True

We made another function for numeric and range checking: is_score_ok. It takes a value, and returns a boolean. We didn't need to think about how is_score_ok would be used when we were writing it. We'd already done that. We defined what is_score_ok should do, and just focused on it when writing the function.

def is_score_ok(score_in):
'''
Check whether a score is valid.
Parameters
----------
score_in : string
A value to check.
Returns
-------
bool
True if score_in is numeric and in-range. False otherwise.
'''
# Is it a number?
try:
score_number = float(score_in)
except Exception:
return False
# Check range.
if score_number < 0 or score_number > 100:
return False
# All OK.
return True

Finally, we had our entire program.

Decomposition - breaking a task into bite-sized chunks - is a programmer's second-best friend. Their best friend is their doggo, of course.

Principle

Decomposition

Programs are too complex to think about all at one time. You need to break them into pieces, and work on one piece at a time. Of course, the pieces must work together.

The function is the most important tool for decomposition.

Summary

read_csv_data_set data is always strings. You need to convert it to int or float if you want to do numeric work.
Python has a special constant None used when data is missing.
If there's an error in a record, and you're using case-wise deletion, you drop the entire record.
Decomposition: Programs are too complex to think about all at one time. You need to break them into pieces, and work on one piece at a time. Of course, the pieces must work together.

Exercises

Exercise

Clean movie ratings

It's Wilde Week in goatville. The Wildes are like the Oscars, but for goat movies.

Download (that's a link) a data set on critic and viewer ratings of movies.

Each record has four fields:

Movie title. Must be present.
Genre (romance, action, or scifi). Must be one of the three values given. Extra leading/trailing spaces are OK. Upper/lowercase OK.
Critic rating, from 1 to 5. Must be an integer in this range.
Viewer rating, from 1 to 5. Must be an integer in this range.

Write a program to read the file, and compute the average critic and viewer for valid records. Also report the number of records and valid records.

Output should be like this:

Number of records read: 33
Number of valid records: 25
Average critic rating: 2.76
Average viewer rating: 3.28

Use at least three functions. I used four in my solution.

Upload a zip of your project folder. The usual coding standards apply.

If you were logged in as a student, you could submit a solution to this exercise.

Attachments

db-lesson-scores-bad-data.csv

Sloth thieves

Totaling

Missing and bad data

Computations

What to do?

A validation function

Checking Before and After

Test is_score_ok

Checking an entire record

Assembling everything

Summary

Exercises

Checking `Before` and `After`

Test `is_score_ok`