Not graded. So why do it?
Not graded. So why do it?
Not graded. So why do it?
Not graded. So why do it?
Not graded. So why do it?
So far, you know how to read a CSV file into a list of dictionaries, with one dictionary for each record. Before we get to analysis, we need to talk about data cleaning. If there are errors in the data, we might not get very far with analysis.
First, we'll talk about type conversion. Then we'll get to cleaning errors.
Sloth thieves
Photo by Michael Jerrard
Sloths are cute, and harmless. Well, that's how it seems. In fact, sloths are outstanding pickpockets. Nobody suspects them, but they can flash into a pocket or purse, and extract some cash.
Seriously! Sloths are a hazard!
Here's a CSV file (that's a link) with data on sloths at one slothary. Here's part of the file:
- "Name","Amount stolen"
- "Azucena",261
- "Dawne",210
- "Rozanne",223
- "Elenor",136
Let's write a program to total the amount stolen by sloths. I'll open a new .py
file, and put the function read_csv_data_set
into it, the thing we talked about earlier. This program will show how many sloths there are in the file.
- sloth_theft = read_csv_data_set('sloth-theft.csv')
- print('Number of sloths: ' + str(len(sloth_theft)))
The first line reads the data, and the second shows the number of items in the list.
Change the program so it shows the names of all the sloths. Hint: it takes two lines.

Ethan
How about this:
- for sloth in sloth_theft:
- print(sloth['Name'])
Good! for
loops over sloth_theft
:
- The first time through the loop,
sloth
is set tosloth_theft[0]
, and theprint
runs. - The second time through the loop,
sloth
is set tosloth_theft[1]
, and theprint
runs. - The third time through the loop,
sloth
is set tosloth_theft[2]
, and theprint
runs. - ...
Change the program to show the amounts.

Ray
Just change from print(sloth['Name'])
to print(sloth['Amount stolen'])
.
Oh, I changed it to amount stolen
at first, but it didn't work.
Why didn't it work?

Ray
Amount
, not amount
.
Correct!
Totaling
Now let's write a program to total up the amounts. Here's some code:
- total = 0
- for sloth in sloth_theft:
- total += sloth['Amount stolen']
- print('Total: ' + str(total))
Explain how this code works, in your own words.

Adela
The first line makes the variable total
, and sets it to 0. The for
loops over the data. The looping variable sloth
is the first sloth the first time through the loop, the second sloth the second time, etc.
Line 3 (total += sloth['Amount stolen']
) takes whatever it in the Amount stolen field, and adds it to total
. total
keeps going up, as each amount is added.
Once it's done, line 4 shows total
.
Try the program. What happens?

Adela
Well, 💩. I got an error on the line total += sloth['Amount stolen']
:
TypeError: unsupported operand type(s) for +=: 'int' and 'str'
What's happening?
Try this in the console:
- one_sloth = sloth_theft[0]
- type(one_sloth['Amount stolen'])
(type
shows the data type of the param.)

Adela
What?! It says amount stolen is a string, but it's a number. Let me check... oh...
(From the console:)
- one_sloth = sloth_theft[0]
- type(one_sloth['Amount stolen'])
- Out[17]: str
- one_sloth['Amount stolen']
- Out[18]: '261'

Adela
The quotes: '261'
. All the data is read as strings, isn't it? Just like input()
.
Aye, that's it!
How would you fix the code?

Georgina
Not too bad: total += float(sloth['Amount stolen'])
The float()
changes the value to a number before adding it to total
.
Bee tea dubs, Kieran, should that be float or int?
The data in the file is int, but it's a currency amount, and they can have decimal parts. I'd make it a float, based on the fact that it's a currency.
Remember, you might use the program again, with a different data set, maybe from a different slothary. They could use fractions, like 260.85.
Note
When choosing between int or float for a numeric field, use whatever makes most sense in the business context (deduction), rather than looking at a few records in a file (induction).
Another notey note
Convert field values from read_csv_data_set
, which are always strings, to whatever type you need before doing computations.
Missing and bad data
Often data will have mistakes. Let's see how to handle missing and bad data.
Here's another data set. Cthulhu teaches a database class to goats. It had an idea for improving a lesson. It gave some goats the old lesson, then a test (the "before" test), then the new lesson, then another test (the "after" test). What happened to the scores?
OK, a poor experimental design. Tell Cthulhu, if you like. Though it might eat you. With picante sauce.
Here's part of the data set:
- "Goat","Before","After"
- "Aisha",17,16
- "Andreas",14,15
- "August",10,13
- "Bertha",17,23
- "Bessie",20,25
- "Boyd",13,12
- "Bridgette",16,22
- "Carrie",15,19
Aisha got 17 on the first test, 16 on the second.
Let's mess up the data a bit. You can download the file (that's a link).
- "Goat","Before","After"
- "Aisha",17,16
- "",14,15
- ,10,13
- "Bertha",-22,23
- "Bessie",20,
- "Boyd",13,Some
- "Bridgette",16,"a few"
- "Carrie",15 19
- Line 3 is missing the goat name, but still has the "s.
- Line 4 is missing the name field altogether, quotes and all.
- Line 5 has a negative test score.
- Line 6 is missing a test score.
- Lines 7 and 8 have text for a score, with and without "s.
- Line 9 has a missing comma.
Let's see what happens when you read the bad data.
Write a program to read the data using read_csv_data_set
. (You can download the file (that's a link) the data file). The program should print the number of records in the file, and list the goat names. Paste your code here.

Ethan
I'll start by making a new project, then download the data (that's a link) into its folder. Now I'll make a new program, and paste read_csv_data_set
into it.
Now read the data set:
- goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
To show the number of records... Oh, OK. goat_scores
is a list of records, so its length will be the number of records. I'll use the len
function.
- goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
- print('Number of scores: ' + str(len(goat_scores)))
OK, now, print out the names of the goats. Each record has a field Goat, with the name, so I'll loop over the records, and print that field.
- goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
- print('Number of scores: ' + str(len(goat_scores)))
- for goat in goat_scores:
- print(goat['Goat'])
Nice!
Let's see what read_csv_data_set
makes of the bad data.
What does read_csv_data_set
put in fields with missing or bad data?

Georgina
Let's see. I'll put a breakpoint just after the read_csv_data_set
does its thing.
Start debug... Check goat_scores
in VE. Oh, if I double-click on it, I get an expanded view.
Here are the errors from before...
- Line 3 (of the CSV file, not the VE display) is missing the goat name, but still has the "s.
- Line 4 is missing the name field altogether, quotes and all.
- Line 5 has a negative test score.
- Line 6 is missing a test score.
- Lines 7 and 8 have text for a score, with and without "s.
- Line 9 has a missing comma.
Hmm... looks like read_csv_data_set
put in MT strings for missing field values.
Except... it did something different for the missing comma in "Carrie",15 19
.
What's that about?
Here's that record again:
- "Carrie",15 19
There's one comma. Python's CSV library isn't smart enough to break up "15 19" into two fields. "Carrie" goes into the first field, "15 19" into the second. The third isn't there, so you get the special value None of the special type NoneType.
Let's use the console to explore further. (Recall the console remembers variable values from the last program run.)
goat_scores[7]
is Carrie's record. Try goat_scores[7]
in the console to confirm this.
We want to get the value for After
, so try this: goat_scores[7]['After']
. Notice you can use []
in the line twice. The first pair accesses list element 7, which is a dictionary, and the second pair gets the element After
from that dictionary.

Ray
Console told me there's nothing there. Makes sense, I suppose.
So, if we look at an MT string in the console, like goat_scores[1]['Goat']
, we get nothing. If we look at None
in the console, like goat_scores[7]['After']
, we get nothing.
Are None
and an MT string the same?
Try this:
- goat_scores[7]['After'] == None
That's true, the console tells us.
Here's the data again:
Predict what these will show in the console.
- goat_scores[4]['After'] == ''
- goat_scores[4]['After'] == None
- goat_scores[7]['After'] == ''
- goat_scores[7]['After'] == None

Adela
I think True, False, False, and True.
That's it! An MT string and None are not the same.
Another thing to try:
- goat_scores[7]['After'] == goat_scores[1]['Goat']
Computations
Complete this program. Type your guess as to the code, but don't try to run it.
- goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
- Stuff goes here
- print('Total before: ' + str(total_before))
- print('Total after: ' + str(total_after))

Adela
OK, here's what I've got.
- total_before = 0
- total_after = 0
- for goat in goat_scores:
- total_before += float(goat['Before'])
- total_after += float(goat['After'])
- print('Total before: ' + str(total_before))
- print('Total after: ' + str(total_after))
I remembered to use Before
and not before
, and to convert the fields to numbers. They're always read as strings.
Now run it.

Adela
Uh oh. I got ValueError: could not convert string to float: ''
.
I'm guessing that's an MT string?
It'd be nice to know which goat record is causing the problem.
Add a line to the program, so Adela will know which record causes the issue.

Adela
I'll change it to:
- total_before = 0
- total_after = 0
- for goat in goat_scores:
- print(goat['Goat'])
- total_before += float(goat['Before'])
- total_after += float(goat['After'])
- print('Total before: ' + str(total_before))
- print('Total after: ' + str(total_after))
Run it... It's Bessie. The record is:
- "Bessie",20,
Hmm. After
is missing.
I'll add something else to test whether After
is MT for that record.
- total_before = 0
- total_after = 0
- for goat in goat_scores:
- print(goat['Goat'])
- if goat['After'] == '':
- print('After is MT')
- total_before += float(goat['Before'])
- total_after += float(goat['After'])
- print('Total before: ' + str(total_before))
- print('Total after: ' + str(total_after))
The output is:
- Bertha
- Bessie
- After is MT
Then the error message. So float(goat['After'])
breaks when it gets an MT string.
What to do?
Here's some of the data again.
- "Goat","Before","After"
- "Aisha",17,16
- "",14,15
- ,10,13
- "Bertha",-22,23
- "Bessie",20,
- "Boyd",13,Some
- "Bridgette",16,"a few"
- "Carrie",15 19
There are seven broken lines in the entire data set, and 43 that are OK. If we want to compute totals for Before
and After
...
What do we do with the broken records?

Georgina
We could not include the missing data.
Yes. A question, though. For the record...
- "Bertha",-22,23
Before
is invalid, but After
is OK. Do we omit the entire record, or just the -22? If we omit just the -22, we'd still add 23 to the After
total.

Georgina
I'm not sure. What do we think?

Ethan
If the Before
value is a mistake, can we trust the other fields for Bertha?

Georgina
I see what you mean. Kieran, what do you think?
With a record like...
- "Bertha",-22,23
The most common things would be:
- Drop the entire record from the analysis, called case-wise deletion.
- Drop -22, but keep the rest of the fields.
- Replace -22 with the average of all the other values for that field.
What you do depends on how much data you have, what questions you're asking, and other things.
In this course, we'll do case-wise deletion only. That is, if a record has any errors in any field, we ignore it. It's the most cautious approach.
A validation function
Let's imagine code we'd like to have in the main program if we want to compute totals. This didn't work:
- goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
- total_before = 0
- total_after = 0
- for goat in goat_scores:
- total_before += float(goat['Before'])
- total_after += float(goat['After'])
- print('Total before: ' + str(total_before))
- print('Total after: ' + str(total_after))
Imagine we had a function returning true if a record was OK, or false if the record has errors. If we had such a function, we could fix the program with one line:
- goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
- total_before = 0
- total_after = 0
- for goat in goat_scores:
- if is_record_ok(goat):
- total_before += float(goat['Before'])
- total_after += float(goat['After'])
- print('Total before: ' + str(total_before))
- print('Total after: ' + str(total_after))

Adela
That'd be nice. The program's easy to understand. Send it a record, and get back true or false.
Are we gonna write is_record_ok
?
Aye.
Deciding what functions do what can make your life easier. Let's list what could go wrong.
- Name could be MT or None.
Before
could be nonnumeric (MT, a string, or None), too small, or too large.After
could be nonnumeric (MT, a string, or None), too small, or too large.
The last two are very similar. What does this suggest to you?

Georgina
Maybe write one function to check both?
Aye!

Ray
Wait. We want to write a function is_record_ok
to make writing the program easier.
And to make is_record_ok
easier to write, we'll write another function to check the Before
and After
values.
Yes, that's it. That's common, and very effective.
Programs are too complex to think about all at one time. You need to break them into pieces, and work on one piece at a time. Of course, the pieces must work together.
The function is the most important tool for decomposition.
Decomposition is the most important thing in making programming easier, and programmers more productive.
Not graded. So why do it?
Checking Before
and After
Let's do this one first, since you've done some of this already.
Write the signature for a function that will check a Before
or After
value.

Ethan
How about: is_score_ok(value)
, returning True or False.
Good, that will work. You send a score into a validation function. You call it like this:
- is_score_ok(record['Before'])

Ray
You could also send the key, right? Like is_score_field_ok(key)
. key
is a string, either Before
or After
.
Aye, you could. You'd need to send in the record, so the function could look up the key: is_score_field_ok(record, key)
.
You'd call it like this:
- @is_score_field_ok(goat_record, 'Before')@
That would work, too. I prefer Ethan's way a little, since the function is more generic, and more reusable in other programs.
However, both are correct. It's common for different solutions to be equally correct.
The function should detect when:
- The value can't be made numeric. E.g., it's a word.
- The value is too small or too large.
Try writing the function.

Ray
Let me try.
We had this earlier:
- try:
- meal_cost = float(user_input)
- except ValueError:
- print('Sorry, you must enter a number.')
- sys.exit()
That will check for a nonnumeric thing. The range check was:
- if meal_cost <= 0 or meal_cost > 1000000:
- print('Sorry, please enter a number more than zero, and less than 1,000,000.')
- sys.exit()
I'll start with the signature and some comments.
- def is_score_ok(score):
- # Is it a number?
- # Check the range.
I'll put in the earlier code and change it.
- def is_score_ok(score):
- # Is it a number?
- Start old code
- try:
- meal_cost = float(user_input)
- except ValueError:
- print('Sorry, you must enter a number.')
- sys.exit()
- End old code
- # Check the range.
Let's see... line 5, change the param and the variable name.
- def is_score_ok(score_in):
- # Is it a number?
- Start old code
- try:
- score_number = float(score_in)
- except ValueError:
- print('Sorry, you must enter a number.')
- sys.exit()
- End old code
- # Check the range.
OK, now... what should happen if there's an error? Hmm, we want to return False, not show a message. How about this?
- def is_score_ok(score_in):
- # Is it a number?
- Start old code
- try:
- score_number = float(score_in)
- except ValueError:
- return False
- End old code
- # Check the range.
Can I use return
in the middle of a function?
Can you use return
in the middle of a function?

Georgina
Yes, You can.
Note
The function returns True
or False
, and that's it. It doesn't return a number. So if the param isn't numeric, we have enough information to return the right value.
Write the code to check the range. Valid values can be from 0 to 100.

Ray
Here's my function.
- def is_score_ok(score_in):
- # Is it a number?
- try:
- score_number = float(score_in)
- except Exception:
- return False
- # Check range.
- if score_number < 0 or score_number > 100:
- return False
- # All OK.
- return True
I'm not sure about the last line. I was thinking, if there are errors, return false. But if the compy gets to the end of the function, well, score_in
is OK. Is that right?
Yep. Looks good.
Test is_score_ok
Write a program to test is_score_ok
.

Ray
I started to write some code, but realized I was repeating. So this is what I did:
- def check_it(value):
- print('Is ' + str(value) + ' OK? ' + str(is_score_ok(value)))
- check_it(3)
- check_it(-3)
- check_it(55)
- check_it(-55)
- check_it(23.4)
- check_it('Goattos')
- check_it(None)

Georgina
Nice going! I didn't think of that.
I also noticed you need to convert a boolean (is_score_ok(value)
returns a boolean) to a string before you append it to another string.
Did is_score_ok
work?

Ray
Yes, perfectly.
Checking an entire record
Let's see where we are. We're writing code to test a goat score record, like this: "Aisha",17,16
. The function's signature is is_record_ok(goat_record)
, returning true or false.
Here are the errors is_record_ok
should look for:
- Name could be MT or None.
Before
score could be nonnumeric (MT, a string, or None), too small, or too large.After
score could be nonnumeric (MT, a string, or None), too small, or too large.
The last two are almost the same, so we wrote a function what would work for both. We wrote is_score_ok(value)
.
Write is_record_ok
, passing in a record (records are stored in dictionaries, remember). Check the name, then the Before
score, then the After
score. Check the scores using is_score_ok(value)
. Your function should return True
if the data is OK, or False
if it isn't.

Adela
I think I have it.
- def is_record_ok(record):
- # Check name.
- goat_name = record['Goat']
- if goat_name == '' or goat_name == None:
- return False
- # Check Before value.
- before = record['Before']
- if not is_score_ok(before):
- return False
- # Check After value.
- after = record['After']
- if not is_score_ok(after):
- return False
- return True
I love it! Each step is simple. Comments make it easy to follow. Nice work!
Assembling everything
We wrote a program to total the Before
and After
scores, but it didn't work, because of errors in the data. We decided we wanted something like this:
- goat_scores = read_csv_data_set('db-lesson-scores-bad-data.csv')
- total_before = 0
- total_after = 0
- for goat in goat_scores:
- if is_record_ok(goat):
- total_before += float(goat['Before'])
- total_after += float(goat['After'])
- print('Total before: ' + str(total_before))
- print('Total after: ' + str(total_after))
Most of the complexity is in validation, but we hid that away in is_record_ok
. When we're writing is_record_ok
, we didn't need to think abut the rest of the program. We just need to know it takes a record, and returns a boolean.
Here's is_record_ok
:
- def is_record_ok(record):
- '''
- Check whether a record is valid.
- Parameters
- ----------
- record : dictionary
- The record to check.
- Returns
- -------
- bool
- True of the record is OK, False if not.
- '''
- # Check name.
- goat_name = record['Goat']
- if goat_name == '' or goat_name == None:
- return False
- # Check Before value.
- before = record['Before']
- if not is_score_ok(before):
- return False
- # Check After value.
- after = record['After']
- if not is_score_ok(after):
- return False
- return True
We made another function for numeric and range checking: is_score_ok
. It takes a value, and returns a boolean. We didn't need to think about how is_score_ok
would be used when we were writing it. We'd already done that. We defined what is_score_ok
should do, and just focused on it when writing the function.
- def is_score_ok(score_in):
- '''
- Check whether a score is valid.
- Parameters
- ----------
- score_in : string
- A value to check.
- Returns
- -------
- bool
- True if score_in is numeric and in-range. False otherwise.
- '''
- # Is it a number?
- try:
- score_number = float(score_in)
- except Exception:
- return False
- # Check range.
- if score_number < 0 or score_number > 100:
- return False
- # All OK.
- return True
Finally, we had our entire program.
Decomposition - breaking a task into bite-sized chunks - is a programmer's second-best friend. Their best friend is their doggo, of course.
Programs are too complex to think about all at one time. You need to break them into pieces, and work on one piece at a time. Of course, the pieces must work together.
The function is the most important tool for decomposition.
Summary
read_csv_data_set
data is always strings. You need to convert it to int or float if you want to do numeric work.- Python has a special constant
None
used when data is missing. - If there's an error in a record, and you're using case-wise deletion, you drop the entire record.
- Decomposition: Programs are too complex to think about all at one time. You need to break them into pieces, and work on one piece at a time. Of course, the pieces must work together.
Exercises
Clean movie ratings
It's Wilde Week in goatville. The Wildes are like the Oscars, but for goat movies.
Download (that's a link) a data set on critic and viewer ratings of movies.
Each record has four fields:
- Movie title. Must be present.
- Genre (romance, action, or scifi). Must be one of the three values given. Extra leading/trailing spaces are OK. Upper/lowercase OK.
- Critic rating, from 1 to 5. Must be an integer in this range.
- Viewer rating, from 1 to 5. Must be an integer in this range.
Write a program to read the file, and compute the average critic and viewer for valid records. Also report the number of records and valid records.
Output should be like this:
- Number of records read: 33
- Number of valid records: 25
- Average critic rating: 2.76
- Average viewer rating: 3.28
Use at least three functions. I used four in my solution.
Upload a zip of your project folder. The usual coding standards apply.