Data sets | Python with Pets

Records

Almost all business data you'll deal with is organized like this:

Episode number	Title	Length
1	Nobody Listens to Paula Poundstone	51.37
2	Maintaining friendships	50.75
3	Audiologist Michele Sherman talks ears	48.9
4	The Survivalist!	53.22

(It's a data set about my fave podcast, Nobody Listens to Paula Poundstone. Tell your old people about it.)

The data is about one type of thing: NLTPP episodes. Each line is called a record, row, or maybe entity. A record describes one episode.

Records are made of attributes, also called fields. Here, each episode has three attributes: number, title, and length. Every record has the same set of attributes, although some data might be missing.

Each attribute is the same data type. Here:

All episode numbers are integers.
All titles are strings.
All lengths are floats.

A data set in memory

We need to have a data set in memory, so we can analyze it. There's a problem, though. Until now, each variable only has one piece of data in it, like:

The variable legs is an integer with the value 8. Or 4, or 2. legs can only hold one value, though. It can't be 8 and 4.
The variable weight is a float with 23.1 in it. It can only have one value.
The variable family_name is a string with "Park" in it. It can be Park, Smith, Felber, whatever, but only one name.

Now we have bunches of data. We might have a customer data set with thousands of records. What do we do?

We need variables that somehow store many values together, and make it easy to get each one when we need it. Just as we have strings, floats, ints, and booleans, we need a new data type to hold lots of data in a variable.

We'll actually need two new data types:

Store fields in a record
Store a collection of records

Let's do the first one.

Dictionary

Dictionaries are perfect for storing individual records. Here's an example of a Python dictionary.

animal1 = {
'common name': 'Red kangaroo',
'species name': 'Osphranter rufus',
'length': 1.5,
'weight': 74,
'url': 'https://en.wikipedia.org/wiki/Red_kangaroo'
}

animal1 is the variable containing one record, as a dictionary. It has attributes, each one with a key and a value. The keys are usually strings. The values can be anything. Add as many key/value pairs as you like.

Python knows it's a dictionary because of the braces (the {}). Other types use different symbols, like () and [].

Snorlax
by andrework

Another dictionary:

best_pokemon = {
'name': 'Snorlax',
'generation': 1,
'pokedex number': 143
}

(Snorlax is my spirit animal.)

That's one Pokémon record.

You set the values of individual fields like this:

best_pokemon['rating'] = 10

Notice that rating wasn't in the original record. We've added it:

best_pokemon = {
'name': 'Snorlax',
'generation': 1,
'pokedex number': 143,
'rating': 10
}

Doing more things:

# Changing a value.
best_pokemon['rating'] = 11
# Appending to the name.
best_pokemon['name'] += ' (the best)'
# Testing a value.
if best_pokemon['generation'] == 1:
print('OG!')
# Input a value.
best_pokemon['pokedex number'] = int(input("What's the Pokedex number? "))

Basically, you can do anything with a dictionary_name[key] that you can do with a variable. Calculate with it, input it, output it, whatevs. In reality, a dictionary_name[key] is a regular variable.

Here's some data again.

Episode number	Title	Length
1	Nobody Listens to Paula Poundstone	51.37
2	Maintaining friendships	50.75
3	Audiologist Michele Sherman talks ears	48.9
4	The Survivalist!	53.22

Here's the same data as four dictionaries, once for each record.

an_episode = {
'Episode number': 1,
'Title': "Nobody Listens to Paula Poundstone",
'Length': 51.37
}
another_episode = {
'Episode number': 2,
'Title': 'Maintaining friendships',
'Length': 50.75
}
yet_another_episode = {
'Episode number': 3,
'Title': 'Audiologist Michele Sherman talks ears',
'Length': 48.9
}
yet_yet_another_episode = {
'Episode number': 4,
'Title': 'The Survivalist!',
'Length': 53.22
}

There's a problem, though. We have four variables, each containing a dictionary. But there are hundreds of episodes. Creating hundreds of different variables, one for each episode, would be a pain.

Lists

What we need is to put a bunch o' dictionaries together, in a collection. There's a data type called list that does the job.

A list is a sequence of individual values. The values can be strings, floats, dictionaries, anything. Here's a list of strings, Australian state names.

states = ['Queensland', 'New South Wales', 'Victoria',
'Tasmania', 'South Australia', 'Western Australia']

I'm using individual values for now, to keep it simple. We'll bring dictionaries back in later.

Python uses [] for lists, as it uses {} for dictionaries.

states can be any size. Australia has six states. The US has 50. No problem. A list can contain as many values as we like. A thousand? No problem. 65,536? OK.

You create a list like this:

name = [values]

Values can be MT, and often is at the start of a program. Like:

movies = []

Here are things you can do with lists.

list_name.append(thing) adds thing to the end of the list.
len(list_name) tells you how many items are in the list.
And lots more.

You can get items from a list in two main ways. First, you can use an index, like states[3]. Indexes are always numbers.

Paste this into the console.

states = ['Queensland', 'New South Wales', 'Victoria',
'Tasmania', 'South Australia', 'Western Australia']
states[0]

Reflect

What did you get?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Adela

The first value, Queensland.

Right.

Reflect

What's states[1]? Answer before you try it.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Georgina

It's the second element.

So data_set[0] is the first element, and data_set[1] is the second?

Aye. The first element's index is zero. The reason for that is buried in software history. It's not relevant for this course.

So the six values in the list are:

states[0]
states[1]
states[2]
states[3]
states[4]
states[5]

Reflect

Try states[6] in the console. What happens?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Georgina

I got IndexError: list index out of range.

states[5] is the last one.

Right!

Reflect

In your own words, explain what this program does, without running it.

fruits = []
done = False
while not done:
fruit = input('Type the name of a fruit, or bye to quit: ')
fruit_normalized = fruit.lower().strip()
if fruit_normalized == 'bye':
done = True
else:
fruits.append(fruit)
print(fruits)

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Adela

It asks you to type a fruit. It adds the fruit to a list. It keeps asking until you type bye. Then it shows the list.

Indeed! (Try running it.)

Ethan

About these lines:

fruit = input('Type the name of a fruit, or bye to quit: ')
fruit_normalized = fruit.lower().strip()
if fruit_normalized == 'bye':
done = True
else:
fruits.append(fruit)

You get the fruit in line 4, then normalize it in line 5. But you use a new variable, fruit_normalized instead of putting the normalized value back into fruit. Why?

Reflect

Why?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Georgina

Ooo! I see it.

You want to add whatever the user typed to the list, with uppercase characters and everything. But when you normalize, which you do to check for bye, you lose things like capitalization.

So, you keep the original fruit around, and make a new variable (fruit_normalized) to normalize and test for bye.

Exactly! Nice work, Georgina.

`for`

You saw how we can use indexes to get the values in the list directly, like states[2] gets the third value (the first one is states[0]).

Often, we want to go through all the items in a list to compute some stats, or do something else. We want to get the values one at a time.

We could use a while loop and a counter, but there's as easier way: a for loop.

for is a loop, like while. Remember while loops while a condition is true, like:

count = 1
while count <= 5:
print('Doggos! ')
count += 1

That loop prints Doggos! five times.

Here's a loop that prints each state.

states = ['Queensland', 'New South Wales', 'Victoria',
'Tasmania', 'South Australia', 'Western Australia']
counter = 0
while counter < len(states):
print(states[counter])
counter += 1
print('OK, bye!')

counter starts at 0, so the first value printed is states[0]. The last one printed is when counter is less than 6, the number of items in the list. The last value is 5, so the last one printed is states[5].

We could do that, but for is easier.

Ray

Easy is good.

Aye, 'tis so.

for var in collection:
Do something with var

... runs Do something for each item in the list. Do something can be as many lines of Python as you like.

For example:

states = ['Queensland', 'New South Wales', 'Victoria',
'Tasmania', 'South Australia', 'Western Australia']
for state in states:
print(state)

The first time through the loop, state is equal to the first element, 'Queensland' (that's where I'm from). So line 4 prints Queensland.

The second time through the loop, state is equal to the second element, 'New South Wales'. Line 4 prints New South Wales.

And so on, until the last element is run through the code block. The code block is the stuff indented inside the for loop.

Reflect

Add two new states to the list. Call them what you want. Run the program again. Did it work?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ethan

My code is:

states = ['Queensland', 'New South Wales', 'Victoria',
'Tasmania', 'South Australia', 'Western Australia',
'Wombatland', 'Koalaland']
for state in states:
print(state)

Cool. Did the for loop change?

Ethan

No, it didn't

Indeed! The for loop works no matter how many items there are in the list.

Pattern

Loop over a list of dictionaries

You have a data set. Each record is a dictionary. All the records are in a list. Use a for loop to run through each record in the list.

Lists of dictionaries

Our goal is to store records and fields in memory, so we can easily read a file like this:

"Episode number","Title","Length"
1,"Nobody Listens to Paula Poundstone",51.37
2,"Maintaining friendships",50.75
3,"Audiologist Michele Sherman talks ears",48.9
4,"The Survivalist!",53.22

This is a comma-separated values (CSV) file. You'll learn about them in the next lesson.

A dictionary is a good way to store one record. How to store a bunch of records? In a list of dictionaries!

episodes = [
{
'Episode number': 1,
'Title': "Nobody Listens to Paula Poundstone",
'Length': 51.37
},
{
'Episode number': 2,
'Title': 'Maintaining friendships',
'Length': 50.75
},
{
'Episode number': 3,
'title': 'Audiologist Michele Sherman talks ears',
'Length': 48.9
},
{
'Episode number': 4,
'Title': 'The Survivalist!',
'Length': 53.22
}
]

Because of the [], Python knows you want a list. Each list item has {}, with is Pythonese for a dictionary.

Georgina

That's so cool!

Aye!

You can store as many items in a list as you want. So, any number of records.

Reflect

What does this output? Type your answer before you run the code.

episodes = [
{
'Episode number': 1,
'Title': "Nobody Listens to Paula Poundstone",
'Length': 51.37
},
{
'Episode number': 2,
'Title': 'Maintaining friendships',
'Length': 50.75
},
{
'Episode number': 3,
'title': 'Audiologist Michele Sherman talks ears',
'Length': 48.9
},
{
'Episode number': 4,
'Title': 'The Survivalist!',
'Length': 53.22
}
]
for episode in episodes:
print(episode['title'])

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.

Ray

I like this! It outputs:

Nobody Listens to Paula Poundstone
Maintaining friendships
Audiologist Michele Sherman talks ears
The Survivalist!

Right! Woohoo!

Summary

Almost all the business data sets are groups of records. Records are made of attributes, also called fields.
A convenient way to represent data sets in Python is as lists of dictionaries.
Use for loops to process lists of dictionaries.

Exercise

Sum of episode lengths

Write a program to show average episode length. Show the number of episodes as well.

Paste this into your code. Use it without any changes:

episodes = [
{
'Episode number': 1,
'Title': "Nobody Listens to Paula Poundstone",
'Length': 51.37
},
{
'Episode number': 2,
'Title': 'Maintaining friendships',
'Length': 50.75
},
{
'Episode number': 3,
'title': 'Audiologist Michele Sherman talks ears',
'Length': 48.9
},
{
'Episode number': 4,
'Title': 'The Survivalist!',
'Length': 53.22
}
]

Here's what the output should be.

Number of episodes: 4
Average length: 51.06 minutes

Use a for loop.

Upload a zip of your project folder. The usual coding standards apply.

If you were logged in as a student, you could submit a solution to this exercise.