Data sets

Multiple choice

Every variable has a data type, like string or float. If two variables have different data types, what does that mean?

Saving
A

The variables must have different values. So if x = 3.14 and y = 2.71, x and y are different types.

B

The variables store different kinds of data.

C

Variables are more likely to marry other variables of the same type, than variables of a different type.

Not graded. So why do it?

Multiple choice

What's an integer in Python?

Saving
A

A numeric type that can have a decimal part, like 3.14 or -2.72.

B

A numeric type for whole numbers, like 4, -1, or 899933. Not 3.14, though.

C

A data type for text, like "Doggos rule!"

D

Someone who harvests grapes for making wine.

Not graded. So why do it?

Multiple choice

Some code:

  • x = input('Value for X? ')
  • y = input('Value for Y? ')
  • z = x * 2 + y
  • z += 10
  • if z < 30:
  •     message = 'Too low'
  • elif z < 80:
  •     message = 'OK'
  • else:
  •     message = 'Too large'
  • print(message)

What's the output if the user types 4.5 for X and 5.5 for Y?

Try to work it out in your head first. Then you can run the code.

Saving
A
Too low
B
OK
C
Too large
D

Error

Not graded. So why do it?

Records

Almost all business data you'll deal with is organized like this:

Episode number Title Length
1 Nobody Listens to Paula Poundstone 51.37
2 Maintaining friendships 50.75
3 Audiologist Michele Sherman talks ears 48.9
4 The Survivalist! 53.22

(It's a data set about my fave podcast, Nobody Listens to Paula Poundstone. Tell your old people about it.)

The data is about one type of thing: NLTPP episodes. Each line is called a record, row, or maybe entity. A record describes one episode.

Records are made of attributes, also called fields. Here, each episode has three attributes: number, title, and length. Every record has the same set of attributes, although some data might be missing.

Each attribute is the same data type. Here:

  • All episode numbers are integers.
  • All titles are strings.
  • All lengths are floats.

A data set in memory

We need to have a data set in memory, so we can analyze it. There's a problem, though. Until now, each variable only has one piece of data in it, like:

  • The variable legs is an integer with the value 8. Or 4, or 2. legs can only hold one value, though. It can't be 8 and 4.
  • The variable weight is a float with 23.1 in it. It can only have one value.
  • The variable family_name is a string with "Park" in it. It can be Park, Smith, Felber, whatever, but only one name.

Now we have bunches of data. We might have a customer data set with thousands of records. What do we do?

We need variables that somehow store many values together, and make it easy to get each one when we need it. Just as we have strings, floats, ints, and booleans, we need a new data type to hold lots of data in a variable.

We'll actually need two new data types:

  • Store fields in a record
  • Store a collection of records

Let's do the first one.

Dictionary

Dictionaries are perfect for storing individual records. Here's an example of a Python dictionary.

  • animal1 = {
  •     'common name': 'Red kangaroo',
  •     'species name': 'Osphranter rufus',
  •     'length': 1.5,
  •     'weight': 74,
  •     'url': 'https://en.wikipedia.org/wiki/Red_kangaroo'
  • }

animal1 is the variable containing one record, as a dictionary. It has attributes, each one with a key and a value. The keys are usually strings. The values can be anything. Add as many key/value pairs as you like.

Python knows it's a dictionary because of the braces (the {}). Other types use different symbols, like () and [].

Snorlax
by andrework

Another dictionary:

  • best_pokemon = {
  •     'name': 'Snorlax',
  •     'generation': 1,
  •     'pokedex number': 143
  • }

(Snorlax is my spirit animal.)

That's one Pokémon record.

You set the values of individual fields like this:

  • best_pokemon['rating'] = 10

Notice that rating wasn't in the original record. We've added it:

  • best_pokemon = {
  •     'name': 'Snorlax',
  •     'generation': 1,
  •     'pokedex number': 143,
  •     'rating': 10
  • }

Doing more things:

  • # Changing a value.
  • best_pokemon['rating'] = 11
  •  
  • # Appending to the name.
  • best_pokemon['name'] += ' (the best)'
  •  
  • # Testing a value.
  • if best_pokemon['generation'] == 1:
  •     print('OG!')
  •  
  • # Input a value.
  • best_pokemon['pokedex number'] = int(input("What's the Pokedex number? "))

Basically, you can do anything with a dictionary_name[key] that you can do with a variable. Calculate with it, input it, output it, whatevs. In reality, a dictionary_name[key] is a regular variable.

Here's some data again.

Episode number Title Length
1 Nobody Listens to Paula Poundstone 51.37
2 Maintaining friendships 50.75
3 Audiologist Michele Sherman talks ears 48.9
4 The Survivalist! 53.22

Here's the same data as four dictionaries, once for each record.

  • an_episode = {
  •     'Episode number': 1,
  •     'Title': "Nobody Listens to Paula Poundstone",
  •     'Length': 51.37
  • }
  •  
  • another_episode = {
  •     'Episode number': 2,
  •     'Title': 'Maintaining friendships',
  •     'Length': 50.75
  • }
  •  
  • yet_another_episode = {
  •     'Episode number': 3,
  •     'Title': 'Audiologist Michele Sherman talks ears',
  •     'Length': 48.9
  • }
  •  
  • yet_yet_another_episode = {
  •     'Episode number': 4,
  •     'Title': 'The Survivalist!',
  •     'Length': 53.22
  • }

There's a problem, though. We have four variables, each containing a dictionary. But there are hundreds of episodes. Creating hundreds of different variables, one for each episode, would be a pain.

Lists

What we need is to put a bunch o' dictionaries together, in a collection. There's a data type called list that does the job.

A list is a sequence of individual values. The values can be strings, floats, dictionaries, anything. Here's a list of strings, Australian state names.

  • states = ['Queensland', 'New South Wales', 'Victoria',
  •           'Tasmania', 'South Australia', 'Western Australia']

I'm using individual values for now, to keep it simple. We'll bring dictionaries back in later.

Python uses [] for lists, as it uses {} for dictionaries.

states can be any size. Australia has six states. The US has 50. No problem. A list can contain as many values as we like. A thousand? No problem. 65,536? OK.

You create a list like this:

  • name = [values]

Values can be MT, and often is at the start of a program. Like:

  • movies = []

Here are things you can do with lists.

  • list_name.append(thing) adds thing to the end of the list.
  • len(list_name) tells you how many items are in the list.
  • And lots more.

You can get items from a list in two main ways. First, you can use an index, like states[3]. Indexes are always numbers.

Paste this into the console.

  • states = ['Queensland', 'New South Wales', 'Victoria',
  •           'Tasmania', 'South Australia', 'Western Australia']
  • states[0]
Reflect

What did you get?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Adela
Adela

The first value, Queensland.

Right.

Reflect

What's states[1]? Answer before you try it.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Georgina
Georgina

It's the second element.

So data_set[0] is the first element, and data_set[1] is the second?

Aye. The first element's index is zero. The reason for that is buried in software history. It's not relevant for this course.

So the six values in the list are:

  • states[0]
  • states[1]
  • states[2]
  • states[3]
  • states[4]
  • states[5]
Reflect

Try states[6] in the console. What happens?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Georgina
Georgina

I got IndexError: list index out of range.

states[5] is the last one.

Right!

Reflect

In your own words, explain what this program does, without running it.

  1. fruits = []
  2. done = False
  3. while not done:
  4.     fruit = input('Type the name of a fruit, or bye to quit: ')
  5.     fruit_normalized = fruit.lower().strip()
  6.     if fruit_normalized == 'bye':
  7.         done = True
  8.     else:
  9.         fruits.append(fruit)
  10. print(fruits)
If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Adela
Adela

It asks you to type a fruit. It adds the fruit to a list. It keeps asking until you type bye. Then it shows the list.

Indeed! (Try running it.)

Ethan
Ethan

About these lines:

  1.     fruit = input('Type the name of a fruit, or bye to quit: ')
  2.     fruit_normalized = fruit.lower().strip()
  3.     if fruit_normalized == 'bye':
  4.         done = True
  5.     else:
  6.         fruits.append(fruit)

You get the fruit in line 4, then normalize it in line 5. But you use a new variable, fruit_normalized instead of putting the normalized value back into fruit. Why?

Reflect

Why?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Georgina
Georgina

Ooo! I see it.

You want to add whatever the user typed to the list, with uppercase characters and everything. But when you normalize, which you do to check for bye, you lose things like capitalization.

So, you keep the original fruit around, and make a new variable (fruit_normalized) to normalize and test for bye.

Exactly! Nice work, Georgina.

for

You saw how we can use indexes to get the values in the list directly, like states[2] gets the third value (the first one is states[0]).

Often, we want to go through all the items in a list to compute some stats, or do something else. We want to get the values one at a time.

We could use a while loop and a counter, but there's as easier way: a for loop.

for is a loop, like while. Remember while loops while a condition is true, like:

  • count = 1
  • while count <= 5:
  •     print('Doggos! ')
  •     count += 1

That loop prints Doggos! five times.

Here's a loop that prints each state.

  • states = ['Queensland', 'New South Wales', 'Victoria',
  •           'Tasmania', 'South Australia', 'Western Australia']
  • counter = 0
  • while counter < len(states):
  •     print(states[counter])
  •     counter += 1
  • print('OK, bye!')

counter starts at 0, so the first value printed is states[0]. The last one printed is when counter is less than 6, the number of items in the list. The last value is 5, so the last one printed is states[5].

We could do that, but for is easier.

Ray
Ray

Easy is good.

Aye, 'tis so.

  • for var in collection:
  •     Do something with var

... runs Do something for each item in the list. Do something can be as many lines of Python as you like.

For example:

  1. states = ['Queensland', 'New South Wales', 'Victoria',
  2.           'Tasmania', 'South Australia', 'Western Australia']
  3. for state in states:
  4.     print(state)

The first time through the loop, state is equal to the first element, 'Queensland' (that's where I'm from). So line 4 prints Queensland.

The second time through the loop, state is equal to the second element, 'New South Wales'. Line 4 prints New South Wales.

And so on, until the last element is run through the code block. The code block is the stuff indented inside the for loop.

Reflect

Add two new states to the list. Call them what you want. Run the program again. Did it work?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Ethan
Ethan

My code is:

  • states = ['Queensland', 'New South Wales', 'Victoria',
  •           'Tasmania', 'South Australia', 'Western Australia',
  •           'Wombatland', 'Koalaland']
  • for state in states:
  •     print(state)

Cool. Did the for loop change?

Ethan
Ethan

No, it didn't

Indeed! The for loop works no matter how many items there are in the list.

Pattern

Loop over a list of dictionaries

You have a data set. Each record is a dictionary. All the records are in a list. Use a for loop to run through each record in the list.

Lists of dictionaries

Our goal is to store records and fields in memory, so we can easily read a file like this:

  • "Episode number","Title","Length"
  • 1,"Nobody Listens to Paula Poundstone",51.37
  • 2,"Maintaining friendships",50.75
  • 3,"Audiologist Michele Sherman talks ears",48.9
  • 4,"The Survivalist!",53.22

This is a comma-separated values (CSV) file. You'll learn about them in the next lesson.

A dictionary is a good way to store one record. How to store a bunch of records? In a list of dictionaries!

  • episodes = [
  •     {
  •         'Episode number': 1,
  •         'Title': "Nobody Listens to Paula Poundstone",
  •         'Length': 51.37
  •     },
  •     {
  •         'Episode number': 2,
  •         'Title': 'Maintaining friendships',
  •         'Length': 50.75
  •     },
  •     {
  •         'Episode number': 3,
  •         'title': 'Audiologist Michele Sherman talks ears',
  •         'Length': 48.9
  •     },
  •     {
  •         'Episode number': 4,
  •         'Title': 'The Survivalist!',
  •         'Length': 53.22
  •     }
  • ]

Because of the [], Python knows you want a list. Each list item has {}, with is Pythonese for a dictionary.

Georgina
Georgina

That's so cool!

Aye!

You can store as many items in a list as you want. So, any number of records.

Reflect

What does this output? Type your answer before you run the code.

  1. episodes = [
  2.     {
  3.         'Episode number': 1,
  4.         'Title': "Nobody Listens to Paula Poundstone",
  5.         'Length': 51.37
  6.     },
  7.     {
  8.         'Episode number': 2,
  9.         'Title': 'Maintaining friendships',
  10.         'Length': 50.75
  11.     },
  12.     {
  13.         'Episode number': 3,
  14.         'title': 'Audiologist Michele Sherman talks ears',
  15.         'Length': 48.9
  16.     },
  17.     {
  18.         'Episode number': 4,
  19.         'Title': 'The Survivalist!',
  20.         'Length': 53.22
  21.     }
  22. ]
  23.  
  24. for episode in episodes:
  25.     print(episode['title'])
If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Ray
Ray

I like this! It outputs:

  • Nobody Listens to Paula Poundstone
  • Maintaining friendships
  • Audiologist Michele Sherman talks ears
  • The Survivalist!

Right! Woohoo!

Summary

  • Almost all the business data sets are groups of records. Records are made of attributes, also called fields.
  • A convenient way to represent data sets in Python is as lists of dictionaries.
  • Use for loops to process lists of dictionaries.

Exercise

Exercise

Sum of episode lengths

Write a program to show average episode length. Show the number of episodes as well.

Paste this into your code. Use it without any changes:

  • episodes = [
  •     {
  •         'Episode number': 1,
  •         'Title': "Nobody Listens to Paula Poundstone",
  •         'Length': 51.37
  •     },
  •     {
  •         'Episode number': 2,
  •         'Title': 'Maintaining friendships',
  •         'Length': 50.75
  •     },
  •     {
  •         'Episode number': 3,
  •         'title': 'Audiologist Michele Sherman talks ears',
  •         'Length': 48.9
  •     },
  •     {
  •         'Episode number': 4,
  •         'Title': 'The Survivalist!',
  •         'Length': 53.22
  •     }
  • ]

Here's what the output should be.

  • Number of episodes: 4
  • Average length: 51.06 minutes

Use a for loop.

Upload a zip of your project folder. The usual coding standards apply.