Reading data sets

Multiple choice

What is a list?

Saving
A

A collection of values. Each one is accessed through an index that's an integer.

B

A collection of values.
Each one is accessed through a key that's usually a string.

C

A set of unique values (no two in the list are the same).
Each one is accessed through an index that's an integer.

D

A collection of dictionaries.
Each one is a record.

Not graded. So why do it?

Multiple choice

What's a dictionary?

Saving
A

A collection of values.
Each one is accessed through an index that's an integer.

B

A collection of values.
Each one is accessed through a key that's usually a string.

C

A set of unique values (no two in the list are the same).
Each one is accessed through an index that's an integer.

D

A collection of lists.
Each one is a record.

Not graded. So why do it?

Multiple choice

Lists are delimited by ____, dictionaries by _____.

Saving
A

() and {}

B

{} and ()

C

[] and ()

D

[] and {}

E

{} and []

Not graded. So why do it?

Multiple choice

team is a list of dictionaries of players. Each dictionary has keys id, name, and position. What's the best way to print the player's names?

Saving
A
  • index = 0
  • while index < len(team):
  •     print(team[index]['name'])
  •     index += 1
B
  • for player in team:
  •     print(player[name])
C
  • foreach player in team:
  •     player = next(team)
  •     name = player[name]
  •     print(name)
D
  • for player in team:
  •     print(player['name'])
E
  • for 'player' from team:
  •     print('player[name]')

Not graded. So why do it?

A list of dictionaries is a good way to store a data set in memory. Now, how do we read the data into memory?

Let's write a function for that. You can reuse it as much as you like.

CSV

Data is typically stored in files or databases, or is fetched over the internet. We'll only discuss files in this course, though the processing isn't much different for the other sources. Once you've fetched the data, you analyze it the same way.

Here's what the data from before would look like in a CSV (comma-separated values) file.

  • "Episode number","Title","Length"
  • 1,"Nobody Listens to Paula Poundstone",51.37
  • 2,"Maintaining friendships",50.75
  • 3,"Audiologist Michele Sherman talks ears",48.9
  • 4,"The Survivalist!",53.22

The first row gives headers for each column. Then there's a row for each entity.

Function to read a data set

Make a new project. Put this file (that's a link) in the project folder. It's the data set above.

Now make a Python file, and put this code into it.

  1. import csv
  2.  
  3. def read_csv_data_set(file_name):
  4.     '''
  5.     Read a data set from a CSV file.
  6.  
  7.     Parameters
  8.     ----------
  9.     file_name : string
  10.         Name of the CSV file in the current folder.
  11.  
  12.     Returns
  13.     -------
  14.     data_set : List of dictionaries.
  15.         Data set.
  16.  
  17.     '''
  18.     # Create a list to be the return value.
  19.     data_set = []
  20.     with open('./' + file_name) as file:
  21.         file_csv = csv.DictReader(file)
  22.         # Put each row into the return list.
  23.         for row in file_csv:
  24.             data_set.append(row)
  25.     return data_set
  26.  
  27. episodes = read_csv_data_set('episodes.csv')
  28. print(episodes)

Here's the file again, and the data structure read_csv_data_set creates.

File Data structure
  • "Episode number","Title","Length"
  • 1,"Nobody Listens to Paula Poundstone",51.37
  • 2,"Maintaining friendships",50.75
  • 3,"Audiologist Michele Sherman talks ears",48.9
  • 4,"The Survivalist!",53.22

  • [
  •  {
  •   'Episode number': '1',
  •   'Title': 'Nobody Listens to Paula Poundstone',
  •   'Length': 51.37
  •  },
  •  {
  •   'Episode number': '2',
  •   'Title': 'Maintaining friendships',
  •   'Length': 50.75
  •  },
  •  {
  •   'Episode number': '3',
  •   'Title': 'Audiologist Michele Sherman talks ears',
  •   'Length': 48.9
  •  },
  •  {
  •   'Episode number': '4',
  •   'Title': 'The Survivalist!',
  •   'Length': 53.22
  •  }
  • ]

As you can see, a list of dictionaries.

How does it work?

  1. import csv
  2.  
  3. def read_csv_data_set(file_name):
  4.     '''
  5.     Read a data set from a CSV file.
  6.  
  7.     Parameters
  8.     ----------
  9.     file_name : string
  10.         Name of the CSV file in the current folder.
  11.  
  12.     Returns
  13.     -------
  14.     data_set : List of dictionaries.
  15.         Data set.
  16.  
  17.     '''
  18.     # Create a list to be the return value.
  19.     data_set = []
  20.     with open('./' + file_name) as file:
  21.         file_csv = csv.DictReader(file)
  22.         # Put each row into the return list.
  23.         for row in file_csv:
  24.             data_set.append(row)
  25.     return data_set

Line 1 (import csv) imports Python's csv module. There are other ways to read CSV, but this one is the easiest to learn.

Line 19 (data_set = []) creates the thing the function will return. data_set is a list.

Line 20 (with open('./' + file_name) as file) opens a file, using the parameter you pass to the function as the file name. with closes the file automatically, as soon as its code block finishes.

About the './' thing. It means "the current folder," so Python will look for the file in the same folder as the program.

Line 21 (file_csv = csv.DictReader(file)) reads the entire contents into the variable file_csv. file_csv is a DictReader, one of Python's many special types. A DictReader can read a CSV file and make a list of dictionaries.

DictReaders are a bit of a pain, though, so lines 23 and 24 copy the data from file_csv into data_set, the thing that's returned.

Check out these lines:

  1.         for row in file_csv:
  2.             data_set.append(row)

Line 23 loops over the elements in file_csv. The first time through the loop, row is the first element of file_csv, that is, the first row from the CSV file. Second time through the loop, row is the second element of file_csv, that is, the second row from the CSV file. And so on.

What to do with each row? Line 24 appends it to the list data_set.

Run the program. You should see a list of episodes, printed by print(episodes).

Switch to the console. The program left its variables behind, so we can check out what episodes is.

Try this is the console.

  • type(episodes)

It will tell you the data type of the variable episodes.

Reflect

What type is episodes?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Ray
Ray

It's a list.

Right!

Reflect

What type is episodes[0]? How about episodes[1]?

Answer before you try it in the console.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Ethan
Ethan

They're dictionaries.

Correct!

Let's look at the individual fields in the dictionaries. Try:

  • type(episodes[3]['Title'])
Reflect

What's the value of episodes[3]['Title']? What's its type?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Georgina
Georgina

That's the title of the fourth episode, 'The Survivalist!'. It's a string.

Good!

Reflect

Without typing in the console, what's the value of episodes[0]['length']?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Ethan
Ethan

It's the length of the first episode, 51.37.

Let me try it in the console...

What?! I got an error: KeyError: 'length'

What's that about?

Anyone?

Adela
Adela

I think I see it. You get the same error from episodes[0]['title'], but episodes[0]['Title'] is OK.

Ray
Ray

Huh? They're the same... Oh, you've got to be kidding. Title works, but title doesn't.

Reflect

Why does Title work, but not title? Where did Title come from?

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Georgina
Georgina

Title comes from the first line in the CSV file, that gives the column names.

Right! Nice work. Here's the CSV file:

  • "Episode number","Title","Length"
  • 1,"Nobody Listens to Paula Poundstone",51.37
  • 2,"Maintaining friendships",50.75
  • 3,"Audiologist Michele Sherman talks ears",48.9
  • 4,"The Survivalist!",53.22

The lesson:

Note

Dictionary keys are case-sensitive.

Use the function

Ethan
Ethan

The code you gave us, to read the CSV file. Should we just use it? As is?

Aye, that's why I gave it to you. You can modify it if you like, though.

There's more to CSV files, a lot more, but let's leave it at that for now. That's enough for us to get into data analysis.

Using the data

Suppose the episode data was in a file called episodes.csv.

Reflect

Write a line of code that would read the episodes data into a list of dictionaries named episodes.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Ray
Ray

I got: episodes = read_csv_data_set('episodes.csv')

Right! Short and sweet.

Reflect

Write two lines that will print the lengths of all the episodes.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Adela
Adela

I got this.

  • for episode in episodes:
  •     print(episode['Length'])

Good! Once you have those values, you can do anything with them you want. Print them, add them up, whatevs.

A new pattern

Let's add a data machine pattern to the pattern catalog.

Pattern

Data machine: Reader

A function to read a comma-separated values (CSV) file into a data set.

When you start a new project, you can use the pattern catalog to remind yourself of useful chunks of code.

Summary

  • A list of dictionaries is a good way for Python to have a data set in memory.
  • CSV (comma-separated values) files are commonly used in analysis.
  • The function read_csv_data_set reads CSV data sets into memory. Copy-and-paste it as you need.
Attachments