Data machines

A nice metaphor

Georgina said:

Georgina
Georgina

My aunt Tabitha is an engineer, works at a gas refinery. She showed me a diagram of a refinery once, though I didn't understand it all.

Refinery

​Gas refinery (Wikipedia)

You start off with the raw material, do something to it, and get stuff that's closer to the final product. You do something to that, getting something even closer. And so on.

Our code is like that. Got a raw CSV file. read_csv_data_set does something with it, making a list of dictionaries. clean_goat_scores does something to that, getting a step close. And so on, until the output.

That's a great metaphor. In fact, data analysts talk about data pipelines, with step after step taking the data closer and closer to what they can analyze. We can write our data analysis programs like that.

We talk about several data machines in the course. They're steps in the pipeline. They take data in, do something to it, and send data out. The things they do are well-defined.

Here's an example of a program made up of data machines.

Subset process

The machines

Here are the machines used in the course.

Reader

Reader

​Reader

Reads a CSV file into a list of dictionaries. Deets.

Pattern

Data machine: Reader

A function to read a comma-separated values (CSV) file into a data set.

Cleaner

Cleaner

​Cleaner

Takes a data set, remove errors, and converts to numeric data types as needed. Deets.

Pattern

Data machine: Cleaner

Write a function that takes a data set as a param. Some of the records in the data set might have errors. The function returns a data set with no errors.

Computed field

Computed field

​Computed field

Creates a new field in a data set. Deets.

Pattern

Data machine: Computed field

Write a function that adds a new field to records in a data set (a list of dictionaries), based on existing fields.

Records subset

Subset

Records subset

Extracts a subset of records from a data set. Deets.

Pattern

Data machine: Records subset

Write a function taking a data set as a param, and returning another data set with a subset of the original records, based on criteria you choose.

Field extractor

Extractor

​Field extractor

Extracts a field from a data set, into a list. Deets.

Pattern

Data machine: Field extractor

Write a function that extracts a list with the values of one field from a data set.

Computation

Computation

​Computation

A function that takes a data set, and computes something.

Pattern

Data machine: Computation

Use existing computation functions where you can, like statistics.mean. If you can't, like when you identify records with lowest/highest values in a data set, write your own loopy function.