Exercises: analysis

Exercise

Puffin length

Puffin
A puffin

Download a file with data on the length of puffins. Here are the first few records:

  • "Name","Length"
  • "Roderick",26
  • "Junie",24.8
  • "Bea",29
  • "Rodney",26

The minimum puffin length is 24 cm. Values below that are invalid. Values above 50 cm are invalid, too.

Most puffins are between 24 cm and 34 cm in length. Puffins lengths larger than 34 cm are mutants, possibly with superpowers. For example, the infamous super-villain The Dart was a mutant puffin. She sank 18 ships in the Atlantic before being taken out by Seal Team 3.

Seal Team 3
Seal Team 3

Write a program that computes statistics for regular and mutant puffins. Use the statistics module, and at least two functions (I used three).

Here's what the report should look like:

  • Puffin Lengths
  • ====== =======
  •  
  • Normals: no super powers
  • Count: 41
  • Mean: 29.01
  • Standard deviation: 3.14
  •  
  • Mutants: might have super powers
  • Count: 3
  • Mean: 38.23
  • Standard deviation: 0.75

- - CUT SCREEN HERE - -

At one point, I put the wrong version of the data file in this exercise. Here's the output from that. Either one will count as correct.

  • Puffin Lengths
  • ====== =======
  •  
  • Normals: no superpowers
  • Count: 37
  • Mean: 28.93
  • Standard deviation: 3.29
  •  
  • Mutants: might have superpowers
  • Count: 7
  • Mean: 38.16
  • Standard deviation: 1.34
  •  
  • Urgent! Send this report to Seal Team 3

- - CUT SCREEN HERE - -

Round to two decimal places.

Show the Seal Team 3 warning if there are more than five mutants.

Upload a zip of your project folder. The usual coding standards apply.

Exercise

Goats Just Want to Have Jokes

Molly runs the comedy club, Goats Just Want to Have Jokes. For every comedian's set, she counts the number of hecklers, and the highest laugh volume, in dB. Write a program to work out some statistics for Molly.

Download the data file. Here's some of the data:

  • "Performer","Hecklers","Highest laugh dB"
  • "Aisha",5,113
  • "Andreas","a lot",100
  • "August",3,101
  • "Bertha",2,118
  • "Bessie",3,"Laptop broken"
  • "Boyd",5,115
  • "Bridgette",-2,114

When Aisha performed, she had 5 hecklers, and the loudest laugh was 113dB.

Valid records must have names. Can't be MT.

Valid heckler and volume input must be non-negative numbers.

Write a program that outputs stats like this:

  • Goats Just Want to Have Jokes
  • ===== ==== ==== == ==== =====
  •  
  • Counts
  • - - -
  • Valid records: 46
  • Invalid records: 4
  • Total records: 50
  •  
  • Hecklers
  • - - - -
  • Mean: 3.67
  • Std dev: 2.07
  • Smallest: 0 Dawn
  • Largest: 7 Johnnie
  •  
  • Sound
  • - - -
  • Mean: 101.3
  • Std dev: 11.36
  • Smallest: 81 Heather
  • Largest: 125 Herman
  •  
  • Correlation (hecklers, sound dB): -0.3

Use the statistics module. Use at least three functions (I had more in my solution).

Upload a zip of your project folder. The usual coding standards apply.

Exercise

Grades and starting salaries

This was from ye olde web on December 7, 2024.

- - CUT SCREEN HERE - -

The average salary for an entry-level Python developer in the United States is around $80,625 per year, or $68.10 per hour.

Salary

The average salary for an entry-level Python developer is $80,625 per year, which is up from $73,551 in 2023.

Hourly rate

The average hourly pay for an entry-level Python programmer is $68.10, with a range of $40.62–$82.93.

Additional pay

The estimated additional pay for an entry-level Python developer is $17,260 per year, which could include cash bonuses, commissions, tips, and profit sharing.

Here are some other Python developer salaries:

  • Mid-level: $127,363 per year
  • Senior: $201,196 per year
  • Top earners: $188,507 per year

- - CUT SCREEN HERE - -

Write a program to compute statistics for the relationship between grades in a Python course and starting salary. Use this data file. (I made up the raw data, based on the stats above.)

Here's some data from the file:

  • Python grade,Starting salary
  • 2.5,74222
  • 3.3,88878
  • 3.1,84680
  • 3.2,83757
  • 2.4,64016

The first field is a student's grade in a Python course. The second is their starting salary.

Rules for valid data:

  • Grades are floats that cannot be less than 0 or greater than 4.
  • Salaries are integers that cannot be less than 0 or greater than 400,000.

Only analyze data from valid records. As usual, all fields in a record must be valid for the record to be considered valid.

Here's the program's output with the data set given, showing the type of data your program should produce:

  • Grades and starting salaries for Python programmers
  • ===================================================
  •  
  • Number of records: 49
  • Number of valid records: 45
  •  
  • Mean grade: 3.14
  • Mean starting salary: 87590
  •  
  • Correlation: 0.96

Of course, your program should be able to work with any compatible data set.

Round values as shown. Two decimal places for mean grade, zero for mean salary, and two for correlation.

Use functions. My solution had six, but you can have more or fewer. Include docstrings for every function.

My main program was 17 lines without comments. Ten lines were output, so the guts of the main program was just seven lines. Almost all of the program's code was in functions.

Upload your solution here, not to Moodle. The usual programming standards apply.

Exercise

Pokemon EXP

This data is modified from a Kaggle data set.

The data set has records with three fields. Here is part of the CSV file:

  • "trainer","number_pokemon","total_EXP"
  • "Youngster Tristan",1,60
  • "Youngster Logan",1,65
  • "Lass Natalie",1,62
  • "Youngster Michael",2,150
  • "Camper",1,61

Fields:

  • Trainer name. Validation rule: cannot be MT (empty).
  • Number of Pokémon. Validation rule: an integer from 0 to 6.
  • EXP. Validation rule: an integer from 0 to 50,000.

Write a program to validate the data, and show means and counts comparing trainers with few Pokémon (0 to 2) to those with many (3 to 6). Here is my output:

  • Pokemon Trainers Exp Points
  • ======= ======== === ======
  •  
  • Valid records: 919
  • Invalid records: 8
  • Total records: 927
  •  
  • Low pokemon count (0-2)
  • - - - - - - - - - - - -
  • Records: 697
  • Mean: 1653.22
  •  
  • High pokemon count (3-6)
  • - - - - - - - - - - - -
  • Records: 222
  • Mean: 4097.98

Use line spacing and underlining as shown.

Add docstrings to every function.

Round means to two decimal places.

Upload your solution to this website as usual, not to Moodle.

The usual programming standards apply.

Exercise

Suspicious applications

Grinder Corp.'s HR team wants your help filtering job applications for data analyst positions. They no longer fully trust university grades, since AI cheating is so easy. So, they ask applicants to write programs for basic data analysis tasks. They run the programs through an AI checker, that gives a score from 1 (not done by AI) to 10 (definitely AI). They also work out applicants' GPA on math, statistics, and programming courses.

They give you a CSV file with the data (download it). Here's part of it:

  • "Name","GPA","AI detection score"
  • "Roderick",3.4,3
  • "Junie",3.4,6
  • "Bea",3.2,5
  • "Rodney",3.1,6
  • "Weldon",3.5,5
  • "Del",3.2,3
  • "Charissa",6.8,9

The fields and validation rules are:

  • Name should not be MT (empty), or consist of just whitespace characters.
  • GPA should be a float from 0 to 4.
  • AI detection score should be an integer from 1 to 10.

Write a program that goes through the data, filters out invalid records, lists applicants the company will consider, and shows some summary counts. Here's what the output should be:

  • AI suspect cleaner
  • == ======= =======
  •  
  • Applicants for further review:
  • Weldon
  • Gail
  • Margy
  • Anneliese
  • Nubia
  • Delmar
  • Valery
  • Lecia
  • Ashanti
  • Cassy
  • Chastity
  • Bo
  •  
  • Counts (I love to count, ah, ah, ah!)
  •  
  • Records: 50
  • Valid records: 44
  • Suspicious applications: 32
  • Accepted applications: 12
  • Fraction accepted (%): 27.3
  •  
  • All hail King Snorlax!

The company will consider applicants who have a GPA of at least 3.5 and an AI score of no more than 5.

Make sure you have at least three functions. Add correctly formatted docstrings to each one. Check the textbook so you know what "correctly formatted" means.

Make sure your output's format matches the above, including the blank lines. Round the accepted fraction to one decimal place. You can replace Snorlax with your fave Pokemon, if you want, though Snorlax is clearly the best. Seriously, what's up with that potato one?

As usual, upload a zip of your solution, with code and data file, to this site. Do not upload to Moodle.

The usual coding standards apply.

Exercise

Goat influencers

More and more goats are following YouTube influencers. But what type of content are they most interest in? Write a program to work out which category of influencer has the most growth.

Download this data set. Here's part of it:

  • Influencer,Category,Last year,This year
  • Aisha,entertainment,84789,131902
  • Andreas,entertainment,60528,103209
  • August,tech,89564,103189
  • Bertha,lifestyle,91941,137909

Each record has four fields. Here they are, with their validation rules.

  • Goat name. Cannot be empty.
  • Content category. One of entertainment, lifestyle, or tech. Extra leading or trailing spaces are OK, and case doesn't matter, So " Tech " is valid, but "t3ch" is not.
  • Last year's subscribers: number, zero or more.
  • This year's subscribers: number, zero or more.

Only include valid records in your analysis.

Write a program to show the average changes in subscribers for each category. Use at least three functions. Using the data machine functions from this textbook might be easiest. Like cleaning the record set, adding the change in subscribers to the data set as a computed field, getting a record subset for each category, and so on. Use the statistics module if you want; I did in my solution.

Here's what your program's output should be:

  • Goats Influencers
  • ===== ===========
  •  
  • Subscriber changes by category
  •  
  • Counts
  • - - -
  • Valid records: 45
  • Invalid records: 5
  • Total records: 50
  •  
  • Category mean changes
  • - - - - - - - - - -
  • Entertainment: 36453.7
  • Lifestyle: 15022.7
  • Tech: 14326.8
  •  
  • Category with the highest change: Entertainment

No, don't write a program with just a bunch of print statements. Someone tries that every so often. Your program's output should change if the data changes.

Include the record counts as shown.

The "Category mean changes" are average changes in subscribers between last year and this, that is, this year minus last year. So for this record...

  • Aisha,entertainment,84789,131902

... the value to be analyzed is 131902 - 84789. Check the computed fields lesson if that is not clear.

Write a program to read all the data in the CSV file, perform the calculations, and output the results in the format shown. The averages should be to one decimal place, as you can see. The usual coding standards apply.

Upload your solution here, as usual, not to Moodle.