TDM 20100: Project 9 — 2023
Motivation: Although SQL syntax may still feel unnatural and foreign, with more practice it will start to make more sense. The ability to read and write SQL queries is a "bread-and-butter" skill for anyone working with data.
Context: We are in the second of a series of projects that focus on learning the basics of SQL. In this project, we will continue to harden our understanding of SQL syntax, and introduce common SQL functions like AVG
, COUNT
, and MAX
.
Scope: SQL, sqlite
Dataset(s)
For this project, we will be using the lahman
sqlite database. This database contains the data in the directory
-
/anvil/projects/tdm/data/lahman
You may get some more lahman
database information from this youtube video 2023 SABR Analytics:Sean Lahman, "introduction to Baseball Databases"
To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook to establish a connection with the database.
%sql sqlite:////anvil/projects/tdm/data/lahman/lahman.db
For every following cell where you want to run a SQL query, prepend %%sql
to the top of the cell — just like we do for R or bash cells.
Questions
In previous projects, we used awk to parse through and summarize data. Using SQL gives us more ways to analyze and summarize data.
|
Make sure all queries limit output to only 100 rows. You may refer to SQLite-Limit Syntax If you want the headers to be more descriptive, you can use aliases.You may refer to SQLite-aliases Syntax |
Question 1 (2 pts)
-
Write a query to find out who won the 2022 World Baseball Series from table
seriespost
? -
For this champion team, please find out the home runs (hr) rate and runs batted in (rbi) rate in year 2022 from table
batting
. Round the rates to 2 decimals. You may get rates by-
hr_rate = sum of home runs / sum of hits
-
rbi_rate = sum of runs batted in / sum of hits
-
Use the
Try to do the calculation without |
Question 2 (2 pts)
-
For the champion team from question 1, please write a query that counts the number of RBIs for each athlete in the champion team during year 2022, using the
batting
table. Display your output in ascending order. -
Run the query again, but this time, display the output in descending order.
-
Which athlete has the highest RBIs in this question? Please provide the player’s
playerID
, along with their first name and last name, from thepeople
table
|
Question 3 (2 pts)
-
Write a query that finds how many times the athlete from question 2 attended All Star Games.
-
Write a query to find out who is the athlete that attended most All Star Games in the entire data set.
Question 4 (1 pt)
-
Write a query that gets the average
salary
for each athlete in the database. Display your output in descending order. Limit the output to 100 rows (i.e., to the top 100 salaries).
Question 5 (1 pt)
Now create your own query about a topic that you are interested in. Use at least one of the aggregation functions, such as min
, max
, count
, or sum
. Be sure to use group by
and display the results in order with order by
.
Project 09 Assignment Checklist
-
Jupyter notebook with your code, comments and output for questions 1 to 5
-
firstname-lastname-project09.ipynb
-
-
Submit files through Gradescope
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |