STAT 29000: Project 13 — Fall 2020
Motivation: Databases you will work with won’t necessarily come organized in the way that you like. Getting really comfortable writing longer queries where you have to perform many joins, alias fields and tables, and aggregate results, is important. In addition, gaining some familiarity with terms like primary key, and foreign key will prove useful when you need to search for help online. In this project we will write some more complicated queries with a fun database. Proper preparation prevents poor performance, and that means practice!
Context: We are towards the end of a series of projects that give you an opportunity to practice using SQL. In this project, we will reinforce topics you’ve already learned, with a focus on subqueries and joins.
Scope: SQL, sqlite
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/movies_and_tv/imdb.db
For this project you will use SQLite to access the data. To connect to the database, copy and paste the following before your solutions in your .Rmd: ````markdown
|
If you want to use a SQLite-specific function like .tables
(or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command:
sqlite3 /class/datamine/data/movies_and_tv/imdb.db
From there, the SQLite-specific commands will function properly. They will not function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option eval=F
like this:
````markdown
SELECT * FROM titles LIMIT 5;
``
This will allow the code to be displayed without throwing an error.
Questions
Question 1
A primary key is a field in a table which uniquely identifies a row in the table. Primary keys must be unique values, and this is enforced at the database level. A foreign key is a field whose value matches a primary key in a different table. A table can have 0-1 primary key, but it can have 0+ foreign keys. Examine the titles
table. Do you think there are any primary keys? How about foreign keys? Now examine the episodes
table. Based on observation and the column names, do you think there are any primary keys? How about foreign keys?
A primary key can also be a foreign key. |
Here are two videos. The first video will remind you how to find the names of all of the tables in the |
-
List any primary or foreign keys in the
titles
table. -
List any primary or foreign keys in the
episodes
table.
Question 2
If you paste a title_id
to the end of the following url, it will pull up the page for the title. For example, www.imdb.com/title/tt0413573 leads to the page for the TV series Grey’s Anatomy. Write a SQL query to confirm that the title_id
tt0413573 does indeed belong to Grey’s Anatomy. Then browse imdb.com and find your favorite TV show. Get the title_id
from the url of your favorite TV show and run the following query, to confirm that the TV show is in our database:
SELECT * FROM titles WHERE title_id='<title id here>';
Make sure to replace "<title id here>" with the title_id
of your favorite show. If your show does not appear, or has only a single season, pick another show until you find one we have in our database with multiple seasons.
-
SQL query used to confirm that
title_id
tt0413573 does indeed belong to Grey’s Anatomy. -
The output of the query.
-
The
title_id
of your favorite TV show. -
SQL query used to confirm the
title_id
for your favorite TV show. -
The output of the query.
Question 3
The episode_title_id
column in the episodes
table references titles of individual episodes of a TV series. The show_title_id
references the titles of the show itself. With that in mind, write a query that gets a list of all episodes_title_id
(found in the episodes
table), with the associated primary_title
(found in the titles
table) for each episode of Grey’s Anatomy.
This video shows how to extract titles of episodes in the |
-
SQL query used to solve the problem in a code chunk.
Question 4
We want to write a query that returns the title and rating of the highest rated episode of your favorite TV show, which you chose in (2). In order to do so, we will break the task into two parts in (4) and (5). First, write a query that returns a list of episode_title_id
(found in the episodes
table), with the associated primary_title
(found in the titles
table) for each episode.
This part is just like question (3) but this time with your favorite TV show, which you chose in (2). |
This video shows how to use a subquery, to |
-
SQL query used to solve the problem in a code chunk.
-
The first 5 results from your query.
Question 5
Write a query that adds the rating to the end of each episode. To do so, use the query you wrote in (4) as a subquery. Which episode has the highest rating? Is it also your favorite episode?
Examples that utilize the relevant topics in this problem can be found here. |
-
SQL query used to solve the problem in a code chunk.
-
The
episode_title_id
,primary_title
, andrating
of the top rated episode from your favorite TV series, in question (2). -
A statement saying whether the highest rated episode is also your favorite episode.