STAT 19000: Project 4 — Fall 2021
Motivation: Control flow consists of the tools and methods that you can use to control the order in which instructions are executed. We can execute certain tasks or code if certain requirements are met using if/else statements. In addition, we can perform operations many times in a loop using for loops. While these are important concepts to grasp, R differs from other programming languages in that operations are usually vectorized and there is little to no need to write loops.
Context: We are gaining familiarity working in Jupyter Lab and writing R code. In this project we introduce and practice using control flow in R, while continuing to reinforce concepts from the previous projects.
Scope: r, data.frames, recycling, factors, if/else, for loops
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/olympics/*.csv
Questions
Question 1
Winning an olympic medal is an amazing achievement for athletes.
Before we take a deep dive into our data, it is always important to do some sanity checks, and understand what population our sample is representative of, particularly if we didn’t get a chance to participate in the data collection phase of the project.
Lets do a quick check on our dataset. We would expect that most athletes would not have won a medal. What percentage of athletes did not get a medal in the olympics
data.frame?
For simplicity, consider an "athlete" a row of the data.frame. Do not worry about the same athlete participating in the different olympics games, or in different sports.
We are considering the combination of Sport
, Event
, and Games
, as a unique identifier for an athlete.
Relevant topics: is.na, mean, indexing, sum
-
Code used to solve this problem.
-
Output from running the code.
Question 2
A friend of yours hypothesized that there is some association between getting a medal and the athletes age.
You want to test it out using our olympics
data.frame. To do so we will compare 2 new variables:
-
(This question) An indicator if the athlete in that year and sport won a medal or not
-
(Next question) Age converted into categories of ages.
Create a new variable in your olympics
data.frame called won_medal
which indicates whether the athlete in that year and sport won a medal or not.
Relevant topics: is.na
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Now that we have our first new column, won_medal
, let’s categorize the Age
column. Using a for loop and if/else statements, create a new column in the olympics
data.frame called age_cat
. Use the guidelines below to do so.
-
"youth": less than 18 years old
-
"young adult": between 18 and 25 years old
-
"adult": 26 to 35 years old
-
"middle age adult": between 36 to 55 years old
-
"wise adult": greater than 55 years old
How many athletes are "young adults"?
Remember to consider the `NA`s as you are solving the problem. |
Relevant topics: nrow, if/else, for loops, indexing, is.na
-
Code used to solve this problem.
-
Output from running the code.
-
How many athletes are "young adults"?
Question 4
We did not need to use a for loop to solve the previous problem. Another way to solve the problem would have been to use a vectorized function called cut
.
Create a variable called age_cat_cut
using the cut
function to solve the problem above.
To check that you are getting the same results, run the following commands. If you used
If you didn’t use
|
Note that by default |
You can use the argument |
These past 2 questions do a good job emphasizing the importance of vectorized functions. How long did it take you to run the solution to question (3) vs question (4)? If you find yourself looping through one or more columns one at a time, there is likely a better option. |
Relevant topics: cut
-
Code used to solve this problem.
-
Output from running the code.
Question 5
Now that we have the new columns in the olympics
data.frame, look at the data and write down your conclusions. Is there some association between winning a medal and the athletes age?
There a couple of ways you can look at the data to make your conclusions. You can visualize using plots, using functions like barplot
, and pie
. Alternatively, you can use numeric summaries, like a table or table with proportions (prop.table
). Regardless of the method used, explain your findings, and feel free to get creative!
You do not need to use any special statistical test to make your conclusions. The goal of this question is to explore the data and think logically. |
The argument |
Relevant topics: barplot, pie, indexing, table, prop.table, balloonplot
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. |