STAT 29000: Project 11 — Spring 2021
Motivation: Data wrangling is the process of gathering, cleaning, structuring, and transforming data. Data wrangling is a big part in any data driven project, and sometimes can take a great deal of time. tidyverse
is a great, but opinionated, suite of integrated packages to wrangle, tidy and visualize data. It is useful to gain some familiarity with this collection of packages, in case you run into a situation where these packages are needed — you may even find that you enjoy using them!
Context: We have covered a few topics on the tidyverse
packages, but there is a lot more to learn! We will continue our strong focus on the tidyverse
(including ggplot
) and data wrangling tasks.
Scope: R, tidyverse, ggplot
Make sure to read about, and use the template found here, and the important information about projects submissions here.
The tidyverse
consists of a variety of packages, including, but not limited to: ggplot2
, dplyr
, tidyr
, readr
, purrr
, tibble
, stringr
, and lubridate
.
One of the underlying premises of the tidyverse
is getting the data to be tidy. You can read a lot more about this in Hadley Wickham’s excellent book, R for Data Science.
There is an excellent graphic here that illustrates a general workflow for data science projects:
-
Import
-
Tidy
-
Iterate on, to gain understanding:
-
Transform
-
Visualize
-
Model
-
-
Communicate
This is a good general outline of how a project could be organized, but depending on the project or company, this could vary greatly and change as the goals of a project change.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/okcupid/filtered/*.csv
Questions
datamine_py()
library(tidyverse)
questions <- read_csv2("/class/datamine/data/okcupid/filtered/questions.csv")
users <- read_csv("/class/datamine/data/okcupid/filtered/users.csv")
users$id <- 1:nrow(users)
set.seed(12345)
columns_to_pivot <- 1:2278
users_sample_long <- users[sample(nrow(users), 2200),] %>%
mutate_at(columns_to_pivot, as.character) %>%
pivot_longer(cols = columns_to_pivot, names_to="question", values_to = "selected_option")
myDF <- merge(users_sample_long, questions, by.x = "question", by.y = "X")
users$id <- 1:nrow(users)
set.seed(12345)
columns_to_pivot <- 1:2278
users_sample_long <- users[sample(nrow(users), 2200),] %>%
mutate_at(columns_to_pivot, as.character) %>%
pivot_longer(cols = columns_to_pivot[-1242], names_to="question", values_to = "selected_option")
myDF <- merge(users_sample_long, questions, by.x = "question", by.y = "X")
users$id <- 1:nrow(users)
set.seed(12345)
columns_to_pivot <- 1:2278
users_sample_long <- users[sample(nrow(users), 2200),] %>%
mutate_at(columns_to_pivot, as.character) %>%
pivot_longer(cols = columns_to_pivot[-(which(substr(names(users), 1, 1) != "q"))], names_to="question", values_to = "selected_option")
myDF <- merge(users_sample_long, questions, by.x = "question", by.y = "X")
myDF <- myDF %>% mutate(generation=case_when(d_age<=24 ~ "Gen Z",
between(d_age, 25, 40) ~ "Millenial",
between(d_age, 41, 56) ~ "Gen X",
between(d_age, 57, 66) ~ "Boomers II",
TRUE ~ "Other"))
ggplot(myDF[1:100,]) +
geom_point(aes(x=d_age, y = lf_min_age, col=gender2), alpha=.6) +
labs(title="Minimum dating age by gender", x="User age", y="Minimum date age")
Question 1
Let’s pick up where we left in project 10. For those who struggled with project 10, I will post the solutions above either on Saturday morning, or at the latest Monday. Re-run your code from project 10 so we, once again, have our tibble
, myDF
.
At the end of project 10 we created a scatterplot showing d_age
on the x-axis, and lf_min_age
on the y-axis. In addition, we colored the points by gender2
. In many cases, instead of just coloring the different dots, we may want to do the exact same plot for different groups. This can easily be accomplished using ggplot
.
Without splitting or filtering your data prior to creating the plots, create a graphic with plots for each generation
where we show d_age
on the x-axis and lf_min_age
on the y-axis, colored by gender2
.
You do not need to modify |
This may take quite a few minutes to create. Before creating a plot with the entire myDF, use myDF[1:50,]. If you are in a time crunch, the minimum number of points to plot to get full credit is 500, but if you wait, the plot is a bit more telling. |
-
R code used to solve the problem.
-
Output from running your code.
-
The plot produced.
Question 2
By default, facet_wrap
and facet_grid
maintain the same scale for the x and y axes across the various plots. This makes it easier to compare visually. In this case, it may make it harder to see the patterns that emerge. Modify your code from question (1) to allow each facet to have its own x and y axis limits.
Look at the argument |
-
R code used to solve the problem.
-
Output from running your code.
-
The plot produced.
Question 3
Let’s say we have a theory that the older generations tend to smoke more. You decided you want to create a plot that compares the percentage of smokers per generation
. Before we do this, we need to wrangle the data a bit.
What are the possible values of d_smokes
? Create a new column in myDF
called is_smoker
that has values TRUE
, FALSE
, or NA
when applicable. You will need to determine how you will assign a user as a smoker or not — this is up to you! Explain your cutoffs. Make sure you stay in the tidyverse
to solve this problem.
-
R code used to solve the problem.
-
Output from running your code.
-
1-2 sentences explaining your logic and cutoffs for the new
is_smoker
column. -
The
table
of theis_smoker
column.
Question 4
Great! Now that we have our new is_smoker
column, create a new tibble
called smokers_per_gen
. smokers_per_gen
should be a summary of myDF
containing the percentage of smokers per generation
.
The result, |
-
R code used to solve the problem.
-
Output from running your code.
Question 5
Create a Cleveland dot plot using ggplot
to show the percentage of smokers for each different generation
. Use ggthemr
to give your plot a new look! You can choose any theme you’d like!
Is our theory from question (3) correct? Explain why you think so, or not.
(OPTIONAL I, 0 points) To make the plot have a more aesthetic look, consider reordering the data by percentage of smokers, or even by the age of generation
. You can do that before passing the data using the arrange
function, or inside the geom_point
function, using the reorder
function. To re-order by generation
, you can either use brute force, or you can create a new column called avg_age
while using summarize
. avg_age
should be the average age for each group (using the variable d_age
). You can use this new column, avg_age
to re-order the data.
(OPTIONAL II, 0 points) Improve our plot, change the x-axis to be displayed as a percentage. You can use the scales
package and the function scale_x_continuous
to accomplish this.
Use |
-
R code used to solve the problem.
-
Output from running your code.
-
The plot produced.
-
1-2 sentences commenting on the theory, and what are your conclusions based on your plot (if any).