STAT 29000: Project 11 — Spring 2021

Motivation: Data wrangling is the process of gathering, cleaning, structuring, and transforming data. Data wrangling is a big part in any data driven project, and sometimes can take a great deal of time. tidyverse is a great, but opinionated, suite of integrated packages to wrangle, tidy and visualize data. It is useful to gain some familiarity with this collection of packages, in case you run into a situation where these packages are needed — you may even find that you enjoy using them!

Context: We have covered a few topics on the tidyverse packages, but there is a lot more to learn! We will continue our strong focus on the tidyverse (including ggplot) and data wrangling tasks.

Scope: R, tidyverse, ggplot

Learning objectives

Explain the differences between regular data frames and tibbles.
Use mutate, pivot, unite, filter, and arrange to wrangle data and solve data-driven problems.
Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows.
Group data and calculate aggregated statistics using group_by, mutate, and transform functions.
Demonstrate the ability to create basic graphs with default settings, in ggplot.
Demonstrate the ability to modify axes labels and titles.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

The tidyverse consists of a variety of packages, including, but not limited to: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and lubridate.

One of the underlying premises of the tidyverse is getting the data to be tidy. You can read a lot more about this in Hadley Wickham’s excellent book, R for Data Science.

There is an excellent graphic here that illustrates a general workflow for data science projects:

Import
Tidy
Iterate on, to gain understanding:
1. Transform
2. Visualize
3. Model
Communicate

This is a good general outline of how a project could be organized, but depending on the project or company, this could vary greatly and change as the goals of a project change.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/okcupid/filtered/*.csv

Questions

datamine_py()
library(tidyverse)
questions <- read_csv2("/class/datamine/data/okcupid/filtered/questions.csv")
users <- read_csv("/class/datamine/data/okcupid/filtered/users.csv")

users$id <- 1:nrow(users)
set.seed(12345)
columns_to_pivot <- 1:2278
users_sample_long <- users[sample(nrow(users), 2200),] %>%
    mutate_at(columns_to_pivot, as.character) %>%
    pivot_longer(cols = columns_to_pivot, names_to="question", values_to = "selected_option")
myDF <- merge(users_sample_long, questions, by.x = "question", by.y = "X")

users$id <- 1:nrow(users)
set.seed(12345)
columns_to_pivot <- 1:2278
users_sample_long <- users[sample(nrow(users), 2200),] %>%
    mutate_at(columns_to_pivot, as.character) %>%
    pivot_longer(cols = columns_to_pivot[-1242], names_to="question", values_to = "selected_option")
myDF <- merge(users_sample_long, questions, by.x = "question", by.y = "X")

users$id <- 1:nrow(users)
set.seed(12345)
columns_to_pivot <- 1:2278
users_sample_long <- users[sample(nrow(users), 2200),] %>%
    mutate_at(columns_to_pivot, as.character) %>%
    pivot_longer(cols = columns_to_pivot[-(which(substr(names(users), 1, 1) != "q"))], names_to="question", values_to = "selected_option")
myDF <- merge(users_sample_long, questions, by.x = "question", by.y = "X")

myDF <- myDF %>% mutate(generation=case_when(d_age<=24 ~ "Gen Z",
                                     between(d_age, 25, 40) ~ "Millenial",
                                     between(d_age, 41, 56) ~ "Gen X",
                                     between(d_age, 57, 66) ~ "Boomers II",
                                     TRUE ~ "Other"))

ggplot(myDF[1:100,]) +
    geom_point(aes(x=d_age, y = lf_min_age, col=gender2), alpha=.6) +
    labs(title="Minimum dating age by gender", x="User age", y="Minimum date age")

Question 1

Let’s pick up where we left in project 10. For those who struggled with project 10, I will post the solutions above either on Saturday morning, or at the latest Monday. Re-run your code from project 10 so we, once again, have our tibble, myDF.

At the end of project 10 we created a scatterplot showing d_age on the x-axis, and lf_min_age on the y-axis. In addition, we colored the points by gender2. In many cases, instead of just coloring the different dots, we may want to do the exact same plot for different groups. This can easily be accomplished using ggplot.

Without splitting or filtering your data prior to creating the plots, create a graphic with plots for each generation where we show d_age on the x-axis and lf_min_age on the y-axis, colored by gender2.

You do not need to modify myDF at all.

This may take quite a few minutes to create. Before creating a plot with the entire myDF, use myDF[1:50,]. If you are in a time crunch, the minimum number of points to plot to get full credit is 500, but if you wait, the plot is a bit more telling.

Items to submit

R code used to solve the problem.
Output from running your code.
The plot produced.

Question 2

By default, facet_wrap and facet_grid maintain the same scale for the x and y axes across the various plots. This makes it easier to compare visually. In this case, it may make it harder to see the patterns that emerge. Modify your code from question (1) to allow each facet to have its own x and y axis limits.

Look at the argument scales in the facet_wrap/facet_grid functions.

Items to submit

R code used to solve the problem.
Output from running your code.
The plot produced.

Question 3

Let’s say we have a theory that the older generations tend to smoke more. You decided you want to create a plot that compares the percentage of smokers per generation. Before we do this, we need to wrangle the data a bit.

What are the possible values of d_smokes? Create a new column in myDF called is_smoker that has values TRUE, FALSE, or NA when applicable. You will need to determine how you will assign a user as a smoker or not — this is up to you! Explain your cutoffs. Make sure you stay in the tidyverse to solve this problem.

Items to submit

R code used to solve the problem.
Output from running your code.
1-2 sentences explaining your logic and cutoffs for the new is_smoker column.
The table of the is_smoker column.

Question 4

Great! Now that we have our new is_smoker column, create a new tibble called smokers_per_gen. smokers_per_gen should be a summary of myDF containing the percentage of smokers per generation.

The result, smokers_per_gen should have 2 columns: generation and percentage_of_smokers. It should have the same number of rows as the number of generations.

Items to submit

R code used to solve the problem.
Output from running your code.

Question 5

Create a Cleveland dot plot using ggplot to show the percentage of smokers for each different generation. Use ggthemr to give your plot a new look! You can choose any theme you’d like!

Is our theory from question (3) correct? Explain why you think so, or not.

(OPTIONAL I, 0 points) To make the plot have a more aesthetic look, consider reordering the data by percentage of smokers, or even by the age of generation. You can do that before passing the data using the arrange function, or inside the geom_point function, using the reorder function. To re-order by generation, you can either use brute force, or you can create a new column called avg_age while using summarize. avg_age should be the average age for each group (using the variable d_age). You can use this new column, avg_age to re-order the data.

(OPTIONAL II, 0 points) Improve our plot, change the x-axis to be displayed as a percentage. You can use the scales package and the function scale_x_continuous to accomplish this.

Use geom_point not geom_dotplot to solve this problem.

Items to submit

R code used to solve the problem.
Output from running your code.
The plot produced.
1-2 sentences commenting on the theory, and what are your conclusions based on your plot (if any).