Calculating New/Returning Students with Factors and dplyr

I have yet to learn the tidy evaluation paradigm. Typically, violating DRY principles with a combination of hacks and brute force gets the job done. But a recent challenge at work took me perilously to the edge.

The task was this: calculate the ratio of returning and new students by academic quarter. Our working definition of a returning student is having taken any course in a previous quarter.

Let’s build some sample data:

library(tidyverse)

set.seed(2018)

ids <- 1:5000
start_dates <- seq(as.Date("2010/01/01"), as.Date("2018/05/18"), by = "day")
departments <- c("Math", "Science", "English", "History", "Modern Languages")

generate_data <- function(reps, samp, size) {
  c(replicate(size, sample(samp, size = size, replace = TRUE)))
}

ids <- generate_data(100, ids, 100)
departments <- generate_data(100, departments, 100)
start_dates <- start_dates %>% 
  sample(500, TRUE) %>% 
  sample(10000, TRUE)

df <- tibble(
  student_id = ids,
  department = departments,
  class_start_date = start_dates
) %>% 
  mutate(quarter = lubridate::quarter(class_start_date),
         year = lubridate::year(class_start_date)) %>% 
  mutate(
    quarter = case_when(
      quarter == 1 ~ "Winter",
      quarter == 2 ~ "Spring",
      quarter == 3 ~ "Summer",
      quarter == 4 ~ "Fall"
      ),
    school = case_when(
      department %in% c("Math", "Science") ~ "STEM",
      TRUE ~ "Liberal Arts"
      )
    ) %>% 
  mutate(quarter = paste(quarter, year)) %>%
  select(student_id, school, department, class_start_date, quarter, year)

A glace at our generated data:

# A tibble: 10,000 x 6
   student_id school       department       class_start_date quarter      year
        <int> <chr>        <chr>            <date>           <fct>       <dbl>
     1681 Liberal Arts History          2010-03-14       Winter 2010 2010.
     2319 STEM         Math             2013-03-29       Winter 2013 2013.
      303 Liberal Arts English          2014-08-04       Summer 2014 2014.
      988 STEM         Science          2011-07-14       Summer 2011 2011.
     2372 Liberal Arts English          2013-11-26       Fall 2013   2013.
     1506 Liberal Arts English          2012-01-11       Winter 2012 2012.
     3034 STEM         Math             2017-10-24       Fall 2017   2017.
      651 Liberal Arts English          2013-11-03       Fall 2013   2013.
     4794 Liberal Arts Modern Languages 2015-02-08       Winter 2015 2015.
     2735 STEM         Science          2012-03-24       Winter 2012 2012.
# ... with 9,990 more rows

To calculate the ratios, I had to encode the quarter variable as a factor. And to get the chronological levels, I arranged the data frame by class start date, pulled out the unique quarters, and factored the original quarter variable with the appropriate levels.

quarters <- df %>% 
  arrange(class_start_date) %>% 
  pull(quarter) %>% 
  unique() 
quarters <- factor(quarters, levels = quarters)

df <- df %>% mutate(quarter = factor(quarter, levels = quarters))

A useful attribute of factors is that they can be converted to integers. For example, note how each factor level corresponds to an ascending integer:

> quarters
 [1] Winter 2010 Spring 2010 Summer 2010 Fall 2010   Winter 2011 Spring 2011 Summer 2011 Fall 2011  
 [9] Winter 2012 Spring 2012 Summer 2012 Fall 2012   Winter 2013 Spring 2013 Summer 2013 Fall 2013  
[17] Winter 2014 Spring 2014 Summer 2014 Fall 2014   Winter 2015 Spring 2015 Summer 2015 Fall 2015  
[25] Winter 2016 Spring 2016 Summer 2016 Fall 2016   Winter 2017 Spring 2017 Summer 2017 Fall 2017  
[33] Winter 2018 Spring 2018
34 Levels: Winter 2010 Spring 2010 Summer 2010 Fall 2010 Winter 2011 Spring 2011 ... Spring 2018

> as.integer(quarters)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
[34] 34

I then tried to imagine the data frame as a timeline with the temporal markers as quarters. For example, I could identify all the new students for a given quarter by (1) subsetting the data with all quarters “less than” (earlier than) that specific quarter, (2) identifying the unique quarters by student, and (3) filtering for the quarter, and then checking if the unique quarters match the given quarter.

The entirety of my function is below, but I’ll break the pipeline down here.

Step 1: Filtering all Earlier Quarters

The quarter_td argument is first factored with the same levels as the data frame, then converted to an integer. I’m then able to filter the earlier quarters within the data frame.

calc_new_students <- function(quarter_td, ...) {
  
  quarter_td_int <- as.integer(factor(quarter_td, levels = quarters))
  
  df %>% 
    filter(as.integer(quarter) <= quarter_td_int)
    ...
}
  calc_new_students("Fall 2015", department) %>% 
    distinct(quarter) %>%
    arrange(desc(quarter))

# A tibble: 24 x 1
   quarter    
   <fct>      
Fall 2015  
Summer 2015
Spring 2015
Winter 2015
Fall 2014  
Summer 2014
Spring 2014
Winter 2014
Fall 2013  
Summer 2013
# ... with 14 more rows

Step 2: Indentify Unique Quarters by Student

Collapsing the unique quarters into a vector produces a snapshot of their registration history.

calc_new_students <- function(quarter_td, ...) {
    
    ...
    group_by(student_id) %>%
    mutate(unique_quarters = paste(unique(quarter), collapse = ", ")) %>%
    ungroup() 
    ...
}

calc_new_students("Fall 2015") %>% 
  select(student_id, unique_quarters)

# A tibble: 7,083 x 2
   student_id unique_quarters                                            
        <int> <chr>                                                      
     1681 Winter 2010, Winter 2012                                   
     2319 Winter 2013, Summer 2013, Winter 2012, Fall 2010           
      303 Summer 2014, Summer 2010                                   
      988 Summer 2011, Spring 2012, Fall 2011, Spring 2013           
     2372 Fall 2013, Winter 2012, Winter 2015                        
     1506 Winter 2012, Winter 2010, Fall 2012                        
      651 Fall 2013, Winter 2010, Summer 2012, Fall 2012, Summer 2015
     4794 Winter 2015                                                
     2735 Winter 2012, Spring 2014, Winter 2011, Fall 2013           
     4911 Spring 2012, Fall 2013                                     
# ... with 7,073 more rows

Step 3: Indentify New and Returning Students

If a student’s unique_quarters is length one and matches the quarter argument, they are a new student. If not, they are a returning student.

calc_new_students <- function(quarter_td, ...) {
  
    ...
    filter(as.integer(quarter) == quarter_td_int) %>% 
    distinct(student_id, .keep_all = TRUE) %>%
    mutate(student = if_else(unique_quarters == quarter_td, "New", "Returning"))
    ...
}

calc_new_students("Fall 2015") %>% 
  select(student_id, student)

# A tibble: 350 x 2
   student_id student  
        <int> <chr>    
     1301 New      
     3022 New      
     2049 Returning
     3229 Returning
     1610 Returning
     2671 Returning
     3857 New      
      431 New      
     2383 Returning
     1311 Returning
# ... with 340 more rows

Step 4: Group the New and Returning Students by Other Columns

I’m able to generalize the function to group by other column names by passing them through ....

calc_new_students <- function(quarter_td, ...) {
    
    ...
    group_by(student, ...) %>% 
    summarize(students = n()) %>% 
    arrange(...) %>% 
    mutate(quarter = quarter_td)
}

calc_new_students("Fall 2015", department) 
calc_new_students("Fall 2015", school)

# A tibble: 10 x 4
# Groups:   student [2]
   student   department       students quarter  
   <chr>     <chr>               <int> <chr>    
New       English                15 Fall 2015
Returning English                62 Fall 2015
New       History                15 Fall 2015
Returning History                54 Fall 2015
New       Math                   14 Fall 2015
Returning Math                   65 Fall 2015
New       Modern Languages       17 Fall 2015
Returning Modern Languages       45 Fall 2015
New       Science                23 Fall 2015
Returning Science                40 Fall 2015

# A tibble: 4 x 4
# Groups:   student [2]
  student   school       students quarter  
  <chr>     <chr>           <int> <chr>    
1 New       Liberal Arts       47 Fall 2015
2 Returning Liberal Arts      161 Fall 2015
3 New       STEM               37 Fall 2015
4 Returning STEM              105 Fall 2015

Here’s a sequence of the transformations, courtesy of ViewPipeSteps:

Alt Text

Step 5: Iterate through all Quarters

Now we can loop through all the quarters and bind the new and returning students into a data frame:

all_quarters <- quarters %>% 
  map_dfr(calc_new_students)

And a final visualization!

ggplot(all_quarters, aes(quarter, students, fill = student)) +
  geom_bar(stat = "identity", position = "fill") +
  geom_hline(yintercept = 0.5, linetype = "dashed") +
  labs(x = "Quarter",
       y = "",
       title = "Ratio of New and Returning Students",
       subtitle = "2010-2018") +
  hrbrthemes::theme_ipsum() +
  theme(axis.text.x = element_text(angle = 45))

ratio plot Obviously in this closed, generated data, the proportion of returning students grows with each successive quarter. But with some additional tinkering, some interesting, seasonal, patterns may emerge.