I have yet to learn the tidy evaluation paradigm. Typically, violating DRY principles with a combination of hacks and brute force gets the job done. But a recent challenge at work took me perilously to the edge.
The task was this: calculate the ratio of returning and new students by academic quarter. Our working definition of a returning student is having taken any course in a previous quarter.
Let’s build some sample data:
A glace at our generated data:
# A tibble: 10,000 x 6
student_id school department class_start_date quarter year
<int> <chr> <chr> <date> <fct> <dbl>
1 1681 Liberal Arts History 2010-03-14 Winter 2010 2010.
2 2319 STEM Math 2013-03-29 Winter 2013 2013.
3 303 Liberal Arts English 2014-08-04 Summer 2014 2014.
4 988 STEM Science 2011-07-14 Summer 2011 2011.
5 2372 Liberal Arts English 2013-11-26 Fall 2013 2013.
6 1506 Liberal Arts English 2012-01-11 Winter 2012 2012.
7 3034 STEM Math 2017-10-24 Fall 2017 2017.
8 651 Liberal Arts English 2013-11-03 Fall 2013 2013.
9 4794 Liberal Arts Modern Languages 2015-02-08 Winter 2015 2015.
10 2735 STEM Science 2012-03-24 Winter 2012 2012.
# ... with 9,990 more rows
To calculate the ratios, I had to encode the quarter variable as a factor. And to get the chronological levels, I arranged the data frame by class start date, pulled out the unique quarters, and factored the original quarter variable with the appropriate levels.
A useful attribute of factors is that they can be converted to integers. For example, note how each factor level corresponds to an ascending integer:
> quarters
[1] Winter 2010 Spring 2010 Summer 2010 Fall 2010 Winter 2011 Spring 2011 Summer 2011 Fall 2011
[9] Winter 2012 Spring 2012 Summer 2012 Fall 2012 Winter 2013 Spring 2013 Summer 2013 Fall 2013
[17] Winter 2014 Spring 2014 Summer 2014 Fall 2014 Winter 2015 Spring 2015 Summer 2015 Fall 2015
[25] Winter 2016 Spring 2016 Summer 2016 Fall 2016 Winter 2017 Spring 2017 Summer 2017 Fall 2017
[33] Winter 2018 Spring 2018
34 Levels: Winter 2010 Spring 2010 Summer 2010 Fall 2010 Winter 2011 Spring 2011 ... Spring 2018
> as.integer(quarters)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
[34] 34
I then tried to imagine the data frame as a timeline with the temporal markers as quarters. For example, I could identify all the new students for a given quarter by (1) subsetting the data with all quarters “less than” (earlier than) that specific quarter, (2) identifying the unique quarters by student, and (3) filtering for the quarter, and then checking if the unique quarters match the given quarter.
The entirety of my function is below, but I’ll break the pipeline down here.
Step 1: Filtering all Earlier Quarters
The quarter_td
argument is first factored with the same levels as the data frame, then converted to an integer. I’m then able to filter the earlier quarters within the data frame.
# A tibble: 24 x 1
quarter
<fct>
1 Fall 2015
2 Summer 2015
3 Spring 2015
4 Winter 2015
5 Fall 2014
6 Summer 2014
7 Spring 2014
8 Winter 2014
9 Fall 2013
10 Summer 2013
# ... with 14 more rows
Step 2: Indentify Unique Quarters by Student
Collapsing the unique quarters into a vector produces a snapshot of their registration history.
# A tibble: 7,083 x 2
student_id unique_quarters
<int> <chr>
1 1681 Winter 2010, Winter 2012
2 2319 Winter 2013, Summer 2013, Winter 2012, Fall 2010
3 303 Summer 2014, Summer 2010
4 988 Summer 2011, Spring 2012, Fall 2011, Spring 2013
5 2372 Fall 2013, Winter 2012, Winter 2015
6 1506 Winter 2012, Winter 2010, Fall 2012
7 651 Fall 2013, Winter 2010, Summer 2012, Fall 2012, Summer 2015
8 4794 Winter 2015
9 2735 Winter 2012, Spring 2014, Winter 2011, Fall 2013
10 4911 Spring 2012, Fall 2013
# ... with 7,073 more rows
Step 3: Indentify New and Returning Students
If a student’s unique_quarters
is length one and matches the quarter argument, they are a new student. If not, they are a returning student.
# A tibble: 350 x 2
student_id student
<int> <chr>
1 1301 New
2 3022 New
3 2049 Returning
4 3229 Returning
5 1610 Returning
6 2671 Returning
7 3857 New
8 431 New
9 2383 Returning
10 1311 Returning
# ... with 340 more rows
Step 4: Group the New and Returning Students by Other Columns
I’m able to generalize the function to group by other column names by passing them through ...
.
# A tibble: 10 x 4
# Groups: student [2]
student department students quarter
<chr> <chr> <int> <chr>
1 New English 15 Fall 2015
2 Returning English 62 Fall 2015
3 New History 15 Fall 2015
4 Returning History 54 Fall 2015
5 New Math 14 Fall 2015
6 Returning Math 65 Fall 2015
7 New Modern Languages 17 Fall 2015
8 Returning Modern Languages 45 Fall 2015
9 New Science 23 Fall 2015
10 Returning Science 40 Fall 2015
# A tibble: 4 x 4
# Groups: student [2]
student school students quarter
<chr> <chr> <int> <chr>
1 New Liberal Arts 47 Fall 2015
2 Returning Liberal Arts 161 Fall 2015
3 New STEM 37 Fall 2015
4 Returning STEM 105 Fall 2015
Here’s a sequence of the transformations, courtesy of ViewPipeSteps:
Step 5: Iterate through all Quarters
Now we can loop through all the quarters and bind the new and returning students into a data frame:
And a final visualization!
Obviously in this closed, generated data, the proportion of returning students grows with each successive quarter. But with some additional tinkering, some interesting, seasonal, patterns may emerge.