David Ranzolin Research Analyst "He had an idea that even when beaten he could steal a little victory by laughing at defeat." -East of Eden

Calculating New/Returning Students with Factors and dplyr

I have yet to learn the tidy evaluation paradigm. Typically, violating DRY principles with a combination of hacks and brute force gets the job done. But a recent challenge at work took me perilously to the edge.

The task was this: calculate the ratio of returning and new students by academic quarter. Our working definition of a returning student is having taken any course in a previous quarter.

Let’s build some sample data:

Reproducing a Mike Bostock d3.js Specialty with ggplot2

Mike Bostock is the Pontifex Maximus of data visualization. As a d3.js novice, I spend hours each week poring over his creations over at Observable HQ, a brilliant new medium to share compelling data visualizations and quantitative analysis. And because, as they say, imitation is the highest form of flattery, I’ve decided to reproduce some of his recent work in R (ggplot).

My interest in reproduction, however, goes beyond meer admiration. While I’ve enjoyed getting to know D3, I confess it’s hard to struggle through the unfamiliar mechanics, knowing I could have produced a ggplot facsimile in a fraction of the time. I tell myself it builds character. But at a programmatic level, I hope to catch a glimpse of what makes these visualization paradigms unique and how easy it is to google their respective errors each tool might be leveraged in the future. An ambitious goal to be sure.

New rcanvas Tricks to Manage and Analyze your Canvas LMS Course

Canvas is, in my humble opinion, the best LMS around, and my second package rcanvas is for all the students, TAs, instructors, and analysts lucky enough to reside there. A growing number of contributors has made it easier than ever to automate various workflows, and I’m excited to show off some that new functionality here.

Career Tracks at UC Part I: Cleaning and Tidying

In 2017 the University of California launched Career Tracks, a job classification system for staff not represented by a union. The goal was threefold: (1) to give employees better-defined career paths; (2) to better-align university compensation with the market; and (3) to better-reflect primary job responsibilities for each employee. Additionally, Career Tracks is meant to promote greater transparency in hiring and promotions. Employees can now chart their UC careers via a hierarchy of job families and functions, each with specified education levels, scopes, and responsibilities. And while the initial feedback has been positive, the overall success of the program remains to be seen.

Introducing ViewPipeSteps: Towards Observable Programming in R

Sophie Alpert makes several good points in a recent post on ‘Observable programming’:

Excel is unusually good at allowing you to build complex programs while allowing you to see the values of every intermediate computation …In contrast, traditional programming environments make the code you write much clearer…But they’re usually terrible at allowing you to observe the behavior of your program when subjected to concrete values.

All true. She continues:

The Data Analyst as Wanderer: Pre-Exploratory Data Analysis with R

This post is about pre-exploratory data analysis. Namely, answering questions about the data at two junctures: before you know anything about the data and when you know only very little about the data. There are roughly three overlapping questions to ask:

  1. What is this?
  2. What’s in this?
  3. What can I do with this?

Simple Probability Trees in R

This may surprise you, but there isn’t an easy, “canonical” method to construct simple probability trees in R. Google uncovers some hacky attempts from years past, but it obviously hasn’t been a pressing issue or priority in the community. The reason for this, I think, is threefold: (1) probability trees are boring, STATS101 material; (2) until recently, there haven’t been great tools to render nodes and trees; and (3) designating a sensible, comprehensive input is somewhat tricky. What do you name the parameters? Should the function(s) take a table or data frame? If so, in what shape?

To expand on the first point, R and the R community are great at introducing programming techniques to people who know statistics, but less great at introducting statistics to people who know how to program. The purpose of this post is to demonstrate how a simple statistical procedure–Bayes’ Theorum–might be calculated and displayed in a simple visual format in R.

R at the Golden State Sprint Triathlon

Earlier today I completed my first (sprint) triathlon. For me, it was 2 hours and 12 minutes of barbarism–a 1/2 mile swim, a 15 mile bike ride, and a three mile run to boot. I knew my time was poor; I struggled to wiggle out of my wet suit, was passed by over 100 people while cycling,1 heard the winner announced before I even started running, and completed the race amongst septuagenarians. But how poor? I needed data.

  1. But not by the guy on a BMX. 

I am a tidyverse Enthusiast

I am a tidyverse enthusiast. The proof is in the pudding: of my six packages on GitHub, only one DESCRIPTION contains a non-tidyverse package (rcicero, tidyjson). I once contemplated rewriting these packages sans the tidyverse–for science, learning, growth, bragging rights, and character building–but I broke into a cold sweat once I typed plot. Admittedly, my reliance on the tidyverse might be considered a crutch. Do I really know R, or just the conventions of a popular subset? A different question for a different time.

Introducing ggtextparallels

The purpose of this post is to introduce ggtextparallels, a tool for cheap frugal biblical scholars who don’t want to pay for books and other proprietary software. Somewhere on the internet Hadley Wickham advises new R developers to find problems and then try and solve them. Within that framework, the problem is about $70 (plus shipping and handling) and the solution is devtools::install_github("daranzolin/ggtextparllels").

The Language Wars are Good

Hot take: the language “wars” are helpful. It’s surprising that the #rstats community, so attentive to the needs of beginners misses this.

The Perseus Dictionary, Part II

In Part I of this series, I showed how to get the Victorian masterpiece, A Dictionary of Greek and Roman Biography and Mythology by William Smith, LLD, ed. In Part II I’m going to explore it, but not in an archaic, linear fashion. It’s 2017–we can do better.

I wanted to move beyond mere word counts, and a sentiment analysis makes little sense in this context. And so after perusing several entries, I was struck by the image of the network. A network is an abstraction, a visual representation of connections both explicit and implicit. Networks are interesting (and very, very slippery) because their cumulative effect is impossible to pin down; if I see a basketball, a volleyball, a baseball, and a volleyball together, I might think “talent”, “hard work”, “achievement”. But someone else could view the same cluster of objects and think “absurdity”, “waste of time”, and “barbarism”. Our experience shapes our perception of networks.

The Perseus Dictionary, Part I

One of my favorite episodes of Black Adder is “Ink and Incapability”. The plot is this: Samuel Johnson is soliciting patronage from Prince George for his new book, the first ever English dictionary. But due to Black Adder’s petty jealousy, Johnson’s dictionary is instead burned through a hilarous turn of events. The burning of the dictionary represented a catastrophic loss of scholarship, maybe akin to us losing all of Wikipedia, the Encyclopedia Brittanica, and basketball-reference.com now. Black Adder must then rewrite the dictionary, proceeding linearly, alphabetically, and manually. For over 300 years, this was how we used dictionaries. But with the advent of computing, we can now scour a dictionary much more creatively.

R Taboos

In Freud’s tripartite model of the psyche, the superego represents the internalization of parental and societal values. Should I consider an immoral act, my superego will reflexively flood me with guilt. Thus is our behavior and neuroses explained.

So it is with writing code. I have an R superego. You probably have an R superego. A catalog of taboo functions and control flows haunts our scripts. And perhaps much like our western ethical norms, their validity is not always obvious. For example, I do not know why match.call is bad form or distasteful. I only know I was once scolded for using it on Reddit by one of our most authoritative authorities, and that I won’t make that mistake again.

How'd they do that? Part I: Console Histograms

This is the first post in a new series I’m calling How’d they do that? Motivated by professional curiosity and personal jealousy, I will identify, expose, and interpret stumble through the secrets hidden within my fellow #rstats enthusiasts’ repos. The exposition will not be exhaustive; curiosity is a fickle thing, burning one moment and extinguished the next. So it will be with these blogs.

First up: the console histograms featured in rOpenSci’s skimr. Observe:

> library(skimr)
> skim(mtcars) %>% dplyr::filter(stat == "hist")
# A tibble: 11 x 5
     var    type  stat      level value
   <chr>   <chr> <chr>      <chr> <dbl>
 1   mpg numeric  hist ▂▅▇▇▇▃▁▁▂▂     0
 2   cyl numeric  hist ▆▁▁▁▃▁▁▁▁▇     0
 3  disp numeric  hist ▇▇▅▁▁▇▃▂▁▃     0
 4    hp numeric  hist ▆▆▇▂▇▂▃▁▁▁     0
 5  drat numeric  hist ▃▇▂▂▃▆▅▁▁▁     0
 6    wt numeric  hist ▂▂▂▂▇▆▁▁▁▂     0
 7  qsec numeric  hist ▂▃▇▇▇▅▅▁▁▁     0
 8    vs numeric  hist ▇▁▁▁▁▁▁▁▁▆     0
 9    am numeric  hist ▇▁▁▁▁▁▁▁▁▆     0
10  gear numeric  hist ▇▁▁▁▆▁▁▁▁▂     0
11  carb numeric  hist ▆▇▂▁▇▁▁▁▁▁     0

'And the winner is...' Text Mining Award Nominations

I recently had the privilege of reviewing internal award nominations. In a secret duel of persuasion, hundreds of employees seized the opportunity to outboast their colleagues in boasting about their colleagues. Letterhead was prepared, anecdotes collected, and thesaureses consulted.1

While the review process on the whole was a charming experience, the sheer volume of submissions made the work somewhat tedious. The word count ran into the tens of thousands. Each nomination required a human’s careful perusal, but it wasn’t long before I was imagining a more programmatic approach. That meant R, and that meant tidytext.

  1. I’m told the plural of “thesaurus” can also be “thesauri”. 

Data Validation with assertr: Dates and Regular Expressions

Being introduced to a new R package is like going on a really good first date–you wish you had met the person sooner, you imagine doing all sorts of fun things together in the future, and you can’t wait to blog about the experience. Okay, maybe just the first two.

Such was my introduction to assertr, a brilliant data validation package from rOpenSci. It was love at first sight. And much like a great date, assertr’s bouquet of functions whisper sweet assurances into your console, increasing your confidence, pushing you forward, and enhancing your desirability. Okay, maybe just the first two again. Here’s a little demonstration.

Do Cats have Different Personalities? A Cat Town Cafe Investigation

I like cats more than other animals. Alongside their evolutionary advantages–retractable claws, night vision, a tail, lighting reflexes, etc.–Mother Nature imbued these creatures with grace, majesty, and personalities unequaled within the animal kingdom. Don’t @ me.

That last phenomena, personality, merits closer attention. Cats have personalities. My cat Mycroft, for example, is shy but loquacious, snuggly but fiercely independent. He has moods. He is unpredicable in ways a dog or hamster could never be. But am I just describing every cat? I confess I’ve only ever had one cat (and pet), so I’m willing to concede, for the sake of argument, that I’m merely projecting personality traits on everyday cat behavior. Is Mycroft just like every other cat? For an answer, I turned to two of my favorite things: Cat Town Cafe and R.

How well do you know Calvin and Hobbes? An R Game

One of my most prized possessions is The Complete Calvin and Hobbes, the undisputed greatest comic of all time. I defy anyone to name a strip more hilarious, more cohesive, more emblematic of the vicissitudes of life. I’ve read and reread those volumes countless times, and I rank both Calvin and Hobbes as two of the most influential fictional characters in my life. In fact, I once boasted that I knew the strip so well, I could predict the content of each strip’s final panel while only being shown the first two. Incredulous friends would test my ability, and it became a fun game that I’ve since dominated.

But what if I want to play and no one is around? Thankfully, my friend R is always there for me.

How to be an O'Reilley Author

Earlier today I fantasized about publishing a book through O’Reilley Media. Besides producing dozens of essential works on programming, data science, and UX, they’re the ones responsible for introducing IT folk to the African Civit, the Binturong, and my persona favorite, the Springhaas. Would zoology textbooks with terminal prompts and monitors be half as interesting? I doubt it.

The fantasy ended abruptly, as I remembered that I had a B.A. in English instead of a B.S. in Computer Science, and an M.T.S. instead of an MSci. I also assumed that the vast majority of O’Reilley authors hold terminal degrees in their respective fields. But…was that a safe assumption? I went to find out.

Sentiment Analysis of the Four Gospels, Part I

The purpose of this post is twofold: (1) to introduce rperseus, my latest R package; and (2) to venture a sentiment analysis of the four gospels.

Rivers of ink have been spilled over the unity and disunity of the four gospels. Their intertextualities have inspired almost two millenia of speculation both scholarly and pious. We may never know if Q existed, or discern John’s stages of composition, or figure out what the devil bdelugma tes eremoseos (NRSV: “desolating sacriliege”) means. But now, almost 2000 years removed from their original composition, we can do a sentiment analysis! And we owe it all to tidytext and the good people over at the Perseus Digital Library.

rcanvas + the tidyverse

rcanvas continues to grow. Thanks to the recent contributions from Chris Hua, getting user groups, announcements, and discussions from your institution’s Canvas instance has never been easier. More collaborators are welcome!

By my lights, R package development should be attuned to the tidyverse. Piping output into a sequence of clear, logical functions not only makes for clean, readable code, but is an undeniably damned good time.

Exploring the Perseus Digital Library with R

For those miserable fortunate enough to study classical literature, the Perseus Digital Library (PDL) is an invaluable resource. In brief, PDL is a collection of Greek and Latin texts (among others), although it also functions as a lexicon similar to non-digital resources like BDAG. I often used the site to check a word’s alternative definitions or locate other occurrences. Wow, that feels like forever ago now.

'We the People' + ImageMagick

Yesterday I dipped my toes in the now raging currents of American activism. Alongside millions of men, women, and children, we rallied to the #WomensMarch, a full-throated, univocal rejection of Donald Trump and his ilk.

While the signage ranged from merrily goofy to violently hostile, three images stood apart. I speak of course of Shepherd Fairey’s breathtaking “We the People” project: Defend Dignity, Are Greater Than Fear, and Protect Each Other.

We The People

Reshaping Data: R vs. SPSS at CAIR 2016

One of the highlights of CAIR 2016 was Data Mining to Identify Grading Practices, a splendid presentation by Kelly Wahl and Nida Rinthapol of UCLA. But besides the quality of the slides, the eloquence of the speakers, and the application of machine learning, I confess I was most intrigued by SPSS, the statistical software. What I felt was excitement–not so much excitement at learning a new tool, but the thrill of a challenge: what they’re doing in SPSS, I was determined to replicate in R. And if Twitter and #rstats are at all representative of the R community, I wager most useRs would have felt something similar. Petty? Probably. Insecure? Perhaps. But come hell or high water, anything SPSS can do, R can do better.

Joining a List of Data Frames with purrr::reduce()

I’ve been encountering lists of data frames both at work and at play. Most of the time, I need only bind them together with dplyr::bind_rows() or purrr::map_df(). But recently I’ve needed to join them by a shared key. This operation is more complex. In fact, I admitted defeat earlier this year when I allowed rcicero::get_official() to return a list of data frames rather than a single, tidy table. Forgiveable at the time, but now I know better.

CAIR 2016 Recap

Last week I had the good fortune of attending my first CAIR–no, not the Council on American-Islamic Relations, but the California Association for Institutional Research. Over 300 administrators, analysts, and data nerds decended upon the Millenium Biltmore Hotel in Los Angeles to network, share tips and tricks, and even watch a live shoot of ABC’s The Catch.

Elected Official Lookup with R, rcicero

Chances are that you now hate politics. “E-mails” makes you perspire, and every Halloween pumpkin bore an eerie resemblance to the Republican nominee. So why not channel that hate into something productive, like building interactive web applications and/or learning more about your elected officials? In other words, why not make cool stuff with R?

The Cicero folk recently linked to their elected official locator, but it only goes live after election day. In the meantime, we can build our own locator with rcicero and shiny.

Building a Library Catalog with R, Part I

I have a habit of boasting about R’s innumerable merits. One day, I was celebrating the arrival of a particular R package when my wife stopped and asked me why–if R was so great–can’t it build her an online catalog for her library? I accepted the challenge without hesitation, being infinitely confident in R, but only mildly sure of myself. R’s besmirched reputation must be restored.

Some background: my wife is the librarian at a local high school. Her patrons, however, do not have the luxury of an electronic catalog; if they want to find a book, they must pester her. This is annoying for all parties.

Calculate Ethnic Diversity Index (EDI) with R

How diverse is your student body? A tour through some classrooms may give you some idea. But how diverse is your student body in relation to the school across town? To answer that question, you need a more precise measure. Enter Ethnic Diversity Index (EDI), a reflection of how evenly distributed your students are among the race/ethnicity categories reported to the California Department of Education.

US Senator Tweets, Part III

At long last we come to the end of this series. I confess that I lost interest when–spoiler alert–there was nothing exciting to report here. But for the sake of symmetry, here in Part III we’ll delve into the tweets themselves with the awesome tidytext package.

Full disclosure: what follows is basically David Robinson’s brilliant analysis of Trump’s twitter feed with a different twist: instead of examining tweets by source, we’re examining them by party.

US Senator Tweets, Part II

Time to continue where I left off. With each US senator’s twitter handle handled, it’s time to get their tweets following a similar create/use-function-and-iterate pattern.

US Senator Tweets, Part I

I spent the majority of the most recent presidential debate thinking of a fun #rstats project. I had three objectives for the post: (1) that it be fun to do; (2) that it showcase some functionality of rcicero or rcanvas; and (3) that it be interesting to the public. The third objective was admittedly negotiable.

Prepare Progress Reports to Email with R, Canvas, and rcanvas

While neglecting my Python coursework this fall, I wrote an rstats package called rcanvas. rcanvas is an R client for the Canvas LMS API. It makes getting course data from your institution’s Canvas LMS easy, and I’ve utilized it in a variety of ways at work. And despite the Canvas Developer community’s tepid response, I am optimistic about its future. Perhaps a quick demonstration will help.