In 2017 the University of California launched Career Tracks, a job classification system for staff not represented by a union.
The goal was threefold: (1) to give employees better-defined career paths; (2) to better-align university compensation with the market; and (3) to better-reflect
primary job responsibilities for each employee. Additionally, Career Tracks is meant to promote greater transparency
in hiring and promotions. Employees can now chart their UC careers via a hierarchy of job families and functions,
each with specified education levels, scopes, and responsibilities. And while the initial feedback has been
positive, the overall success of the program remains to be seen.
Career Tracks is meant to create a coherent bureaucracy. I don’t mean that pejoratively. Rather, Career Tracks expresses
an institutional dichotomy–these individuals perform these tasks, these individuals perform other tasks, and
so on and so forth. It is the important work of a massive, complex economy of peoples, benefitting individuals at every level. It is human resources writ large.
Career Tracks is instantiated by hundreds of job templates and tables. These tables communicate almost
everything you need to know about every UC position. They are concise, precise, and well-written. But unfortunately, the information is squirreled away
in hideous tables within .pdf and .docx files. There is subtle irony here: while Career Tracks makes grand
claims to transparency, the vehicles of its standards are barely human or machine readable. Here is what the
Digital Communications job template looks like:
You may have guessed where this is going. The language of Career Tracks almost cries out for text mining, and
the general hideousness of the templates demand cleaning and tidying. Tasks for which both R and my blog
are particularly well-suited. In Part I I’ll work through the cleaning process, and in Part II I’ll
venture a sentiment analysis of UC bureacracy.
My code for the cleaning process is below. I leaned heavily on the tabulizer and docxtractor packages,
but the former did a poor job parsing the tables within each files. As you can see in the visdat below,
I was only able to parse about 20% of the templates, a disappointing number. Fortunately, I was able to obtain the salary grades
and other information from other sources.
I begin by loading the required libraries, unzipping the files, and defining some preliminary functions that would prove useful later:
Then I defined two additional functions, one to parse PDFs and another to parse .docx files. Why the templates are stored with
alternating file extensions is beyond me.
Some highlights: the nanair and janitor packages, as well as fill() from the tidyr package. Transposing the entire
table was also a nifty trick.
For extra trick points, I created a progress bar while mapping over the entire directory of files:
As there were over 300 templates, this operation took several minutes, but it was cool to see the countdown. I was able to speed
things up a little by postponing some additional formatting until I had one large table instead of reshaping hundreds of little tables
and then joining them together. A visual sequence of the cleaning and tidying sequence in the gif below, courtesy of ViewPipeSteps:
Finally, I downloaded some additional files and joined them all together:
Here’s a View() of the data now:
Better. But I admit the final vis_miss was disappointing:
Whether this is a healthy sample remains to be seen in Part II. In the meantime, I may ponder how to get more of the missing data…