Have you ever started a project unsure how to accomplish several of the required tasks? Perhaps you were optimistic you would figure it out eventually, or (more likely) that you would find someone on the internet who had already figured it out for you. Such was my hope when I set out to design a program scheduler for UCSC Silicon Valley Extension. While the end result was only a mild success, far be it from me to let all that code remain unattended on my computer. This is catharsis by blogging.
My original goal was to design an algorithm that could generate a hypothetical schedule under given constraints such as cost, weekday availability, preference for online classes, etc. I also wanted to calculate time-to-completion and perhaps cluster related courses together. The task ultimately proved beyond my skill, time, energy, and interest, but I maintain there is some interesting code to my approximation here. Besides serving as a general tour of web scraping and the tidyverse, I was pushed into some new coding conventions. To name a few: I wrote my first ever
while loop; I unpacked a vector with the
zeallot package; I wrote a print method; I dug into CSS selectors; and simplified function calls with
partial(). All in all, a character-building and fruitful experience.
Getting the Program Data
I needed data about Extension’s courses–the more the merrier. As an employee, I could have retrieved the data through other means, but I enjoy web scraping and was curious if one could rely solely on our public website.
I first navigated to the schedule of Internet Programming and Development, the program in which I am currently enrolled. Before attempting to scrape any website, you should make sure it’s legal: most sites have Terms and Conditions forbidding any sort of web crawler or scraper. A quick way to check is with the
The offering data on the page resides in several tables but presents an interesting conundrum–some of the data is represented by images. Fortunately, these icons have “alt” html tags that indicate what they represent. The icon highlighted below, for example, has an alt tag of “Classroom and Online”:
After some tinkering, the
.cols-7 .views-field CSS selector will scrape 252 nodes from this page, one for every table cell. Because the structure of each table is identical, we can reformat the nodes into a single table without much effort.
First, we load our packages, read the HTML, and define some helper functions to parse data:
parse_xml_node() checks if the node contains an image, extracts the “alt” label if it does, or extracts the text if it doesn’t. Then we can loop through each node accordingly:
 "Offering Code" "Offering"  "Units" "Fall"  "Winter" "Spring"  "Summer" "O-CE0359"  "Internet Programming & Development Certificate Completion Review" ""  "" ""  "" ""  "Offering Code" "Offering"  "Units" "Fall"  "Winter" "Spring"  "Summer" "LINX.X401"  "LAMP: Linux Based Web Application Development – Apache, MySQL, PHP" "3.0"  "Online" "Classroom and Online"  "Online" "Classroom and Online"
You may notice that there are four rows and seven columns worth of data in these 28 elements. The columns are every seventh element, so we need to define a function that plucks them out while shifting the starting point of the sequence:
We’ll return to the
get_every() function later, so I simplified the subsequent calls with
partial() from the
purrr package before unpacking the subsequent list with the
zeallot operator. I now have seven vectors of length 36. Assigning them to columns within a tibble, and then tidying the subsequent data is done as follows:
# A tibble: 116 x 6 code name units quarter_name availability category <chr> <chr> <dbl> <chr> <chr> <chr> 1 CMPR.X402 C# .NET Programming, Advanced 3.00 fall Online Elective 2 CMPR.X402 C# .NET Programming, Advanced 3.00 winter Blended Elective 3 CMPR.X402 C# .NET Programming, Advanced 3.00 spring Online Elective 4 CMPR.X402 C# .NET Programming, Advanced 3.00 summer Blended Elective 5 CMPR.X403 C# .NET Programming, Comprehensive 3.00 fall Blended Elective 6 CMPR.X403 C# .NET Programming, Comprehensive 3.00 winter Online Elective 7 CMPR.X403 C# .NET Programming, Comprehensive 3.00 spring Blended Elective 8 CMPR.X403 C# .NET Programming, Comprehensive 3.00 summer Online Elective 9 IPDV.X400 Cloud Computing, Introduction 0.500 fall Classroom Elective 10 IPDV.X400 Cloud Computing, Introduction 0.500 winter "" Elective # ... with 106 more rows
Getting the Offering Data
This is a good start, but I need additional data on the individual offering pages. To pull out the urls, I extract all the anchor nodes and href attributes from the html file, filter out the unwanted urls, and reformat the paths with the UCSC Extension domain:
I can then loop through the urls, scraping and aggregating the table data in a similar fashion. It’s courteous to the website to delay iterative web scraping, so I’m calling a five second delay up front:
All that remains is some additional tidying, mutating, and joining. Here I’m just trying to create additional variables for users to set their preferences against when it comes time to creating the hypothetical schedules.
My definition of an “online” course is fairly ridgid. Any course classified “blended” or “Classroom and Online” is re-coded as an online course.
The Scheduling “Algorithm”
I omitted some of the quarterly data for reasons I explained above. I never bothered to figure out how to effectively project future quarter dates, which could have been used to calculate time-to-completion. Instead, we’re left with a half-decent approximation. The final scheduling function,
plan_my_program_schedule() is below. This monstrosity of a function has five parameters:
program_data– the scraped program data.
days_available– a vector indicating the weekdays the student is available.
online_willing– a boolean indicating whether the student is willing to take online courses.
weekday_morning_available– a boolean indicating whether the student is available on mornings during the week.
cost_threshold– the maximum amount of money the student is willing to spend.
In the end, I substituted “algorithm” with a “randomize-combinations-until-it-works” approach. The function first calculates the maximum combinations of six unique courses within the program data. Within the while loop, six random course codes are sampled and then run through a series of checks corresponding to the function inputs. The minimum number of credits (14) is also checked, as is whether or not a “core” class is included in the generated schedule. If the maximum possible combinations are made, the while loop breaks and apologizes to the user. Otherwise, it returns the first combination that passes all checks.
Bonus: I wrote a print method!
So finally, in the example below, I’m a student interested in the Internet Programming and Development program. I am only available on Mondays, Tuesdays, Wednesdays, and Thursdays. Because I am very old-fashioned, I am not willing to take online courses, nor am I available on weekday mornings. The maximum amount of money I am willing to spend is $5000. What’s my schedule?
Nice, it only took 6290 permutations to find it!
Scaling to Other Programs
With a little bit of effort, you can wrap the above code into several functions and input another program you might be interested in. For example, I scraped the Software Engineering program code for another hypothetical schedule:
The end. The lesson, as always, let the computers randomize your decisions.