How'd they do that? Part I: Console Histograms

This is the first post in a new series I’m calling How’d they do that? Motivated by professional curiosity and personal jealousy, I will ~~identify, expose, and interpret~~ stumble through the secrets hidden within my fellow #rstats enthusiasts’ repos. The exposition will not be exhaustive; curiosity is a fickle thing, burning one moment and extinguished the next. So it will be with these blogs.

First up: the console histograms featured in rOpenSci’s skimr. Observe:

> library(skimr)
> skim(mtcars) %>% dplyr::filter(stat == "hist")
# A tibble: 11 x 5
     var    type  stat      level value
   <chr>   <chr> <chr>      <chr> <dbl>
 1   mpg numeric  hist ▂▅▇▇▇▃▁▁▂▂     0
 2   cyl numeric  hist ▆▁▁▁▃▁▁▁▁▇     0
 3  disp numeric  hist ▇▇▅▁▁▇▃▂▁▃     0
 4    hp numeric  hist ▆▆▇▂▇▂▃▁▁▁     0
 5  drat numeric  hist ▃▇▂▂▃▆▅▁▁▁     0
 6    wt numeric  hist ▂▂▂▂▇▆▁▁▁▂     0
 7  qsec numeric  hist ▂▃▇▇▇▅▅▁▁▁     0
 8    vs numeric  hist ▇▁▁▁▁▁▁▁▁▆     0
 9    am numeric  hist ▇▁▁▁▁▁▁▁▁▆     0
10  gear numeric  hist ▇▁▁▁▆▁▁▁▁▂     0
11  carb numeric  hist ▆▇▂▁▇▁▁▁▁▁     0

Pretty cool, right? Well, How’d they do that? The rOpenSci folks provide an obvious clue in the README:

uses Hadley’s colformats, specifically colformats::spark-bar()

Thanks. Before we visit that repo, let’s glance at the function used to produce the histograms. After some digging, I found inline_hist:¹

inline_hist <- function(x) {
  x <- x[!is.na(x)]
  out <- 0
  if ( !all(x == 0)) {
    hist_dt <- table(cut(x, 10))
    hist_dt <- hist_dt / max(hist_dt)
    names(out) <- colformat::spark_bar(hist_dt)
  }
  return(out)
}

A brief interpretation: First, NAs are removed from the vector. Second, the vector is checked for a distribution. Third, if the vector contains values other than 0, the vector is cut into 10 “bins” and coerced into a table. Fourth, the table values are scaled. And fifth, the actual histograms are assigned to the names of the object. That, I think, allows them to be printed within the resulting tibble above.

inline_hist is elsewhere bundled into a list of functions that gets passed to the .default and appears in a separate environment. But here my curiosity was temporarily sated. I want to dig deeper into the histograms themselves generated by spark_bar. On to colformats.

Briefly, colformats is a nifty package that helps you style and color columns of data. Let’s look at spark_bar:

spark_bar <- function(x, safe = TRUE) {
  stopifnot(is.numeric(x))

  bars <- vapply(0x2581:0x2588, intToUtf8, character(1))
  if (safe) {
    bars <- bars[-c(4, 8)]
  }

  factor <- cut(
    x,
    breaks = seq(0, 1, length = length(bars) + 1),
    labels = bars,
    include.lowest = TRUE
  )
  chars <- as.character(factor)
  chars[is.na(chars)] <- crayon::style(bars[length(bars)], colour_na())

  structure(paste0(chars, collapse = ""), class = "spark")
}

The bars object is most intriguing. Watch what happens when I recreate it in the console:

> vapply(0x2581:0x2588, intToUtf8, character(1))
[1] "▁" "▂" "▃" "▄" "▅" "▆" "▇" "█"

intToUtf8 converts UTF-8 encoded character vectors to and from integer vectors. I know very little about encoding, but I do know that the “8” in UTF-8 means that 8 bits are used to represent a character. So in the example above, vapply converts 9601 9602 9603 9604 9605 9606 9607 and 9608 to the blocks and then creates a character object.

Much of the remaining code is a mystery, but the input vector is then cut, styled, and structured in ways I don’t entirely understand. The important part is that I’m no longer curious, I learned something along the way, and I resurrected my blog after a month haitus.

Thanks to the rOpenSci team for always piquing our curiosity.

Special thanks to Maelle Salmon for pointing out the lookup package after this posted. ↩