This is the first post in a new series I’m calling How’d they do that? Motivated by professional curiosity
and personal jealousy, I will identify, expose, and interpret stumble through the secrets hidden within my fellow #rstats enthusiasts’ repos.
The exposition will not be exhaustive; curiosity is a fickle thing, burning one moment and extinguished the next. So it will be with these blogs.
First up: the console
histograms featured in rOpenSci’s skimr
. Observe:
> library(skimr)
> skim(mtcars) %>% dplyr::filter(stat == "hist")
# A tibble: 11 x 5
var type stat level value
<chr> <chr> <chr> <chr> <dbl>
1 mpg numeric hist ▂▅▇▇▇▃▁▁▂▂ 0
2 cyl numeric hist ▆▁▁▁▃▁▁▁▁▇ 0
3 disp numeric hist ▇▇▅▁▁▇▃▂▁▃ 0
4 hp numeric hist ▆▆▇▂▇▂▃▁▁▁ 0
5 drat numeric hist ▃▇▂▂▃▆▅▁▁▁ 0
6 wt numeric hist ▂▂▂▂▇▆▁▁▁▂ 0
7 qsec numeric hist ▂▃▇▇▇▅▅▁▁▁ 0
8 vs numeric hist ▇▁▁▁▁▁▁▁▁▆ 0
9 am numeric hist ▇▁▁▁▁▁▁▁▁▆ 0
10 gear numeric hist ▇▁▁▁▆▁▁▁▁▂ 0
11 carb numeric hist ▆▇▂▁▇▁▁▁▁▁ 0
Pretty cool, right? Well, How’d they do that? The rOpenSci folks provide an obvious clue in the README:
uses Hadley’s colformats, specifically colformats::spark-bar()
Thanks. Before we visit that repo, let’s glance at the function used to produce the histograms. After some digging, I
found inline_hist
:1
inline_hist <- function(x) {
x <- x[!is.na(x)]
out <- 0
if ( !all(x == 0)) {
hist_dt <- table(cut(x, 10))
hist_dt <- hist_dt / max(hist_dt)
names(out) <- colformat::spark_bar(hist_dt)
}
return(out)
}
A brief interpretation: First, NAs are removed from the vector. Second, the vector is checked for a distribution. Third, if the vector contains values other than 0, the vector is cut into 10 “bins” and coerced into a table. Fourth, the table values are scaled. And fifth, the actual histograms are assigned to the names of the object. That, I think, allows them to be printed within the resulting tibble above.
inline_hist
is elsewhere bundled into a list of functions that gets passed to the .default and appears in a separate environment. But
here my curiosity was temporarily sated. I want to dig deeper into the histograms themselves generated by spark_bar
. On to colformats
.
Briefly, colformats
is a nifty package that helps you style and color columns of data. Let’s look at spark_bar
:
spark_bar <- function(x, safe = TRUE) {
stopifnot(is.numeric(x))
bars <- vapply(0x2581:0x2588, intToUtf8, character(1))
if (safe) {
bars <- bars[-c(4, 8)]
}
factor <- cut(
x,
breaks = seq(0, 1, length = length(bars) + 1),
labels = bars,
include.lowest = TRUE
)
chars <- as.character(factor)
chars[is.na(chars)] <- crayon::style(bars[length(bars)], colour_na())
structure(paste0(chars, collapse = ""), class = "spark")
}
The bars
object is most intriguing. Watch what happens when I recreate it in the console:
> vapply(0x2581:0x2588, intToUtf8, character(1))
[1] "▁" "▂" "▃" "▄" "▅" "▆" "▇" "█"
intToUtf8
converts UTF-8 encoded character vectors to and from integer vectors. I know very little about encoding, but I do know that
the “8” in UTF-8 means that 8 bits are used to represent a character. So in the example above, vapply
converts 9601 9602 9603 9604 9605 9606 9607 and 9608 to the blocks and then creates a character object.
Much of the remaining code is a mystery, but the input vector is then cut, styled, and structured in ways I don’t entirely understand. The important part is that I’m no longer curious, I learned something along the way, and I resurrected my blog after a month haitus.
Thanks to the rOpenSci team for always piquing our curiosity.
-
Special thanks to Maelle Salmon for pointing out the
lookup
package after this posted. ↩