--- title: "Introduction to gghighlight" author: "Hiroaki Yutani" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to gghighlight} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( fig.width = 6, fig.height = 4, fig.align = "center", collapse = TRUE, comment = "#>" ) ``` ## Motivation Suppose we have data that has so many series that it is hard to identify them by their colours as the differences are so subtle. ```{r data} set.seed(2) d <- purrr::map_dfr( letters, ~ data.frame( idx = 1:400, value = cumsum(runif(400, -1, 1)), type = ., flag = sample(c(TRUE, FALSE), size = 400, replace = TRUE), stringsAsFactors = FALSE ) ) ``` ```{r ggplot2-simple} library(ggplot2) ggplot(d) + geom_line(aes(idx, value, colour = type)) ``` To filter the data to a reasonable number of lines, we can use dplyr's `filter()`. ```{r ggplot2-filter} library(dplyr, warn.conflicts = FALSE) d_filtered <- d %>% group_by(type) %>% filter(max(value) > 20) %>% ungroup() ggplot(d_filtered) + geom_line(aes(idx, value, colour = type)) ``` But, it seems not so handy. For example, what if we want to change the threshold in predicate (`max(value) > 20`) and highlight other series as well? It's a bit tiresome to type all the code above again every time we replace `20` with some other value. Besides, considering one of the main purposes of visualization is to get the overview of a data, it may not be good to simply filter out the unmatched data because the plot will lose its context. Here comes gghighlight package, `dplyr::filter()` equivalent for ggplot2. (If you are interested in more details behind the idea of highlighting, please read this post: [Anatomy of gghighlight](https://yutani.rbind.io/post/2018-06-03-anatomy-of-gghighlight/).) ## gghighlight() The main function of the gghighlight package is `gghighlight()`. For example, by using this function, we can highlight the lines whose max values are larger than 20 as seen below: ```{r gghighlight-simple} library(gghighlight) ggplot(d) + geom_line(aes(idx, value, colour = type)) + gghighlight(max(value) > 20) ``` You can specify as many predicates as you like. For example, the following code highlights the data that satisfies both `max(value) > 15` and `mean(flag) > 0.55`. ```{r gghighlight-two-conds} ggplot(d) + geom_line(aes(idx, value, colour = type)) + gghighlight(max(value) > 15, mean(flag) > 0.5) ``` ## Customization As adding `gghighlight()` results in a ggplot object, it is fully customizable just as we usually do with ggplot2 like custom themes. ```{r gghighlight-theme} ggplot(d) + geom_line(aes(idx, value, colour = type)) + gghighlight(max(value) > 19) + theme_minimal() ``` The plot also can be facetted: ```{r gghighlight-facet} ggplot(d) + geom_line(aes(idx, value, colour = type)) + gghighlight(max(value) > 19) + theme_minimal() + facet_wrap(~ type) ``` There are also some options to control the way of highlighting. See "Options" section below. ## Geoms `gghighlight()` can highlight almost every geom. Here are some examples. ### Bar `gghighlight()` can highlight bars. ```{r bar} p <- ggplot(iris, aes(Sepal.Length, fill = Species)) + geom_histogram() + gghighlight() p ``` Are you wondering if this is really highlighted? Yes, it is. But, the unhighlighted bars are all overwritten by the highlighted bars. This seems not so useful, until you see the facetted version: ```{r bar-wrap} p + facet_wrap(~ Species) ``` ### Point As is explained in [Anatomy of gghighlight](https://yutani.rbind.io/post/2018-06-03-anatomy-of-gghighlight/), lines and points typically have different semantics (group-wise or not). But, in most cases, you don't need to be careful about the difference with `gghighlight()` because it automatically picks the right method of calculation. ```{r point} set.seed(10) d2 <- dplyr::slice_sample(d, n = 20) ggplot(d2, aes(idx, value)) + geom_point() + gghighlight(value > 0, label_key = type) ``` More precisely, `gghighlight()` takes the following strategy: 1. Calculate the group IDs from mapping. a. If `group` exists, use it. b. Otherwise, assign the group IDs based on the combination of the values of discrete variables. 2. If the group IDs exists, evaluate the predicates in a grouped manner. 3. If the group IDs doesn't exist or the grouped calculation fails, evaluate the predicates in an ungrouped manner. Note that, in this case, `label_key = type` is needed to show labels because `gghighlight()` chooses a discrete variable from the mapping, but `aes(idx, value)` consists of continuous variables only. ## Non-logical predicate To construct a predicate expression like below, we need to determine a threshold (in this example, `20`). But it is difficult to choose a nice one before we draw plots. ```{r predicate-example, eval=FALSE} max(value) > 20 ``` So, `gghighlight()` allows predicates that return non-logical (e.g. numeric and character) results. The values are used for sorting data and the top `max_highlight` of rows/groups are highlighted: ```{r numeric-highlight} ggplot(d, aes(idx, value, colour = type)) + geom_line() + gghighlight(max(value), max_highlight = 5L) ``` ## Labels `gghighlight()` adds direct labels for some geoms. Currently, the following geoms are supported: * `point`: add labels at each highlighted point. * `line`: add labels at the right end of each highlighted line. * `bar`: (do not add labels) If you don't want them to be labelled automatically, you can specify `use_direct_label = FALSE` ```{r labels} ggplot(d) + geom_line(aes(idx, value, colour = type)) + gghighlight(max(value) > 20, use_direct_label = FALSE) ``` Labels are drawn by `geom_label_repel()`. If you want to customize the labels, you can pass parameters to it via `label_params`. ```{r labels2} ggplot(d) + geom_line(aes(idx, value, colour = type)) + gghighlight(max(value) > 20, label_params = list(size = 10)) ``` You can also add labels by yourself. It is easy to add labels on only highlighted data because `gghighlight()` replaces the plot's data to the filtered one. ```{r labels3} p <- ggplot(d2, aes(idx, value)) + geom_point(size = 4) + gghighlight(value > 0, use_direct_label = FALSE) # the filtered data p$data p + geom_label(aes(label = type), hjust = 1, vjust = 1, fill = "purple", colour = "white", alpha= 0.5) ``` ## Options ### `unhighlighted_params` If you want to change the style of unhighlighted layers, use `unhighlighted_params`. ```{r gghighlight-params} ggplot(d) + geom_line(aes(idx, value, colour = type), linewidth = 5) + gghighlight(max(value) > 19, unhighlighted_params = list(linewidth = 1, colour = alpha("pink", 0.4))) ``` You can also specify `NULL` to `fill` or `colour` to preserve the original color. ```{r gghighlight-params-null} ggplot(d) + geom_line(aes(idx, value, colour = type)) + gghighlight(max(value) > 19, # preserve colour and modify alpha instead unhighlighted_params = list(colour = NULL, alpha = 0.3)) ``` ### `keep_scales` If you want to keep the original scales, set `keep_scales` to `TRUE`. ```{r keep_scales, fig.show='hold'} p <- ggplot(mtcars, aes(wt, mpg, colour = factor(cyl))) + geom_point() p + gghighlight(cyl == 6) p + gghighlight(cyl == 6, keep_scales = TRUE) + ggtitle("keep_scales = TRUE") ``` ### `calculate_per_facet` If you want to highlight each facet individually, set `calculate_per_facet` to `TRUE`. Note that `gghighlight()` affects the plot *before* `gghighlight()`. If you add `facet_*()` after adding `gghighlight()`, this option doesn't work. ```{r calculate_per_facet, fig.show='hold'} d <- data.frame( idx = c(1, 2, 3, 4, 1, 2, 3, 4), value = c(10, 11, 12, 13, 4, 8, 16, 32), cat1 = rep(c("a", "b"), each = 4), cat2 = rep(rep(c("1-2", "3-4"), each = 2), 2), stringsAsFactors = FALSE ) p <- ggplot(d, aes(idx, value, colour = cat1)) + geom_line() + facet_wrap(vars(cat2)) p + gghighlight(max(value) > 10) p + gghighlight(max(value) > 10, calculate_per_facet = TRUE) + ggtitle("calculate_per_facet = TRUE") ``` ### `line_label_type` (experimental) By default, gghighlight uses [the ggrepel package](https://cran.r-project.org/package=ggrepel) for labeling lines. You can change the method by `line_label_type` argument. The options are: * `"ggrepel_label"` (default): Use `ggrepel::geom_label_repel()`. * `"ggrepel_text"`: Use `ggrepel::geom_text_repel()`. * `"text_path"`: Use `geomtextpath::geom_textline()` for lines and `geomtextpath::geom_textpath()` for paths. * `"label_path"`: Use `geomtextpath::geom_labelline()` for lines and `geomtextpath::geom_labelpath()` for paths. * `"sec_axis"`: Use secondary axis. Please refer to [Simon Jackson's blog post](https://drsimonj.svbtle.com/label-line-ends-in-time-series-with-ggplot2) for the trick. ```{r line_label_type, fig.show='hold'} d <- data.frame( x = rep(1:3, times = 3), y = c(1:3, 2, 4, 2, 0, 1, 1), id = rep(c("a", "b", "c"), each = 3) ) p <- ggplot(d) + geom_line(aes(x, y, colour = id)) p + gghighlight(max(y) >= 3, line_label_type = "label_path", label_params = list(size = 10)) + ggtitle('line_label_type = "label_path"') p + gghighlight(max(y) >= 3, line_label_type = "sec_axis") + ggtitle('line_label_type = "sec_axis"') + theme(axis.text.y.right = element_text(size = 20)) ``` Note that, while this looks good for this example, there are some limitations: * Unlike ggrepel, there's no mechanism to avoid overlapping. You'll probably want to choose `ggrepel_label` or `ggrepel_text` when the data has many series. * Since `"sec_axis"` is a very different approach than the other, some `label_params` are ignored. For example, if you want to change the text size of the labels, you need to specify it via `ggplot2::theme()` instead of `label_params`.