Comparisons

Materials for class on Tuesday, October 3, 2017

Slides
Comparing things
Feedback for today

Slides

Download the slides from today’s lecture.

Comparing things

Sparklines

Create in Excel
Create in Word, HTML, and Adobe products with AtF Spark
Create in R by saving really really tiny PDFs or PNGs

Small multiples

Create in R with facets

Lollipops

Download Lord of the Rings data

First, we load the libraries we’ll need for all the examples. We’ll use tidyverse in both; we’ll use forcats in the lollipop chart and ggrepel in the slopegraph.

library(tidyverse)
library(forcats)
library(ggrepel)

We load the CSV file into R and save it as a variable, or an object, named lotr. We use fct_inorder() inside a mutate command to make the Film variable an ordered factor so that the films plot in the correct order.

lotr <- read_csv("data/Lord_Of_The_Rings.csv") %>%
  mutate(Film = fct_inorder(Film, ordered = TRUE))

Next we summarize the data by race and gender. We can output the summarized table as a Markdown table with knitr::kable(). Remember that this syntax means that we’re using the kable() function inside the knitr package without actually loading it. Alternatively, you can run library(knitr) to load the package and then use kable() as needed without the double colon ::.

race_gender <- lotr %>% 
  group_by(Race, Gender) %>%
  summarize(total_words = sum(Words))

knitr::kable(race_gender)

Race	Gender	total_words
Elf	Female	1743
Elf	Male	1994
Hobbit	Female	16
Hobbit	Male	8780
Man	Female	669
Man	Male	8043

We can plot this summarized data, mapping variables to different aesthetics in the plot. A few things to note:

geom_pointrange() is what makes the lollipops. It takes two aesthetics: ymin and ymax. Here we want ymin to be zero so that all the lines start at the axis, but in other situations you could set it to an actual variable.
position_dodge() and width is what tells the point ranges to be plotted side-by-side.
coord_flip() rotates the whole plot so that the x-axis becomes the y-axis and vice versa.⊕ Alternatively, you can avoid using coord_flip() by using geom_pointrangeh() (and all the other *h functions like geom_barh and geom_colh() in the ggstance library.

ggplot(race_gender, aes(x = Gender, y = total_words, color = Race)) +
  geom_pointrange(aes(ymin = 0, ymax = total_words), 
                  position = position_dodge(width = 0.5)) +
  coord_flip() +
  labs(x = NULL) +
  theme(legend.position = "bottom") +
  scale_color_manual(values = c("red", "orange", "blue"))

We can also plot race and gender across the three films. Because we used fct_inorder() earlier, the Film variable/column is an ordered factor and the films should be in order. Here we use the original lotr data frame instead of the summarized one, since we want the Film column too. We facet by film with facet_wrap(~ Film).

ggplot(lotr, aes(x = Gender, y = Words, color = Race)) +
  geom_pointrange(aes(ymin = 0, ymax = Words), 
                  position = position_dodge(width = 0.5)) +
  coord_flip() +
  labs(x = NULL) +
  theme_light() +
  theme(legend.position = "bottom") +
  scale_color_manual(values = c("red", "orange", "blue")) +
  facet_wrap(~ Film)

Slopegraphs

Download General Conference “isms” data⊕ This data comes from BYU’s LDS General Conference Corpus. I created a list of "*ism" divided between the 1950s/60s and the 1990s/2000s, and then copied/pasted the results in to a CSV file.

First, we load the data. I already filtered and summarized and tidyified this data, based on the original data that looked liked this. You can see the R code I used to clean and tidy the data here.

isms <- read_csv("data/isms_top5.csv")

knitr::kable(isms)

word	decade	count	permil
Baptism	coldwar	596	191.7
Baptism	today	609	207.8
Communism	coldwar	216	69.5
Communism	today	1	0.3
Criticism	coldwar	75	24.1
Criticism	today	51	17.4
Mormonism	coldwar	135	43.4
Mormonism	today	39	13.3
Socialism	coldwar	82	26.4
Socialism	today	0	0.0

Just plotting the data as-is gives us a rudimentary and ugly slopegraph. Note the group aesthetic—without it, the lines will not plot across the coldwar and today columns.

ggplot(isms, aes(x = decade, y = permil, group = word)) + 
  geom_line(size = 1.5) + 
  geom_text(aes(label = word))

We can add a bunch of columns to the original isms data frame to help with plotting. Here’s what’s happening:

We create a label_first variable for the labels on the right side of the plot. We use ifelse()—if the decade is coldwar, use paste0() to combine the word column with the permil (or per million words) value, surrounded by parentheses; if it’s not coldwar, use a missing value, or NA
We create a label_last variable similarly, but this time only include the permil value on rows that aren’t coldwar
We create a highlight column to mark which lines we want colored
We recode the decade column to nicer values

isms_plot <- isms %>%
  mutate(label_first = 
           ifelse(decade == "coldwar", paste0(word, " (", permil, ")"), NA)) %>%
  mutate(label_last = ifelse(decade == "today", permil, NA)) %>%
  mutate(highlight = ifelse(word %in% c("Socialism", "Communism"), TRUE, FALSE)) %>%
  mutate(decade = recode(decade,
                         coldwar = "1950-1969",
                         today = "1990-2009"))

We can use this enhanced data frame to add labels and color specific lines. Here’s the complete, final plot. A few things to note:

geom_text_repel() comes from the ggrepel library and uses fancy algorithms to ensure labels don’t overlap. We set direction = "y" to make sure the algorithm only shifts labels up and down (without it, the labels will show up everywhere), and we use nudge_x to move the labels horizontally away from the lines. We can use seed = some_number to ensure the random positioning of the labels is the same each time the plot is generated.
We use a super minimal theme_minimal() theme and them make a few more adjustments with theme() to remove grid lines and the y axis text.
R will give two warnings saying Removed 5 rows containing missing values (geom_text_repel). This is because it’s trying to plot missing values for the labels—but recall that we created those missing values in both label_first and label_last, so it’s not an issue. We’re purposely causing the warning and we can ignore it.

fancy_plot <- ggplot(isms_plot, aes(x = decade, y = permil, group = word, color = highlight)) + 
  geom_point() +
  geom_line(size = 1.5) + 
  geom_text_repel(aes(label = label_first), direction = "y", nudge_x = -1, seed = 1234) + 
  geom_text_repel(aes(label = label_last), direction = "y", nudge_x = 1, seed = 1234) + 
  scale_color_manual(values = c("black", "red")) + 
  guides(color = FALSE) +
  labs(title = "Frequency of words ending in ‘ism’ in\nLDS General Conference talks",
       subtitle = "Word (occurrences per million words)",
       x = NULL, y = NULL) +
  theme_minimal() + 
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.text.y = element_blank())
fancy_plot

Because we saved the plot to a variable (fancy_plot), we can do stuff with it like saving it to our computer with ggsave():

ggsave(fancy_plot, filename = "output/words.pdf", 
       width = 7, height = 4, units = "in")
ggsave(fancy_plot, filename = "output/words.png", 
       width = 7, height = 4, units = "in")

Finally, just for fun, we can use nicer fonts to make the graphic even nicer. We’ll use Roboto Condensed, which is free from Google Fonts.

If you’re using macOS, ensure that you install XQuartz first. That’s all you have to do to get the Cairo graphics library working.
If you’re using Windows, either reload all the system fonts (after installing Roboto Condensed) with extrafont::font_import() or load the Roboto Condensed fonts individually on-the-fly:

windowsFonts(`Roboto Condensed` = windowsFont("Roboto Condensed"))
windowsFonts(`Roboto Condensed Light` = windowsFont("Roboto Condensed Light"))

With that, we can specify font families and font faces (bold, italic, plain, etc.) in geom_text_repel() and in theme():

nice_fonts <- ggplot(isms_plot, aes(x = decade, y = permil, group = word, color = highlight)) + 
  geom_point() +
  geom_line(size = 1.5) + 
  geom_text_repel(aes(label = label_first), direction = "y", nudge_x = -1,
                  family = "Roboto Condensed Light", fontface = "plain",
                  seed = 1234) + 
  geom_text_repel(aes(label = label_last), direction = "y", nudge_x = 0.3,
                  family = "Roboto Condensed Light", fontface = "plain",
                  seed = 1234) + 
  scale_color_manual(values = c("black", "red")) + 
  guides(color = FALSE) +
  labs(title = "Frequency of words ending in ‘ism’ in\nLDS General Conference talks",
       subtitle = "Word (occurrences per million words)",
       x = NULL, y = NULL) +
  theme_minimal(base_family = "Roboto Condensed") + 
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.text.y = element_blank(),
        plot.title = element_text(family = "Roboto Condensed", face = "bold"),
        plot.subtitle = element_text(family = "Roboto Condensed Light",
                                     color = "grey50", face = "plain"))
nice_fonts

We can save the plot with the custom fonts with ggsave() like always, but we have to use the Cairo graphics library to get the fonts to embed and to get a PNG with proper dimensions. Note the difference between the two—PDFs need device = cairo_pdf while PNGs need type = "cairo". It’s different and I don’t fully understand why but ¯\_(ツ)_/¯.

ggsave(plot_isms, filename = "isms.pdf", device = cairo_pdf,
       width = 6, height = 4, units = "in")
ggsave(plot_isms, filename = "isms.png", type = "cairo",
       width = 6, height = 4, units = "in")

Bullet charts

Download fake performance data

Bullet charts are just a bunch of bar charts stacked on top of each with with extra dots and lines. First we load a data frame I copied from Stephanie Evergreen’s book, and we make sure the region variable—here named measure—is an ordered factor, and we reverse it with fct_rev() because coord_flip() does weird stuff to the ordering.

performance <- read_csv("data/performance.csv")

knitr::kable(performance)

measure	bad	satisfactory	good	target	value
Region A	33.3	66.6	100	75	70
Region B	33.3	66.6	100	65	72
Region C	33.3	66.6	100	70	78
Region D	33.3	66.6	100	65	71

With the data in this form, it’s easy to plot. Here are a couple things to note:

We overlay a bunch of geom_col() layers with different aesthics set for each
geom_errorbar() creates a line at the target level. We add a geom_point() layer in the same place for fun

ggplot(performance) +
  geom_col(aes(x = measure, y = good), fill="goldenrod2", width = 0.5, alpha = 0.2) +
  geom_col(aes(measure, satisfactory), fill="goldenrod3", width = 0.5, alpha = 0.2) +
  geom_col(aes(measure, bad), fill="goldenrod4", width = 0.5, alpha = 0.2) +
  geom_col(aes(measure, value), fill="black", width = 0.2) +
  geom_errorbar(aes(x= measure, ymin = target, ymax = target), color = "red", width = 0.45) +
  geom_point(aes(measure, target), colour = "red", size = 2.5) +
  scale_y_continuous(breaks = c(0, 50, 100)) +
  labs(x = NULL, y = NULL) +
  coord_flip() +
  theme_minimal(base_family = "Roboto Condensed") +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

Feedback for today

Go to this form and answer these three questions (anonymously if you want):

What new thing did you learn today?
What was the most unclear thing about today?
What was the most exciting thing you learned today?