Key Things to Remember

Look at your data!



Summary Statistics Datasaurus Dozen

dataset X Mean Y Mean X SD Y SD Corr.
bullseye 54.26873 47.83082 16.76924 26.93573 -0.0685864
circle 54.26732 47.83772 16.76001 26.93004 -0.0683434
dino 54.26327 47.83225 16.76514 26.93540 -0.0644719
dots 54.26030 47.83983 16.76774 26.93019 -0.0603414
h_lines 54.26144 47.83025 16.76590 26.93988 -0.0617148
high_lines 54.26881 47.83545 16.76670 26.94000 -0.0685042
slant_down 54.26785 47.83590 16.76676 26.93610 -0.0689797
slant_up 54.26588 47.83150 16.76885 26.93861 -0.0686092
star 54.26734 47.83955 16.76896 26.93027 -0.0629611
v_lines 54.26993 47.83699 16.76996 26.93768 -0.0694456
wide_lines 54.26692 47.83160 16.77000 26.93790 -0.0665752
x_shape 54.26015 47.83972 16.76996 26.93000 -0.0655833

Visualizing Datasaurus Dozen

Datasaurus shows us why visualisation is important, not just summary statistics.

Boxplots in Base R: Perceptions of Probability

Maps: Where Europe Lives, in 14 lines of code

- visualization of population density in Europe in 2011, created by Henrik Lindberg
- parse the latitude/longitude of population centers
- arrange into a 0.01 by 0.01 degree grid
- plot each row as a horizontal line with population as the vertical axis
- grid cells with zero populations cause breaks in the line and leave white gaps

Maps: Where Europe Lives, in 14 lines of code

library(readr)
library(tidyr)
read_csv('../data/moreplots_GEOSTAT_grid_POP_1K_2011_V2_0_1.csv') %>%
  rbind(read_csv('../data/moreplots_JRC-GHSL_AIT-grid-POP_1K_2011.csv') %>%
          mutate(TOT_P_CON_DT='')) %>%
  mutate(lat = as.numeric(gsub('.*N([0-9]+)[EW].*', '\\1', GRD_ID))/100,
         lng = as.numeric(gsub('.*[EW]([0-9]+)', '\\1', GRD_ID)) *
           ifelse(gsub('.*([EW]).*', '\\1', GRD_ID) == 'W', -1, 1) / 100) %>%
  filter(lng > 25, lng < 60) %>%
  group_by(lat=round(lat, 1), lng=round(lng, 1)) %>%
  summarize(value = sum(TOT_P, na.rm=TRUE))  %>%
  ungroup() %>%
  complete(lat, lng) %>%
  ggplot(aes(lng, lat + 5*(value/max(value, na.rm=TRUE)))) +
    geom_line(size=0.4, alpha=0.8, color='#5A3E37', aes(group=lat), na.rm=TRUE) +
    ggthemes::theme_map() +
    coord_equal(0.9)
ggsave('../images/moreplots_europe.png', width=10, height=10)

Maps: Where Europe Lives, in 14 lines of code

Interactive ggplots: plotly

Interactive ggplots: plotly

Interactive ggplots: ggiraph (-> ggplot2-exts.org)

price rooms bedr size location date
795 1 NA 18 75016 2017-05-03
1240 1 NA 28 75003 2017-05-03
1860 3 1 65 75011 2017-05-03
810 1 NA 20 75011 2017-05-03
1300 2 1 42 75016 2017-05-03
## 'data.frame':    2000 obs. of  9 variables:
##  $ price      : int  795 1240 1860 810 1300 1000 850 900 3100 995 ...
##  $ photo      : Factor w/ 166 levels "/images/visuel-annonce-nophoto.jpg",..: 93 149 44 100 1 127 142 101 133 124 ...
##  $ description: Factor w/ 199 levels "Paris 10e (75010). A 3 mn métro Strasbourg Saint Denis et proche métro Château d'Eau. Studio 19 m² au 3ème étage : pièce avec c"| __truncated__,..: 91 161 21 19 83 94 126 182 8 59 ...
##  $ link       : Factor w/ 200 levels "http://www.pap.fr/annonce/immobilier-location-appartement-paris-75-g439-40-annonces-par-page-4-r168106171",..: 98 117 91 102 95 109 114 103 111 108 ...
##  $ rooms      : int  1 1 3 1 2 1 1 1 6 1 ...
##  $ bedr       : int  NA NA 1 NA 1 NA 1 NA 4 NA ...
##  $ size       : int  18 28 65 20 42 24 21 17 146 36 ...
##  $ location   : int  75016 75003 75011 75011 75016 75016 75018 75007 75010 75015 ...
##  $ date       : Factor w/ 1 level "2017-05-03": 1 1 1 1 1 1 1 1 1 1 ...

Interactive ggplots: ggiraph (-> ggplot2-exts.org)

Boxplot Variations: The Pirate Plot

- points, symbols representing the raw data (jittered horizontally)
- bar, a vertical bar showing central tendencies
- bean, a smoothed density (inspired by Kampstra and others (2008)) representing a smoothed density
- inf, a rectangle representing an inference interval (e.g.; Bayesian Highest Density Interval or frequentist confidence interval)

Boxplot Variations: The Pirate Plot

library(yarrr)
png(filename = "../images/moreplots_pirateplot_movie.png", width = 800, height = 450)
pirateplot(formula = time ~ genre + sequel,
           data = subset(movies, 
                         genre %in% c("Action", "Adventure", "Comedy") &
                         time > 0),
           main = "Movie running times",
           theme = 2,
           gl.col = gray(.7),
           inf.f.col = piratepal("basel")[1:3],
           bean.f.o = .1,
           point.o = .05,
           avg.line.o = 0
           )
dev.off()
## png 
##   2

Boxplot Variations: The Pirate Plot