library(tidyverse)

Some Rstudio tricks

  1. Turn off automatic workspace saving. Tools -> Gloabl Options select NEVER for “Save Workspace to .Rdata on exit”

  2. Ctlr+I for (magic) Indenting.

  3. Factor vs numeric

junk<-tibble(eyecolorS=
sample(size=10,x=c("blue","green","brown","purple"),
                              repl=T),
             eyecolorF=factor(eyecolorS),
)

      junk
## # A tibble: 10 x 2
##    eyecolorS eyecolorF
##    <chr>     <fct>    
##  1 blue      blue     
##  2 green     green    
##  3 purple    purple   
##  4 blue      blue     
##  5 green     green    
##  6 purple    purple   
##  7 blue      blue     
##  8 green     green    
##  9 blue      blue     
## 10 brown     brown
          as.numeric(junk$eyecolorS)
## Warning: NAs introduced by coercion
##  [1] NA NA NA NA NA NA NA NA NA NA
as.numeric(junk$eyecolorF)
##  [1] 1 3 4 1 3 4 1 3 1 2
levels(junk$eyecolorF)
## [1] "blue"   "brown"  "green"  "purple"

group_by, mutate and summarize

The most important part of these chapters is the use of group_by() and summarize() which together allow you to “collapse” a big data set full of individual observations into a smaller dataset where each observation represents a set of similar observations.

A simple example – find the mean duration of flights (from New York airports) by destination

library(tidyverse)
library(nycflights13)
(by_carrier <-flights %>% group_by(carrier) %>% summarize(meanDuration=mean(sched_arr_time-sched_dep_time,na.rm=T)))
## # A tibble: 16 x 2
##    carrier meanDuration
##    <chr>          <dbl>
##  1 9E             199. 
##  2 AA             255. 
##  3 AS             311. 
##  4 B6              72.2
##  5 DL             249. 
##  6 EV             178. 
##  7 F9             238. 
##  8 FL             213. 
##  9 HA             517. 
## 10 MQ             177. 
## 11 OO             159. 
## 12 UA             232. 
## 13 US             166. 
## 14 VX             332. 
## 15 WN             190. 
## 16 YV             169.

And the coolest thing is the concept of the “pipe operator” which passes the output of the left side function as input into the right hand side function – all in a pleasing and somewhat comprehensible manner. The cool part is that the last pipe operator can pass to a ggplot function

# here's the line of code from above ...
## you might want to ask about this factor...levels ..order business ...
by_carrier %>% mutate(
                      Carrier=reorder(factor(carrier),meanDuration),
                      Carrier.OLD.FASHIONED = factor(carrier,levels=by_carrier$carrier[order(by_carrier$meanDuration)])
                      ) %>%
ggplot(aes(x=Carrier,y=meanDuration)) + geom_col(fill='hotpink2') 

## Or skipping the re-ordering of carrier
#by_carrier %>%
#  ggplot(aes(x=carrier,y=meanDuration)) + geom_col(fill='rosybrown2') 

Segments

Since speed = distance over time, a reasonable way of examining the relationship between flight time and air speed is shown below. The geom used is geom_segments. Let’s reverse engineer it

A box plot example

Dates and times are as tricky in R as they are in preschool. Ya’ know – 7 of these, 60 of those, 12 of the other thing and sometimes 30 sometimes 31 or 28 or even 29 of that middle thing. One venerable strategy is to just convert everything to integer minutes or days and let the reader deal with it, but that’s kind of cowardly here in the 21st Century.

Generally the “easiest” way to deal with dates and times in R is to construct a variable of the mnemonically named type: POSIXlt (or POSIXct … it’s easy to go back and forth between the two). These two data types hold, in one single column the year, month, day, hour, minute, second, and timezone and provide “easy” ways of extracting or computing the bit that you want.

But wait POSIXls and POSIXct are old base R concepts. The tidyverse has something a package called ‘lubridate’ which gives everything a new name – but pretty much does the same thing.

In lubridate there is a command called “make_date” (which used to be called ISOdatetime()). You use it to create a “dttm” object (which is morally equivalent to a POSIXct) generally as a column of your tibble – and you can then use that dttm object to compute dates using arithmetic or extract human friendly renderings.

For example with R’s help, you can become what’s known as an “autistic calendar savant” (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1951792/)

with the following command:

wday(make_date(1911,11,1))

you’ll have to convert numbers like “3” into concepts like “Wed” but you’re up to that.

Getting finally to the point…

For the boxplot below, you’ll need to work with the day of the week. In the flights dataset, there are a couple of ways of extracting the day of the week from information given. Any way you want to do it is fine with me.

Note that lubridate needs to be library’ed in it plays well with tidyverse but it’s not part of the tidyverse “package”.

OK with that long winded “clue” Let’s reverse engineer the graph below:

## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date

Plotting with text

An excellent and often overlooked way to cram more information into a plot is to use text in place of plotting symbols.