The bootstrap

Today we shall endure a demonstration of the bootstrap and of parallel processing. These are two useful things that are intellectually unrelated – and yet go together like science and technology.

The bootstrap is a statistical technique made popular by Bradly Efron in 1979. The idea is quite simple but powerful because it allows the lazy scientist to use the massive free computing power to which we are entitled, in place of solving very difficult math problems. Parallel processing is clever computing trick that allows us to harness all that free computing power by essentially driving our car in several lanes simultaneously.

The bootstrap is useful for assessing the uncertainty around estimates of complicated statistics. For example – suppose we wished to put a confidence interval around a TFR estimate. I have no idea how one would do that if one were limited to mere mathematics. It would probably require some fairly untestable assumptions about distributions and a whole lot of worn out erasers. And If the TFR is too easy, consider the confidence interval around some exotic genomic construction – or a median or an absolute value. You get the idea – demographers like to compute various population indexes and functions of population indexes and putting error bars around such things is really hard if all you have to work with is mathematics.

The bootstrap relies on a clever insight (plus some gnarly proofs) that a sample contains not only the statistic that you want to compute – but also a pant load of information about the distribution of the universe from which it was drawn. And further, we can unlock that information by drawing lots of samples with replacement from the original sample aka bootstrap samples.

Bootstrapping the TFR

To simplify this exercise, let’s reuse the code that we developed so many weeks ago to compute total fertility rates (TFR).

Here is the old code

library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ipumsr)
## Registered S3 methods overwritten by 'ipumsr':
##   method                                 from 
##   format.pillar_shaft_haven_labelled_chr haven
##   format.pillar_shaft_haven_labelled_num haven
##   pillar_shaft.haven_labelled            haven
## NOTE that you can read compressed (gz files) with read_csv  this is not the case with other R data
## input functions.
frt <- read_csv(file="/90days/carlm/usa_00161.csv.gz")
## Parsed with column specification:
## cols(
##   YEAR = col_double(),
##   DATANUM = col_double(),
##   SERIAL = col_double(),
##   HHWT = col_double(),
##   REGION = col_double(),
##   GQ = col_double(),
##   PERNUM = col_double(),
##   PERWT = col_double(),
##   SEX = col_double(),
##   AGE = col_double(),
##   FERTYR = col_double()
## )
ddi<- read_ipums_ddi("/90days/carlm/usa_00161.ddi.xml")
frt<-ipums_collect(frt,ddi)
frt <- frt %>% mutate(region=as_factor(REGION))

# reglabs<-read_tsv(file="https://courses.demog.berkeley.edu/mason213/data/regionLabels.tsv")
# ## expect an warning message about the "second" column.
# frt<-frt %>% mutate(
#   region=factor(REGION,levels=reglabs$REGION,labels=reglabs$CensusName)
# )


ASFRs<-frt %>% filter(! is.na(AGE), FERTYR %in% c(1,2)) %>% group_by(AGE,region) %>% 
  summarize(numer=sum(PERWT * (FERTYR == 2)),
            denom=sum(PERWT),
            asfr=numer/denom)  
  
TFRs<- ASFRs %>%  group_by(region) %>% summarize(tfr=sum(asfr))

ASFRs %>% ggplot(aes(x=AGE,y=asfr,color=factor(region))) +geom_point(aes(alpha=.2)) + geom_smooth(se=FALSE,method='loess')
## Don't know how to automatically pick scale for object of type haven_labelled. Defaulting to continuous.

TFRs<- TFRs %>% mutate(region=fct_reorder(region,tfr))
print(TFRs)
## # A tibble: 9 x 2
##   region                     tfr
##   <fct>                    <dbl>
## 1 New England Division      1.63
## 2 Middle Atlantic Division  1.68
## 3 East North Central Div.   1.90
## 4 West North Central Div.   2.07
## 5 South Atlantic Division   1.78
## 6 East South Central Div.   1.85
## 7 West South Central Div.   2.00
## 8 Mountain Division         1.91
## 9 Pacific Division          1.71
ggplot(data=TFRs) + geom_point(aes(x=region,y=tfr)) + coord_flip()

Nice as far as it goes … but how about some confidence bounds around those TFRs ? Without confidence bounds are we even doing science?

The bootstrap involves re-sampling the data in a special but straight forward way and then computing the statistics of interest (TFR) for each new sample.

  • write a function that computes a TFR for each region when passed a tibble that looks like frt
getTfr<- function(NOTfrt){
  # expects a tibble like frt; returns tfr of each region
  res<-NOTfrt%>% filter(! is.na(AGE), FERTYR %in% c(1,2)) %>% group_by(AGE,region) %>% 
    summarize(numer=sum(PERWT * (FERTYR == 2)),
              denom=sum(PERWT),
              asfr=numer/denom)  %>%
    group_by(region) %>% summarize(tfr=sum(asfr))
  
  return(res)
}
TFRs<-getTfr(frt)
print(TFRs)
## # A tibble: 9 x 2
##   region                     tfr
##   <fct>                    <dbl>
## 1 New England Division      1.63
## 2 Middle Atlantic Division  1.68
## 3 East North Central Div.   1.90
## 4 West North Central Div.   2.07
## 5 South Atlantic Division   1.78
## 6 East South Central Div.   1.85
## 7 West South Central Div.   2.00
## 8 Mountain Division         1.91
## 9 Pacific Division          1.71

Now we’ll use that function to run a bootstrap. The thing to keep your eye on is the sample_n() function. Every time sample_n is called a new re sample of frt is drawn. Obviously the re sample is going to be similar to the original sample –but it’s not going to be the same because some rows will be missing and others will be duplicated.

Each time we draw a sample; we then compute the tfr for each region implied by that re sample.

## Slow and boring but effective bootstrap


system.time({
  tfrBS0<-NULL
  for(i in 1:100){
    bstfr<-  sample_n(frt,size=nrow(frt),replace=TRUE) %>% getTfr()

    tfrBS0<-bind_rows(tfrBS0,bstfr)
  }
  
})
##    user  system elapsed 
## 104.312   9.513 113.830
head(tfrBS0,n=20)
## # A tibble: 20 x 2
##    region                     tfr
##    <fct>                    <dbl>
##  1 New England Division      1.66
##  2 Middle Atlantic Division  1.72
##  3 East North Central Div.   1.89
##  4 West North Central Div.   2.04
##  5 South Atlantic Division   1.77
##  6 East South Central Div.   1.85
##  7 West South Central Div.   1.98
##  8 Mountain Division         1.87
##  9 Pacific Division          1.75
## 10 New England Division      1.57
## 11 Middle Atlantic Division  1.70
## 12 East North Central Div.   1.93
## 13 West North Central Div.   2.08
## 14 South Atlantic Division   1.80
## 15 East South Central Div.   1.92
## 16 West South Central Div.   1.99
## 17 Mountain Division         1.97
## 18 Pacific Division          1.74
## 19 New England Division      1.57
## 20 Middle Atlantic Division  1.71

Note that that code took a rather long time to run. We’ll deal with that speed problem later

And let’s draw some plots and figure out what’s going on.

Each bootstrap trial produces a set of TFRs – Let’s take a look at the histograms

ggplot(data=tfrBS0) + geom_histogram(aes(x=tfr)) + facet_wrap(~region)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

100 trial is not that many but we can already see a hint of the bell shaped distribution that we like in situations like this.

Let’s convince ourselves that these histograms show us that the most likely outcomes are near the center and the less likely outcomes are near the extremes…

Then let’s consider how we might come up with what we traditionally call “confidence bounds”.

# how about 2.5 and 97.25 percentiles?
critvals= tfrBS0 %>% group_by(region) %>% summarize(c025=quantile(tfr,prob=.025),
                                                 c9725=quantile(tfr,prob=.9725))
# for optimal presentation, we'll want to sort regions by TFR
tfrBS0 <- tfrBS0 %>% mutate(region =fct_reorder(region,tfr,.fun=mean))
ggplot(data=tfrBS0) + geom_histogram(aes(x=tfr)) + 
  geom_vline(data=critvals,aes(xintercept=c025),color='red',linetype=2) +
  geom_vline(data=critvals,aes(xintercept=c9725),color='red',linetype=2) +
  geom_vline(data=TFRs, aes(xintercept=tfr),color='blue') +
  facet_grid(region~.)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

How about speeding this up

library(foreach)
## 
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
# the foreach structure is a slightly different way to for loops.
# note that we do not have to collect the results using bind_rows
system.time({
tfrBS1<-foreach(i = 1:100) %do% {
  
  sample_n(frt,size=nrow(frt),replace=TRUE) %>% getTfr()
}
})
##    user  system elapsed 
## 104.904   7.439 112.347
tfrBS1<-bind_rows(tfrBS1)

That doesn’t save us much time (because we aren’t yet parallel processing). But we’re ready now to do it. the point of foreach() is that it has a back end called ‘doParallel’ that connects it to a package called ‘parallel’ which does the work of splitting up the job and parceling it out to multiple processor cores and then recombining the results.

The additional steps are:

-load the doParallel library -register the right number of cores -modify the loop syntax just a tiny bit

library(foreach)
library(doParallel)
## Loading required package: iterators
## Loading required package: parallel
# the same thing with parallel processing
detectCores()
## [1] 28
numc=9
registerDoParallel(numc)

system.time({
tfrBS1<-foreach(i = 1:1000) %dopar% {
  
  sample_n(frt,size=nrow(frt),replace=TRUE) %>% getTfr()
}
})
##     user   system  elapsed 
## 1067.831  112.911  142.773
tfrBS1<-bind_rows(tfrBS1)
head(tfrBS1)
## # A tibble: 6 x 2
##   region                     tfr
##   <fct>                    <dbl>
## 1 New England Division      1.69
## 2 Middle Atlantic Division  1.63
## 3 East North Central Div.   1.87
## 4 West North Central Div.   2.04
## 5 South Atlantic Division   1.76
## 6 East South Central Div.   1.85

And the graphs…

critvals= tfrBS1 %>% group_by(region) %>% summarize(c025=quantile(tfr,prob=.025),
                                                 c9725=quantile(tfr,prob=.9725))
tfrBS1 <- tfrBS1 %>% mutate(region =fct_reorder(region,tfr,.fun=mean))
ggplot(data=tfrBS1) + geom_histogram(aes(x=tfr)) + 
  geom_vline(data=critvals,aes(xintercept=c025),color='red',linetype=2) +
  geom_vline(data=critvals,aes(xintercept=c9725),color='red',linetype=2) +
  geom_vline(data=TFRs, aes(xintercept=tfr),color='blue') +
  facet_grid(region~.)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

A digression to use clever R tricks

It’s nice to have these delightful bell shaped curves of bootstrapped statistics. If our goal were just to know the reasonable range of the true TFRs of the Census Regions, then we are done. But what if we wished instead to ask a question like: Are the differences between the regions merely due to chance – could it be that there is no real difference between the TFRs – we just happened to get a sample wherein there was a difference?

One way of approaching this is to couch this in terms of the range of the TFRs. Let the null hypothesis be that the range of TFRs (max - min) is 0 and that our observed range: West North Central (2.07) - New England (1.63) = 0.438 is just due to chance.

We can now ask the question: At what level of significance can we reject the null hypothesis that our observed range is just the result of randomness. Clever readers will recognize this as the “p-value”.

If this were not a digression intended to remind us of some nifty R tricks that we have begun to forget, we would probably go back to the “slow and boring…” chunk (or the parallel processing example) and include the computation of the range into the code executed with each trial – but that would simply waste 17.584 seconds – and subvert my pedagogical intent. So instead, let’s compute the range from the tfrBS1 tibble then …

Reverse Engineer the graph below

Steps:

  • reshape data so that all regional tfrs for a given trial form one row
  • mutate the stat we want
  • display the CDF of the new bootstrapped statistic stat_ecdf() is reasonable option

Assignment for this week

For this week, use your Brazil infectious disease data to estimates a statistic and then bootstrap it to produce a histogram of uncertainty and a 95% confidence interval.

For your statistic you can use anything you’d like as long as it uses a sizable chunk of your data, and is not something based on or influenced by extreme values. Here are some not terribly creative examples of stuff you could do:

  • The difference between the number of domestic violence cases in the richest 25 percent of municipalities and the poorest 25 percent of municipalities.
  • The year to year change in the median income of the 50 percent of municipalities with the highest female-male gap in dengue cases.
  • The median increase in vaccination rates of municipalities that saw at least on yellow fever case in the previous year.

Profound question

In the TFR example, our data are a sample of 15-45 year old US resident women. It therefore makes perfect sense that we explore the possibility that our sample could differ from other possible samples and from the “true” value in the population as a whole. How big our sample is and how much variation it shows will determine how certain we are that our sample is representative.

But the Brazilian disease data are not a sample. They represent a complete enumeration of municipalities. There are surely errors and omissions in the data. But omissions, typos, auto-corrections, misunderstandings, intentional falsifications and the mysteries of Latin1 are NOT addressed by the bootstrap. The bootstrap is about sampling variation and nothing more. So does this assignment make any sense?