Glitch alert

make sure that your ~/90days link points to /90days/your-userid

Introduction

During this week and next, we will use the American Community Survey (ACS) to compute age specific rates. Fertility rates are the kind of thing demographers are always computing so it’s kind of exciting that we are finally doing it too. The tidyverse/R tools that you have recently become experts in will make the computations surprisingly easy – but there are a few “issues” to come to terms with about how to get the data, how to use the data on the Demography Lab servers and what the data actually mean – before one can simply publish.

This week

The plan for this week and next is to

Fortunately, some of the above steps depend on subsequent steps so a certain amount of pedagogically beneficial frustration and confusion are sure to be forthcoming.

By next Monday, I would like for each group/individual to have a graph and a short explanation. We’ll then “crowd source” some possible next steps – and then take another week to refine it. This is essentially a training wheels version of the two projects that we will do with Ayesha and Dennis later in the semester.

Projects in Rstudio

Since we about to begin our first real project in Rstudio, this is really good time to start using Rstudio’s concept of “projects”. An Rstudio “project” is a directory in which all the bits of your project (except huge chunks of easily downloaded data) should reside. You can create a new project via the File->New Project menu item.

  • Start a New Project in a brand new working directory.

  • Select Empty Directory to create a new project in a (new) empty directory.

The new project and the directory in which it lives will have the same name choose something meaningful like “213/fertility”. Since it’s just a directory, you can move it around later.

Fertility rates

Like so many things in demography, there are two flavors of total fertility rates (TFR): a cohort flavor and a period flavor. And as is so often the case, the former is easier to understand, but the latter is the one that we generally use.

A cohort TFR could be computed by following a birth cohort of women through their reproductive years and – at each age – computing the average annual fertility rate. When the cohort turns 45 years old, we would have a series of age specific fertitlity rates (ASFR) one for each age and the sum of the AFSRs = TFR. Conceptually, this is not difficult, but it’s a bit tricky from an execution stand point unless you are either very patient or blessed with predecessors who have been diligently collecting data in the hopes that some day, you would turn up and know what to do with it.

Practicality aside, the cohort TFR also falls flat if we want to know something about what’s happening with fertility in the present moment – across all ages of women. For example, did fertility increase between last year and this year? The other kind of TFR is the period TFR. The period flavor is based on the “synthetic cohort” idea wherein instead of waiting for a cohort of women to age through their reproductive life – we look at all women who are x years old in a single calendar year. In other words we compute annual fertility rates for women who are 15 years old in say 2017 and women who are 16 years old in 2017 and so on. Then we add up those age specific fertility rates (ASFRs) and we call it a period TFR for 2017.

The period TFR more or less estimates the number of children a woman will have if:

  1. She lives through her entire reproductive years
  2. She behaves (in terms of fertility) … when she is x-years old exactly the same way as women who are x-years old behave this year.

The TFR number is often reported in the media as the ``number of children per woman’’.

But as demographers, we prefer to see it as:

\[\begin{eqnarray} \label{eq:tfr} \text{ASFR}_{a} = \frac{\sum \text{births to women of age a}} {\sum \text{woman years of exposure at age a}}\\ \text{TFR}= \sum_{a=15}^{45} \text{ASFR}_{a}=\sum_{a=15}^{45} E(\text{births}_{a}|\text{Survival through age a}) \\ a= \text{Single year of age} \end{eqnarray}\]

In the equations above, ASFRs are ratios of events (births – at a particular maternal age) to person-years-lived while at risk of the event (being a woman of a particular age). Rates generally work this way: events/time spent at risk.

Subtlties of computing ASFRs with the ACS

The American Community Survey, ACS, provides age and sex as well as a large number of other characteristics of each person in the sample (which is 1 percent of the US). Crucially, they also ask whether or not each person has had a child in the last year (actually, the Census is clever enough to only ask this question of women). So this is good – but it could be better.

Some details that we shall need to consider are the following:

  1. The data are not a simple random sample. We must therefore use weights.

  2. It is feasible and therefore desirable to use single year age cohorts for women at risk of birth.

  3. Mortality could have an effect–some women might have died before the Census Bureau could interview them. Their births and their person-years-at-risk are lost to us.

  4. Twin and multiple births pose problems (for parents as well as demographers). The possibility of multiple single births during the (one year) period of observation also creates problems for all concerned.

  5. We are calculating a period TFR rather than a cohort TFR, consequently, we would like to observe people over a particular calendar year. In a perfect world, that year of observation would also correspond exactly to the years of ages of those under observation. The ACS does not work that way:
    • The ACS surveys are collected throughout the year. The ACS sample is divided into 12 bits and surveys are administered to one bit every month over the year.
    • Relatively few women filled out the ACS survey on their birthday.
  6. By phrasing the fertility question as they do: “did the person have a birth in the last year” a bit of uncertainty is introduced:
    • If my baby was born on the first of this month last year does that count even if today is the 30th ?
    • If I started filling out the survey two months ago when does the year clock start?
    • If my baby is particularly cute can I count her even if she is really 13 months old now?

You deserve a simpler world than the one we have to study.

While we will explicitly deal with sample weights– it will be prudent to assume our way out of many of the other difficulties by:

Pause briefly to consider the implications.

We would like to calculate ASFRs/TFRs for women disaggregated by some interesting characteristic.

The point of this exercise is data exploration – and data exploration, as Hadley Wickham points out – is largely about finding ways in which variables of interest covary. For us, this means looking for some characteristics of women that might reasonably be thought to correlate with fertility behavior over the (synthetic) life course.

The ACS asks lots of questions (variables) which might have an impact on or be affected by fertility. This would be a good time to visit http://ipums.org to see if any of those questions might covary with fertility. When you get to ipums.org note that there are lots of IPUMS project including census data from other countries as well as data on historical GIS boundaries, health, time use, environment and higher education. For this week we’re only going to bother with “IPUMS USA: U.S. Census and American Community Survey micro data from 1850 to the present.”

A couple of things to note:

  1. You need an account with IPUMS in order to download data – you should get one
  2. The interface for building data sets is very good, but requires a little experience to use well
    • Select samples – there are census, full count censuses and American Community Surveys ACS. We want to use recent ACS
    • Variables are either “household” or “person” household level variables are things like geography which are the same for all individuals in the household.
    • the weighting variables will be included in your extract you do not need to select them
    • once variables and samples are selected you can “view data in cart” in order to make final adjustments such as:
      • The format you want the data to arrive in
      • The cases you want included and excluded - e.g. include only females age 15-45
    • once you submit your request it can take minutes or hours to process depending on the size of your data set and the whim of god.
  3. IPUMS is an awesome project of the Minnesota Population Center. In addition to never using it for evil, you must also always cite it reverently as:
Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas, 
and Matthew Sobek. IPUMS USA: Version 8.0 [dataset]. Minneapolis, MN: IPUMS, 2018. 
https://doi.org/10.18128/D010.V8.0

IPUMS is oxygen for social science research. It has tons of high quality data that you can easily download for nothing in minutes. For millennials this might just seem like the sort of thing to which you are generationally entitled – but frankly, I’m not sure you deserve it. But you do deserve a simpler world.

A real ipums (ACS) dataset

Reading in the data …

library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## NOTE that you can read compressed (gz files) with read_csv  this is not the case with other R data
## input functions.
frt <- read_csv(file="/90days/carlm/usa_00161.csv.gz")
## Parsed with column specification:
## cols(
##   YEAR = col_double(),
##   DATANUM = col_double(),
##   SERIAL = col_double(),
##   HHWT = col_double(),
##   REGION = col_double(),
##   GQ = col_double(),
##   PERNUM = col_double(),
##   PERWT = col_double(),
##   SEX = col_double(),
##   AGE = col_double(),
##   FERTYR = col_double()
## )

A useful digression on factors, levels and labels

You already understand how factors work. One of the slightly barbaric features of our current method of reading data from IPUMS is that certain variables which are naturally factors e.g. Census Region, come to us as integers. There are lots of ways of dealing with this but today we’ll work through one of the more primitive but generally useful methods:

Creating a file from the data dictionary, that can be read in and used creating the factor.

  1. File -> New to create a Text File we’ll call it regionLabels.tsv eventually
  2. Cut and past a chunk of (the data dictionary) https://courses.demog.berkeley.edu/readings/usa_00161.cbk into the file
  3. Read the tiny file that we just created in R and use it to set levels and labels of the REGION variable.
reglabs<-read_tsv(file="regionLabels.tsv")
## Warning: Missing column names filled in: 'X2' [2]
## Parsed with column specification:
## cols(
##   REGION = col_double(),
##   X2 = col_logical(),
##   CensusName = col_character()
## )
## expect an warning message about the "second" column.
frt<-frt %>% mutate(
  region=factor(REGION,levels=reglabs$REGION,labels=reglabs$CensusName)
)

A note on perwt

perwt is the person weight variable. In IPUMS, as in many other data sets, the weight is the inverse probability of selection. This is complicated stratified random sample so not everyone has equal liklihood of being in the ACS. If everyone did, we would not need weihgts to make our results representative.

Since probabilities of selection in a 1% sample are about 1/100 the sample weights should be something like 100

frt %>% group_by(region) %>% summarize(N=sum(PERWT),
                                       mnpwt=mean(PERWT)) 
## # A tibble: 9 x 3
##   region                          N mnpwt
##   <fct>                       <dbl> <dbl>
## 1 New England Division      2953415  108.
## 2 Middle Atlantic Division  8397010  110.
## 3 East North Central Div.   9321346  112.
## 4 West North Central Div.   4199227  114.
## 5 South Atlantic Division  12939864  115.
## 6 East South Central Div.   3853669  114.
## 7 West South Central Div.   8380090  118.
## 8 Mountain Division         4885229  114.
## 9 Pacific Division         11081408  112.
frt %>% summarize(N=sum(PERWT))
## # A tibble: 1 x 1
##          N
##      <dbl>
## 1 66011258

A figure to reverse engineer

Let’s install noMachine (…by Wednesday?)

Data sets downloaded from IPUMs don’t belong in your Demography Lab home directory. They are large and easily reproduced and therefore stupid to backup to the cloud every day. Where files like data sets belong is your /90days/ directory. This is an easy place for you to read them from and easy place for you to put them. Having the noMachine remote desktop client working will be essential for this.

https://lab.demog.berkeley.edu/LabWiki has all the instructions.

Once you have noMachine running, start a web browser in your noMachine desktop (make it firefox for now) and use the menus to set the default download directory to ’’‘/hdir/0/your-userid/90days’’’. In Firefox, currently you can do this via the three-horizontal-lines menu in the upper right corner of the firefox window -> Preferences.

Note that /hdir/0/your-userid/90days is a ’‘’symbolic link’’’ to ’’‘/90days/your-userid’’’. This should be confusing. Make your instructor explain it to you.

Once you have configured your browser (the one that runs in noMachine) to save your downloads to 90days, you can download stuff with reckless abandon. Stuff you download will be automatically deleted after a year of complete neglect reading or changing the file in any way restarts the clock. (more on this at the LabWiki site). But more importantly for this project, when you download a data set from IPUMS.org with your thus configured browser in the noMachine remote desktop – it will windup in a place that you can easily read it from Rstudio with something like:

newthing<- read_csv(file=“/90days/my-userid/usa_00001.csv.gz”)