make sure that your ~/90days link points to /90days/your-userid
During this week and next, we will use the American Community Survey (ACS) to compute age specific rates. Fertility rates are the kind of thing demographers are always computing so it’s kind of exciting that we are finally doing it too. The tidyverse/R tools that you have recently become experts in will make the computations surprisingly easy – but there are a few “issues” to come to terms with about how to get the data, how to use the data on the Demography Lab servers and what the data actually mean – before one can simply publish.
The plan for this week and next is to
Fortunately, some of the above steps depend on subsequent steps so a certain amount of pedagogically beneficial frustration and confusion are sure to be forthcoming.
By next Monday, I would like for each group/individual to have a graph and a short explanation. We’ll then “crowd source” some possible next steps – and then take another week to refine it. This is essentially a training wheels version of the two projects that we will do with Ayesha and Dennis later in the semester.
Since we about to begin our first real project in Rstudio, this is really good time to start using Rstudio’s concept of “projects”. An Rstudio “project” is a directory in which all the bits of your project (except huge chunks of easily downloaded data) should reside. You can create a new project via the File->New Project menu item.
Start a New Project in a brand new working directory.
Select Empty Directory to create a new project in a (new) empty directory.
The new project and the directory in which it lives will have the same name choose something meaningful like “213/fertility”. Since it’s just a directory, you can move it around later.
Like so many things in demography, there are two flavors of total fertility rates (TFR): a cohort flavor and a period flavor. And as is so often the case, the former is easier to understand, but the latter is the one that we generally use.
A cohort TFR could be computed by following a birth cohort of women through their reproductive years and – at each age – computing the average annual fertility rate. When the cohort turns 45 years old, we would have a series of age specific fertitlity rates (ASFR) one for each age and the sum of the AFSRs = TFR. Conceptually, this is not difficult, but it’s a bit tricky from an execution stand point unless you are either very patient or blessed with predecessors who have been diligently collecting data in the hopes that some day, you would turn up and know what to do with it.
Practicality aside, the cohort TFR also falls flat if we want to know something about what’s happening with fertility in the present moment – across all ages of women. For example, did fertility increase between last year and this year? The other kind of TFR is the period TFR. The period flavor is based on the “synthetic cohort” idea wherein instead of waiting for a cohort of women to age through their reproductive life – we look at all women who are x years old in a single calendar year. In other words we compute annual fertility rates for women who are 15 years old in say 2017 and women who are 16 years old in 2017 and so on. Then we add up those age specific fertility rates (ASFRs) and we call it a period TFR for 2017.
The period TFR more or less estimates the number of children a woman will have if:
The TFR number is often reported in the media as the ``number of children per woman’’.
But as demographers, we prefer to see it as:
\[\begin{eqnarray} \label{eq:tfr} \text{ASFR}_{a} = \frac{\sum \text{births to women of age a}} {\sum \text{woman years of exposure at age a}}\\ \text{TFR}= \sum_{a=15}^{45} \text{ASFR}_{a}=\sum_{a=15}^{45} E(\text{births}_{a}|\text{Survival through age a}) \\ a= \text{Single year of age} \end{eqnarray}\]
In the equations above, ASFRs are ratios of events (births – at a particular maternal age) to person-years-lived while at risk of the event (being a woman of a particular age). Rates generally work this way: events/time spent at risk.
The American Community Survey, ACS, provides age and sex as well as a large number of other characteristics of each person in the sample (which is 1 percent of the US). Crucially, they also ask whether or not each person has had a child in the last year (actually, the Census is clever enough to only ask this question of women). So this is good – but it could be better.
Some details that we shall need to consider are the following:
The data are not a simple random sample. We must therefore use weights.
It is feasible and therefore desirable to use single year age cohorts for women at risk of birth.
Mortality could have an effect–some women might have died before the Census Bureau could interview them. Their births and their person-years-at-risk are lost to us.
Twin and multiple births pose problems (for parents as well as demographers). The possibility of multiple single births during the (one year) period of observation also creates problems for all concerned.
While we will explicitly deal with sample weights– it will be prudent to assume our way out of many of the other difficulties by:
Defining ``birth’’ as one or more babies.
Ignoring the possibility of two singleton birth events in a single year. (known sometimes impolitically as Irish Twins)
Assuming that the probabilities of dying and of giving birth are not correlated.
Assuming that women’s birthdays are randomly distributed throughout the year.
Defining our ASFRs as applying in some way two calendar years.
Pause briefly to consider the implications.
The point of this exercise is data exploration – and data exploration, as Hadley Wickham points out – is largely about finding ways in which variables of interest covary. For us, this means looking for some characteristics of women that might reasonably be thought to correlate with fertility behavior over the (synthetic) life course.
The ACS asks lots of questions (variables) which might have an impact on or be affected by fertility. This would be a good time to visit http://ipums.org to see if any of those questions might covary with fertility. When you get to ipums.org note that there are lots of IPUMS project including census data from other countries as well as data on historical GIS boundaries, health, time use, environment and higher education. For this week we’re only going to bother with “IPUMS USA: U.S. Census and American Community Survey micro data from 1850 to the present.”
A couple of things to note:
Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas, and Matthew Sobek. IPUMS USA: Version 8.0 [dataset]. Minneapolis, MN: IPUMS, 2018. https://doi.org/10.18128/D010.V8.0
IPUMS is oxygen for social science research. It has tons of high quality data that you can easily download for nothing in minutes. For millennials this might just seem like the sort of thing to which you are generationally entitled – but frankly, I’m not sure you deserve it. But you do deserve a simpler world.
Reading in the data …
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## NOTE that you can read compressed (gz files) with read_csv this is not the case with other R data
## input functions.
frt <- read_csv(file="/90days/carlm/usa_00161.csv.gz")
## Parsed with column specification:
## cols(
## YEAR = col_double(),
## DATANUM = col_double(),
## SERIAL = col_double(),
## HHWT = col_double(),
## REGION = col_double(),
## GQ = col_double(),
## PERNUM = col_double(),
## PERWT = col_double(),
## SEX = col_double(),
## AGE = col_double(),
## FERTYR = col_double()
## )
You already understand how factors work. One of the slightly barbaric features of our current method of reading data from IPUMS is that certain variables which are naturally factors e.g. Census Region, come to us as integers. There are lots of ways of dealing with this but today we’ll work through one of the more primitive but generally useful methods:
Creating a file from the data dictionary, that can be read in and used creating the factor.
reglabs<-read_tsv(file="regionLabels.tsv")
## Warning: Missing column names filled in: 'X2' [2]
## Parsed with column specification:
## cols(
## REGION = col_double(),
## X2 = col_logical(),
## CensusName = col_character()
## )
## expect an warning message about the "second" column.
frt<-frt %>% mutate(
region=factor(REGION,levels=reglabs$REGION,labels=reglabs$CensusName)
)
perwt is the person weight variable. In IPUMS, as in many other data sets, the weight is the inverse probability of selection. This is complicated stratified random sample so not everyone has equal liklihood of being in the ACS. If everyone did, we would not need weihgts to make our results representative.
Since probabilities of selection in a 1% sample are about 1/100 the sample weights should be something like 100
frt %>% group_by(region) %>% summarize(N=sum(PERWT),
mnpwt=mean(PERWT))
## # A tibble: 9 x 3
## region N mnpwt
## <fct> <dbl> <dbl>
## 1 New England Division 2953415 108.
## 2 Middle Atlantic Division 8397010 110.
## 3 East North Central Div. 9321346 112.
## 4 West North Central Div. 4199227 114.
## 5 South Atlantic Division 12939864 115.
## 6 East South Central Div. 3853669 114.
## 7 West South Central Div. 8380090 118.
## 8 Mountain Division 4885229 114.
## 9 Pacific Division 11081408 112.
frt %>% summarize(N=sum(PERWT))
## # A tibble: 1 x 1
## N
## <dbl>
## 1 66011258
Data sets downloaded from IPUMs don’t belong in your Demography Lab home directory. They are large and easily reproduced and therefore stupid to backup to the cloud every day. Where files like data sets belong is your /90days/
https://lab.demog.berkeley.edu/LabWiki has all the instructions.
Once you have noMachine running, start a web browser in your noMachine desktop (make it firefox for now) and use the menus to set the default download directory to ’’‘/hdir/0/your-userid/90days’’’. In Firefox, currently you can do this via the three-horizontal-lines menu in the upper right corner of the firefox window -> Preferences.
Note that /hdir/0/your-userid/90days is a ’‘’symbolic link’’’ to ’’‘/90days/your-userid’’’. This should be confusing. Make your instructor explain it to you.
Once you have configured your browser (the one that runs in noMachine) to save your downloads to 90days, you can download stuff with reckless abandon. Stuff you download will be automatically deleted after a year of complete neglect reading or changing the file in any way restarts the clock. (more on this at the LabWiki site). But more importantly for this project, when you download a data set from IPUMS.org with your thus configured browser in the noMachine remote desktop – it will windup in a place that you can easily read it from Rstudio with something like:
newthing<- read_csv(file=“/90days/my-userid/usa_00001.csv.gz”)