Carl Mason
Demograph 213
Oct 7 2019
Brazil makes available to all who can navigate their website, a whole mess of data on incidence of infectious and other sorts of diseases.
Here is the website: http://www2.datasus.gov.br/DATASUS/index.php?area=0203
A couple of things to note:
Let’s bring down a year or two of monthly infection counts by municipality by month. The way the website is set up, two years means two files.
Ayesha has of course downloaded a bunch of these files already so we can act like this is a cooking show and use the batter that she as already mixed for us…
Getting easy access to Ayesha’s files we simply need to create a “symbolic link” or “symlink” to the directory that has all the good stuff. Perhaps you should ask your instructor what a “symlink” is?
Since symlinks are Unix things, R can only do them via the “system()” function. Since this only needs to be done once, Let’s do it in a shell directly.
Then verify that the shell tool and you R interpreter have the same working directory. How do you do this?
ln -s /90days/carlm/Brazil2019 ./BrazilData
Now you can use some of the 12 most important Unix commands to explore the directory structure of Ayesha’s excellent files, or you can use the files pane and click yourself crazy.
Check out the Dengue directory and look for the file that we are about to read in below. Click to view it…
then maybe execute the next cell. NOTE the read_csv2 (which defaults to ‘;’ as delimiters)
library(tidyverse)
# Read in a file of dengue infection counts
dengue06<- read_csv2(file="BrazilData/Dengue/A150459134_174_250_159.csv",skip=3)
Blame it on colonialism. In the distant past, at the dawn of the computer age, someone had to decide how to “encode” characters and digits – that is how the various letters and digits would be represented in 1’s and 0’s in computer files. The colonial overlords at the time, saw no problem in limiting the number of bits per character to 7 with one bit reserved for “parity checking”. Since \(2^7=128\) the number of distinct characters allowed was 128 –this coding scheme is called “ASCII” where the “A” stands for “American” and the other letters translate roughly to “who cares about anyone else?”. Eventually, our European friends complained of missing their diacritical marks and other chicken scratches, so several 8 bit encodings (still 1 byte) were developed and called “ISO-8859” where the “I” no longer stands for American, but rather “international” which in this case is a pretentious term for “European”. The extra bit produces another 128 characters bringing the total to 256 – which turns out to be not enough for Cyrillic, Turkish, Esperanto, and so on, thus there developed several variants of ISO-8859 of which “Latin1” is one, and it turns out to work out for Portuguese.
NOTE that in the 21st century, an encoding scheme that includes as many bytes as is necessary called “Unicode” or “UTF-8” or “UTF-16” and perhaps others, allows for an unlimited number of characters and much more common in the world than “Latin1”.
NOTE also that there is no good way to determine the encoding scheme of a file. Trial and error is your only friend.
require(tidyverse)
## Loading required package: tidyverse
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
dengue06<- read_csv2(file="BrazilData/Dengue/A150459134_174_250_159.csv",skip=3, na='-',
locale=locale(encoding = "latin1")) # latin1 works for european languages...sometimes
## Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
## Parsed with column specification:
## cols(
## `Município infecção` = col_character(),
## Jan = col_double(),
## Fev = col_double(),
## Mar = col_double(),
## Abr = col_double(),
## Mai = col_double(),
## Jun = col_double(),
## Jul = col_double(),
## Ago = col_double(),
## Set = col_double(),
## Out = col_double(),
## Nov = col_double(),
## Dez = col_double(),
## Total = col_double()
## )
## Warning: 11 parsing failures.
## row col expected actual file
## 2278 -- 14 columns 1 columns 'BrazilData/Dengue/A150459134_174_250_159.csv'
## 2279 -- 14 columns 1 columns 'BrazilData/Dengue/A150459134_174_250_159.csv'
## 2280 -- 14 columns 1 columns 'BrazilData/Dengue/A150459134_174_250_159.csv'
## 2281 -- 14 columns 1 columns 'BrazilData/Dengue/A150459134_174_250_159.csv'
## 2282 -- 14 columns 1 columns 'BrazilData/Dengue/A150459134_174_250_159.csv'
## .... ... .......... ......... ..............................................
## See problems(...) for more details.
While we’re at it, let’s read another just like it
dengue05<- read_csv2(file="BrazilData/Dengue/A150518134_174_250_159.csv",skip=3,na='-',
locale=locale(encoding = "latin1"))
## Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
## Parsed with column specification:
## cols(
## `Município infecção` = col_character(),
## Jan = col_double(),
## Fev = col_double(),
## Mar = col_double(),
## Abr = col_double(),
## Mai = col_double(),
## Jun = col_double(),
## Jul = col_double(),
## Ago = col_double(),
## Set = col_double(),
## Out = col_double(),
## Nov = col_double(),
## Dez = col_double(),
## Total = col_double()
## )
## Warning: 11 parsing failures.
## row col expected actual file
## 1810 -- 14 columns 1 columns 'BrazilData/Dengue/A150518134_174_250_159.csv'
## 1811 -- 14 columns 1 columns 'BrazilData/Dengue/A150518134_174_250_159.csv'
## 1812 -- 14 columns 1 columns 'BrazilData/Dengue/A150518134_174_250_159.csv'
## 1813 -- 14 columns 1 columns 'BrazilData/Dengue/A150518134_174_250_159.csv'
## 1814 -- 14 columns 1 columns 'BrazilData/Dengue/A150518134_174_250_159.csv'
## .... ... .......... ......... ..............................................
## See problems(...) for more details.
The graph below uses both dengue05 and dengue06; you’ll need to use some tools from the wrangling chapter to reshape the data and you may want to use ‘bind_rows()’ to combine the two tibbles at some point. An additional challenge is to get the months to show up in the right order. One way to do that is to create a factor and specify the ‘levels’ argument.
I stumbled across this site which provides all sorts of numbers at the municipality level:
http://www.atlasbrasil.org.br/2013/en/
Click on this to see a gnarly .xls file full of vaguely comprehensible but clearly important stuff:
http://www.atlasbrasil.org.br/2013/data/rawData/atlas2013_dadosbrutos_pt.xlsx
I thoughtfully added this spreadsheet to the GeneralInfo directory of Ayesha’s stash of data so we can read the second sheet of the file with the following"
mhdi<-readxl::read_excel("BrazilData/GeneralInfo/atlas2013_dadosbrutos_pt.xlsx",sh=2)
Our goal here is a file that has one row for each municipality but has both the dengue data AND the data in the mhdi data. We’ll use a Chapter 10 style relational database tool, the inner_join. The book uses something quite similar, the left_join. Either way, the challenge is to find a column that is the same in both tibbles.
Because the world is uncaring, this mhdi data file records municipality a little be differently from the way it is done in the dengue tibble. Since in general, it is better to match numbers than words, let’s do that
This is a job for parse_number OR stringr
## str_extract is a more general tool that uses "regular expressions" such as '\\d+' which mean one or more
## digits.
foo<-str_extract(dengue06$`Município infecção`,'\\d+')
## parse_number works in the present case because what it does happens to be what we want
goo<-parse_number(dengue06$`Município infecção`)
## Warning: 8 parsing failures.
## row col expected actual
## 2277 -- a number Total
## 2278 -- a number -
## 2279 -- a number Notas:
## 2280 -- a number Incluídas notificações de indivíduos residentes no Brasil, independente de sua confirmação, exceto os descartados,
## 2281 -- a number .
## .... ... ........ ..................................................................................................................
## See problems(...) for more details.
sum(foo != goo, na.rm=T)
## [1] 2
To investigate those we can use problems()
problems(parse_number(dengue06$`Município infecção`))
Which looks pretty harmless
So on to the join
#first create a new codnum6
Dengue06 <- dengue06 %>% mutate(Codmun6 = parse_number(`Município infecção`)) %>%
inner_join(mhdi %>% filter(ANO == 2010),by="Codmun6")
## Warning: 8 parsing failures.
## row col expected actual
## 2277 -- a number Total
## 2278 -- a number -
## 2279 -- a number Notas:
## 2280 -- a number Incluídas notificações de indivíduos residentes no Brasil, independente de sua confirmação, exceto os descartados,
## 2281 -- a number .
## .... ... ........ ..................................................................................................................
## See problems(...) for more details.
## Warning: Column `Codmun6` has different attributes on LHS and RHS of join
The first error message is familiar the second a little more disturbing. Investigation is left to the reader.
Here are a couple of things that Google turned up that are worth glancing briefly at before Wednesday.
A brief description of what’s available in Brazil https://www.indexmundi.com/brazil/major_infectious_diseases.html
A category of tropical diseases is called “neglected”
http://www.scielo.br/scielo.php?pid=s0036-46652009000500003&script=sci_arttext