Review-Chapters 1 and 2 of R for Data Science

Exercise 1

Here is the basic plot:

ggplot(data=mpg, mapping=aes(x=cty,y=hwy))  +
  geom_point()

Add stuff to the basic plot to produce this fancier version. Don’t be afraid to use the help facility, google, or your mom.

Exercise 2

Reproduce the graph below using the mpg dataframe. The geom is geom_boxplot() and the tricky bit is that some of the variables used are continuous, but are functioning as categorical.

Excercise 3

Here is the basic plot

ggplot(data=diamonds,aes(x=carat,y=price)) + geom_point()

Now how about adding color, alpha, and two smoothers one showing the 90% confidence interval in gold and the other showing the 50% confidence bounds in blue. Note that using geom_smooth() with the default smoother, gam, will generate ignorable warning messages.

Exercise 4

Because life is unfair, in addition to geom_x layers, there also exist stat_x layers. I believe it is always possible to create the same graph using a geom_x function as with a stat_x function – but certainty here is elusive . The stat_x functions exist because it is believed in some situations they are easier to deal with than a geom_x function.

Consider the graph below which is created using stat_summary(). This sort of makes sense because the graph displays a summary statistic. The data contain individual observations on diamonds, while the plot shows the sum of the prices by cut and clarity –stacked so that we can see the total value of all diamonds as well. Because the plot shows a summary statistic using stat_summary() might make sense.

Try to produce the graph below with help from this web page: https://ggplot2.tidyverse.org/reference/stat_summary.html or any where else the google might take you.

Use stat_summary(….., geom=“area”) to build the graph below

For you bemusement, the same graph can be built with geom_area()

ggplot(data=diamonds) + geom_area(aes(x=cut,y=price,fill=clarity,group=clarity),stat="summary",fun.y="sum") +
  labs(title="Total value (sum of price) of diamonds by cut and clarity")

Exercise 5

A preview of what’s in Chapter 3 – dplyr

df0<- diamonds %>% group_by(cut,clarity) %>% summarise(N=length(price), 
                                                       sumP=sum(price),
                                                       meanP=mean(price))

                                                       

ggplot(df0, aes(x=cut,y=sumP,group=clarity,fill=clarity)) +geom_area()  #note stat="identity"  is default

                                                                        #note that 'group' is needed

Appendix

And finally …

One of the problems with the graph above is that it doesn’t tell us much that is useful. The reason is that we don’t know how many diamonds are in each cut/clarity bin. The total value is, in a sense the average value * N so in the above plot we cannot tell if the total value of those “ideal”ly cut diamonds is high because ideal cuts makes them valuable or because there are lots more of them.

One way to disambiguate this is to decompose the mean price and stack the components. In other words to draw a graph like the one above but each band should show not the total value of each category of diamond, but the contribution of the value of each category of diamond to the average for the particular “cut”.

\[\bar{p} =\sum_i{\frac{N_i}{N}*\bar{p_i}}\] where \(\bar{p}\) is the mean price of diamonds of a particular cut, and \(\bar{p_i}\) is the mean price of diamonds of a clarity level \(i\) and that particular cut, \(N_{i}/N\) is the fraction of diamonds of clarity level i and that particular cut.

What we might plot is a stack of \(\frac{N_i}{N}*\bar{p_i}\) for each particular cut.

The difficult part of this is the mutate statement. Mutate() allows us to add a variable to a tibble but what’s really hard to get one’s brain around is the way that sum(N) in the mutate statement below refers to N in the equation above.

df0<- diamonds %>% group_by(cut,clarity) %>% summarise(N=length(price), 
                                                       sumP=sum(price),
                                                       meanP=mean(price))
df0<- df0 %>%  mutate(wtMean=meanP*(N/sum(N)))
ggplot(df0, aes(x=cut,y=wtMean,group=clarity,fill=clarity)) +geom_area()