Did Mary and John go West?
Abstract
As a final post in the baby-names-the-data-scientist’s-way
series, we use the US Social Security Administration 1910-2015 data to
space-time visualize for each state the most popular baby name for girls
and boys, respectively. The code uses in parts the simple features
package (sf
) in order to to get some first experience with
the new approach for handling spatial maps.
This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License. The
markdown+Rknitr source code of this blog is available under a GNU General Public
License (GPL v3) license from github.
Introduction
After a series of posts on naming
uncertainty, name
collisions in classrooms and illustrating these name
collisions over time, it is time to leave onomatology for now.
However, the availability of the US social security baby name data at state level
requires one last effort: visualizing the top names per state for the
years 1910-2015. Creating a map-based visualization also provides a nice
opportunity to experiment with the new sf
(simple features)
package for spatial visualization.
Data Dancing
We download the US social security data, which consists of a zip file containing a bunch of 51 text files - one for each state.
We then read these individual text files and bind them together into
one large data.frame
:
##Get list of all file names containing state specific baby name data
<- list.files(path=file.path(filePath,"namesbystate"), pattern=".TXT")
fList
##Read complete name list of each state
<- purrr::map_df(fList, .f=function(f) {
names read_csv(file=file.path(filePath,"namesbystate",f), col_names=c("State","Sex","Year","Name","Count"),col_types=cols(col_character(), col_factor(c("M","F")), col_integer(), col_character(), col_integer()))
})
##Show result
head(names, n=4)
## # A tibble: 4 × 5
## State Sex Year Name Count
## <chr> <fctr> <int> <chr> <int>
## 1 AK F 1910 Mary 14
## 2 AK F 1910 Annie 12
## 3 AK F 1910 Anna 10
## 4 AK F 1910 Margaret 8
With the complete data in place, it’s easy to compute the top boy and girl name per state and year. For later use we convert this information into long-format.
##Find top-1 names for each state by gender. Data are already sorted.
<- names %>% group_by(Year,State,Sex) %>% do({
topnames head(.,n=1) %>% dplyr::select(Name)
%>% spread(Sex, Name) })
## Source: local data frame [4 x 4]
## Groups: Year, State [4]
##
## Year State M F
## <int> <chr> <chr> <chr>
## 1 1910 AK John Mary
## 2 1910 AL James Mary
## 3 1910 AR James Mary
## 4 1910 AZ John Mary
Map Massaging
For the map visualization we use an US map from the R package fiftystater
where Alaska and Hawaii have been re-located as map-insets. The process
for doing the necessary transforms sp
-style are described
in the package vignette.
We store the output of this transformation as a shapefile
usa.shp
with appropriate support files. Furthermore, a
lines.shp
shapefile was created which contains information
on where to put the text labels for each state. This was easily edited
interactively in QGIS.
We then use the sf
package for loading these two
shapefiles back into R:
suppressMessages(library("sf"))
<- st_read(file.path(filePath, "maps", "usa.shp"))
usa <- st_read(file.path(filePath, "maps", "lines.shp")) textplacement
The textplacement
information is converted to a
data.frame
where each row contains the state name and the
coordinates of the start and endpoint of each line-segment - this
corresponds to text location and geographical centroid of the region,
respectively.
<- textplacement %>% split(.$State) %>% purrr::map_df(.f = function(x) {
location <- st_geometry(x)[[1]]
pos data.frame(State=x$State, x1.loc=pos[1,1], x2.loc=pos[1,2], x1.center=pos[2,1],x2.center=pos[2,2])
%>% ungroup })
(Note: Is there a fancier way to extract the coordinates for the
geometry of the sf
objects while keeping the
data.frame
part alongside?)
State-Time Visualization
By using the animation::saveGIF
function we create an
animation of the the top girl and boy name for each state for the
sequence of years 1910-2015.
State-Time Cartogram
We use the Rcartogram
and getcartr
packages
to create an analogous cartogram - see the previous Cartograms
with R post for further details. The total number of births per
state in a given year is used as scaling variable for the cartogram.
Its amazing to observe how births go west in the US during the considered time period.