US Baby Name Collisions 1880-2014
Abstract
We use US Social Security Administration data to compute the probability of a name clash in a class of year-YYYY born kids during the years 1880-2014.
This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License. The
markdown+Rknitr source code of this blog is available under a GNU General Public
License (GPL v3) license from github.
Introduction
After reading a cool post by Kasia Kulma on how the release of
Disney films
have an impact on girl namings in the US, I became aware of the
babynames
package by Hadley Wickham. The package wraps the
data by the USA
social security administration on the frequency of all baby names
each year during the period 1880-2014 in the US. Because the data fit
phenomenally in spirit to this blog’s two previous posts on onomastics
and the birthday
problem with unequal probabilities, we use the data to extend our
name analyses in temporal fashion.
library(babynames)
head(babynames,n=2)
## # A tibble: 2 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.07238359
## 2 1880 F Anna 2604 0.02667896
Check how many babies and how many unique first names are contained in the data each year:
<- babynames %>% group_by(year) %>% summarise(n=sum(n)) %>% ggplot(aes(x=year, y=n)) + geom_line() + xlab("Time (years)") + ylab("Number of babies")
p1 <- babynames %>% group_by(year) %>% summarise(n=n()) %>% ggplot(aes(x=year, y=n)) + geom_line() + xlab("Time (years)") + ylab("Number of unique names")
p2 ::grid.arrange(p1, p2, ncol=2) gridExtra
We see that the number of live-births remains at an approximately stable level the last 50 years, whereas the number of unique names kept increasing. Note that for reasons of privacy protection, only names with 5 or more occurrences in a given year, are contained in the data. We therefore investigate the proportion of babies, which apparently have been removed due to privacy protection of the names. This is done by investigating the sum of the proportions column for each year. If all names would be available, the sum per year would be 2 (1 for each gender).
%>% group_by(year) %>% summarise(prop=sum(prop)) %>% ggplot(aes(x=year, y=(2-prop)/2)) + geom_line() + xlab("Time (years)") + ylab("Proportion of the population removed") + scale_y_continuous(labels=scales::percent) babynames
It becomes clear that a non-negligible part of the names are removed and the proportion appears to vary with time. As a simple fix, we re-scale the yearly proportions per year s.t. they really sum to one.
<- babynames %>% group_by(year) %>% mutate(p = n/sum(n)) babynames
Birthday Problem with Unequal Occurrence Probabilities
The data are perfect for testing the name-collision functionality
from the previous Happy
pbirthday class of 2016 post. Since the writing of the post, the
pbirthday_up
function for computing the name collision
probability, which is an instance of the birthday problem with unequal
occurrence probabilities, has been assembled into a preliminary birthdayproblem
R package available from github.
::install_github("hoehleatsu/birthdayproblem")
devtoolslibrary(birthdayproblem)
We can now easily calculate for each year the probability that 2 or more kids in a class of \(n\in \{20,25,30\}\) kids all born in a given year YYYY will have same first name:
<- babynames %>% group_by(year) %>% do({
collision <- c(20L,25L,30L)
n <- sapply(n, function(n) pbirthday_up(n=n, .$p ,method="mase1992")$prob)
p data.frame(n=n, p=p)
})
It looks like the name distribution has become more diverse over time, since the collision probability reduces over time. However, some bias is to be expected due to the removal of names with frequencies below 5 in a given year.