Surveillance Out of the Box - The #Zombie Experiment
Abstract
We perform a social experiment to investigate, if zombie related twitter posts can be used as a reliable indicator for an early warning system. We show how such a system can be set up almost out-of-the-box using R - a free software environment for statistical computing and graphics. Warning: This blog entry contains toxic doses of Danish irony and sarcasm as well as disturbing graphs. Update: The blog post was extended with additional data, graphs and text at 2016-11-03 00:08:27. Scroll to the end of the post for details.
This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License. The
markdown+Rknitr source code of this blog is available under a GNU General Public
License (GPL v3) license from github.
Introduction
Proposing statistical methods is only mediocre fun if nobody applies
them. As an act of desperation the prudent statistician has been forced
to provide R packages supplemented with a CRAN, github, useR! or
word-of-mouth advertising strategy. To underpin efforts, a
reproducibility-crisis has been announced in order to scare decent
comma-separated scientists from
using Excel. Social media marketing strategies of your R package
include hashtag #rstats
twitter announcements, possibly
enhanced by a picture or animation showing your package at its best:
Introducing gganimate: #rstats package for adding animation to any ggplot2 figure https://t.co/UBWKHmIc0e pic.twitter.com/oQhQaYBqOj
— David Robinson (@drob) February 1, 2016
Unfortunately, little experience with the interactive aspect of this
statistical software marketing strategy appears to be available. In
order to fill this scientific advertising gap, this blog post
constitutes an advertisement for the
out-of-the-box-functionality of the
surveillance
package hidden as social experiment. It shows
shows what you can do with R when combining a couple of packages,
wrangle the data, cleverly visualize the results and then team up with
the fantastic R community.
The Setup: Detecting a Zombie Attack
As previously explained in an useR! 2015 lightning talk, Max Brooks’ Zombie Survival Guide is very concerned about the early warning of Zombie outbreaks.
However, despite of extensive research and recommendations, no
reliable service appears available for the early detection of such
upcoming events. Twitter, on the other hand, has become the media
darling to stay informed about news as they unfold. Hence, continuous
monitoring of hashtags like #zombie
or
#zombieattack
appears an essential component of your zombie
survival strategy.
Tight Clothes, Short Hair and R
Extending the recommendations of the Zombie Survival guide we provide
an out-of-the-box (OOTB) monitoring system by using the
rtweet
R package to obtain all individual tweets containing
the hashtags #zombie
or #zombieattack
.
<- "#zombieattack OR #zombie"
the_query <- "" #To limit the seach to berlin & surroundings: geocode <- "52.520583,13.402765,25km"
geocode #Converted query string which works for storing as file
<- stringr::str_replace_all(the_query, "[^[:alnum:]]", "X") safe_query
In particular, the README of the
rtweet
package provides helpful information on how to
create a twitter app to automatically search tweets using the twitter
API. One annoyance of the twitter REST API is that only the tweets of
the past 7 days are kept in the index. Hence, your time series are going
to be short unless you accumulate data over several queries spread over
a time period. Instead of using a fancy database setup for this data
collection, we provide a simple R solution based on dplyr
and saveRDS
- see the underlying R code of this post by
clicking on the github logo in the license statement of this post.
Basically,
- all tweets fulfilling the above hashtag search queries are extracted
- each tweet is extended with a time stamp of the query-time
- the entire result of each query us stored into a separate RDS-files
using
saveRDS
In a next step, all stored queries are loaded from the RDS files and
put together. Subsequently, only the newest time stamped entry about
each tweet is kept - this ensures that the re-tweeted counts are
up-to-date and no post is counted twice. All these data wrangling
operations are easily conducted using dplyr
. Of course a
full database solution would have been more elegant, but R does the job
just as well as long it’s not millions of queries. Actually, in the
example we are going to use the results of a single query. No matter the
data backend, at the end of this pipeline we have a database of
tweets.
#Read the tweet database
<- readRDS(file=paste0(filePath,"Tweets-Database-",safe_query,"-","2016-09-25",".RDS"))
tw options(width=300,tibble.width = Inf)
%>% select(created_at, retweet_count,screen_name,text,hashtags,query_at) tw
## # A tibble: 10,974 × 6
## created_at retweet_count screen_name text hashtags query_at
## <dttm> <int> <chr> <chr> <list> <dttm>
## 1 2016-09-25 10:26:28 0 Lovebian The latest #Zombie Nation! https://t.co/8ZkOFSZH2v Thanks to @NJTVNews @MaxfireXSA @Xtopgun901X <chr [1]> 2016-09-25 10:30:44
## 2 2016-09-25 10:25:49 2 MilesssAwaaay RT @Shaaooun: I'm gonna turn to a zombie soon! xdxdxdxd #AlmostSurvived #204Days #ITried #Zombie #StuckInMyRoom #Haha\n\n#MediaDoomsDay #Kame <chr [7]> 2016-09-25 10:30:44
## 3 2016-09-25 10:21:10 6 catZzinthecity RT @ZombieEventsUK: 7 reasons #TheGirlWithAllTheGifts is the best #zombie movie in years https://t.co/MB82ssxss2 via @MetroUK #Metro <chr [3]> 2016-09-25 10:30:44
## 4 2016-09-25 10:19:41 0 CoolStuff2Get Think Geek Zombie Plush Slippers https://t.co/0em920WCMh #Zombie #Slippers #MyFeetAreCold https://t.co/iCEkPBykCa <chr [3]> 2016-09-25 10:30:44
## 5 2016-09-25 10:19:41 4 TwitchersNews RT @zOOkerx: Nur der frhe Vogel fngt den #zombie also schaut gemtlich rein bei @booty_pax! Now live #dayz on #twitch \n\nhttps://t.co/OIk6 <chr [3]> 2016-09-25 10:30:44
## 6 2016-09-25 10:17:45 0 ZombieExaminer Washington mall shooting suspect Arcan Cetin was '#Zombie-like' during arrest - USA TODAY https://t.co/itoDXG3L8T https://t.co/q2mURi24DB <chr [1]> 2016-09-25 10:30:44
## 7 2016-09-25 10:17:44 4 SpawnRTs RT @zOOkerx: Nur der frhe Vogel fngt den #zombie also schaut gemtlich rein bei @booty_pax! Now live #dayz on #twitch \n\nhttps://t.co/OIk6 <chr [3]> 2016-09-25 10:30:44
## 8 2016-09-25 10:17:23 0 BennyPrabowo bad miku - bad oni-chan... no mercy\n.\n.\n.\n.\n#left4dead #games #hatsunemiku #fps #zombie #witch https://t.co/YP0nRDFFj7 <chr [6]> 2016-09-25 10:30:44
## 9 2016-09-25 10:12:53 62 Nblackthorne RT @PennilessScribe: He would end her pain, but he could no longer live in a world that demanded such sacrifice. #zombie #apocalypse\nhttps: <chr [2]> 2016-09-25 10:30:44
## 10 2016-09-25 10:06:46 0 mthvillaalva Pak ganern!!! Kakatapos ko lang kumain ng dugo! \n#Zombie https://t.co/Zyd0btVJH4 <chr [1]> 2016-09-25 10:30:44
## # ... with 10,964 more rows
OOTB Zombie Surveillance
We are now ready to prospectively detect changes using the
surveillance
R package (Salmon, Schumacher, and Höhle
2016).
library("surveillance")
We shall initially focus on the #zombie
series as it
contains more counts. The first step is to convert the
data.frame
of individual tweets into a time series of daily
counts.
#' Function to convert data.frame to queries. For convenience we store the time series
#' and the data.frame jointly as a list. This allows for easy manipulations later on
#' as we see data.frame and time series to be a joint package.
#'
#' @param tw data.frame containing the linelist of tweets.
#' @param the_query_subset String containing a regexp to restrict the hashtags
#' @param delete_first_day (boolean) Delete first day of the series due to it being incomplete
#' @return List containing sts object as well as the original data frame.
#'
<- function(tw, the_query_subset, delete_first_day=TRUE) {
df_2_timeseries <- tw %>% filter(grepl(gsub("#","",the_query_subset),hashtags))
tw_subset
#Aggregate data per day and convert times series to sts object
<- surveillance::linelist2sts(as.data.frame(tw_subset), dateCol="created_at_Date", aggregate.by="1 day")
ts #Drop first day with observations, due to the moving window of the twitter index, this count is incomplete
if (delete_first_day) ts <- ts[-1,]
return(list(tw=tw_subset,ts=ts, the_query_subset=the_query_subset))
}
<- df_2_timeseries(tw, the_query_subset = "#zombie") zombie
It’s easy to visualize the resulting time series using the plotting functionality of the surveillance package.
We see that the counts on the last day are incomplete. This is because the query was performed at 10:30 CEST and not at midnight. We therefore adjust counts on the last day based on simple inverse probability weighting. This just means that we scale up the counts by the inverse of the fraction the query-hour (10:30 CEST) makes up of 24h (see github code for details). The usefulness of this adjustment relies on the assumption that queries are evenly distributed over the day.
We are now ready to apply a surveillance algorithm to the pre-processed time series. We shall pick the so called C1 version of the EARS algorithm documented in Hutwagner et al. (2003) or Fricker, Hegler, and Dunfee (2008). For a monitored time point \(s\) (here: a particular day, say, 2016-09-23), this simple algorithm takes the previous seven observations before \(s\) in order to compute the mean and standard deviation, i.e. \[ \begin{align*} \bar{y}_s &= \frac{1}{7} \sum_{t=s-8}^{s-1} y_t, \\ \operatorname{sd}_s^2 &= \frac{1}{7-1} \sum_{t=s-8}^{s-1} (y_t - \bar{y}_s)^2. \end{align*} \] The algorithm then computes the z-statistic \(\operatorname{C1}_s = (y_s - \bar{y}_s)/\operatorname{sd}_s\) for each time point to monitor. Once the value of this statistic is above 3 an alarm is flagged. This means that we assume that the previous 7 observations are what is to be expected when no unusual activity is going on. One can interpret the statistic as a transformation to (standard) normality: once the current observation is too extreme under this model an alarm is sounded. Such normal-approximations are justified given the large number of daily counts in the zombie series we consider, but does not take secular trends or day of the week effects into account. Note also that the calculations can also be reversed in order to determine how large the number of observations needs to be in order to generate an alarm (shown as a red line in the graph).
We now apply the EARS C1 monitoring procedure to the zombie time series starting at the 8th day of the time series. It is important to realize that the result of monitoring a time point in the graphic is obtained by only looking into the past. Hence, the relevant time point to consider today is if an alarm would have occurred 2016-09-25. We also show the other time points to see, if we could have detected potential alarms earlier.
"sts"]] <- earsC(zombie$ts, control=list(range = 8:nrow(zombie$ts),
zombie[[method = "C1", alpha = 1-pnorm(3)))
What a relief! No suspicious zombie activity appears to be ongoing. Actually, it would have taken 511 additional tweets before we would have raised an alarm on 2016-09-25. This is quite a number.
As an additional sensitivity analysis we redo the analyses for the
#zombieattack
hashtag. Here the use of the normal
approximation in the computation of the alerts is more questionable.
Still, we can get a time series of counts together with the alarm
limits.
Also no indication of zombie activity. The number of additional tweets needed before alarm in this case is: 21. Altogether, it looks safe out there…
Summary
R provides ideal functionality to quickly extract and monitor twitter
time series. Combining with statistical process control methods allows
you to prospectively monitor the use of hashtags. Twitter has released a
dedicated package for this purpose, however, in case of low count time
series it is better to use count-time series monitoring devices as
implemented in the surveillance
package. Salmon, Schumacher, and
Höhle (2016) contains further details on how to proceed in this
case.
The important question although remains: Does this really work in practice? Can you sleep tight, while your R zombie monitor scans twitter? Here is where the social experiment starts: Please help by retweeting the post below to create a drill alarm situation. More than 511 (!) and 21 additional tweets, respectively, are needed before an alarm will sound.
New blog entry: Please RT to help me evaluate my #zombie monitoring system - https://t.co/b0gNfpJ0RM #zombieattack #biosurveillance #rstats pic.twitter.com/N3PTZBnaw4
— Michael Höhle (@m_hoehle) 25. September 2016
I will continuously update the graphs in this post to see how our
efforts are reflected in the time series of tweets containing the
#zombieattack
and #zombie
hashtags. Thanks for
your help!
Update at 2016-11-03 00:08:31
Below we show how the #zombieattack
series developed
after the post was made public:
The orange part of the bars indicates the fake outbreak tweet (1) as well as its retweets (10). It is obvious that despite the increased activity due to the fake outbreak tweets, no alarm was generated because of them. In parts this is explained by the high activity during 20-21 Sep. A previous outbreak? No, advertisements of zombie pinup badges on etsy. Since the EARS algorithm sequentially estimates the variance of the baseline counts, the peak on 20-21 Sep inflates the mean and variance and thus results in a high upperbound as long as it enters the baseline. Despite some extra activity 25-27 Sep due to the fake outbreak tweets, none of the days are above this bound. However, once the 20-21 Sep peak is out of the previous 7 days baseline, the alarm threshold decreases noticeably. We thus get an alarm for the peak on 30 September, even though it’s not higher than the previous peak on 20-21 Sep and it appears to be caused by other phenomena not related to our fake outbreak. A more careful analysis of the tweets reveals that they are caused by a charity fun run near Kuala Lumpur with a zombie attack theme!
SPAM FAM 😋 #zombieattack #charityfunrun2016 pic.twitter.com/YJW8Mjpd0L
— PAAN (@noorfarhan_) September 30, 2016
For further comparison, we also use a negative binomial CUSUM, which keeps the baseline steady, but allows the detection of sustained increases.
The gray line shows the estimated mean from the eight base-line counts, the orange line shows the corresponding upper 99% quantile of this estimated distribution. The red line shows the alarm limit of a CUSUM method where the potential shift in the mean is estimated by maximum likelihood at each monitoring instance. The threshold is tuned such that it initially roughly coincides with the 99% quantile of the estimated distribution. Here, no signal is detected. Of course this depends on the quantile used for the detection and the 20-21 September peak being included fully in the baseline. Still, according to this metric occasional fake zombie runs are part of the routine.
Altogether, the “failed”” test proves several points:
- its hard to distinguish between previous outbreaks and irregular tweeting behaviour
- robust estimation of the baseline parameters might be needed
- the results are sensitive to the choice of which values to include in the baseline and the underlying probability model
- is twitter monitoring really sensitive enough to detect weak signals early enough?
The take home message of this update
To enhance statistical competence and preserve sleep: stick with the negative binomial CUSUM!