Dynamic visualization of geocoded data (Twitter) using R
- From the sandbox
- Tutorial
“New Year paces the country”
I am an ardent fan of geosocial services. They allow you to visually see the physical implementation of social space. This is what Bourdieu wrote about , but what was accessible to him only in the form of a mental construct. Foursquare is generally my unrequited love. But about it somehow next time, and today we’ll talk about Twitter.
Shortly before the end of the previous 2012 year, I wanted to see what the “wave” of New Year's congratulations tweets looks like. See how it goes through time zones. No sooner said than done. Used tools: R, Python and ffmpeg.

Data collection
While my wife took up the festive preparations and did not remember about me, using python and tweetstream we make a primitive parser of the statuses / filter tweet stream .
import tweetstream
stream = tweetstream.FilterStream('TWITTER_LOGIN', 'TWITTER_PASSWORD', track = keywords_list)
for tweet in stream:
if tweet['coordinates']:
# Декодируем время создания твита в date-time объект
# (в API Twitter встречаются два возможных варианта написания)
try:
timestamp = datetime.datetime.strptime(tweet['created_at'], '%Y-%m-%d %H:%M:%S')
except ValueError:
timestamp = datetime.datetime.strptime(tweet['created_at'][4:-10]+tweet['created_at'][-4:], '%b %d %H:%M:%S %Y')
# Прибавляем 4 часа - "переводим" время на московское
timestamp += datetime.timedelta(hours=4)
# Сохраняем время создания в виде строки
timestamp = datetime.datetime.strftime(timestamp, '%Y-%m-%d %H:%M:%S')
# Пишем все в БД
cursor.execute('INSERT INTO tweets(link, latitude, longitude, date) VALUES ("http://www.twitter.com/{0}/status/{1}", "{2}", "{3}", "{4}")'.format(
tweet['user']['screen_name'],
tweet['id'],
str(tweet['coordinates']['coordinates'][0]),
str(tweet['coordinates']['coordinates'][1]),
str(tweet['created_at'])
))
db.commit()
keywords_list- List of keywords, hash tags and phrases related to Happy New Year. Do not forget that this API call does not take into account morphology, so you should try to add as many relevant word forms as possible to the keywords. My list:- #happy New Year,
- #Christmas2013,
- # NG2013,
- #With the coming 2013,
- #New Year,
- #Happy New Year,
- #NEW YEAR tweet,
- happy New Year,
- holiday greetings,
- happy new 2013,
- Happy Year of the Snake
- Happy new year of the snake
- New Year,
- new year.
That's all, the wife drew attention to the extra working hands, so we launch the parser and rush to help. Finally, we contribute to the "increase in entropy":

Overview of the data
twits <- read.csv2('ny_tweets.csv', header=F)
colnames(twits) <- c('Link', 'Longitude', 'Latitude', 'Timestamp')
twits$Timestamp <- strptime(twits$Timestamp, format='%Y-%m-%d %H:%M:%S')
twits$Latitude <- round(as.numeric(as.character(twits$Latitude)), digits=1)
twits$Longitude <- round(as.numeric(as.character(twits$Longitude)), digits=1)
twits$Longitude <- sapply(twits$Longitude, function(x){
if(x < (-169)){
x<-360+x
} else {x}
})
We translate timestamps into the appropriate format, round off latitudes and longitudes - we will watch twitter activity on the grid at one tenth of a degree. The last conversion of longitudes is necessary for further visualization so as not to lose messages from Chukotka.
Now we load ggplot2 and look at the time distribution:
library('ggplot2')
p <- ggplot()
p <- p + geom_histogram(aes(x=twits$Timestamp, fill = ..count..), binwidth = 3600)
p <- p + ylab('Количество') + xlab('Время (по часам)')
p <- p + theme(legend.position = 'none')
p <- p + ggtitle(expression('Распределение "новогодних" твитов'))
p

Everything looks approximately as expected, which is pleasantly pleasing. We can go on.
Map rendering
Ggplot2 offers a rich set of tools for quickly building informative and attractive graphs. Full documentation with examples: ggplot docs .
To draw the map, we will use the geom_polygon function, which allows us to draw polygons at given coordinates. The contours of countries will be taken from the maps library . The rendering option described in the note “How to draw good looking maps in R” was taken as the basis . We slightly modify the described option, customizing it to our own needs and “ideas about the beautiful”:
# Грузим все данные
# countries - вектор с названиями стран, которые надо отрисовывать
library('maps')
full_map <- map_data('world')
table(full_map$region)
need.map <- subset(full_map, region %in% countries & long>-25 & long<190 & lat>25)
# Рисуем контуры
p <- ggplot()
p <- p + geom_polygon(aes(x=need.map$long, y=need.map$lat, group = need.map$group), colour='white', fill='grey20', alpha=.5)
# Рисуем "точки" отдельных крупных городов для "оживления пейзажа"
# cities - это dataframe вида "название города - широта - долгота"
p <- p + geom_point(aes(cities$Long, cities$Lat), colour='skyblue', size=1.5)
# Убираем подписи к осям, легенду, фиксируем размер поля для вывода, добавляем заголовок
p <- p + theme(axis.line=element_blank(),axis.text.x=element_blank(),
axis.text.y=element_blank(),axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
legend.position = 'none',
text=element_text(family='mono', size=20, face='bold', colour='dodgerblue3')
)
p <- p + scale_x_continuous(limits = c(-15, 190))
p <- p + scale_y_continuous(limits = c(30, 82))
p <- p + ggtitle(expression('#HappyNewYear in Russian Twitter - 2013'))
I want to somehow revive the picture and add information. Put the clock in the frame. Here's an interesting option with a clock: Create an animated clock in R with ggplot2 (and ffmpeg) . But there the polar coordinate system is used, which is not suitable for us, because It will require extra body movements in the form of creating a subgraph. Have to make your own "bike." We take a function for calculating the coordinates of the circle in the center and diameter (peered here ):
draw.circle <- function(center,diameter=1, npoints = 100){
r = diameter / 2
tt <- seq(0,2*pi,length.out = npoints)
xx <- center[1] + r * cos(tt)
yy <- center[2] + r * sin(tt) * roundcoef
return(data.frame(x = xx, y = yy))
}
Now we count the points for three circles: a white backdrop, a gray “top” background and 12 points on a circle for time serifs:
curtime <- c(as.numeric(format(frame.time, '%H')), as.numeric(format(frame.time, '%M')))
clock.center <- c(180, 35)
circdat <- draw.circle(clock.center, diameter=20)
circdat2 <- draw.circle(clock.center, diameter=19.7)
circdat3 <- draw.circle(clock.center, diameter=18, npoints=13)
We consider the position of the arrows:
arrow.r = c(5.5,8.8) # Длина стрелок
if(curtime[1]>=12){curtime[1]=curtime[1]-12}
hourval <- pi*(.5 - (curtime[1]+(curtime[2]/60))/6)
minval <- pi*(.5 - curtime[2]/30)
hour.x <- clock.center[1] + arrow.r[1] * cos(hourval)
hour.y <- clock.center[2] + arrow.r[1] * sin(hourval)
minute.x <- clock.center[1] + arrow.r[2] * cos(minval)
minute.y <- clock.center[2] + arrow.r[2] * sin(minval)
Draw the clock (functions geom_polygon, geom_segment, geom_point):
# Три окружности: фон и часовые засечки
p <- p + geom_polygon(aes(x=circdat$x,y=circdat$y), colour='grey100', fill='grey100', alpha = .5)
p <- p + geom_polygon(aes(x=circdat2$x,y=circdat2$y), colour='grey80', fill='grey80', alpha = .5)
p <- p + geom_point(aes(circdat3$x, circdat3$y), colour='skyblue')
# Часовая стрелка, минутная, "гвоздик" оси стрелок
p <- p + geom_segment(aes(x=clock.center[1], y=clock.center[2], xend=hour.x, yend=hour.y), size=3, colour='dodgerblue3')
p <- p + geom_segment(aes(x=clock.center[1], y=clock.center[2], xend=minute.x, yend=minute.y), size=1.5, colour='dodgerblue4')
p <- p + geom_point(aes(clock.center[1], clock.center[2]), colour='blue4')
Here's what happened (there are also test data on the “activity” on the map, but only in Moscow and its environs):

Draw frames
Initial data - a table with the coordinates and time of publication of the tweet. At the output, we want to get a set of frames. Let's do it this way: in the settings we will set the time of the 1st frame, the time of the last frame and the time difference between adjacent frames. The total number of frames will be calculated in R. Next is a cycle from 1 to the last frame. In each frame, the “current” time will be calculated. Next - from the original table we will select tweets that fall into the required time interval (30 minutes before the "current" frame time - the "current" frame time).
start.date <- '2012-12-31 00:00:00'
finish.date <- '2013-01-01 12:00:00'
seconds.in.frame <- 30
start.date <- strptime(start.date, format='%Y-%m-%d %H:%M:%S')
finish.date <- strptime(finish.date, format='%Y-%m-%d %H:%M:%S')
frames <- as.numeric(difftime(finish.date, start.date, units='secs'))/seconds.in.frame
for(i in 1:frames){
frame.time <- start.date + i*seconds.in.frame
frame.twits <- subset(twits, Timestamp <= frame.time & Timestamp > frame.time - ma.period)
...
Next, we make the contingency table “latitude – longitude” (I recall that the latitude and longitude values were previously converted from continuous to discrete by rounding to one tenth). And we “expand” the contingency table in a dataframe of 3 columns: latitude, longitude, frequency (the number of tweets at a given point). We remove the lines in which the frequency is zero or the coordinates go beyond the established boundaries.
...
frame.twits <- melt(table(frame.twits$Latitude, frame.twits$Longitude))
colnames(frame.twits) <- c('Lat', 'Long', 'Volume')
frame.twits$Lat <- as.numeric(as.character(frame.twits$Lat))
frame.twits$Long <- as.numeric(as.character(frame.twits$Long))
frame.twits <- frame.twits[frame.twits$Volume>0 &
frame.twits$Long>=-25 & frame.twits$Long<=190 &
frame.twits$Lat>=25 & frame.twits$Lat<=85,]
...
Converting latitude and longitude to numeric variables is necessary because after “collapsing-expanding” the data, it is converted to categorical (factor in R terminology).
It remains to count the colors of the dots. To do this, in the preliminary cycle, the maximum possible amount of tweets at one point for the entire future "movie" is calculated. It is taken at the maximum (
max.color), and the colors for all other points are calculated relative to it (logarithm - to "align" the scale): ...
frame.colors <- round(1 + (8*log(frame.twits$Volume)/max.color), digits=0)
...
Now you can draw dots (if any):
...
if(nrow(frame.twits)>0){
p <- p + geom_point(aes(frame.twits$Long,frame.twits$Lat, size=frame.twits$Volume * 5),
colour=twits.colors[frame.colors], alpha = .75)
}
...
It seems to be all, you can save the image and close the loop, but for the next "revitalization of the picture" I decided to add some real messages to the frame. To do this, I accidentally selected several dozens of tweets from a common array, collected message texts for them and created another dataframe - publication time, tweet text, “lifetime”, “appearance time”, “disappearance time”, color, coordinates, size and transparency . The lifetime, coordinates, color and size are generated when the script is initialized with random deviations from the "optimal" values:
twit.texts$x <- rnorm(nrow(twit.texts), mean = 100, sd = 30)
twit.texts$y <- rnorm(nrow(twit.texts), mean = 56, sd = 15)
twit.texts$size <- rnorm(nrow(twit.texts), mean = 10, sd = 2)
The output to the frame is controlled using transparency: if the “current” time of the frame does not fall into the “lifetime” of the tweet, its “opacity” is zero. If in the current time the tweet “exists”, its opacity is calculated depending on the proximity to the center of the life period. It turns out that the tweet "appears" and gradually disappears. In code, it looks like this:
...
twit.texts$opacity <- as.numeric(by(twit.texts, 1:nrow(twit.texts), function(row){
if(frame.time < row$t.start | frame.time > row$t.end){
row$opacity <- 0
} else {
row$opacity <- 0.7 *
(1 - (abs(as.numeric(difftime(row$Timestamp, frame.time, unit='sec'))) /
(row$t.delta * seconds.in.frame / 2)))
}
}))
p <- p + geom_text(aes(x=twit.texts$x, y=twit.texts$y, label=iconv(twit.texts$Text,to='UTF-8')),
colour=twit.texts$color, size=twit.texts$size, alpha = twit.texts$opacity)
...
Text output is organized using geom_text.
Now it’s all. You can save the frame and close the loop.
...
f.name <- as.character(i)
repeat{
if(nchar(f.name) < nchar(as.character(frames))){
f.name <- paste('0', f.name, sep='')
} else {
break
}}
ggsave(p, file=paste('frames/img', f.name, '.png', sep=''), width=6.4, height=3.6, scale = 3, dpi=100)
}
The length of f.name “zeros” is adjusted so that all names “fit” under one mask by the number of characters.
We collect video
To assemble the final roller was used ffmpeg :
ffmpeg -f image2 -i img%04d.png -q:v 0 -vcodec mpeg4 -r 24 happynewyear.mp4But is not that simple. The whole video is not collected - .png files received from ggplot2 have different color depths. Perhaps this problem could be solved in some more correct way, but I used the Python Imaging Library :
import os
from PIL import Image
path = u'/path/to/frames/'
dirList=os.listdir(path)
for filename in dirList:
if filename[-3:] == 'png':
im = Image.open(path + filename).convert('RGB')
im.save(path + filename)
Now all the frames are collected in one video. The video is ready, you can upload to YouTube. Add a couple of YouTube effects (because if the gun hangs on the wall, you need to shoot from it from time to time).
The final:
In this text, I have described not all parts of the R-script, but part of the quotes from the code are arranged according to the principle of the problem being solved, and not according to the logic of work. Therefore, in the application, I post a link to the full code with a small number of comments: pastebin.com
Possible development options
- Add date as The 12-hour dial is not very informative when you want to visualize a time interval of several days.
- Add a bar with a histogram or something like a graph of probability density over the total amount of data at each moment in time.
Community Questions
- Is there a library for R that is comparable in simplicity to maps, but with more relevant information on modern geopolitics? Without USSR, Yugoslavia and Czechoslovakia?
- How can I try to optimize the calculation of the colors of the point palette so as not to “run” the entire “idle” cycle first?
- Is it possible to “overcome” ggplot2 or ffmpeg so that additional recoding of frame color is not required (exclude PIL from the process)?
- And in general - I’ll listen to other indications of my own humanitarian curvature and optimization opportunities.
Thanks!
useful links
- How to draw good looking maps in R - using maps and ggplot2 to draw maps;
- Create an animated clock in R with ggplot2 (and ffmpeg) - animated analog clock on R;
- Simple data mining and plotting data on a map with ggplot2 - use of OSM maps in R.