How your tweets reveal your location

Original author: Technology Review
  • Transfer

Researchers from IBM have developed an algorithm that allows you to determine the user's place of residence with an accuracy of 70% by analyzing 200 of his last tweets.



USA tweets

One of the optional features of Twitter is the ability to provide user location information. They usually use this to tell their friends where you are right now. Or in order to remember after a while where this or that event occurred. Also, it is a valuable tool for researchers, giving the opportunity to study the geographical distribution of tweets in various ways.

At the same time, this feature raises privacy issues, especially when users don’t know or forget that their tweets are tagged with geotags. It is believed that a fairly large number of celebrities lit up their home addresses in this way. And in 2007 in Iraq, four Apache helicopters belonging to the US Army were destroyed by mortars when the rebels calculated their location using geotags in photographs published by American soldiers.

Perhaps these problems are the reason that so few tweets are tagged with geotags. Several studies have shown that less than one percent of tweets contain location metadata.

But the lack of geotargeting of data does not mean that your location remains a secret. Today, Jalal Mahmud and several of his colleagues at IBM Research in Almaden, California, said they had developed an algorithm to analyze the last 200 tweets of any user and determine his location with an accuracy of 70%.

This feature is very useful for researchers, journalists, marketers, and others who want to determine where a particular tweet was written. On the other hand, it raises privacy issues for those who prefer to keep their whereabouts confidential.

The method of Mahmoud and his colleagues is relatively simple. Between July and August 2011 using Twitter Firehosethey filtered tweets that were tagged with geotags from one of the 100 largest cities in the USA, thus collecting 100 different users in each of the cities.

They then uploaded the last 200 tweets that each of these users posted, excluding those that were privately posted. This gave them more than 1.5 million tweets with coordinates of approximately 10,000 people.

Then they divided the data set into two parts, using 90% of the tweets to train their algorithm, and the remaining 10% to test it.

The main idea of ​​their algorithm is that the text of the tweets themselves contain important information about the likely location of the user. For example, more than 100,000 tweets from their sample were generated by Foursquare, a social network with a location feature. Thus, these tweets contained a link giving accurate data about the user's location. Almost 300,000 tweets also contained the name of one of the cities on the list of the US Geological Survey.

In other tweets, the author’s location gave out phrases such as “Why did we take a samovar?”, Which is direct evidence of a visit to Tula. Mahmoud also points to the fact that in the United States, the distribution of tweets throughout the day is approximately the same for each time zone. Therefore, the dynamics of custom tweets during the day can give fairly accurate information about its time zone.

Thus, the researchers are trying to answer the following question: is it possible to use all this information to determine the location of the user. They could verify their results by comparing them with geotag metadata.

IBM employees used an algorithm known as the Naive Bayes classifier. They trained him on a training dataset with geolocation information.

Then they tested the algorithm on the remaining 10% of the data to make sure that the calculated location of users was correct.

The results were very interesting. If we exclude people who travel, then the developed algorithm predicts a person’s hometown with an accuracy of 68%, the native state with an accuracy of 70%, and the time zone with an accuracy of 80%. At the same time, the researchers assure that determining the location for one user takes less than one second.

This development can serve as a useful tool. Journalists, for example, can use it to identify tweets that were written from a region subjected to a cataclysm (for example, an earthquake), as well as those tweets that commented on an event from remote regions. Marketers can use development to increase the popularity of their products in certain cities.

Mahmoud and his colleagues assure that in the future their algorithm may show even more impressive results. For example, they expect that they will be able to get more accurate information using the search function tweets with references to local attractions. Well, let's wait - we'll see what they get out of this.

An interesting consequence of all this is that our idea of ​​privacy has once again turned out to be more fragile than most of us believe. How we can strengthen and protect our right to privacy should be the subject of serious public debate.

Also popular now: