Fbi Detected: How I Found FBI Agents
In the new issue of Black Archeology of Datamining, we will play a little bit of spies. We’ll see what a regular Data Specialist can learn from open data on the network.
It all started with an article on the Haber , that a certain anonymous hacker was sharing data from the FBI agents that had been merged into the network. I received this data, and began to look, what can be done with them? In the data there is only a surname, a name, and office emails and phone - some information.
After receiving this information, I saw that they had come to an end with the letter J . That is, the dataset is not complete. Intreseno, what is its full size? To find out, you need to build statistics on the frequency of occurrence of surnames.
To do this, I started looking for sets of American surnames, and here I was waiting for a discovery - in America you can find open data on, say, state voters - as I understand it, completely legal. For example, for half an hour, I get the data of all Utah voters without any problems .
This is already much more interesting! If in the first dataset we only had a last name, first name and one letter “middle name” (here I call middle name middle name, although this is a bit wrong ), now we can find much more information on the FBI agent - for example, mailing address, full name, age, political preferences. So let's get started.
To begin with, we will evaluate the completeness of the dataset (from which my research began). We build statistics on the occurrence of surnames in Utah, then summarize and look at what proportion the surnames make up to the letter J. It turns out that we have about half of all data, more precisely 43% . A complete list of agents would be 50 thousand records. Yes, if someone needs it, here is the frequency distribution of American surnames:
Next, find the agents in the voters list. First, we will try to find the intersection by last name, first name, and first letter of the middle name (this is all the information that we have on the agents). The voter dataset is very large, and with this action we will significantly reduce it so that it at least fits in the memory of my very ancient computer.
I find intersections - and here the first surprise awaits me. There are a lot of them - almost 15 thousand from 22 thousand according to the file of agents. It is unlikely that the FBI all lives in one state, just in America there are very popular surnames, and there are too many coincidences of Last Name, First Name, First Letter of the Middle Name. Well, we will filter further.
We find surnames that occur only once. These are rare surnames, and most likely coincidences Surname-Name will be quite enough to identify a person. It is unlikely that we will meet another Serine Hovhannisyan. After filtering, we get a dataset of 193 unique records. There is!
Most likely, these are our agents, with full details - mailing address, full name, date of birth, political preferences (we have a voters list, and it contains data on how this person voted since 2002) . Just in case, I won’t publish the result, suddenly the Agency has really long hands :)
It’s better to calculate statistics on these data. For example, a histogram of age:
Minimum age: 21 years (you can vote from this age)
Maximum: 90 years
Political preferences. I determined membership in the party either by the declared affiliation (such information is in the dataset, or if a person constantly votes for one of the parties.
Of the 193 people, 43 are republicans and 32 are democrats .
Interesting information, I thought there will be noticeably more republicans.
How true is this data? in the above link on reddit comments have links to datasets of most states. it could also gather information from social networks, and .... no, thank you. I do not want to spend my life in the Ecuadorian embassy .
Oh, someone call the door - one second, you look who's there And then write about how the preserves.
It all started with an article on the Haber , that a certain anonymous hacker was sharing data from the FBI agents that had been merged into the network. I received this data, and began to look, what can be done with them? In the data there is only a surname, a name, and office emails and phone - some information.
After receiving this information, I saw that they had come to an end with the letter J . That is, the dataset is not complete. Intreseno, what is its full size? To find out, you need to build statistics on the frequency of occurrence of surnames.
To do this, I started looking for sets of American surnames, and here I was waiting for a discovery - in America you can find open data on, say, state voters - as I understand it, completely legal. For example, for half an hour, I get the data of all Utah voters without any problems .
This is already much more interesting! If in the first dataset we only had a last name, first name and one letter “middle name” (here I call middle name middle name, although this is a bit wrong ), now we can find much more information on the FBI agent - for example, mailing address, full name, age, political preferences. So let's get started.
To begin with, we will evaluate the completeness of the dataset (from which my research began). We build statistics on the occurrence of surnames in Utah, then summarize and look at what proportion the surnames make up to the letter J. It turns out that we have about half of all data, more precisely 43% . A complete list of agents would be 50 thousand records. Yes, if someone needs it, here is the frequency distribution of American surnames:
Spoiler heading
Letter | Total records | Frequency |
A | 128934 | 0.030 |
B | 401048 | 0.093 |
C | 298668 | 0.069 |
D | 197078 | 0.046 |
E | 80467 | 0.019 |
F | 152500 | 0.035 |
G | 200349 | 0.046 |
H | 325591 | 0.075 |
I | 17765 | 0.004 |
J | 121452 | 0.028 |
K | 184007 | 0.043 |
L | 183266 | 0.042 |
M | 399768 | 0.093 |
N | 73607 | 0.017 |
O | 53166 | 0.012 |
P | 199195 | 0.046 |
Q | 5802 | 0.001 |
R | 224124 | 0.052 |
S | 456642 | 0.106 |
T | 147229 | 0.034 |
U | 10559 | 0.002 |
V | 52085 | 0.012 |
W | 272087 | 0.063 |
X | 371 | 0.000 |
Y | 28468 | 0.007 |
Z | 27642 | 0.006 |
Next, find the agents in the voters list. First, we will try to find the intersection by last name, first name, and first letter of the middle name (this is all the information that we have on the agents). The voter dataset is very large, and with this action we will significantly reduce it so that it at least fits in the memory of my very ancient computer.
I find intersections - and here the first surprise awaits me. There are a lot of them - almost 15 thousand from 22 thousand according to the file of agents. It is unlikely that the FBI all lives in one state, just in America there are very popular surnames, and there are too many coincidences of Last Name, First Name, First Letter of the Middle Name. Well, we will filter further.
We find surnames that occur only once. These are rare surnames, and most likely coincidences Surname-Name will be quite enough to identify a person. It is unlikely that we will meet another Serine Hovhannisyan. After filtering, we get a dataset of 193 unique records. There is!
Most likely, these are our agents, with full details - mailing address, full name, date of birth, political preferences (we have a voters list, and it contains data on how this person voted since 2002) . Just in case, I won’t publish the result, suddenly the Agency has really long hands :)
It’s better to calculate statistics on these data. For example, a histogram of age:
Minimum age: 21 years (you can vote from this age)
Maximum: 90 years
Political preferences. I determined membership in the party either by the declared affiliation (such information is in the dataset, or if a person constantly votes for one of the parties.
Of the 193 people, 43 are republicans and 32 are democrats .
Interesting information, I thought there will be noticeably more republicans.
How true is this data? in the above link on reddit comments have links to datasets of most states. it could also gather information from social networks, and .... no, thank you. I do not want to spend my life in the Ecuadorian embassy .
Oh, someone call the door - one second, you look who's there And then write about how the preserves.