Fbi Detected: How I Found FBI Agents

    In the new issue of Black Archeology of Datamining, we will play a little bit of spies. We’ll see what a regular Data Specialist can learn from open data on the network.

    It all started with an article on the Haber , that a certain anonymous hacker was sharing data from the FBI agents that had been merged into the network. I received this data, and began to look, what can be done with them? In the data there is only a surname, a name, and office emails and phone - some information.



    After receiving this information, I saw that they had come to an end with the letter J . That is, the dataset is not complete. Intreseno, what is its full size? To find out, you need to build statistics on the frequency of occurrence of surnames.

    To do this, I started looking for sets of American surnames, and here I was waiting for a discovery - in America you can find open data on, say, state voters - as I understand it, completely legal. For example, for half an hour, I get the data of all Utah voters without any problems .



    This is already much more interesting! If in the first dataset we only had a last name, first name and one letter “middle name” (here I call middle name middle name, although this is a bit wrong ), now we can find much more information on the FBI agent - for example, mailing address, full name, age, political preferences. So let's get started.

    To begin with, we will evaluate the completeness of the dataset (from which my research began). We build statistics on the occurrence of surnames in Utah, then summarize and look at what proportion the surnames make up to the letter J. It turns out that we have about half of all data, more precisely 43% . A complete list of agents would be 50 thousand records. Yes, if someone needs it, here is the frequency distribution of American surnames:
    Spoiler heading
    Letter Total recordsFrequency
    A1289340.030
    B4010480.093
    C2986680.069
    D1970780.046
    E804670.019
    F1525000.035
    G2003490.046
    H3255910.075
    I177650.004
    J1214520.028
    K1840070.043
    L1832660.042
    M3997680.093
    N736070.017
    O531660.012
    P1991950.046
    Q58020.001
    R2241240.052
    S4566420.106
    T1472290.034
    U105590.002
    V520850.012
    W2720870.063
    X3710.000
    Y284680.007
    Z276420.006



    Next, find the agents in the voters list. First, we will try to find the intersection by last name, first name, and first letter of the middle name (this is all the information that we have on the agents). The voter dataset is very large, and with this action we will significantly reduce it so that it at least fits in the memory of my very ancient computer.

    I find intersections - and here the first surprise awaits me. There are a lot of them - almost 15 thousand from 22 thousand according to the file of agents. It is unlikely that the FBI all lives in one state, just in America there are very popular surnames, and there are too many coincidences of Last Name, First Name, First Letter of the Middle Name. Well, we will filter further.

    We find surnames that occur only once. These are rare surnames, and most likely coincidences Surname-Name will be quite enough to identify a person. It is unlikely that we will meet another Serine Hovhannisyan. After filtering, we get a dataset of 193 unique records. There is!

    Most likely, these are our agents, with full details - mailing address, full name, date of birth, political preferences (we have a voters list, and it contains data on how this person voted since 2002) . Just in case, I won’t publish the result, suddenly the Agency has really long hands :)

    It’s better to calculate statistics on these data. For example, a histogram of age:



    Minimum age: 21 years (you can vote from this age)
    Maximum: 90 years

    Political preferences. I determined membership in the party either by the declared affiliation (such information is in the dataset, or if a person constantly votes for one of the parties.
    Of the 193 people, 43 are republicans and 32 are democrats .
    Interesting information, I thought there will be noticeably more republicans.

    How true is this data? in the above link on reddit comments have links to datasets of most states. it could also gather information from social networks, and .... no, thank you. I do not want to spend my life in the Ecuadorian embassy .
    Oh, someone call the door - one second, you look who's there And then write about how the preserves.

    Also popular now: