In 2013, a young specialist in computational biology, Yaniv Ehrlich, shocked the research community by showing how to reveal the identities of the people listed in the anonymous genetic database using only the Internet connection . Regulators responded by restricting access to anonymous biomedical genetic data sets. A spokesman for the National Institutes of Health said then: “The chances that this will happen are small for most people, although not zero.”

Fast forward in five years, and we find that the amount of information about DNA stored in digital databases has increased explosively, and this growth is not going to slow down. Consumer companies like 23andMe and Ancestry have compiled genetic profiles for more than 12 million people, according to recent estimates . Users who have downloaded their information can optionally add it to public genealogical sites, for example, GEDmatch, which gained notoriety this year due to its role in targeting the police to the suspect in the “killer state of the Golden State”.

These intersecting family trees that unite people through parts of DNA have already grown so much that they can be used to locate half of the US population. According to a new study by Ehrlich, published in the journal Science in October 2018, more than 60% of Americans with European roots can be identified by their DNA, using open genealogical databases, regardless of whether they sent their DNA there.

“As a result, it turns out that it doesn’t matter whether you passed the analysis or not,” says Erlich, chief researcher at MyHeritage, the third largest consumer genetic company, after 23and Me and Ancestry. “You can be identified because the databases already cover most of the United States, especially those of European origin.”

To derive these estimates, Erlich and his colleagues from Columbia University and Hebrew University in Jerusalem analyzed the MyHeritage database, which contains 1.28 million anonymous users, mostly white-skinned, like the vast majority of genetic databases in the world. Regarding each user as a “target,” they counted the number of his relatives with large proportions of matching DNA, and found that 60% of the search queries found at least his second cousin. Investigators for the search for "the killer of the Golden State" and the disclosure of another 17 cases required only such a level of kinship, known in law enforcement as the "search for distant relatives."

The analysis produced a list of approximately 850 people, depending on the fruitfulness of the ancestors of the object. From this starting point, you can quickly reduce basic demographic information. Public archives, from which the place of residence of a person with an accuracy of 160 km, reduce the recruitment of candidates in half. Age to five years excludes 9 out of 10 people. Gender, which can be established on the basis of genetics, cuts the list down to about 16 people. Exact year of birth can leave you one or two candidates.

To demonstrate the ease of the process, the researchers chose an anonymous woman from 1000 Genomes Project - a project with open genomic codes - who was married to a man, whom Erlich had previously identified in his popular 2013 work. They reformatted the data on her DNA so that they resembled the data of a typical online service client and uploaded it to GEDmatch. The service found two relatives, one in North Dakota and one in Wyoming. From coincidence followed their distant kinship, within 4-6 generations. After an hour combing through the public archives, the team discovered their husband and wife. Based on this, the researchers traced the genealogies of hundreds of descendants and calculated the identity of their goals. It all took one day.

Erlich believes that the day is not far off when such a search can be carried out on any person who has left his DNA somewhere. The study found that when the genetic database covers about 2% of the adult population of any ethnic population, no more than a second cousin will be able to find a match for almost any person. The sample base is richer for people whose ancestors were Americans or Europeans, and for them this milestone can be reached within a few years if interest in entertaining DNA tests is maintained at the same level. According to the latest US census, two percent of the population will be just four million.

Such a resource will seriously increase the number and variety of suspects, to the data of which will be available to law enforcement agencies during investigations. The databases of violators of the law, in which the police store DNA of almost 17 million people, are convicted criminals, and in some states, and just people who have been arrested, they mainly contain data on blacks and Latin Americans. From the early days of DNA testing, technological incompatibilities of different methods created a wall between databases of criminals and databases of people who donate DNA for entertainment or research purposes. Law enforcement officers collect and analyze highly variable non-coding parts of the genome, counting the number of repetitions of the "junk" parts of DNA. It is, in fact, just a sequence of numbers, and it says nothing about a person’s personality. However, it is unique to each person, something like a barcode or fingerprint. Also this method is fast and cheap - ideal for police purposes.

Medical and entertaining DNA records include a complete transcript or arrays of genotypes - a set of changes that occur in one place of a gene. This is a single nucleotide polymorphism (SNP), and it is he who is responsible for the fact that you have green eyes or curly hair, or a predisposition to heart disease. It is also much more useful for finding relatives. Since these two types of databases are not related to each other, in the case of the “Golden State Killer”, I had to extract DNA from old samples, create a SNP profile and upload it to GEDmatch. But now even this is no longer necessary.

In another paper published in October in the journal Cell, for the first time it was demonstrated how to search for distant relatives on the basis of data from databases of criminals. The Noah Rosenberg group from Stanford University has already shown that it is possible to link records in these two bases by comparing the nearest SNPs to non-coding repeats. The work was published last year, and did not attract much attention. “Silence,” says Rosenberg. But his latest work, studying the cross-compatibility of two databases, has already received a new meaning in the light of the case of the “killer of the Golden State”.

“This way can expand the reach of forensic genetics, and potentially help solve even more old cases,” says Rosenberg. “At the same time, he will disclose the data of the participants in these databases during searches related to the investigation of crimes, which they probably did not expect.”

Legal experts consider as a bigger problem the fact that Rosenberg’s work implies that the DNA profile stored in police databases contains more information than previously thought. It can be used to accurately predict the coding regions of the genome — those associated with green eyes, curly hair, and heart problems. "All decisions of the Supreme Court on the fact that the existing databases of criminals do not violate the Fourth Amendment, based on the assumption that nothing can be extracted from this junk DNA, ”says Andrea Roth, director of the Center for Law and Technology at the University of California at Berkeley. "And now it all comes in the dust."

Rosenberg did not release any software with the work, so it will take some time to perform real calculations. But he says that anyone with access to several databases has all the necessary information in order to start using this technology. This means that built-in privacy protection can fall pretty quickly. The work is conceived as a warning to show regulators the capabilities of modern technologies, and Rosenberg hopes that it will launch a long overdue discussion about the storage and use of genetic information.

Ehrlich et al. Went even further in developing recommendations for the changes necessary for GEDmatch-type resources, which provide an important service for people searching for missing relatives, and for adoptive children who are looking for biological parents, to stay online and be safe. They called on the US Department of Health and Human Services to review the personal information health information framework and include impersonal genomes. They described an encryption strategy that can create a chain of information security so that databases can mark users trying to analyze other people's genetic data. But even if absolutely all companies providing services related to genomes are dragged into this system, this may not be enough.

“I think the result is that now all people will be under the hood of genetic surveillance, if we do not regulate the government’s ability to conduct genetic searches,” says Roth. He proposes a system similar to California’s regulation of the more traditional search for relatives in criminal databases. They can only be used to investigate violent crimes - murder, violence - and the scope of the search is limited, so as not to involve information about hundreds of innocent people. There are supervisory commissions that can prevent the careless disclosure of sensitive information, if, say, someone’s father turns out to be not a biological father. “This is all irony,” says Roth. - If your relative is in the CODIS database [criminal base], you have more rights to genetic privacy than if you have a relative in GEDMatch. ” But with enough of your DNA, it doesn't matter if you want to be found or not. Waivers are no longer accepted.

