They volunteered to provide genetic samples for research and to be listed in a publicly accessible database, based on assurances that their names would not be linked to the samples.
But nearly 50 people who had anonymously participated in studies were identified by a team of Cambridge scientists using Internet searches and genealogy websites to discern their names, demonstrating that like credit card and bank account numbers, genetic information is vulnerable to hacking.
Scientists at the Whitehead Institute for Biomedical Research showed how easily this sensitive health information could be revealed and possibly fall into the wrong hands. Identifying the supposedly anonymous research participants did not require fancy tools or expensive equipment: It took a single researcher with an Internet connection about three to seven hours per person.
In the era of Facebook and face-recognition software, the loss of privacy is part of daily conversation, but many people do not realize how much they have already sacrificed. Adding DNA to the mix raises the stakes because of how much information it carries about an individual’s disease risk and traits.
The feat, reported in the journal Science on Thursday, has already triggered action at the National Institutes of Health, which has removed the ages of the participants from a searchable repository of genetic material and put the information under tighter control. It has also sparked a broader discussion about how to better share data among researchers while protecting individuals.
“This is not shocking; I think this is just a moment of recognition, a reflection moment,” said Dr. Eric Green, director of the National Human Genome Research Institute, which worked with the cell repository to remove the ages of participants. “We have all these values which are totally laudable, but are beginning to come into conflict. What is the best way to navigate this?”
Green said some scientists fear that restricting access to genomic databases could slow research, and “in some people’s views, none of this could be completely private in this era. . . . Therefore, what we should be doing is changing the conversation and [being] very open.”
The researchers did not undertake the project with the intent of exposing individuals or violating their privacy. Yaniv Erlich, a fellow at the Whitehead Institute who led the research, was inspired by a previous job at a computer security company. To check the robustness of a bank or credit card company’s database, he would do vulnerability testing, trying to break in to identify security weaknesses.
Now, as a researcher who works with DNA, he got interested in testing how reliable the assurances were when research volunteers were told it was unlikely they would be identified when they provided DNA for studies.
Erlich decided to use genealogy websites, which publicly post limited genetic information taken from mens’ Y chromosomes to help people try to track down their ancestors. Specifically, he and his team examined short tandem repeats, stretches of DNA with characteristic repetitive patterns that are inherited. Genealogy websites post such information because men pass down both Y chromosomes and last names, and people with similar numbers of repeats may be related. But Erlich thought he could use the database to figure out the last names of people whose anonymous genetic information was available for science research.
First, he took the publicly available genome of J. Craig Venter, the biologist who played a key role in sequencing the human genome. He took the repeating stretches of Venter’s Y chromosome and put those into the genealogy website. The top hit for the last name associated with that genetic fingerprint was the last name Venter. With just a few other pieces of information — a year of birth, a state of residence — it was easy to use an Internet search to identify the famous biologist.
Then, he decided to extend the technique to see if it would work with truly anonymous data. He began with 10 unidentified men whose DNA sequences had been analyzed and posted online as part of the federally funded 1,000 Genomes Project. The men were also part of a separate scientific study in which their family members had provided genetic samples. The samples and the donors’ relationships to one another were listed on a website and publicly available from a tissue repository.
Using the same basic technique used to match Venter, as well as obituaries and other searchable public records, Erlich was able to identify nearly 50 people, some of the original men, plus family members who had provided genetic samples. One man, for example, was identified because his great-great nephew had submitted a sample to a genealogy database.
Green said the paper would add to an ongoing discussion about how to preserve access to scientific data and protect those who participate in studies.
“My hope is that this will make everything more open,” said George Church, a geneticist at Harvard Medical School who runs the Personal Genome Project, a research effort that publicly displays people’s DNA sequences, traits, and, in some cases, names. “I think pretending that there’s a new encryption algorithm or . . . if we put the age in one database and the data in another, to fix things, that’s just sticking our heads in the sand,” Church said.
In some ways, the work echoes other efforts to draw out the identities of people from data they would never consider was identifiable.
“We release information about ourselves without thinking about where it’s going to go and what it means to us,” said Jennifer Lynch, a staff lawyer at the Electronic Frontier Foundation, a nonprofit digital rights group. “And in many instances, I think we release that information for good reason. There’s a lot to be gained by giving up samples of DNA for research purposes.”
Lynch said her fear is that something a single researcher did in three to seven hours could easily be automated and used by companies or insurers to make predictions about a person’s risk for disease. Although the federal Genetic Information Nondiscrimination Act protects DNA from being used by health insurers and employers to discriminate against people, she and others consider it insufficient.
Still, societal attitudes are evolving. Some young people may realize they can be identified by the breadcrumbs they leave on social networks and think, “So what?”
“The up and coming generations have a much different concept of privacy than past generations have,” said Dov Greenbaum, an assistant professor of molecular biophysics and biochemistry at Yale University. “Perhaps that will play out in terms of how controlling people are going to want to be over their private information.”