Baylor College of Medicine

Public databases can threaten privacy of research participants

-
Content

The explosion of genomic information on the worldwide web--both as part of publicly funded genetic research and private genetic genealogy--can be exploited by people with computer know-how and access to the Internet to identify individuals--even those for whom the computer experts do not have a DNA sample, said a consortium of researchers that includes those from Baylor College of Medicine in a report in the journal Science.

While other scientific articles have looked at the vulnerabilities of such databases, "This is the first one that shows you can identify people even without a reference sample," said Dr. Amy McGuire, director of the Center for Medical Ethics and Health Policy at BCM and an expert in the ethics of genomic studies.

She emphasized that work is ongoing to enhance privacy protections and at the same time inform would-be research participants of the risks involved.

Heading

Protecting privacy

Content

"Efforts to address these issues began more than five years ago when the National Institutes of Health started to draw back some data that had been in the public domain," she said.

She pointed out that this research combines openly available genomic databases created for research purposes with genealogy databases that are sorted by surname.

"The focus in research has been to try to de-identify things as much as possible, and if that is not possible, to create barriers to intrusion by people who should not have access to the information," she said. "Now that focus is shifting to providing greater security and treating people respectfully as participants in research. It is time for public dialogue about the risks to people’s privacy and how to promote research in this area while protecting individuals from unwanted intrusion."

Heading

Markers transmitted

Content

Led by principal investigator Dr. Yaniv Erlich, a fellow at the Whitehead Institute in Cambridge, Mass., a team of researchers identified nearly 50 people who had submitted their DNA to be sequenced and the genetic sequences included in databases that are publicly available on the Internet.

They began by analyzing genetic markers called short tandem repeats on the Y chromosomes of men who had contributed DNA to the Center for the Study of Human Polymorphisms (CEPH). These men’s genomes had been sequenced and made available as part of an international effort known as the 1,000 genomes project.

These markers are transmitted from father to son on the Y chromosome. Similarly, fathers pass their surnames to their sons in many cultures. Some genealogists or companies that do genetic genealogy have established public databases that group these Y chromosome markers by surname. When Erlich’s team compared the Y chromosome markers they had identified from the CEPH participants to those in these genealogy databases, which are also publicly available, they were able to identify the surnames of the men involved. They then took that information and queried other sources of information on the Internet--obituaries, genealogical website and demographic data included on the National Institute of General Medical Sciences Human Genetic Cell Repository(located in the Coriell Institute in New Jersey). In the final analysis, they were able to identify nearly 50 people who had been participants in CEPH.

Heading

Result of disclosure

Content

The Whitehead researchers contacted McGuire about the ethical issues involved and subsequently shared their results with officials at the National Human Genome Research Institute and the National Institute of General Medical Sciences. Those two institutes moved some demographic information from the publicly accessible portion of the Coriell cell repository as a result of the disclosure.

The leaders of the two institutes, Drs. Judith H. Greenberg (NIGMS) and Dr. Eric D. Green (NHGRI) wrote a perspective on the issues arising from this work in the same issue of the journal Science, calling for renewed dialogue on the issues involved.

McGuire agrees with that plan, noting that there needs to be concrete recommendations about how to inform people about these risks when they are deciding whether or not to take part in any kind of genetic database.

Heading

Elephant in the room

Content

"One big elephant in the room is that you could potentially identify someone based on publicly available data that his or her fourth cousin decided to share in a genealogy database," she said.

Who would take the time to do this kind of work, other than researchers trying to test the limits of their privacy safeguards? The publicly available information could be used forensically, said McGuire. DNA found at a crime scene might be linked to a distant relative in a genealogy database. Some people have also used these publicly available databases in order to identify "anonymous" sperm donors.

Others who took part in this work include Melissa Gymrek (first author) of the Whitehead Institute,Dr. David Golan and Eran Halperin of Tel Aviv University.

Funding for this work came from the National Defense Science and Engineering Graduate Fellowship and the Edmond J. Safra Center for Bioinformatics at Tel-Aviv University.

Dr. McGuire is the Leon Jaworski Professor of Biomedical Ethics.

Back to topback-to-top