Clinical Chemistry - Podcast

Intersection of DNA Privacy and Whole-Genome Sequencing

Jason Park



Listen to the Clinical Chemistry Podcast



Article

C. Hong, J. Wang, C. Xing, T.H. Hwang, and J.Y. Park. Intersection of DNA Privacy and Whole-Genome Sequencing. Clin Chem 2015;61:900-902.

Guest

Dr. Jason Park is Director of the Advanced Diagnostics Laboratory at the Children's Medical Center in Dallas, an Assistant Professor with the University of Texas Southwestern Medical Center in the Department of Pathology and the Eugene McDermott Center for Human Growth and Development.



Transcript

[Download pdf]

Bob Barrett:
This is a podcast from Clinical Chemistry, sponsored by the Department of Laboratory Medicine at Boston Children’s Hospital. I am Bob Barrett.

In the past decade there has been a rapid increase in the number of individuals who have had their entire genome sequenced. Over the next decade that number will go from the thousands to the millions. These people are having their genomes sequenced for research, recreational interest, and clinical testing. In the July 2015 issue of Clinical Chemistry, an article titled “The Intersection of DNA Privacy and Whole-Genome Sequencing” examines the possibility of using existing DNA profile databases to search for and identify individuals who have had their genomes sequenced. One of the authors of this article is Dr. Jason Park, Director of the Advanced Diagnostics Laboratory at the Children’s Medical Center, Dallas, and Assistant Professor at the University of Texas Southwestern Medical Center in the Department of Pathology and the Eugene McDermott Center for Human Growth and Development.

Dr. Park is our guest in this podcast. So doctor, first of all, just what is Whole-Genome Sequencing?

Jason Park:
Genome is the DNA material which contains the instructions, or blueprints, for every organism. Genomes are not only comprised of genes which include protein, but they also encompass DNA material which regulate the expression of genes. Whole-Genome Sequencing is the technology by which an individual’s genome can be sequenced or decoded down to the level of individual DNA basis.

Bob Barrett:
Do we know how many humans have had their whole genomes sequenced?

Jason Park:
Beginning with the completion of the Human Genome Project there were initially two individuals whose genomes were sequenced. A decade later, in 2014, there were approximately 230,000 human genomes sequenced. It is estimated that by 2017 there will be over 1.5 million human genomes sequenced. There are individual projects that have been proposed that each have a goal of sequencing over a million genomes.

One of these, announced in 2013, was by the genomics company BGI. Another one million genome project which was proposed just this year is part of the US President's initiative on Personalized Medicine. With last year’s introduction of the DNA sequencing instrument the HiSeq X Ten by Illumina it is now possible for a single laboratory to sequence tens of thousands of human genomes each year.

Bob Barrett:
Your article describes the possibility of identifying specific individuals using genome sequencing. Why is this something we should be worried about?

Jason Park:
In 2012 the United States Presidential Commission for the Study of Bioethical Issues issued a report titled ‘Privacy and Progress in Whole Genome Sequencing.’ The Commission made recommendations to improve privacy and security in the context of genomics.

Whole genome sequencing reveals the entire genetic make- up of an individual; not only does this genetic data inform us on our current disease status, but it also informs us on our future risk of disease. Furthermore, genomic information is a unique identifier of individuals. There are legitimate privacy concerns because genomic information can be used to identify individuals as well as inform us of an individual’s current and future risk of disease.

Bob Barrett:
So how did the Presidential Commission address privacy concerns?

Jason Park:
The Commission acknowledged that genomics is a rapidly progressing field and the policies that were made would need to be adaptable to changes in technology. The Commission's recommendations are focused on balancing privacy protection, with the need for continued innovation and advancement in the field. The recommendations in 2012 included provisions for protection of research subjects, as well as an outline for coordinated privacy protection at the state and federal level.

Bob Barrett:
Okay, let's go to what your study, how did your study examine the use of genomic data to identify individuals?

Jason Park:
My collaborators and I speculated that whole genome sequencing data contains the same DNA identifiers that are used in criminal investigations. We realized that if genome datasets contain these specific DNA identifiers, then existing databases of these DNA identifiers could be used to identify individuals based on their genome data.

Bob Barrett:
Can you tell us more about the DNA identifiers used in criminal investigations. What are the databases that currently exist?

Jason Park:
The DNA identifiers that I am referring to are known as STR or short tandem repeat profiles. STR loci vary quite a bit from individuals, and if multiple STR loci are tested, we can get a DNA profile that is unique for an individual. This type of DNA profiling is commonplace in forensics and law enforcement.

Thirteen specific STR loci are used by the US Federal Bureau of Investigation in their Combined DNA Index System abbreviated CODIS. The STR profile has been popularized on television shows focused on criminal investigation. The CODIS database contains the STR profiles from more than 11 million individuals.

Bob Barrett:
Okay, let’s get into it. What were you able to find in your study?

Jason Park:
Well, luckily for us, another research group had already created a software tool to examine STR loci across large genome datasets. This other group had applied this software tool across one of the largest whole genome databases, the 1000 Genomes Project. Their study was a broad survey of genomes and they characterized many DNA STR loci, not just the ones used by the FBI.

In their report they briefly addressed the FBI CODIS STR loci but only in a subset of genomes. They did not give a detailed report on how their software performed at finding the FBI CODIS STR features across all genomes in the dataset. We reanalyzed this group’s published data and we found that on average 5 of 13 CODIS STR loci were identified in each genome. One genome had all 13 CODIS STR loci identified. Thus, the data for matching a genome’s STR profile to an individual may be possible; however, the majority of the genome datasets in this public database only yielded partial STR profiles.

Bob Barrett:
Is this the first examination using genomic data to identify individuals?

Jason Park:
No, actually this is just the latest example of how advances in sequencing technology and software have increased the possibility of re-identifying individuals by their genetic information. There was a recent comprehensive report by the International Cancer Genome Consortium, the ICGC, in Nature Biotechnology last year. This report summarized multiple recent studies which identified that there is the possibility of re-identification of participants in genomic research.

Bob Barrett:
So what are the implications of your article?

Jason Park:
A basic principle in genomic research has been that while the data is unique for each individual, the data in itself cannot be used to identify an individual. In other words, although my genome information is unique to me, if I gave you that information on a disk you could not identify me, you would need a “decoder ring,” in other words a database, to match me to my genomic information.

What my collaborators and I are attempting to illustrate is that the existing STR databases such as FBI CODIS can be the decoder ring. Given the recent revelations on the United States National Security Agency’s secret data collection program on internet users, we should anticipate that publicly available whole genome sequencing data may be similarly examined and queried by law enforcement or government agencies.

Bob Barrett:
Well finally Dr. Park, what recommendations do you have for research and clinical laboratories doing genomic sequencing?

Jason Park:
Research and clinical programs need to rethink how they allow access and share whole genome datasets. CODIS STR profiles can be derived from whole genome datasets. These genome-derived STR profiles can be used to not only identify the research study participants, but also may be used to identify biological relatives. Although access to STR databases such as CODIS is restricted, there may be scenarios where a law enforcement agency could justify searching publicly available genome datasets to identify a person of interest.

We can no longer consider genomic data to be anonymous when dealing from identifiers such as name, gender, and date of birth. Entities that have access to DNA STR profile databases have the ability to identify individuals and their biological relatives based on their genomic data.

My collaborators and I believe that the solution to this potential privacy issue is to anticipate improvements in DNA genome sequencing and informatics. With each technical improvement in sequencing and information technology, there is an increasing probability of re-identification of individuals. In order to protect the DNA privacy of individuals participating in whole genome sequencing, there needs to be constant vigilance to identify and protect against the possible misuse of information.

Bob Barrett:
Dr. Jason Park is Director of the Advanced Diagnostics Laboratory at the Children’s Medical Center, Dallas, and Assistant Professor at the University of Texas Southwestern Medical Center in the Department of Pathology and the Eugene McDermott Center for Human Growth and Development. He has been our guest in this podcast from Clinical Chemistry. I’m Bob Barrett. Thanks for listening!