Ultra High Throughput Sequencing for Clinical Diagnostic Applications - Approaches to Assess Analytical Validity: Report from the Public Meeting (June 23, 2011)
Ultra high throughput sequencing has exerted considerable impact on basic, applied and clinical research. It is now possible to sequence a complete human genome in a single instrument run. Ultimately, sequencing is expected to become so inexpensive that whole genome sequencing (WGS) may be routinely deployed throughout biomedicine.1,2,3
Ultra high throughput sequencing technologies, commonly referred to as next-generation sequencing or NGS, have many applications that include resequencing human samples to study inherited variations or somatic mutations, using hybridization-capture techniques to re-sequence a targeted subset (such as the protein coding sequences in exome sequencing), analyzing chromatin modification and protein binding that can be mapped by chromatin immunoprecipitation followed by sequencing (ChIP-Seq), employing massively parallel sequencing to create ‘epigenomic maps’, and sequencing RNA transcripts (RNA-Seq) to characterize their abundance and/or identify novel splice forms.
With the emergence of these novel technologies, the FDA needs to prepare for new regulatory challenges while continuing to apply scientific evidence-based oversight. To achieve this regulatory goal, the first step is to understand how to analytically evaluate the WGS data generated by the emerging NGS platforms. To this end, the FDA has partnered with other federal organizations (NCBI, NIST, NHGRI) as we endeavor to develop evaluation protocols and standards.
The Archon Genomics X PRIZE was launched to foster innovative technologies for whole human genome sequencing.4 The $10 million prize will be awarded to the first team to sequence 100 human genomes in certain amount of time with 98% completeness, 99.999% accuracy, and accurate haploid phasing at a cost less than $1000 per genome. A scientifically rigorous validation protocol has been developed by some of the leading experts in the field to evaluate sequencing requirements for the AGXP competition. Although the FDA’s goal is to develop a transparent evidence-based regulatory pathway for evaluating medical devices/products based on NGS that would assure safety and effectiveness of devices marketed for clinical diagnostics, some lessons may be learned from the X PRIZE’s publicly available validation protocol.
Analytical evaluation of technologies capable of sequencing whole human genomes poses novel regulatory challenges both in terms of biology-related and technical questions, as well as computational resources. One of the important issues is the selection of suitable reference human genomes for validation purposes. Requirements for extensive computer data storage capacity and powerful computational capability to analyze complex data sets also need to be considered. The availability of a variety of platforms employing an array of different sequencing techniques and strategies dictates that any regulatory requirement should have the flexibility to adapt to rapidly changing technology in terms of the wet-lab procedures and the bioinformatics pipeline.
The FDA ultimately has to balance public safety concerns with the goal of fostering innovation and enabling the translation of these new technologies to benefit public health. To evaluate a new technology, the experimental data need to be in accordance with the claim, and the platform performance needs to be adequate for its intended use. In general, the sequencing process can be divided into three phases, with some steps varying between platforms or absent entirely: (1) nucleic acid preparation and amplification (including library preparation, as applicable), (2) signal generation and detection, and (3) bioinformatics analysis. The following sections highlight the gross differences between platforms, and consider possible stages at which formats and metrics could be standardized5 for uniform evaluation. This is not a comprehensive review, but rather an illustration of the diversity of technology and approaches that need to be accounted for within any future regulatory framework.
Variations in DNA preparation/amplification and signal generation/detection
DNA preparation, at least to some extent, and signal generation are currently platform-specific procedures, whereas the bioinformatics processing steps may be somewhat more generalizable. Some notable platform-specific characteristics include: (a) single molecule sequencing or PCR amplification of target (by numerous mechanisms); (b) use of various chemically modified nucleotides, typically enabling step-wise control of the reactions or specific fluorescent labeling; (c) different enzymatic processes such as polymerization or ligation, or even mutant enzyme versions to accommodate the chemical modification of nucleotide substrates; (d) step-wise (base-by-base) synchronous sequencing versus “real time” sequencing; and (e) various mechanisms for generating and detecting nucleotide-specific signals. Signal detection is usually via light emitted upon the specific addition of bases, but may also be accomplished by other means, such as electrochemical detection of the release of protons or the release of the nucleotides themselves. Research efforts continue to advance novel sequencing techniques such as use of nanopores or electron microscopy.
Due to these heterogeneous approaches each platform may have unique characteristics and particular idiosyncrasies. One important parameter is the read length, where some platforms may generate reliable sequences of tens of bases, while others can generate hundreds of bases, or possibly greater than 1000 bases. Some technologies are also adaptable to producing mate-paired or paired-end reads, in which two reads are linked within a defined range of genomic distance. Mate pairs are useful in assembling the sequence and spanning repeat regions in the target DNA. Furthermore, each technology may be susceptible to a particular error mode such as systematic or random errors in base calling, or insertions and deletions in runs of homopolymers (stretches of a single repeated nucleotide type). Underlying features of the target DNA, such as repeat sequences or regions of high GC content, as well as bias due to PCR-based amplification techniques, may pose particular difficulties for specific technologies.
The variety of detection methods means that each platform will have a unique raw data output at the step analogous to the chromatogram of Sanger sequencing. The output may be a series of images (TIFF format, etc.) that record the step-wise capture of cyclic light emissions, but also may be quite different, such as a digital trace generated by a microchip pH meter. Ultimately, the signal is generally converted into a base call and a quality score representing the probability of a particular nucleotide being at a specific position within the read. The platform-specific nature of the detection method implies the necessity of correspondingly specific (and possibly proprietary) software to convert the signals into base calls.
The bioinformatics pipeline may be divided into several phases: (1) preliminary library/run QC and generation of individual sequence read files, (2) alignment of the sequence against a (human) reference genome, and (3) variant calling (enumeration of differences between the assembled test sequence and the reference). Each of these steps may present an opportunity to evaluate the quality of the process by specific metrics.
The first likely standardizable step in unifying outputs of these disparate platforms is at the level of the “raw reads” (i.e., the sequence of base calls for each stretch of sequenced DNA). One widely used file type for containing the read sequences is the FASTQ format, which is a variant of the FASTA format (nucleotides only) that also contains a quality assessment (Q-score) for each base in the read. The Q-score is not always an accurate estimate of the probability of error. For example, the Q-score has been shown to vary in reliability with sequence context, cycle number, and other covariates.
There is a plethora of software to assemble and align the reads, and quality metrics to appraise the results. The most popular file format for the assembly/alignment stage is SAM (Sequence Alignment/Mapping) or the binary version BAM, which contains the sequences, base and mapping quality scores, alignment location, and various metadata. In the alignment stage, a variety of quality metrics can be calculated, such as the mapping fraction (uniquely mapped reads/total reads), the pairing rate and distance of mate pairs, duplicate reads due to PCR or optical artifacts, coverage uniformity across GC-base content, and the fraction of bases and reads aligned to the reference. In addition, when mapping to a reference sequence without expected variants, the reported base quality scores can be compared to empirical base error rates in relation to covariates such as machine cycle or dinucleotide context. All the above metrics may be employed in assessing the analytical validity of the sequencing at the read level.
Before calling variants, the reads are sometimes refined by removing PCR and optical duplicates, realigning around putative indels, and recalibrating base quality scores. A large number of tools are available to call different types of variants. As no single method is widely accepted for calling any variant type, multiple algorithms are frequently used resulting in a set of variant calls that deviate from the “known” reference genome (usually reported in a VCF format). At this stage, the performance is often evaluated by examining a subset of variants using technologies such as microarrays, capillary sequencing, and high-depth targeted sequencing. However, it is important to recognize that each of these technologies can contain errors and biases as well, particularly in difficult regions of the genome. In this respect, a genomic reference material would be useful in evaluating performance.
In the interest of standardization, transparency, and wider usability, it may be preferable to implement a standard pipeline for data, although different platforms may be using different algorithms for sequence alignment. Parameters influencing the pipeline process could be adjusted to accommodate the idiosyncrasies of various platforms in defined ways. It may also be desirable to establish quality thresholds for intermediate metrics in order to understand the inherent strengths and weaknesses of any particular platform, quality of input material, and platform outputs – e.g., high fold-coverage may compensate for an inherently error-prone technology.
Validation and Standardization Considerations
Validation of the ultra high throughput sequencing platforms for whole genome sequencing will ultimately involve evaluation of the final results, possibly in the form of aligned sequence, or more probably as variant calls, percentage of sequence that was detected, percentage that was called accurately, etc. As an example, a well-characterized genomic reference material from a standards agency such as NIST, if available, may be used to validate the sequencing performance by comparing the variant calls and their variant quality scores to the known variants in the sample. Establishment of a standard set of metrics for assessment of platform performance, particularly for completeness and accuracy, would be very helpful. The specific use of the platform may require accurate calling of single nucleotide polymorphisms (SNPs), insertions and deletions, copy number variants, haplotype phasing, complex chromosomal rearrangements or epigenetic modifications.6 It would be helpful for the definition and implementation of any quality scoring system to be consistent (automated if possible) and transparent, yet focused to ensure fidelity vis-à-vis the intended use.
Additionally, the basic data may need to be available for analysis, if there are problems. There are several possible intermediates for reconstruction prior to the final sequence, including raw reads, images or even physical DNA libraries. The choice of levels of data that need to be retained may have further consequences related to storing large volumes of data. Furthermore, innovations such as distributed computing (“cloud computing”) may need to be considered over historic notions of local data storage and processing, including possible problems related to data confidentiality.
Validation of WGS and the possible need to recreate the data processing steps make it important to consider standardized data formats. This would require the transformation of platform-specific data into a uniform pipeline using standard formats (e.g., FASTQ, BAM, VCF). The idiosyncrasies particular to any platform may require different cognate algorithms, but the final data output may need to be standardized. However, any such process would need to be flexible and sufficiently robust to accommodate differences in technology, especially in the face of the rapid evolution of software and processing algorithms for sequence assembly and alignment.
In June 2011, the FDA hosted a public meeting titled Ultra High Throughput Sequencing for Clinical Diagnostic Applications – Approaches to Assess Analytical Validity. The meeting was organized to seek input from subject matter experts in academia, government, industry, and other stakeholders, and to serve as a platform for discussions on validation methodologies, materials, and bioinformatics approaches needed to address unique analytical validation requirements of NGS-based molecular diagnostics. FDA’s ultimate goal is to accelerate and support the introduction of safe and effective innovative diagnostics in public health applications.
In keeping with the scientific and exploratory fact-finding goal of the meeting, regulatory questions or defining regulatory recommendations were not addressed since it was judged that would be premature.
To streamline discussions, the FDA suggested the meeting focus on possible analytical accuracy evaluation strategies. The possible evaluation strategies for NGS platforms could be divided into the following applications: gene panels, exome sequencing and whole genome sequencing. Experience with other genomics platforms, e.g. microarrays may partially inform the development of an approach for the evaluation of the analytical performance for gene panels. The panelists were asked to consider the challenges for exome and whole genome sequencing, including pre-analytical issues. While one generally evaluates both analytical7 and clinical8 performance of a clinical diagnostic test, this initial meeting was primarily focused on the analytical performance of NGS-based sequencing tests. Specifically, the panelists were charged with providing input on how the technical and bioinformatics challenges, summarized in the preceding sections, may impact the evaluation of the accuracy9 of sequencing-based tests.
The meeting was divided between two sessions – technical performance and bioinformatics session. For the session on the technical performance of a platform, Dr. Vincent Magrini of the Genome Institute at Washington University provided an overview of the existing technologies and their respective advantages and disadvantages, including read length, homopolymer detection, speed, accuracy, etc. Dr. Laurence Kedes of the Archon Genomics X PRIZE Foundation presented one possible perspective on the analytical validation by summarizing Foundation’s framework and criteria for judging the accuracy of new and upcoming whole genome sequencing platforms. Following the technical performance panel discussion, representatives providing public comments presented their thoughts and suggestions during the public comment session. For the bioinformatics session, Dr. Russ Altman from Stanford University provided the background on the gamut of bioinformatics tools, followed by the expert panel discussion on the challenges in the bioinformatics arena.
All presentations, public comments, questions submitted to the discussion panels, and professional affiliations of the participating experts are available at http://www.fda.gov/MedicalDevices/NewsEvents/WorkshopsConferences/ucm255327.htm10.
3. PANEL DISCUSSION
3. i. Technical Performance of a Platform
Topics for discussion, supplied in advance of the meeting, focused on developing reproducible and transparent evaluation protocols that can accommodate emerging NGS technologies for a variety of clinical uses. Protocols elucidating advantages and pitfalls of any particular platform, sample preparation method, or bioinformatics tool would enable the end users to better understand the performance of any particular platform and to select appropriate methods for their specific application. Discussions at the meeting followed the topics for discussion somewhat loosely, which was expected considering the complexity of the issues and rapid development of the field. Therefore, questions listed in this section are adjusted to reflect main discussion points, and are not necessarily questions originally listed to guide discussion11. Of note, meeting discussions gravitated towards a more general philosophical level, without dwelling on some of the specifics raised in the original topics. Here, we are highlighting main points brought up by discussion panel members as well as the audience.
Q: What evaluation criteria should be used to assess the accuracy of generating a raw read, e.g., sequencing fidelity, completeness, quality scores, sequencing depth?
Discussion participants pointed out that the accuracy of the platform may not depend upon the target of the application alone (gene panels, exome or whole genome sequencing), but may also need to be evaluated in the context of the clinical claim. The discussants cautioned that for example, claims related to hereditary disease in homogeneous cell population or claims related to heterogeneous mixtures of cancer cells would require different technical performance characteristics. Of note, clinical claims as discussed by the panel are somewhat different and much broader than the very specific indications for use (IFUs) that the FDA typically evaluates for genetic tests.
There was a discussion on the relative value of machine readout level quality scores (Q-scores) and the general utility of low level metrics at individual steps in assessing the overall pipeline. The main points brought up in discussion:
- Quality scores are platform dependent, but are intended to represent an estimate of the likelihood of error in calling a specific base, given the historical performance of a platform in calling bases with a similar signal profile. Quality scores clearly have value in monitoring data quality and in filtering individual base calls and raw reads. They are useful to estimate the presence of random errors arising from the measurement and base-calling process.
- Quality scores alone may be misleading in assessing the overall quality of the final product, particularly since sufficient read depth can compensate for individual random errors in the base calling at the raw read level. In other words, overlapping reads and high coverage of any particular segment of DNA will overwhelm random errors with a consensus base call(s) at that site. The strength of any particular platform may not rest in the fidelity of single instances of base calling, but in the ability to deliver a depth of reads that are, in aggregate, overwhelmingly correct. This ability has led to the aphorism that one can sequence the way out of random errors by increasing read depth.
- In contrast to random errors, quality scores are often not accurate indicators of systematic errors that occur either in the pre-analytical steps or as inherent characteristics of the platform and target sequence. For example, an error may generate a substitution early in PCR amplification, which is then propagated to a large proportion of the target molecules. Such a substitution may correctly generate a near-perfect quality score in the context of the signal read since the error occurred far upstream of the detection machinery. In fact, many errors probably occur in pre-analytical steps or sequence alignment steps, both of which will produce systematic errors that are not reflected in raw read level quality scores. Systematic errors in effect decouple the quality of the detected signal pattern from the estimation of the true error rate. Therefore, one cannot sequence the way out of systematic errors (i.e., compensate by increasing read depth), since this is exactly the type of error that is often not detected by quality scores.
In general, the panel felt that quality scores on raw reads per se are not necessarily a good focus point for quality metrics. Sequencing depth, or depending on the application read length, may be more important factor to estimate the quality of the raw output of the sequencing reaction. Since quality scores are assigned to individual bases, they currently lack the power to distinguish the occurrence of missed insertions or deletions. Furthermore, low-level instrument quality values are not a particularly discriminatory element of different platforms or applications, and generally are not informative for the types of errors that are actually problematic. Although in the hands of expert researchers the general expectation to reach a high threshold for read-level quality scores may become trivial, but for clinical use this may be crucial for quality control. No particular minimum depth of coverage was recommended as that would depend on the specific application; however it was mentioned that for most clinical applications sequencing depth needs to be much higher than the coverage currently available from most research data.
Q: What performance characteristics should be measured and what metrics used for final validation?
The overriding view of the panel was that a regulatory framework for ultra high throughput sequencing must be flexible enough to address the rapidly evolving technology and bioinformatics, and the emerging diversity of possible applications. The regulatory framework probably cannot rigidly stipulate a priori thresholds for specific low-level metrics. The view expressed was that it might be difficult and possibly pointless to segregate the process into separate regulatory phases; the process may need to be judged by the accuracy and fidelity of the final result. However, this may be complicated by the specialized nature of current providers of platforms and reagents for different phases.
A consistent trend in the discussion concerning the validation of NGS methods was that validation must be application specific. This parallels the general FDA dictum of validating an intended use in an intended population, although as mentioned previously the FDA evaluated claims are more specific than general clinical applications that were mentioned, such as cancer. The discussants were generally skeptical of a “one size fits all” approach, and opined that a more useful measure would be the ability to assay defined clinically useful targets. For example, a standard threshold for sensitivity may be perfectly suited for identifying SNPs in constitutive germline variants, but may be inadequate for identifying low frequency mutations in tumor samples. Philosophically, this approach moves away from evaluating overall aspects of technical performance like coverage and overall error rates, to evaluating defined performance characteristics with respect to somewhat more specific claims.
A critical point expressed during this part of discussion focused on the idea that the most salient metric in validation might be an accurate estimate of the degree of confidence associated with the calling of any particular variant. The ideal metric would be similar to an overall quality score for the calling of a clinically relevant event and would reflect the probability of accuracy for a particular class of calls. In this framework, the ability to estimate the error probability is almost as important as the ability to make a call. Such an approach would be helpful in having a degree of reliability assigned to a particular clinically relevant outcome. Furthermore, it was underlined how important it is to “know what you don’t know.” Certain parts of the sequence may not be detected when sequencing a sample. Therefore, it is important to understand not just the level of confidence in the call, but also what is the confidence in the lack of a call - that there may not be a variant at certain site. A reliable error estimate would tell a user that a call has a certain predictable chance of being incorrect (false negative/false positive) or that the confidence in the call is too low to be useful (failed call). Currently, the field lacks a universal consensus for a standard metric estimating confidence of calls, but there seems to be agreement that such a metric would be useful to both the NGS users and the regulatory agencies such as the FDA.
Q: Could an appropriate validation set (reference set) of samples be created to evaluate accuracy of the sequence generated by the platform?
One general approach to regulation of devices is the evaluation of measurements performed by the instrument against a known standard of “truth” using defined metrics. For WGS, it is not clear what would constitute such a “gold standard” reference set. One possibility would be an individual genome that has been extensively validated by numerous orthologous methods to generate a highly reliable composite or consensus sequence. It was suggested that such a resource should be widely available and would be invaluable for testing and platform comparison. Another suggestion was that the ideal reference may be an intensely sequenced family trio or quartet that would allow Mendelian Inheritance Error (MIE) detection to identify and reduce sequencing errors by utilizing information across the whole family set.
Using a standard reference genome also gives rise to a number of problems. One major difficulty is the continual replenishment of DNA material. An attractive source would be immortal cell lines, which can be treated as nearly infinitely expanding DNA factories. However, cell lines are not stable and even low levels of mutation will accumulate over multiple passages of cells. Another alternative, artificial DNA, is recommended by the qualities of stability and homogeneity, but as a technical validation source is not analogous to DNA that is extracted and purified from the biochemical and biophysical conditions in a living cell. Furthermore, artificial DNA may itself have systematic errors, or may lack sufficient sequence and structural complexity to act as a critical surrogate for biologically derived genomic DNA. A more subtle objection is the danger of overfitting that might occur for a singular reference material. Lab processes, instrumentation and analytical tools might be developed with a bias toward correctly resolving sequences specific to the reference genome. Finally, a reference genome may not be an accurate model for any other particular individual’s genome, and an inappropriate reference may be particularly misleading for sequence alignment.
There was also recognition that a universal reference may not be suitable for all applications and a validation set may need to be tailored to the application type. For example, applications targeting cell mixtures would need an appropriately heterogeneous reference material. Ultimately, the reference should reflect the scope of the target sequence, such as whole genome, exome, transcriptome or a select panel of genes.
Q: What constitutes a suitable set of variants in a reference genome that should be used to evaluate whole genome sequencing?
The convened experts suggested that there were probably 30-40 different classes of variants that would need validation, possibly ranging from simple single-nucleotide variants to chromosomal rearrangements to epigenetic states, such as DNA methylation. These variants may occur within genomic contexts that may be problematic for reliable identification, such as high GC content regions, repeats, multiple close homologues or pseudogenes, and copy number variations (CNVs). These conditions may interfere with the ability to sequence the region with high fidelity, or to map the resultant reads to the proper location. For example, various CYP homologues may not be difficult to sequence, but may be problematic for mapping. Highly homologous regions with a sprinkling of SNP variants impose a premium on the ability of a platform to generate long reads.
The discussion participants were generally sanguine about the possibility of monitoring the accuracy of SNPs, but felt that no other variant class would be nearly comparable for straightforward verification; most variant types were likely to “behave badly”. For validation, the HapMap and 1000 Genomes Projects may provide a large pool of SNP data and samples that have certain overall accuracy, although clinical lab representatives cautioned that the level and depth of sequencing in this data set may not be adequate for a number of clinical applications. Also, different error rates would be required for heterozygote detection in germline tissue versus useful base calling in mixed cell populations (like tumors). Furthermore, many of the SNPs have been widely verified by orthologous technologies like Sanger sequencing and genotyping microarrays. Sanger sequencing has long been considered the “gold standard” for DNA sequencing and performs well for constitutive variants, but is not so useful in cases of chimeras, heterozygous tissues or CNVs. One paradox for the validation of ultra high throughput sequencing by orthologous methods is the recognition that most other methods (including mass spectrometry) may not have the sensitivity of ultra high throughput sequencing, or the ability to interrogate many individual copies of DNA.
3. ii. Bioinformatics
The overall opinion of discussants was that sample preparation steps and bioinformatics analysis are generally more likely to generate false calls than the errors generated within the sequencing platforms themselves. Two primary functions of bioinformatics in whole genome sequencing are to assemble the sequence without having reference sequences (de novo), or more commonly for human DNA sequencing, to align the reads to extant reference sequences and call variants relative to the reference. The bioinformatics steps, particularly alignment against a reference genome, are typically the greatest source of error in ultra high throughput sequencing variant calling. Especially, regions that have high SNP density, such as HLA, are not as difficult to sequence as they are difficult to map with fidelity. Sequence variation in pseudogenes and highly homologous gene regions can be easily mismapped unless read length and sequence fidelity are sufficient for unambiguous placement against the reference. Sequences that are present in the test sample but not in the reference genome are likely to be entirely misaligned or discarded as spurious unless there are constraints against the “best-match” policy. One suggestion was the idea of including a set of DNA sequences carrying a collection of interesting variations, termed a “Frankenstein” reference. This would represent a set of major use cases, clinically relevant variation, and rare or difficult events for which the sequence would act as an alternative reference target during alignment.
The panel discussed the key bioinformatics steps that have the largest impact on the downstream biology. Both the accuracy of a base call and subsequent alignment of sequence reads composed of multiple base calls (~25 to ~400) are highly significant. Some suggested that a confidence metric needs to be established for each base call so that the quality of a sequence can be assessed and justified. Others thought that such a metric could be difficult to define. In terms of sequence alignment, there are many algorithms to choose from, and significant parameter settings for each. Overall, alignment algorithms are extremely complex but evolving and improving rapidly in their ability to perform reproducibly and with reliability. In general, bioinformaticians can eventually achieve good performance given reliable standards, that is, with suitable reference samples and detection specifications. Still, the algorithms need to be robust against varying sequence quality, varying read depth, novel sequence and the many clinically interesting loci that are, thus far, mostly intractable.
The bioinformatics steps cannot yet realistically be considered a “pipeline”. There are many choices to consider and optimize in the bioinformatics process of ultra high throughput sequencing alignment and variant calling. There are typically several choices of algorithms for the different steps in the process, and options and parameters that must be optimized for each. This situation is diametrically opposed to the requirements of clinical practice, which in general needs standardized, uniform, and precisely defined steps with standard pre-programmed options and parameters. Typically, even minimal changes in protocol require re-validation.
As discussed in the session on technical validation of platforms, the bioinformatics phase also would be hard to judge by a specific set of performance characteristics at each step of the pipeline, but rather as an application-specific outcome with respect to a reference set of appropriate variants.
Data Output Format
The discussion was focused on several common currently available, widely accepted and generally useful formats like FASTQ and BAM/SAM that may lack sufficient “expressiveness” to maintain utility in the future. “Expressiveness” means that the data format can contain and convey a richness and depth of information in a standard format. The format may need to be flexible enough to include not only currently useful information, but also that which may prove beneficial in the future. The current formats are generally adequate for single nucleotide events, but are lacking the facility to express “in-between events”, like insertion/deletions, haplotype phase, etc., and are clearly inadequate for more complex structural events like CNVs or epigenetic phenomenon. Such information requires an increased data complexity, which has not been warranted thus far, but may be worthwhile in the future. Members of the panel suggested that we should learn from the efforts of standard development in other areas of genomics. Alternatively, it may be worthwhile to set up a simpler data exchange format that can be uniformly applied across different platforms. The standard setting for file formats should be done as a consensus in a community-wide effort.
The immense amount of data created by ultra high throughput sequencing is absorbing a huge amount of resources and manpower. Academics and clinical professionals had two different perspectives on handling storage of data (which may both differ from the FDA’s viewpoint). In cases of perfect consensus with high coverage, many academics suggest throwing out redundant data, although they are likely to keep old data due to the possibility that interesting insights may be gained by developing new analysis methods that reveal novel discoveries in the future. Clinical laboratories may need to repeat the experiment; however they also may need to store the data in order to satisfy medical and legal requirements. Additionally, failed experiments may be useful for QC follow-up. Clinical laboratories would like guidance and standards in place regarding what to store and how long, in terms of legal requirements and medical utility.
In summary, discussants opined that the regulatory challenge in analytical validation of ultra high throughput sequencing lies primarily in the need to develop a flexible approach that accommodates a rapidly evolving field both at the bench and in bioinformatics analyses. These processes are interconnected and the intermediate steps cannot be easily separated from the end product, making individual validation somewhat problematic. One consideration was that perhaps a more general validation approach that is both platform-agnostic and application-specific may need to be developed. Discussants expressed the opinion that the FDA should determine what would constitute appropriate reference materials and metrics for specific validation of different classes of variants (SNPs, insertions, deletions, CNVs, etc.), particularly with regard to clinically interesting uses and cases. These metrics should not only interrogate the ability to make a call at a certain accuracy level, but possibly also to accurately estimate the confidence of each call, generating a distinction between three categories of calls: correct, incorrect, and no call at a specific level of confidence.
This was an initial FDA-organized meeting to discuss some of the issues related to possible regulation of NGS. In lieu of providing some specific answers, this meeting exposed many questions and uncertainties related to rapid evolution and development on novel and potentially better methodologies and platforms. As discussed, it may be helpful to develop reproducible and transparent evaluation protocols that are adaptable to emerging ultra high throughput sequencing technologies for a variety of intended uses. This would elucidate the advantages and pitfalls of any particular platform, sample preparation method, or bioinformatics tool. Ultimately, end users would be able to select the appropriate method for their specific application.
The FDA is proactively participating with other Federal partners in building the regulatory path for emerging new technologies, such as ultra high throughput sequencing, to facilitate the translation of novel technology into clinical practice. Here, we have summarized the background on the regulatory challenges as well as discussions at the FDA-organized public meeting on Ultra High Throughput Sequencing for Clinical Diagnostic Applications. One of the outcomes was a suggestion that there may be a need to build a coalition, likely involving academia, device manufacturers, software developers, and clinical laboratory end users, to collectively develop some ways of standardization. This standardization may include procedures for analytically evaluating ultra high throughput sequencing platforms and applications. In parallel, the FDA will continue working with platform manufacturers to establish a flexible and transparent way to accomplish this while simultaneously assuring that devices used are safe and effective for use in clinical applications.
We thank our FDA colleagues from CDRH, CFSAN and NCTR for careful review of the document. We are especially thankful to M. Salit and J. Zook from NIST for their invaluable help in drafting this document.
4 The complete AGXP competition guidelines http://genomics.xprize.org/sites/genomics.xprize.org/files/AGXP_Competition_Guidelines_20111004.pdf.
8 Clinical performance includes evaluation of whether the test results correlate with the target condition of interest in a clinical significant manner; the metrics include clinical sensitivity, clinical specificity, etc.
10 Ultra High Throughput Sequencing for Clinical Diagnostic Applications - Approaches to Assess Analytical Validity, June 23, 2011 (including meeting Agenda, Presentations, Topics for Discussion, Webcast, and Transcripts) - http://www.fda.gov/MedicalDevices/NewsEvents/WorkshopsConferences/ucm255327.htm