Genomics is a discipline in genetics concerning the study of the genomes of organisms. This field builds upon the sequencing of the genome, and aims to annotate where the genes and their regulatory regions are located in the genome. That, in turn, allows scientists to better understand how the cell reads the "genetic code."
Genomics is only one of a number of "-omics" fields that have emerged in recent years, other examples being transcriptomics, proteomics, and metabolomics. Linguistically, the suffix "-omics" just means "to study," but in recent years it has become synonymous with consideration of a system -- genes, gene transcripts, proteins, etc. -- in its entirety. What has made this explosion of new -omics fields possible has been a revolution in the molecular-biology lab called high-throughput technology.
High-throughput technology refers to experimental methods for performing a large number of assays in parallel, using miniaturized instrumentation and automation, made possible due to the advances in microfluidics, high-resolution microscopy, robotics, and molecular-biology techniques. This technology has allowed researchers in the biological sciences to increase the scope of their experiments by orders of magnitude. In drug discovery, we can now screen thousands of compounds at a time; in protein science, we can search for interactions between thousands of proteins; in gene expression, we can probe the transcriptional level of most genes in the time it used to take to examine that of one gene. In recent years, the "whole genome" approach using these new technologies has allowed the researchers to make remarkable advances in studying the varied and complex functional pathways of an entire organism.
As the name implies, high-throughput technology churns out enormous volumes of data -- on the order of gigabytes in a single experiment. Dealing with this data generated in the "wet lab" has spawned an entirely new discipline, bioinformatics, that marries classical molecular biology with statistics, computer science, and mathematics. We'll look at how bioinformatics teases out insights from terabytes of genomic data in the next section.
In this section, meanwhile, we'll explore a number of high-throughput wet-lab techniques that have been crucial (especially in the modENCODE project) to moving from genome sequence to biological insight.
Next-Generation Sequencing (NGS)
A central technique in much of genomics is the sequencing of the DNA and RNA. A recent technological breakthrough here has been the introduction of next-generation sequencing.
Introduced in various platforms starting in 2005, "next-gen" sequencing essentially involves sequencing a massive number of DNA fragments in parallel. The increase in the number of the sequenced fragments, often referred to as reads, has been breathtaking. We can now sequence over 3 billion reads in a single run of a sequencer, compared with ??? in a run in the previous generation of sequencing technology.
Next-generation sequencing also has a limitation, however -- the short length of the reads, initially at ~20-30 bases but now at 100-150 bases. Longer reads are advantageous for many applications. In genome sequencing, for instance, longer reads enable more efficient assembly of a genome, just as a paragraph broken into sentences would be easier to put together than the paragraph broken into phrases.
How does next-gen sequencing overcome that limitation? Through sheer numbers. Each sample that is used for DNA sequencing typically consists of many cells, each containing the same DNA sequence. Thus, an overwhelming number of fragments can be sequenced, and that helps to get past the limitation imposed by the shortness of the reads -- just as one could reconstruct a paragraph if many copies of the paragraph were fragmented into a string of words at random places. Even with massive redundancy of sequence, though, it's tricky: the human genome in particular contains a portion of highly repetitive elements (the same sentence has been duplicated many times), and short reads don’t allow scientists to distinguish one copy from another.
The next-gen sequencing technology represents a vast increase in scale over older capillary sequencing methods (often called Sanger sequencing). This technique generates longer reads (800-1000 base pairs), but is much more limited in the number of reads that can be sequenced simultaneously. For comparison, the Human Genome Project sequenced roughly 90% of the euchromatic genome (the more accessible portion of the genome) to 99.99% accuracy, for a cost of about $300 million dollars over five years. On the latest NGS platform, the same amount of data can be generated at the cost of ~$5,000 in a week, with further rapid declines expected in the coming years. NGS is also a substantial improvement over microarray technology, which has been used successfully in many applications discussed below.
The primary NGS platforms used by the modENCODE Consortium were Genome Analyzer II and its successor HiSeq 2000, from the company Illumina.
Web Mission: What It Takes to Sequence a Genome
How does next-gen sequencing really differ from previous-generation approaches (so-called Sanger Sequencing)? To explore that question, let's have a look at two video projects from the Genome Center at Washington University in St. Louis.
The first set of videos -- Sequencing a Genome: Inside the Washington University Genome Sequencing Center -- was produced in 2004, and provides a detailed tour through a state-of-the-art sequencing facility at that time. As you go through the videos, note the main components of the sequencing "pipeline," and its similarity to an industrial process. One particular technique, polymerase chain reaction (PCR), looms especially large in this presentation. What is PCR, and what is its role in the traditional sequencing pipeline?
Next, turn to a later video feature from the same university, A Tour of Next Generation Sequencing: Inside the Washington University Genome Center. This series was produced in 2010, only six years after the previous series, yet the changes have been profound. Can you name two or three ways that the processing pipeline has changed with the advent of next-gen sequencing in this lab?
In one of the videos, Dr. Elaine Mardis shows a graph comparing the number of experimental "runs" on sequencing machines with the number of reads processed, and highlights the substantial decline in runs and huge increase in reads. What is this graph saying about the efficiency of sequencing in this lab? When did the change take place, and why? Dr. Mardis also cites some numbers on the cost of sequencing before and after the transition to next-gen techniques. How much has the cost of sequencing changed in the past few years?
In the course of these videos, the interviewees frequently note the large number of model organisms being sequenced in the lab -- not just human, fly, and worm, the focus of our study here, but many others, including corn, chimpanzee, and the Anopheles mosquito. Why do you think the genomes of these latter three organisms might be important in studies of human well-being, evolution, and disease?
Once the reads are generated from the sequencer, they need to be "mapped" back to the genome (a process we'll explore more in our look at bioinformatics). In mapping, we try to associate each read with a location in a "reference" genome -- the sequence that scientists have accepted as representative for a species. Some reads may not match the reference genome exactly; just as no two people are exactly alike, the genomes of different individuals of a species will have differences from each other and from the reference genome. The various types of variation include:
- Single nucleotide polymorphisms (SNPs) -- for instance, one individual may have a C at a particular location while another has a T.
- Copy number variations -- a segment of the genome may have been duplicated or removed.
- Other structural variations -- one or more segments in the genome may have been rearranged in a complex manner, often involving multiple chromosomes.
Characterization of these variations from whole-genome sequencing is an important area of research, particularly given its potential importance in disease. It was less important in the context of the modENCODE project, where inbreeding of fly and worm strains has eliminated much of the natural variation.
Transcriptomics is the analysis of the RNA transcripts produced from the genome, and is one of the key components of the modENCODE project. An important part of the transcriptome consists of messenger RNAs (mRNAs), which are produced from genes and translated by ribosomes into proteins. We can largely characterize the state of the cell by quantifying what genes are expressed, at what levels, at what time points in the cell life span. Much of our understanding of gene regulation comes from knowledge of the transcriptome.
There are a number of ways to get at the state of the transcriptome. The expression level of specific genes, for example, can be obtained through a technique called quantitative polymerase chain reaction (qPCR). But NGS technology now allows a more direct and accurate quantification of the abundance of specific RNA transcripts across the genome -- and importantly the details of the transcript structure -- using a technique called RNA-seq.
Essentially, RNA-seq is another high-throughput technology that enables sequencing of a cell's compliment of RNA, in the same way that genome sequencing addresses the cell's DNA. To sequence RNAs using RNA-seq, the first step is to build a "library" consisting of RNAs that have been reverse transcribed into more stable complimentary DNA (cDNA) molecules. In reverse transcription, a DNA strand is read from an RNA molecule by an enzyme called reverse transcriptase. The ends of the cDNA molecules in a library have also been modified so that the NGS platform can recognize them and sequence them -- the "seq" in RNA-seq.
An RNA-seq dataset can be a rich source of information -- the modENCODE Fly Transcription Consortium, for instance, has sequenced many cell types and developmental stages of the fly, generating more than 16 billion reads. Such a dataset can be used to:
- Discover thousands of new genes and transcripts in the model organisms, as was done in the modENCODE project, to more accurately annotate the genome.
- Obtain quantitative measure of gene expression levels by counting the number of reads matching each gene. These gene expression levels are used to describe the activity of the cell and forms the basis for understanding gene regulation in multiple cell types during the development of an organism.
- Give exon-specific expression levels that can be used to characterize the patterns of alternative splicing, a process in which different combinations of exons from the same gene are joined together to form mature mRNA and subsequently multiple protein isoforms.
Due to the limitations in the current NGS platforms, the sequenced reads are just a fraction of the entire RNA transcripts. Some portions of transcripts are easier to capture than others using the technology. For instance, the exact 5' and 3' ends of transcripts are difficult to learn using RNA-seq, because of the many fragments that we see, we don't know which correspond to the precise beginnings or ends of individual transcripts. Thus, the modENCODE Consortium used other methods to supplement RNA-seq data. For instance, rapid amplification of cDNA ends (RACE) was used to more precisely identify "transcription start sites". In addition, when a gene includes multiple alternatively spliced exons, it is difficult to identify precisely which set of exons come together to form transcripts based on reads that cover just a portion of the transcript. In those situations, the modENCODE consortium utilized the cDNA capture technology to isolate a particular transcript or set of transcripts, and then sequenced the full transcript on a traditional capillary sequencer.
Protein-DNA Interactions and Chromatin (ChIP-chip and ChIP-seq)
As we learned in the sections on transcription and chromatin, many biochemical processes function together to regulate when and how many times a gene is transcribed into RNA. On top of transcriptional regulation, there is also translational regulation, the process that determines when and where RNAs are translated into protein. Some basic mechanisms are known, but how they function together in a network of genes and proteins to govern a complicated biological process is largely unknown. One primary mechanism of transcriptional control is the binding of an important class of proteins, called transcription factors, to DNA. Transcription factors generally bind to a specific DNA sequence to promote or inhibit the recruitment of RNA polymerase to promoters. Other proteins interact with histone proteins to remodel chromatin into an open or accessible state so that, for instance, a transcription factor can bind more readily to the open DNA.
How can scientists hope to unravel these complexities? The answer, once again, lies in new high-throughput technologies, which now allow identification of most locations on the DNA to which a particular protein is bound in a single assay. The modENCODE consortium has made ample use in particular of two techniques, ChIP-chip and ChIP-seq. In both cases, "ChIP" stands for chromatin immunoprecipitation. In ChIP-chip, what follows is profiling on a microarray (also known as a "gene chip"); in ChIP-seq, what follows is sequencing. At a high level, here's how it works:
- ChIP. Both techniques begin by isolating the regions of the DNA to which the protein of interest is bound. To do this, one first adds a chemical compound that fixes the protein to the DNA ("cross-linking") so that the protein and DNA remain attached in subsequent steps. Subsequently, the DNA is fragmented, typically by sonication (the sound waves break the long lengths of DNA into shorter pieces). This is followed by immunoprecipitation, a process in which an antibody designed to recognize a particular protein is used to isolate the protein as well as the DNA fragments that are cross-linked to it. Then, finally, the cross-linking is reversed, the protein is removed, and we are left with a sample of the DNA fragments of interest.
- Microarray analysis. In ChIP-chip, the DNA fragments obtained by chromatin immunoprecipitation are allowed to bind to matching genomic sequences on a microarray.[Need an extra sentence on microarray analysis.]
- Sequencing. Alternatively, in ChIP-seq, the DNA fragments are sequenced directly on a NGS platform to obtain the exact sequences, which can be mapped to the genome to identify their precise location. ChIP experiments often start with a large number of cells, e.g., millions, and each DNA fragment is also amplified many times. Thus, multiple copies of the DNA sequence from the same region will be present in the data if the binding is strong and/or if the binding is present in many cells.
When ChIP-chip was developed, it brought a significant advance in our ability to study gene regulation. By knowing exactly where a transcription factor binds in the genome, for instance, one can identify the genes that are potentially regulated by that factor. One important limitation for ChIP-chip, however, was that the potential sites to which a protein is bound had to be predetermined and incorporated onto the microarray. For large genomes, it was only possible to include only a small fraction of the genome onto microarray, thus limiting its usefulness. ChIP-seq represents a further advance, allowing one to identify the protein-bound fragments with higher resolution, without first having to specify their potential sites. This is the first time that researchers can get a genome-wide view of protein-DNA interactions in humans. In modENCODE, ChIP-chip was used initially, as the fly and worm genomes were small enough to fit on the latest microarrays. As the NGS platforms improved, a switch was made to ChIP-seq.
A successful ChIP-seq experiment provides a genome-wide map of the binding sites of the targeted protein as well as modifications of chromatin . This provides useful information for biologists who want to understand which proteins regulate particular genes. The modENCODE consortium has used hundreds of ChIP-seq experiments and extensive computational analysis to assemble "networks" of protein/DNA interactions (each gene and protein as a node, and each protein-DNA binding as an edge between a gene and a protein) that may be responsible for gene regulation in flies and worms.
Understanding these regulatory relationships, or networks, is an essential step in developing animal models of human disease. Many mutations, even deletions of entire protein-coding genes or large gene "deserts" do not reveal an obvious phenotype in model organisms. This is due to aspects of network robustness -- often, other proteins can substitute for the function of one protein in a network. Only by understanding the regulatory relationships that determine when and how genes are transcribed and ultimately translated can we begin to understand, and therefore be able to predict, the effects of genomic mutations on phenotype.
Additional Data Types
Other aspects of the genome have been characterized in the modENCODE project. These include profiling of microRNAs (short RNAs that bind to their target mRNAs to repress translation) and other small RNAs, origins of replication (locations in the genome where DNA replication is initiated), and histone variants (histone proteins make up a nucleosome and several variants have been linked to its different functions).
- Describe how microarray technology allows researchers to identify genetic relationships among different disease types.
- High-throughput technology makes rapid genome sequencing easily accessible. It will soon become relatively inexpensive to have your own genome sequenced. What are some of the scientific and ethical issues, should this become a routine process?
On to Bioinformatics
We've had a look at some high-throughput technologies that are capable of generating vast quantities of biological data. But data doesn't mean much if you don't have the tools to analyze and learn from it. And that is where bioinformatics, the subject of our final section, comes in.