Bioinformatics is the use of computers to analyze data from biological experiments. People who work in bioinformatics are known as bioinformaticians. In recent years, with high-throughput biological experiments generating massive amounts of data, computational analysis of data has become an integral part of biology. Assays based on Next Generation Sequencing (NGS) in particular require extensive computerization, since the output of the wet-lab work is millions or billions of sequence reads.
In this section, to explain what bioinformatics is all about, we'll walk through an analysis of data from a specific high-throughput experiment. Before doing that, let's talk generally about some of the main steps, and challenges, in a bioinformatic analysis of a large data set.
Aligning Short DNA Sequences to the Reference Genome
As outlined elsewhere, ChIP-seq can find regions in the genome bound by a particular protein. ChIP-seq experiments result in several million sequence reads representing the DNA fragments from those binding regions. However, we need to sift through these millions of sequence reads to learn the binding sites of the protein. This task is not easy -- and, needless to say, cannot be done by hand.
The first step in the analysis of a ChIP-seq experiment, after sequencing, is mapping or alignment. Alignment is the process by which we attempt to learn where each read originated in the genome. That is, we find the sequence in the genome that matches the sequence of the read. Finding exactly where such a sequence matches best in something as big as the genome is difficult -- the classic "needle in haystack" problem.
Look, for example, at the image below, showing a 21-nucleotide sequence read and a larger, 100-nucleotide sequence. Time yourself -- how long does it take you to determine, by eye alone, where the 21-nt read sits inside the longer sequence?
Now click on the image, to see where the 21-nt fragment actually maps within the longer sequence. The genomes of fruit fly and worm, however, are millions of times longer than this -- ~120 million and ~100 million bases, respectively. If the above task took you 5 seconds, then it would probably take you more than a month to find that one fragment in the whole genome! Clearly, this is a job made for a computer.
Even with computer assistance, however, there are other challenges. Sequencing technologies aren't perfect, and reads often have errors. The best match in the genome therefore might not be perfect. The genome might have an "A" where the read has a "T" because of a sequencing error, yet it may still be the correct match. Because of individual variation, the sequenced genome may also contain variations such as single-nucleotide polymorphisms (a single base difference) compared to the reference genome, resulting in more mismatches in alignment. Furthermore, the genome contains repeated sequences -- e.g., a sequence of 300 base pairs might occur several times in a row. Hence, some alignments can be ambiguous, and it is difficult to figure out exactly where a particular read came from.
Because experiments are done examining millions of cells at a time, there will be variation among these cells in the amount of protein binding to DNA. Thus our results are an aggregate over these millions of cells. In a ChIP-seq experiment, we expect to see more reads aligning where the protein we are studying was bound in most cells or very strongly bound in some cells, and fewer reads where it was not bound.
Computers can help us visualize all of this, in a kind of visualization known as a signal track or signal trace. After all the reads have been aligned, the signal trace shows the frequency or strength with which the protein is bound in a given region for that particular experiment. The diagram below is an example of such a signal trace visualization, for an experiment run twice for the same short region on the X chromosome:
The first thing you'll notice is that the two traces have a lot of variability. In some places in the sequence, there appears to be relatively strong binding for the protein in both experiments; in others, the peaks in one experiment are missing in the other. How do we know which peaks are really binding sites? Why do those apparently random peaks show up at all?
Is the Observed Pattern Meaningful?
Let's do an experiment. Suppose you flipped a coin ten times in a row and got ten heads:
Before you flipped the coin, you probably assumed the coin was fair, so you expected to get about 5 heads and 5 tails. You had a null hypothesis regarding the behavior of the coin. Since we expected each flip to have a 50/50 or one half chance of heads each time, the chance of getting 10 heads after 10 flips of a fair coin is 0.510, which is about 0.001 -- a 1-in-1000 chance. Not impossible -- but you would have to believe that you are very, very lucky in order to still believe that the coin is fair.
But what if this wasn't the first time you'd ever flipped this coin before? What if, instead of flipping the coin just 10 times, you flipped the coin 1000 times, and looked for 10 heads in a row? What would be the chance of getting at least 10 heads in a row during a marathon of 1000 flips? The math is a bit trickier, but it turns out that the chance is around 4 in 10, so it's not as unlikely as it initially might seem. That's an important point: if you are doing something just once, then the chance of seeing something unlikely is low, but if you are doing it over and over, then the chance of seeing at least one rare outcome can become quite high. Or, to put it in more formal terms: when multiple tests are run, the probability of seeing an unlikely event among those multiple tests is relatively high.
For much the same reason -- that is, because of the large numbers of features being examined at any one time -- it is important to be cautious when interpreting genome-wide experiments. For instance, we may observe a gene with higher expression in an experiment in which worms are starved compared to an experiment where the worms are fed (this is known as a control experiment). It may be that the gene is related to the worm’s response to starvation. However, in our coin example, if there had been 10,000 people each flipping the coin 10 times, we would expect about 10 people to get all heads (.001 x 10,000 = 10) by chance. Similarly, in our experiment we are examining all of the ~20,000 genes in the worm genome. If the chance of seeing one gene expressed significantly above the control at random were 0.001, then we would expect 20 genes to be over-expressed by chance. Thus, careful statistical assessment and replicate experiments (see below) are needed to determine biologically significant results.
Making Inferences from Genomic Data
Another way that biologists distinguish real events from events that are due to chance alone is by replicating experiments -- that is, by doing the same experiment more than once. Chance events, like the random binding of a regulatory protein on a segment of DNA on which it doesn’t normally bind, can be identified because, unlike real events, they aren’t reproducible. Suppose 10,000 people flip the coin 10 times, and then the same 10,000 people flip the coin 10 times again, replicating the experiment. If all the coins are fair (have a 50/50 chance of showing heads), then the chance that the same person would get 10 heads in a row in both replicates is roughly 1 in 100,000. This means that out of the 10,000 people, we would not expect anyone to get 10 heads in a row in two consecutive replicates by chance alone. In the starvation experiment, by filtering out genes more highly expressed in starvation conditions in only one replicate, we generate a list of genes related to starvation with higher confidence and remove random noise.
The picture below shows two ChIP-seq tracks where we account for the variation between replicates and and other statistical issues. In addition to DPY27 from the above figure, we also show the transcription factor HLH-1. We have highlighted those peaks that represent binding sites most likely to be real after correcting for chance events. You can see that the two replicates of HLH-1 are more similar to each other and show less variation (noise) relative to the DPY27 replicates.
Working together, bioinformaticists, statisticians, and biologists have built sophisticated computer programs that make inferences from genomic data in an automated fashion, taking into account the strength of signal, the likelihood of signal strength under various models, the effects of multiple testing, and experimental replicates. The output of these programs usually identifies particular genomic regions with specific biological function. For instance, an RNA-seq assay identifies transcribed regions of the genome, and ChIP-seq identifies regions bound by a particular protein.
Web Mission: Drilling Down in WormBase
Using the concepts introduced in this educational supplement, it’s now your turn to explore the modENCODE data. For this exploration, we'll use WormBase, a central resource for the community of C. elegans researchers. (See the fly chromatin Vignette for an exploration of FlyBase, an analogous resource for D. melanogaster.) Before starting, open up a separate tab or browser window -- the purpose of this mission is for you to explore WormBase on your own, following the guidelines documented below.
In this missiion, we will examine muscle development in the worm, guided by a study by T. Fukushige et al. We focus on a transcription factor called Helix-Loop-Helix-1 (HLH-1) and its binding profile around one of its target genes, troponin T (abbreviated as tnt-3). Both of these genes are identified as important for muscle development in the article cited above.
In the course of this mission, you'll use WormBase to answer the following questions:
- What is the relationship of these genes in the worm to human biology and disease?
- Where is the tnt-3 gene located in the worm's genome?
- How do the data show that HLH-1 bind to tnt-3 -- that is, that tnt-3 is indeed a binding target of HLH-1?
- Is HLH-1 a promoter or a repressor -- that is, does it turn transcription of tnt-3 on or off?
To get started, we first go to the WormBase homepage in a Web browser. At the top of the WormBase homepage is a search box that says "Search for a gene."
Entering hlh-1 in the search box brings up a Gene Summary page for that gene:
A similar query for tnt-3 brings up that gene's summary:
Look at the Overview paragraph on the tnt-3 Gene Summary. It notes that tnt-3’s human ortholog is associated with a human disease, and references the number OMIM:191045. That's a reference to another online database, Online Inheritance in Man (OMIM), hosted at Johns Hopkins University. Drilling down to the entry for 191045 on the site reveals that mutations in human tnt-3 lead to thickening of the heart mucle (hypertrophic cardiomyopathy). Thus, study of muscle development in the worm can be helpful in understanding human muscle diseases.
To see the available hlh-1 data and identify binding sites near tnt-3, go to the "Location" section of the tnt-3 Gene Summary page, by scrolling down or clicking to highlight "Location" in the left-hand toolbar under "Page Contents." You'll see that tnt-3 is on the X chromosome and has four annotated isoforms.
Clicking on the Genomic Position hyperlink in the upper left of the Location section takes us to a Genome Browser view where we can add signal tracks for the modENCODE datasets we are interested in. Here is the default view most users will see when they first examine the tnt-3 locus.
We are not interested in the Allele and SNP annotations right now. Click the red "X" () in their headers to turn off display of those annotations.
Next, click on the "Select Tracks" tab to find the modENCODE signal tracks for HLH-1 mapped onto the C. elegans genome. The easiest way to find the HLH-1 ChIP-seq tracks on the track list page is to use your browser's Find function and search for "HLH-1."
You see that the HLH-1 ChIP-seq tracks are found in the "Snyder Group" subheading of the modENCODE datasets; the Snyder Lab at Stanford is where the data were generated. We choose to display both replicates of the track by checking "Replicates HLH-1" under the "Replicates" group of experiments.
You should next select the "Gene Models" track at the top of the page (if it is not already selected), and then click "Back to Browser".
What are we seeing on this genome browser view? From top to bottom:
- The Overview shows that the X chromosome is more than 17 million nucleotides (nt) long. The red vertical line shows the location of the tnt-3 gene at about the 8M nt mark among a set of marker loci along the length of the chromosome.
- The Detail view below shows a narrow window around the gene, which we can see from the scroll / zoom bar is 9.427 kilobases (kb) -- i.e., 9,427 bases -- wide.
- In the Gene Models area below, we see the annotation of the troponin T gene. Tnt-3 has four known alternative transcripts; exons are light blue boxes, and introns are black lines connecting the exons. (See the transcription section for a deeper discussion of gene structure.)
- Below the Gene Models section we see the HLH-1 replicate signal tracks that we turned on.
- One of the screenshots in the introductory discussion above showed tracks from another protein, DPY27. How would you use WormBase to display the DPY27 tracks?
- Look at the gene summary for DPY27. Why do you think we chose to show DPY27 binding around tnt-3? (Hint: Which chromosome is tnt-3 on?)
- What happens when you display the ChIP_seq tracks for all the transcription factors assayed so far by modENCODE? Choose "All On" in the Select Tracks display; you might want to select "Combined" tracks to show just one track per protein, since we understand the issue of replicates now. The "Binding Sites" display is even more succinct; it shows only peak regions that have statistically significant binding. Do factors other than HLH-1 bind near tnt-3?
- Browse through the genome using WormBase and find several other genes regulated by HLH-1.
At first glance, there do not appear to be any strong HLH-1 binding peaks near tnt-3, but that may be a problem of the scale at which the data are being presented. To find out, click on the wrench icon () above the HLH-1 tracks to adjust the display settings. You'll see something that looks like the left-hand image in the screenshot below.
At left in the figure above are the default display settings for the HLH-1 signal tracks. Change the Y-axis scaling to "fixed," set the Y-axis range to 0-50, and adjust the height to 100. There are different default settings at different magnifications, so to keep these settings at all magnifications, set "Apply config when view between min and max."
The resulting settings are shown in the right-hand image above. Click the Change button to see the new view. In the image shown below, we also removed the marker loci track from the overview section, zoomed out to show 20kb around the gene and scrolled the display to the right a little to see neighboring genes.
The main feature in the HLH-1 signal track near the troponin T gene is a binding site to the right of the gene. The arrow pointing left in the gene model shows that the HLH-1 binding site is at the beginning of the gene, which is called the promoter region. This view of HLH-1 binding the promoter of tnt-3 is a classic example of a transcription factor turning its target gene on by binding at the promoter, which recruits RNA polymerase to copy the gene from DNA into RNA, and eventually to express the target protein.
- The fly genome contains six sequenced and assembled euchromatic chromosome arms, 2L, 2R, 3L, 3R, 4 and X. You've done a ChIP-seq experiment in replicate on fruit fly embryos and have inferred a large set of binding targets for the the factor you are studying. This particular factor tends to bind the promoters of gene. Now, you want to discover if this factor binds chromosome X more often than you would expect given it's binding patterns on the other chromosomes. How would you approach this problem? Would you need to take into account the lengths of chromosomes? Would you need to take into account the number of genes or the gene-density (genes per unit length) on chromosomes?
- You need to map 2 million sequence reads from a ChIP-seq experiment to the fly genome. What sort of computer program might you use? Note: you can use your favorite search engine (e.g. Google), or you can read through the modENCODE literature to find out how modENCODE tackled this sort of challenge. Are their any drawbacks to the algorithm you chose?
- You've done ChIP-seq for RNA polymerase II (Pol II) and you've done RNA-seq for the same sample. Is there any reason to believe that there will be a relationship between the output of these two assays? How might you test your hypothesis?