Credit: Courtesy of The Tech Museum, San Jose, CA.


Most genes can be thought of as simply the recipes for proteins. And proteins are both the building blocks and the worker bees of the cell -- much of what gets done in a cell is done by proteins. Yet, if you think about it, not every protein needs to be made all the time in every cell:

  • Cells need to turn genes "on" and "off." The chromosomes of the cells in your eye, and virtually every other cell in your body, include every gene in your genome. Yet there wouldn't be much point in making a protein in your eye that helps to digest your food. So genes are turned on or off selectively in different cells.
  • Cells need to turn genes "up" or "down." There may be times when the cell needs more or less of a certain protein. Pregnancy, for example, temporarily changes the protein requirements of a woman's body; a prolonged period of oxygen deprivation (hypoxia) will cause the body to boost hemoglobin levels; muscle mass, fundamentally made of protein, will increase in response to exercise. These are all temporary, and often reversible, changes in proteins for selected cell types, based on environmental circumstances. The body's cells have dialed up, or down, protein levels depending on specific needs.

The processes by which different cells manage which genes are turned on and off, and how much protein is produced, are called gene regulation. And cells have more than one way of regulating gene activity. One approach to regulation, for example, relates to how chromosomes are wrapped up and coiled within the cell in the complicated structure known as chromatin. We'll cover this approach to gene regulation, sometimes called epigenetics, in another part of this module.

Here, we're going to focus on another way that cells regulate gene expression -- a team of specialized proteins in the cell called transcription factors (TFs). Understanding how TFs work is a big part of the modENCODE project.

Transcription Factors: Regulatory Authorities

Credit: DNA Learning Center/YouTube.

Most biology students are familiar with the basic molecular process by which a genotype (the way the gene for a specific trait is encoded in the DNA) is transformed into a phenotype (the expressed trait itself). The two steps:

  • Transcription. First, the genetic code is read off of the DNA and encoded into messenger RNA (mRNA). This process is handled by a protein called RNA polymerase, which zips down the DNA region and patches together pieces of RNA to form the long mRNA strand. (The process is well illustrated by the video at right from the DNA Learning Center, and also in a set of interactive slides from biologist-author John Kyrk.)
  • Translation. Once the mRNA has been assembled, it's used as a template for creating cellular proteins in the cell's ribosomes, which decode the message on the mRNA and construct the appropriate protein chain, in the process called translation. The assembled protein (sometimes altered after creation, in a process called posttranslational modification) then goes off to perform its specialized function in the cell.

While the process seems straightforward enough, in practice it's actually a complicated business. The organism, faced with a specific need, has to find, and read,

  • the right gene
  • in the right cell
  • at the right time
  • at the right level.

How can the cell possibly solve this problem? The most obvious way (but not the only way) to regulate gene expression is at the point of transcription itself -- directly controlling whether, and to what extent, a gene is activated. And that's where transcription factors (TFs) come in.

What Transcription Factors Are . . .

TFs are proteins that sit on the DNA -- more specifically, they are chemically tuned to bind to a specific sequence in the DNA next to the gene itself, known as an "enhancer" or "promoter" region -- and basically tell the RNA polymerase what to do with respect to that gene. If a TF helps polymerase find a gene and get started transcribing it, it is called an activator. And if it blocks RNA polymerase from transcribing the gene, it is called a repressor.

TFs can help determine not just whether a gene is expressed at all, but how strongly it's expressed -- that is, how much protein gets made. If a TF increases expression of a gene, it's said to up-regulate the gene; if it decreases gene expression, it's down-regulating. How do TFs accomplish this? A big part of it relates simply to how long the activator or repressor TF sits on the DNA -- the longer a TF is there, the bigger effect it can have.

. . . and Why They're Important

TFs are crucial for the survival of an organism, as they control which genes get expressed when and where and to what level. Failure to express the right gene, at the right time, in the right place -- or expressing the wrong gene at the wrong time in the wrong place -- can have serious consequences. Scientists need to figure out which TFs are bound where in a genome to understand regulatory networks that control how an organism develops and functions. This is what the modENCODE project set out to do in the model organisms, Drosophila and C.elegans. Some of the information learned by modENCODE from the fly and the worm can be directly translated into insights on how our own genomes function.

TFs Hooking Up with Certain DNA Sequences

TFs (suggested by the red ribbons pictured here) prefer to bind certain DNA sequences. The contact points of the protein and DNA result in binding of the protein to the DNA, if it is the right sequence. The closer the sequence is to the preferred TF site, the longer the TF hangs around.

Image source: European Bioinformatics Institute/Wikimedia Commons

Before diving into the modENCODE results, though, we need to address a few basic questions that have probably occurred to you:

  • How do transcription factors know where to go?

    TFs are a major category of a class of proteins called DNA-binding proteins. Such proteins are chemically tuned to recognize different combinations of same four DNA bases -- A, G, C, and T -- that genes use to code for proteins. The basic idea is that each TF recognizes different a combinations of these bases: one TF might recognize AGGCCT, for example, and another might recognize ACCGCA. If that particular DNA sequence is in the right place near a gene, then the right TF binds there and either marks it as an active gene (up-regulation) or hides it from the polymerase (down-regulation).

  • How do they know how long to stick around on a given gene?

    The effect the TF has on a gene is partly determined by how long it stays at a certain spot on the DNA. Many things can affect this. For example, the more similar a DNA sequence is to a TF's "favorite" site, the longer that TF will bind. Put another way: As noted above, TFs are chemically tuned to recognize certain specific combinations of bases and bind to them. If the combination in the DNA is close to the combination recognized by the TF, but not exactly the same, the TF may still bind to the DNA -- but not as strongly.

    Other nearby proteins can help a TF bind or keep it from binding too. One example, which has been studied in detail, is the TF called Transcription Factor IIB (TFIIB), which requires the help of another protein, TATA binding protein (TBP), to form a "platform" for gene transcription to take place. (This is actually a "basal" interaction common to a wide range of transcription processes, rather than one that manages differential gene expression, but it does illustrate the point that proteins often interact to do their work -- and such interactions are common in differential gene expression as well.) We'll explore this more in the Web Mission below.

  • How are the TFs -- which are so important in regulating gene activity -- themselves regulated?

    TFs, of course, are proteins encoded by genes, and so -- just as with any other gene expression -- the transcription of the TFs themselves need somehow to be regulated. How this happens is actually quite complicated -- and, as we'll see, the modENCODE project has shed some light on some of the details.

    One interesting mechanism by which the action of TFs is regulated, however, is that a TF can in many cases regulate itself. For example, a TF may be able to bind to the DNA at the locus of its own gene as well as at others, and in so doing, down-regulate production of itself, thereby keeping a limit on the amount of that particular TF available in the nucleus at a given time.

Web Mission: TFs in Action -- TATA, TBP, and TFIIB

TFIIB (right), TBP (left), and DNA (center). (View full size image.)
Image source: By Engineer gena (Own work), via Wikimedia Commons

Take a look at the image to the left. It shows a molecule of DNA -- the twisted complex of red and white in the middle of each diagram -- surrounded by two proteins, transcription factor IIB (TFIIB) and the TATA-binding protein (TBP). These molecules (along with others) form the so-called preinitiation complex -- a platform that can engage with RNA polymerase II and allow transcription to commence. Let's take a Web Mission to dig a bit deeper into this molecular interaction.

The action starts in a sequence of DNA known as the TATA box -- a key part of the promoter region of some 24% of human genes. The TATA box is named for a specific sequence of DNA rich in thymine (T) and adenine (A) bases. For genes whose promoter regions include the TATA box, transcription begins when the TBP, a specialized transcription factor (really a part of a larger general transcription factor called TFIID) binds to the TATA region in the DNA. To see part of the process in action, have a look at a video clip from Molvisions.

The binding of TBP to the TATA region, however, is only the first step in the development of the preinitiation complex, which is actually a group of proteins that help position the RNA polymerase to the active site of transcription and kick off the process of transcription. To see one of the next steps, have a look at a student project on human TFIIB from the Biology Department at Kenyon College. What you see in the left pane is a 3D representation of the complex of the general transcription factor TFIIB, the DNA, and the TATA-binding proten (your browser will need to have Java enabled to see this diagram).

Click the gray box in the section of text labeled "Introduction." You will now see a less cluttered view consisting of the DNA helix at the center, the TBP in red, and TFIIB in blue. The proteins are depicted in ribbon diagrams, a common schematic way of depicting proteins, with the spiral ribbons and twisted arrows showing specific key features of the protein's detailed structure. Why are the details of a protein's structure important in understanding the transcription process?

Now scroll down in the text to the right to the section titled "Interactions of the Ternary Complex." Click the button that says "Reset Ternary Complex." Now select, in turn, the three gray buttons within this section of the text. Note that TBP is structurally and chemically favored to bind to the DNA's TATA region, and that it introduces a kink in the DNA molecule. TFIIB is structured ideally to fit with the the complex of the bent DNA and TBP, and forms a molecular "clamp" that holds the TBP in place. The result is a stable platform for RNA polymerase II to kick off the transcription process in the gene's active region. (In reality, formation of the preinitiation complex is even more complicated, with the action of several other transcription factors in addition to TBP and TFIIB.)

Intricate as the process might seem, the TATA box is only one of a number of different DNA elements that can constitute the business end of a gene's core promoter region. Some evidence has emerged, for example, that genes lacking the TATA box might be important in a wide range of basic "housekeeping" functions, while the TATA box is more prevalent in highly cell-type-specific or highly regulated, specific genes. Why do you think evolution might have developed different core promoter elements for different biological processes?

modENCODE Insights

With some basic knowledge of transcription factors under our belts, we can now turn to how the modENCODE project expanded our understanding of these gene regulation elements through studies of Drosophila and C. elegans. In what follows we'll look in particular at studies of the worm, with insights from the fly covered in a separate Vignette.

Journal Club

Students who want to dig deeper into the science that follows can refer to some of the scientific papers that came out of the modENCODE project. Important in the discussion below are the papers by Gerstein et al. (Science), Niu et al. (Genome Research), and MacArthur et al. (Genome Biology). You may want to open these links up in separate tabs in your browser for reference as you go through the material below.

The Raw Material: Desperately Seq-ing Binding Sites

ChIP-Seq: A method to accurately identify where TFs are bound in the genome
Image source: Chris Taplin/Wikipedia

A first step for modENCODE was to find out where 23 specific transcription factors in C. elegans actually bind on the DNA. This is harder than it sounds, not only because even the lowly worm has more than 20,000 genes in its genome, but also because scientists can't easily predict where TFs are bound based on sequence alone. For example, not every possible DNA site that can be bound is bound -- and some DNA sites that are bound under specific conditions don't look like they would be a place where a TF would likely bind.

Fortunately, scientists have come up with ways to figure out where a particular TF is bound in the genome in a particular kind of cell. One technique -- depicted in the diagram to the right, and crucial to the modENCODE project -- is called ChIP-Seq. And, as the name implies, it's really a combination of two other techniques:

  • Chromosomal immunioprecipitation (ChIP). In the ChIP process, represented by steps 1 to 4 of the diagram, DNA is cross-linked to the proteins bound to it in the cell, the DNA is mechanically or chemically clipped into short fragments, and the fragments linked to the TF of interest are "tagged" with antibodies that, because of their structure, themselves bind specifically to just that TF. Binding the TFs to particular antibodies allows the TF-linked DNA to be specifically isolated through a process called immunoprecipitation, so that the scientists know they are working only with DNA regions specifically targeted by the TF.
  • Sequencing. In the second part of the process (steps 5 and 6 of the diagram), the TFs are un-linked from the DNA (usually by heating), and the DNA fragments are identified in a massively parallel process that allows thousands of sites to be assayed at once (we'll learn more about this new methodology in the Genome module).
A 1994 paper in Science pioneered the use of GFPs in studies of C. elegans.
Image source: Science

For Drosophila, using ChIP-Seq required finding a separate antibody for each individual transcription factor. For C. elegans, however, the scientists could once again take advantage of the worm's transparency, coupled with the use of green-fluorescent protein (GFP) as a tracer, to do things differently.

As you'll recall from the worm module, GFP is a protein that glows under fluorescent light, and DNA specifying GFP can be added to a specific gene sequence to see whe that proteins is made and, thus, where the protein encoded by that gene is active. For the modENCODE worm project, rather than find or create an antibody to each of the worm's transcription factors, they created an antibody to just one protein -- GFP -- and fused the GFP individually to each of the TFs they were interested in. This allowed the team to use a single antibody for lots of different TFs -- and, since GFP glows green under the right light, and since the worm is transparent throughout its life cycle, they could see when and where these TFs were present in the living worm.

So, for the worm, the ChIP-Seq process was:

  • Create many different strains of worms, each with a GFP attached to a different TF.
  • Cross-link the bound GFP-labeled TF to the DNA.
  • Chop up the DNA, and use the antibody to GFP to pull out the appropriate TF-linked DNA fragment.
  • Release the TF from the DNA fragments.
  • Sequence the DNA to determine the base-pair sequence of all of the regions to which the TF was bound.

The result: Looking at 23 TFs in the worm, the modENCODE scientists identified 16,700 binding sites -- the raw material for the insights on TFs highlighted below.

Insight 1:You don't need to code for a protein to be important

Most of the cells RNA is not translated into proteins and it does important tasks in the cell.

Source: Todd Smith, "Small RNAs Get Smaller," Geospiza FinchTalk (2009)

We all know the importance of RNA in the process of gene transcription and protein translation -- both as the blueprint for protein synthesis transcribed from the DNA template (messenger RNA, or mRNA), and as central elements of the translational apparatus, including the RNAs of the ribosome itself (rRNAs) and the coded carriers of the amino acid building blocks of the proteins themselves (tranfser RNA, or tRNA). But a major theme in molecular biology over the past few decades has been an increased understanding of the role of other forms of RNA in managing gene expression and other tasks. These other RNAs include so-called micro-RNAs (miRNA) and small interfering RNAs (siRNA), which can down-regulate gene expression through a process called RNA interference. (You can read a bit more about these RNAs in a blog posting here.)

The modENCODE project's exhaustive examination of transcription factors allowed new insights on these so-called noncoding RNAs -- RNAs that are transcribed from the DNA sequence but never get translated into proteins, yet still perform important work in the cell. Before modENCODE, most of these noncoding RNAs were thought to be on all the time (which is known as being "constitutively expressed").

Surprisingly, after analyzing the 16,700 binding sites found across 23 TFs in the worm, the modENCODE scientists found many TF binding sites near DNA that get transcribed into these noncoding RNAs. This suggests that, like protein-coding genes, noncoding RNAs might respond to developmental and environmental signals too.

Insight 2:Little RNAs regulating TFs

The C. elegans TF-miRNA regulatory network (click to visit).

In addition to figuring out TF binding sites, the modENCODE project also found many micro-RNAs (miRNAs) using another next-generation sequencing technology, RNA-Seq. miRNAs are small noncoding RNAs that, like TFs, regulate gene expression. The modENCODE scientists found that there are some miRNAs that control when, where, and how much of a given TF gets made in a cell -- and that the reverse is true, too. These results changed how scientists look at the human genome and how they think about the various roles noncoding RNAs play in a cell.

The diagram on the left schematically shows the network of connections between TFs and miRNAs, in both directions, and among the TFs themselves. To get an idea of the network's complexity, visit the modENCODE project's data-warehouse site, modMINE, for the full-sized interactive figure. Toggle buttons at the bottom of the figure allow you to highlight interactions from miRNAs to TFs, from TFs to miRNAs, and between individual TFs.

Insight 3:Genes that are HOT, HOT, HOT

Sometimes a gene is so important it needs to be controlled by lots of TFs to make sure it stays on.
Image Source: Trupti Kawli

Most genes are controlled by a few transcription factors (TFs) that bind to certain sites near the gene called "regulatory regions." Genes that are expressed in similar cell types are often controlled by the same TFs. Results from the modENCODE project showed that this is not always the case, though. Sometimes a gene is associated with a short piece of DNA called a High-Occupancy Target (HOT) region -- 300 or so base pairs of DNA that can be bound by 15 or more different, unrelated TFs.

Why would so many TFs be needed for these particular genes? One possibility is that the genes controlled by HOT regions are so important that the cell "throws everything it can at them" to make sure the gene is on and stays on. This idea makes sense, since HOT regions are usually close to key genes expressed at high levels in many cell types. And, not surprisingly, this strategy to ensure proper expression of key genes is used by Drosophila, too.

Scientists have found similar regions in humans. The presence of HOT regions in such a diverse array of organisms suggests that they reflect very basic biological processes.

Insight 4:Policing the policemen

Transcription factors play a vital role in policing and regulating gene expression. But their very importance raises some additional mysteries -- how the transcription factors themselves are made, for example, and how the cell regulates the activity of the transcription factors (which in turn regulate gene expression).

TF regulatory heirarchy. (a) The entire TF regulatory network that the modENCODE scientists were able to identify. (b) Expression of higher-up TFs in the hierarchy is controlled by fewer TFs. (c) Middle-level TFs are controlled by higher-up TFs, and in turn control lower TFs. (d) Lower TFs are controlled by lots of higher TFs.

The first step toward answering these quandaries is to remember that, like any other protein, TFs are encoded by genes. The modENCODE scientists set out to understand how these genes, like others, are regulated. What they found was that there's a definite hierarchy: some TF genes in the worm are controlled by lots of other TFs, and some are controlled by very few. This makes sense, if you think about the fact that genes tend to work together to make something happen in a cell.

For example, it usually takes more than one TF to respond to a certain environmental or developmental signal. In fact, it often takes a cascade of them. So some signal might set off one TF, which turns on a number of genes, some of which are genes for other TFs -- which, in turn, might turn on other TF genes in an expanding cascade.

As you might have guessed, the TFs at the top of the cascade are controlled by fewer TFs than are those at the next level of TFs, and this is true of the next levels too. Scientists call the TFs at the top of the heap modulator TFs while the lower down TFs are called mediator TFs. The modulator TFs are often the first responders to a developmental cue and so more often regulate specific developmental processes in multiple tissues. The mediator TFs are more uniformly expressed in multiple tissues and are essential for survival, and tend to have more interactions with other proteins. The TF hierarchical network that the modENCODE scientists observed was similar to those observed in yeast and E. coli.

modENCODE Insights: Summing Up

The initial studies of just 23 out of the 700 plus different TF's in C. elegans by modENCODE scientists revealed important aspects of how TFs act to regulate gene expression throughout the genome. Many newly identified features provide new ways to look at functions that molecules such as TFs can play in a cell. Understanding how genes are regulated at the right time in the right place by TFs provides key understanding of how an organism is made and how it functions.

Why should we care? Because failure to execute these key regulatory steps often manifests itself as disease. Understanding how gene regulation works in these model organisms might be a first step towards fixing genetic defects that may lead to disease in humans.

What About the Fly?

In the discussion above we've looked at the worm side of modENCODE, and at some specific insights that have emerged, particularly on the importance of small, noncoding RNAs, in running gene expression. But the fly side of the project also provided some interesting insights into transcription, a few of which are reviewed in the Vignette below.


Next Stop: Chromatin

Fascinating and intricate as TFs are, they form only one approach to regulating gene expression. Another, very different approach lies in the architecture of the DNA-containing chromosomes themselves, and in the chemical helpers that pack and unpack the DNA within the remarkable folded structure called chromatin.