CHEMICAL BIOLOGY
Transcript Profiling, Tools for
Valtteri Wirta, School of Biotechnology, Department of Gene Technology, KTH, Royal Institute of Technology, AlbaNova University Center, Stockholm, Sweden, Department of Bacteriology, Swedish Institute for Infectious Disease Control, Solna, Sweden
Joakim Lundeberg, School of Biotechnology, Department of Gene Technology, KTH, Royal Institute of Technology, AlbaNova University Center, Stockholm, Sweden
doi: 10.1002/9780470048672.wecb602
Regulation of gene expression plays a central role in controlling and shaping the functions of a cell. Tools for quantification of the expression of individual genes have been available for years;but over the past decade, development of the microarray technology and accompanying bioinformatics tools has made it possible to generate comprehensive overviews of the transcriptional events in both diseased and normal cells and tissues. This review covers various approaches for transcript profiling from single genes to more global analyzes approaches. In addition, a detailed description of different microarray-based technology platforms will be provided, which all enable an essentially genome-wide characterization of the transcript levels. We also discuss the current challenges and future trends within transcriptional profiling, and we introduce briefly the next generation DNA sequencing technology that will enable a more detailed description of the entire transcriptome, which includes various small RNA species (e.g., microRNA) and other noncoding transcripts.
Transcription is the essential cellular and biochemical process that links the genetic information encoded in the genome (DNA) to the functionally active macromolecules, proteins, which carry out most tasks in a cell. Transcription generates an RNA transcript, and the process is regulated at multiple levels, which includes both synthesis and degradation of the transcript. To understand the complexity of this mechanism and to connect the process to the phenotype of an organism, we need to measure the transcript levels accurately under various situations and samples. In many cases, it is advantageous to do this in a genome-wide and unbiased manner.
Biologic Background
The central dogma of molecular biology states that genetic information flows from genes, via RNA, to proteins. In this flow of information, messenger RNA (mRNA) is generated in a process called transcription and is subsequently processed to yield a mature transcript. The transcriptome is the combined set of all transcripts present in a cell at a certain time, but it should be noted that mRNA is only a minor component of the entire RNA population of a cell. The cell contains highly abundant ribosomal RNA, transfer RNA, microRNA, small nucleolar RNA, small nuclear RNA, and additional rare types of RNA, but the focus of this review is on the protein-coding mRNA transcripts. The maturation of these protein-coding transcripts consists of several distinct steps, all of which are regulated specifically (Fig. 1)
For years, it was assumed that the rate of RNA synthesis was the rate-limiting step that controls indirectly the amount of protein synthesized. However, this simplistic view has been replaced by results clearly showing that mRNA and protein levels do not always correlate fully, and that extensive regulation of mRNA transcript processing and availability for translation occurs.
A typical human cell contains approximately 300,000-500,000 transcripts, and most genes are transcribed at low to moderate level, whereas only a small number of genes encode for a large number of transcripts. The underlying assumption of most transcript profiling studies is that the pattern of mRNA transcripts in a cell at a certain time can be used to explain the phenotype of the cell and the activities within the cell. For example, by comparing a cancer cell with a normal active pathway, signaling mechanisms in the cancer cells can be identified; in the next step, we can attempt to alter these pathways in treatment of the cancer.
Interestingly, another level of complexity has been added by research during the last few years that has shown that only a fraction of the transcribed loci generate protein-coding transcripts and that almost the entire genome is transcribed (1, 2). Even though these transcripts do not code for a protein, they may be functional in other ways; therefore, they constitute interesting target transcripts that may be analyzed using the same approaches as the protein-coding transcripts. This review will focus on analysis of the protein-coding transcripts, but it should be kept in mind that the same tools can be used to analyze of many other types of transcripts and RNA molecules as well.
This review is sectioned in the following way; a brief overview of the various approaches for transcript profiling is given in the section on Tools and techniques. In the section on Microarray-based transcriptional profiling, we provide a detailed description of the microarray technology, which enables essentially a genome-wide characterization of the transcript levels. The last chapter discusses the current challenges and future trends within transcriptional profiling.
Figure 1. (a) The flow and the regulation of genetic information. (b) The processing of a eukaryotic mRNA transcript. ATG, the methionine start codon; ORF, open reading frame; UTR, untranslated region.
Essentially every cell in an organism is, at any given time, transcribing thousands of its genes in various quantities. As described in the previous section, the amount of an mRNA transcript is regulated, and an interest exists, from both basic science and clinical perspectives, in quantifying the transcripts levels accurately. The available techniques can be divided broadly into either gene-by-gene methods (see sections on Northern blot and Quantitative real-time RT-PCR) or global methods (see sections on Sequencing-based transcriptional profiling and onward). The gene-by-gene methods aim at high-accuracy quantification of a low number of transcripts, whereas the global methods aim at a highly parallelized quantification of many transcripts, often even genome-wide. The focus of this review is on the microarray technology, which is the most widely used technology for genome-wide transcriptional profiling (see the section on Microarray-based transcriptional profiling). The other techniques are reviewed briefly below.
Northern blot
The northern blot technique allows quantification and size determination of a transcript in a complex mixture (e.g., the entire transcriptome) first by separating the transcripts by denaturing agarose gel electrophoresis, followed by a transfer to a membrane strip and hybridization with a labeled probe (3). Historically, Northern blotting has been used widely, but during the recent years, a shift toward more sensitive methods has taken place. These alternative methods are often less sensitive to RNA degradation, and they have a wider dynamic range.
Quantitative real-time RT-PCR
Quantitative real-time reverse-transcription PCR (qRT-PCR), which is the current “golden standard” for high-accuracy transcript profiling, provides superior sensitivity for analysis of transcript levels compared with other methods. A complex mixture of total RNA is converted to cDNA using reverse transcriptase with either random or gene-specific priming. Next, a 100-200 bp fragment is PCR-amplified using gene-specific primers that often target two different exons of the transcript, and the accumulation of the amplicons is monitored in real-time using a fluorophore that either specifically targets the amplicon of interest (highest possible specificity), or any double-stranded DNA (lower cost). Theoretically, during an exponential phase of the amplification, each PCR cycle doubles the amount of product, which in log2 scale corresponds with a linear increase. Extrapolation of this linear increase back to the base-line level provides an estimate of the initial starting amount of mRNA.
Using qRT-PCR has many advantages, which makes it the method of choice for high accuracy—but low throughput—gene expression analysis for the following reasons: 1) it offers, at its best, a dynamic range of 7-8 log orders of magnitude (4), 2) it can achieve single-copy detection (5), 3) it can be carried out in one step, 4) it has low coefficients of variation facilitating detection of small differences between samples (6), and 5) design of specific amplicons allows for discrimination between similar transcripts, such as transcript isoforms or different gene family members. The drawback of the method is that genes must be targeted individually. Therefore, a large-scale approach is not feasible.
Sequencing-based transcriptional profiling
Recently, several platforms for high-throughput sequencing have been developed (see section on Next generation of sequence technologies), and these platforms offer an impressive improvement of the number of bases sequenced compared with previous technologies. However, use of standard capillary sequencing has also provided means for large-scale sequencing-based transcript profiling, albeit at lower throughput levels. These different sequencing-based methods are outlined briefly below and are discussed in more detail elsewhere (7).
Expressed sequence tag analysis
Expressed sequence tag (EST) sequencing generates random, 200-900 bp single-pass sequences of cDNA clones. The initial purpose of these sequences was to facilitate gene detection (8), but the technique has been used subsequently to estimate gene expression levels. The underlying assumption is that the EST sequences are generated randomly, and hence the EST counts correspond with the transcript’s abundance in the original sample. The main drawback is the low throughput (number of counts) caused by the high data generation costs (library generation and sequencing). Given that gene expression levels follow a distribution with many genes expressed at low levels and a few genes expressed at high levels (9), transcript profiling that uses EST technology enables reliable detection only of a small number of genes expressed at moderate or high levels.
Serial analysis of gene expression
Serial analysis of gene expression (SAGE) (10) was the first approach to provide large-scale absolute estimates of transcript frequencies, and it relies on a biotinylated primer, streptavidin-coated beads, and type Ils restriction endonucleases (that cleave outside the recognition site) to generate short tags of each transcript. The tags are concatenated and sequenced using standard sequencing technology to derive a digital representation of the transcript frequencies (counts). In the original approach, SAGE was used to isolate approximately 14-bp 3' tags, but the method has been later modified to allow for isolation of 5' tags and of longer 26-bp tags.
Cap analysis of gene expression
Cap analysis of gene expression (CAGE) (11) uses 5' captrapping methods to isolate full-length cDNAs selectively and generates 20-bp tags from these. After isolation, the tags are ligated to yield ~700 bp of concatenated sequence, which is cloned into a vector and sequenced. The first-strand synthesis can be primed with random primers, which allows for analysis of polyA-negative transcripts. Recently, CAGE has been used for large-scale transcription start site mapping (12).
Massive parallel signature sequencing
In massive parallel signature sequencing (13), 3' sequences of each transcript are isolated using biotinylated primer in the cDNA synthesis and are cleaved with a restriction enzyme that generates a cohesive end. Next, the 3' signature sequences are ligated into specifically designed plasmid vectors that contain 32-nt oligonucleotide tags (in total 16.8 x 106 different tags), and amplified using PCR. Use of a large number of tags provides a unique tag for each 3' signature sequence, which is subsequently coupled to a 5-μm microbead. Each bead contains one type of capture tag complementary to one of the 32-nt oligonucleotide tags. Next, the captured signature sequences are sequenced on beads to yield 16-20 nt signature tags, which are counted to derive a global estimate of transcript levels.
Next generation of sequencing technologies
The technology development driven by the race toward low-cost sequencing of the entire human genome has provided the research community with new ultra-high-throughput DNA sequencers, which in the near future may open up for sequencing- based analysis to generate a global overview of the tran- scriptome. The new sequencing approaches include bead-based Genome Sequencer pyrosequencing instrument (454 Life Sciences, Brandford, CT, USA) that produces more than 100 million bases of sequence per run, Solexa Clonal Single Molecule Array (Illumina, San Diego, CA, USA) producing up to one billion bases of sequence per run, and also the SOLID sequencing chemistry (Applied Biosystems, Foster City, CA, USA) that aims at generation of more than one billion bases per run.
Common for all these techniques is that they are based on random fragmentation of the sample to be analyzed (e.g., a genome), ligation of adapter molecules to both ends of the fragment, and an amplification step (e.g., emulsion PCR for Genome Sequencer and the SOLiD technologies) on a solid-phase surface. This step is followed by the actual sequence reading step that is based on detection of fluorescence (Solexa and SOLiD technologies) or emitted light (Genome Sequencer) using a CCD camera.
The large amount of data generated is a consequence of many sequence reads (~300,000 for Genome Sequencer, tens of millions for SOLiD and Solexa). The read lengths are however significantly longer for the Genome Sequencer (200-300 nt) than for the two other technologies (25-35 nt). All platforms are expected to be improved in the near future.
In addition to de novo and whole-genome resequencing, all these technologies open up also for large-scale transcrip-tome analysis, possibly in combination with the approaches described above. For example, the Genome Sequencer platform has been used in several transcriptome analysis studies (http://www.454.com/news-events/publications.asp?cat=4).
Microarray technology
Since the first publications in 1990's (14-16), the use of the microarray technology for transcriptional profiling has become widespread with more than 25,000 publications in the NCBI’s PubMed database. In addition, the number of microarray hybridizations in two public data repositories (see section on Data sharing) is rapidly increasing, already approaching 250,000 hybridizations. Finally, today an entire industry exists that provides resources (e.g., arrays and reagents) and analysis support for the microarray community, which brings the technology within reach of essentially every researcher.
The term “microarray” refers to a solid-phase support on which multiple capture probes have been immobilized in an ordered fashion, and which participate in a capture reaction of a specific target molecule. Microarrays are used commonly to measure levels of mRNA transcripts, microRNAs, and proteins, but also to analyze characteristics of genomes (e.g., SNPs, gene copy number changes, and larger chromosomal gains and deletions). Since their first use in the mid-1990s, the different microarray platforms have been modified and improved extensively. The fundamental underlying advantage of the technique is that a simultaneous, highly parallelized measurement of thousands of different targets is possible; in some cases allowing for analysis of all known protein-coding transcripts.
A detailed description of the technology is provided in the section on Microarray-based transcriptional profiling.
In situ hybridizations
In situ hybridizations are based on labeled probes that base pair and identify target transcripts in fixed samples. This technology is the only approach to provide a snapshot of transcripts and their cellular localization. The probe is labeled either using a radioactive isotope, a fluorophore, or an antigen. After wash, the probes that hybridized to their target transcripts can be detected using autoradiography, fluorescence microscopy, or immuno-histochemistry, respectively. Use of fluorescence or antigens allows for use of probes for multiple transcripts simultaneously, which allows identification of genes with overlapping expression patterns.
The in situ hybridization technique is used in several large-scale efforts to provide a comprehensive picture of the gene expression pattern in for example mouse embryos and brain (17).
Microarray-Based Transcriptional Profiling
A typical transcript analysis starts with isolation of total RNA, followed by cDNA synthesis and labeling. Next, the purified and labeled cDNA is applied onto a microarray that contains thousands of immobilized probes, hybridized, washed and scanned, and the signal for every probe estimated and analyzed (Fig. 2).
The aim of a typical microarray-based transcriptional profiling experiment is to identify target genes for down-stream validation experiments, for example to identify genes that are expressed differentially after treatment with a certain compound. Likewise, the array technology can be used to verify a hypothesis, for example to verify that a compound does induce a certain expected effect. Furthermore, microarrays are used widely in various classification studies in which an initial set of samples with known “labels” (e.g., type of leukaemia) are profiled, followed by a profiling of a different set of samples with unknown labels (e.g., patient biopsies) and assigning these into the previously identified classes based on their expression profile.
Figure 2. A schematic overview of the microarray technology. The figure exemplifies the two-channel technology.
Typical workflow of a microarray experiment
This section describes the use of the microarray technology for transcriptional profiling. The section includes a summary of the experimental procedures (section on Experimental design), the most widely used array platforms (section on Platforms for Gene Expression Analysis), and briefly summarizes the data analysis steps, the software that can be used, and the public microarray data repositories (sections on Analysis of microarray data and Data sharing).
Experimental design
Microarray experiments should be designed to be maximally informative given a certain amount of resources, and they need to answer the primary questions of the experiment. Extensive reviews on experimental design are available elsewhere (18-20). The consequence of a nonoptimal design ranges from loss of statistical power and an increased number of false negatives to inability to answer the primary scientific question of the experiment. The number of available arrays is often determined by financial resources; therefore, one of the most important questions is to determine how to allocate the different samples to a given set of arrays (hybridization scheme) and what to replicate (e.g., biologic samples or hybridizations). In addition, selection of the array platform and the target preparation approach must be considered.
For Affymetrix (Santa Clara, CA) (see below) and for other single-channel experiments, the hybridization scheme is straightforward, but for two-channel platforms, the allocation of samples to arrays is important to prioritize the primary scientific question of the study. Furthermore, balanced designs should be used so that treatments are not confounded with technical issues such as dye assignments, batch of slides, or day of hybridization. Replication is carried out to control the three levels of variation in an experiment: biologic variation (e.g., differences between animals), technical variation (e.g., differences caused by the RNA amplification), and measurement error (e.g., uneven hybridizations). Statistical testing can be carried out on any of these levels, but interpretations of the results differ; is the purpose to analyze the difference between two mice (inference at the level of technical replicates), or is the purpose to generalize the results and to draw conclusions at the level of a population (inference at the level of biologic replicates)? It can be safely assumed that the purpose of most, if not all, experiments is to analyze differences at the population level; hence, biologic replication is essential.
Sample preparation
Direct analysis of complex samples such as unfractionated tissue is often of little value because of cellular heterogeneity. Consider bulk brain for example, which is a mixture of hundreds of different cell types. Unless a specific cell type is selected prior to mRNA extraction, the obtained gene expression profile will be a weighted average of the total gene expression of all different cell types. To enrich for a certain cell type or to obtain homogeneous samples several different approaches have been used. First, experiments can be designed to include sampling shortly after perturbation, which allows for monitoring of early events before secondary changes accumulate. Second, synchronized cell cultures can be used, which allows for analysis of cell cycle phase-specific gene expression patterns (21). Third, fluorescence-activated cell sorting that uses one or multiple fluorophore-conjugated antibodies to identify cells that express a combination of different cell-surface molecules provide a rapid and sensitive cell fractionation assay. Finally, laser-capture microdissection based on a microscopic evaluation of the sample and use of a computer-controlled laser to excise and to isolate specific cells into a collection vessel can be used.
Target preparation
Depending on the array platform, a labeled sample that originates from 1-20 μg of total RNA is required for each hybridization. This corresponds to 0.1-2 million cells (assuming 10pg of total RNA per cell), which is obtainable in cell culturing studies. However, use of various sample preparation methods (see previous section) compromises the yield; hence, a target amplification method is often required. Linear T7-based in vitro transcription (IVT) (22, 23) typically yields 300- to 1000-fold amplification, and a higher amplification can be obtained by performing up to three consecutive rounds of amplification. A double-stranded DNA template that contains a T7 RNA polymerase binding site in the 5' end is synthesized using the mRNA as template, and subsequently transcribed in a 3-12 hour isothermal IVT reaction during which the amplified accumulates RNA linearly. It should be noted that all samples used on Affymetrix arrays are subjected to IVT amplification. PCR-based exponential amplification methods are diverse and typically are based on ligation of linker sequences to both ends of double-stranded cDNA, followed by a limited number of PCR cycles to yield double-stranded DNA. Generally, these methods are assumed to introduce bias to the data because of transcript-length dependent or base composition differences in amplification efficiencies. To circumvent this problem, approaches have been developed that restrict the length of the template and make it more uniform (24, 25). The advantage of PCR-based methods over linear IVT methods is that a much faster and a higher amplification is achievable (26). Comprehensive literature reviews of target amplification approaches are available (27, 28).
Labeling using fluorophores is typically carried out in a prehybridization fashion for the spotted array platforms, and in a posthybridization manner for the Affymetrix platform. For Affymetrix arrays, biotin-modified nucleotides are incorporated into the aRNA during the IVT step, and the dye coupling is carried out after hybridization using phycoerythrin-streptavidin and biotinylated antistreptavidin-antibody conjugates. For spotted arrays, target labeling and incorporation of the fluorophore can be carried out either directly (the fluorophore is attached to the nitrogenous base of one of the nucleotides) or indirectly (the fluorophore is attached to modified nucleotides after cDNA synthesis using a chemical coupling). Direct labeling is often affected by incorporation difficulties and differences in efficiencies between the dyes. Indirect labeling avoids these problems by using only one type of modified nucleotide in the cDNA synthesis. Alternatively, an emerging approach, based on labeled platinum conjugates, can be used to label the RNA or DNA chemically (29). The possibility to omit all enzymatic steps makes this approach interesting and promising.
Hybridization
In the hybridzation of the labeled target to the immobilized probe and subsequent washing, two opposing forces need to be balanced—too stringent conditions develop low signals, whereas too unspecific hybridizations yield compressed ratios with little differential expression. An extensive analysis of the conditions is beyond the scope of this review, but the main parameters that must be considered are probe length, hybridizations buffer composition, hybridization temperature, duration of hybridization, mixing, and wash stringency.
Scanning and image analysis
Typically, spotted microarrays are scanned at 5- or 10-pm resolution one channel at time, which generates two 25-100 Mb 16-bit images. To facilitate the image analysis and visualization, the two images are overlaid to generate one 24-bit RGB pseudo-color image with the red, green, and yellow spots associated commonly with microarray data. Affymetrix arrays are scanned using only one wavelength. After scanning, and irrespective of array platform, the purpose of the image analysis step is to separate foreground and background pixels, to derive an estimate of the gene expression level for each feature, and to calculate various intensity and quality-control parameters (30). The background intensity is considered to represent the contribution of nonspecific hybridization to the slide surface and to the immobilized DNA. Various correction approaches have been described to account for this binding, but it should be noted that the subtraction issue is controversial, and the effect may vary depending on the dataset being analyzed (31).
Platforms for gene expression analysis
Multiple platforms are available for high-throughput, microarray-based, genome-wide transcriptional profiling. They mainly differ in the type of probe attached to the surface, the number of target samples that can be hybridized on each array, and the principal expression measurement (ratio for two-channel arrays and absolute level estimate for single-channel experiments). In addition, the target labeling and hybridization, image analysis, and initial low-level data analysis aspects often differ. At the high-level data analysis stage (where biologic inference is sought), the data analyzes for the different array platforms converge, and the approaches and the interpretation of results generated are similar.
Several different microarray platforms have been compared in the Microarray quality control project using commercial RNA samples. This comprehensive study of reproducibility and variability, both within and between different microarray platforms, stated clearly that the data obtained using the microarray platform is generally of high quality. The study also included a large-scale analysis of the expression levels by quantitative RT-PCR, and these results allowed first time the microarray technology to be benchmarked against the current “golden standard” technique (32).
The next section describes briefly the platforms that are used most widely for transcript profiling, starting with the two-channel platforms.
Spotted cDNA arrays
The relatively low cost of cDNA array production and the access to thousands of EST clones in the freezers in many laboratories and the commercial distribution of EST clone collections propelled the early development and popularity of the cDNA arrays in the late 1990s (15). The arrays are generated through PCR-amplification of cloned 200-4000 bp insert sequences using vector-specific primers. The double-stranded amplicons are then purified (ethanol precipitation or filter plates), printed, and immobilized on coated glass slides. To avoid plate-handling errors, rigorous quality control steps, which include complete, partial, or random resequencing, and agarose gel electrophoresis analysis of the purified clones is advantageous, but it is labor intensive and costly.
The general advantages of the spotted cDNA arrays include the following: 1) the low cost of arrays that allow for design of experiments with extensive replication, 2) the possibility to use two-color detection and thereby reducing the number of arrays, 3) large-scale clone collections are widely available from multiple sources, 4) they are compatible with most target amplification protocols, and 5) they include established laboratory protocols. The drawbacks (many of which are shared with oligonucleotide and Affymetrix arrays, see below) include: 1) unspecific target-probe interaction because of long probes, 2) false negatives caused by drop-outs during probe preparation or array printing, 3) batch-to-batch variability in array production, 4) incomplete transcriptome coverage, 5) uncertainty over which region or isoform of a transcript is targeted with a given probe (the complete probe sequence is rarely known), 6) difficulties in maintaining high-quality probe collections (avoidance of evaporation, well-to-well contamination, plate rotation, etc.), and 7) confounded measurement of sense and antisense transcripts.
Spotted long-oligonucleotide arrays
Spotted arrays with (50-90 nt) oligonucleotides have been available for several years, and they offer a higher specificity than is achievable using the cDNA arrays. Using publicly available genome sequences, oligonucleotides are designed in silico for each gene in a genome. The melting temperatures are also taken into account to achieve uniform hybridization conditions. Typically, the oligonucleotides are bought presynthesized and are dissolved in appropriate printing buffer and printed. Use of presynthesized oligonucleotides offers several advantages. First, probes can be generated for any organism given that its genome sequence and gene predictions are available. Second, the probes are targeted to specific regions of genes, which allows for some differentiation of splice variants. Third, clone handling is reduced, minimising the risk for plate or clone handling errors. Fourth, replacement plates are easy to obtain. Last, the probes are designed to have the same sense as the mRNA; hence, they are complementary to the labeled cDNA generated from the mRNA, and a confounded measurement between sense and antisense strands is avoided. In addition to many of the drawbacks listed for cDNA arrays (see points 2, 3, 4 and 6 in cDNA array section), the initial purchase investment for oligonucleotide collections is substantial.
Affymetrix genechip arrays
Affymetrix arrays are one of the most widely used platforms for transcriptional profiling. The probes are designed in silico and are synthesized directly on the array using photolithography (33, 34). Each gene (transcript) is interrogated by a “probe set,” which constitutes 11-20 different 25-mer perfect match (PM) probes and their corresponding mismatch (MM) probes. The MM probes differ from their PM probes by one mismatched base in the central position that functions to destabilize the probe-to-target complementarity. Depending on the data processing approach, the intensities from the MM probes can be used to correct the signal from the PM probes, but other approaches are available (35).
Several advantages exist with Affymetrix arrays. First, arrays and operating procedures are standardized, which allows for direct comparison of data between projects and laboratories. Second, the direct synthesis of the probes on the array avoids problems with plate handling and ensures that the batch-to-batch variability is minimized. Three, small feature sizes yield dense arrays, which allows for genome-wide transcriptional profiling with multiple probes per gene. Fourth, probes are single-stranded, and nonconfounded measurements between overlapping transcripts are obtained. The drawbacks include the inflexibility in probe content caused by the initially high production costs, and sample preparation always includes linear amplification (i.e., extra enzymatic steps that may introduce bias into the results).
Recently, Affymetrix also launched their exon arrays (i.e., arrays that use multiple probes to target essentially every exon of each transcript variant). This platform is the most detailed for analysis of gene expression levels, and it allows for identification of alternative splicing in addition to the standard genome-wide transcriptional analysis.
Affymetrix provides arrays that not only analyze expression levels of genes, but also interrogate the entire genome for transcriptional activity. These arrays are termed “tiling arrays,” and they contain probes that are more or less evenly spaced (approximately 35 bp) along the genome. The arrays have been used in several studies to identify extensive transcriptional activity from regions that were not considered to encode for genes. Typically, these arrays are used to provide a genome-wide transcriptional mapping or are used in chromatin immunoprecipitation studies.
Illumina's beadarray platform
The BeadArray technology (Illumina, San Diego, CA, USA) allows for genome-wide transcript profiling using a bead-based, high-density microarray platform. The 3-|xm beads are coated with hundreds of thousands of copies of a specific capture probe, and these beads are assembled into an array of beads on either a fiber-optic bundle substrate or planar silica slides with etched microwells. For transcript profiling arrays, the oligonucleotides are 50 nt long. Multiple beads per target transcript are used to generate a large redundancy in the data, which increases the precision of the final measurement. In addition, quality control steps in the oligonucleotide synthesis and bead attachment process ensures that the frequency of dropouts is kept low (36). In addition to transcript profiling, the BeadArray platform can also be used to analyze DNA, for example, in a comparative genomic hybridization experiment.
NimbleGen gene expression platform
NimbleGen arrays (NimbleGen Systems, Inc., Madison, WI) are similar to the Affymetrix arrays in that they have been produced using a light-directed probe synthesis approach, but they use computer-controlled micromirrors instead of masks to de-protect selected surfaces of the array during the probe extension. The use of the Maskless Array Synthesizer (NimbleGen Systems, Inc.) technology allows for synthesis of longer probes, which yields arrays typically with up to 70 nt long probes; hence, the platform is in this respect similar to the long oligonucleotide array platforms. Redundancy and reliability of the data is increased by using 6 to 20 independent probes to interrogate each gene. Another advantage of the synthesis approach is the flexibility to change the probe content of the array frequently in parallel with new genomic information that is published, and that arrays can be generated rapidly for newly sequenced organisms. In addition to gene expression analysis, the NimbleGen arrays can also be used to analyze genomes for example in comparative genomic hybridizations.
Analysis of microarray data
The generation of the raw data is followed by an extensive analysis of the data. A detailed description of this analysis is beyond the scope of this review, but it is available in a recent review (37). Briefly, in most cases the analysis is divided into two sections: pre-processing of the data (low-level data processing) and subsequent data mining (high-level data processing). The purpose of the pre-processing is to identify and to correct for systematic and nonsystematic technical artefacts and other nonbiologic bias in the data. Steps that are typically included are correction of the hybridization background levels, exclusion of nonreliable data (e.g., dust particles on hybridization surface), log2-transformation of the data, and normalization to account for technical intra- and inter-slide differences. The high-level data mining include steps in which biologic inference are sought; for example, identification of differentially expressed genes using a moderated t-test (38, 39), identification of enriched (overrepresented) biologic themes from the Gene Ontology database (40), clustering analyzes to identify coregulated genes and patterns in the data, and various other dimension reduction and classification tools (41, 42).
A typical microarray experiment generates large quantities of data, and to analyze these data efficiently, both commercial and open-source software solutions have been developed. The open-source software have gained extensively in popularity, mainly because the availability of R packages that provide tools for analysis steps described in publications, a large user community that improves existing functions, the possibility to modify and to automate analysis steps, and the fact that the software is available at no cost. R is a programming language and an environment for statistical computing and graphics (43), and its functionality can be extended by packages such as the Bioconductor project (44), which provides a comprehensive collection of tools for all steps of microarray data analysis. TM4 is another open-source software suite that provides an easy-to-use, java-based graphical interface (45).
Data sharing
To facilitate comparisons between experiments and especially meta-analyzes, two raw data storage and exchange repositories are available: ArrayExpress, which is run by European Bioinformatics Institute (Cambridge, UK) (46), and Gene Expression Omnibus (GEO), which is run by the National Center for Biotechnology Information (Bethesda, MD) (47). Both accept submissions that fulfill the Minimum Information About a Microarray Experiment (MIAME) standards (48). The purpose of the MIAME standards is to ensure that all essential information regarding the experiment underlying a publication is available, and that the interpretation of the results can be carried out properly. An increasing number of journals are also requiring the data to be available publicly in the repositories to publish the results. Hence, it is not surprising that these repositories are widely used; ArrayExpress contains more than 2,500 experiments and over 80,000 hybridizations, whereas GEO has passed 6,000 experiments and 160,000 hybridizations (August 2007).
Practical Aspects and Future Trends
Since the early years of microarray technology, the field has matured rapidly and the technology development has resulted in numerous different platforms for genome-wide transcript profiling. With the availability of the genome sequence and gene predictions of an organism, probes can be designed easily, and arrays can be generated. In parallel with the development of the experimental platforms, tools and the statistical framework for the analysis have been developed strongly. Today, multiple excellent tools are available as both commercial and open-source software packages. The shear magnitude of the data has also required development of data management systems that can efficiently store, back-up, and process large data sets in an efficient way. Altogether, during the last 15 years, the array technology has been shown multiple times to yield an accurate description of the transcriptional status of a large number of genes in a cell, but it has to be remembered that our understanding of the transcription process, the transcriptome and the central dogma itself is rapidly changing and that the transcript profiling technologies irrespective of platform need to adjust to this changing picture where new details are emerging continuously.
The new information gained on transcription of nonproteincoding transcripts has changed the view of transcription in many ways during the recent years, and it has changed the requirements for transcript profiling. Future technology development will have to deal with improved separation of transcript isoforms, for example, to separate coding and noncoding (e.g., partial) versions of a transcript, and transcript strand assignment (i.e., sense or antisense transcription). In addition, the identification of transcription start site may be useful to understand in more detail how transcription is regulated. Furthermore, the sensitivity of the methods needs to be improved to measure transcripts of low abundance accurately.
The recently published genome of an individual human (49), together with development of the next generation of sequencing technologies, prepares ground for an exciting alternative to microarray-based transcription profiling. These new technologies have a capacity that is unmatched by the older sequencing approaches, and with extended read lengths additional information of the transcripts can be obtained, such as detailed transcript isoform information. In combination with the information obtained on individual genomes, exciting possibilities to combine gene copy number variation and gene expression levels may emerge in the near future. Hence, the sequencing-based methods for transcript profiling not only will challenge the microarray-based transcript profiling methods, but also will complement them with additional detailed information of the transcripts and the genome from which they originate.
References
1. Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007; 447:799-816.
2. Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, Oyama R, Ravasi T, Lenhard B, Wells C. The transcriptional landscape of the mammalian genome. Science 2005; 309:1559-1563.
3. Alwine JC, Kemp DJ, Stark GR. Method for detection of specific RNAs in agarose gels by transfer to diazobenzyloxymethyl-paper and hybridization with DNA probes. Proc. Natl. Acad. Sci. U.S.A. 1977; 74:5350-5534.
4. Morrison TB, Weis JJ, Wittwer CT. Quantification of low-copy transcripts by continuous SYBR Green I monitoring during amplification. Biotechniques 1998; 24:954-958, 960, 962.
5. Palmer S, Wiegand AP, Maldarelli F, Bazmi H, Mican JM, Polis M, Dewar RL, Planta A, Liu S, Metcalf JA, Mellors JW, Coffin JM. New real-time reverse transcriptase-initiated PCR assay with single-copy sensitivity for human immunodeficiency virus type 1 RNA in plasma. J. Clin. Microbiol. 2003; 41:4531-4536.
6. Gentle A, Anastasopoulos F, McBrien NA. High-resolution semiquantitative real-time PCR without the use of a standard curve. Biotechniques 2001; 31:502, 504-506, 508.
7. Harbers M, Carninci P. Tag-based approaches for transcriptome research and genome annotation. Nat. Methods 2005; 2:495-502.
8. Wilcox AS, Khan AS, Hopkins JA, Sikela JM. Use of 3’ untranslated sequences of human cDNAs for rapid chromosome assignment and conversion to STSs: implications for an expression map of the genome. Nucleic Acids Res. 1991; 19:1837-1843.
9. Kuznetsov VA, Knott GD, Bonner RF. General statistics of stochastic process of gene expression in eukaryotic cells. Genetics 2002; 161:1321-1332.
10. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. Serial analysis of gene expression. Science 1995; 270:484-487.
11. Kodzius R, Kojima M, Nishiyori H, Nakamura M, Fukuda S, Tagami M, Sasaki D, Imamura K, Kai C, Harbers M, Hayashizaki Y, Carninci P. CAGE: cap analysis of gene expression. Nat Methods 2006; 3:211-222.
12. Carninci P, Sandelin A, Lenhard B, Katayama S, Shimokawa K, Ponjavic J, Semple CA, Taylor MS, Engstrom PG, Frith MC, Genome-wide analysis of mammalian promoter architecture and evolution. Nat. Genet. 2006; 38:626-635.
13. Brenner S, Johnson M, Bridgham J, Golda G, Lloyd DH, Johnson D, Luo S, McCurdy S, Foy M, Ewan M. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat. Biotechnol. 2000; 18:630-634.
14. Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C, Kobayashi M, Horton H, Brown EL. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat. Biotechnol. 1996; 14:1675-1680.
15. Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995; 270:467-470.
16. Southern EM, Maskos U, Elder JK. Analyzing and comparing nucleic acid sequences by hybridization to arrays of oligonucleotides: evaluation using experimental models. Genomics 1992; 13:1008-1017.
17. Visel A, Thaller C, Eichele G. GenePaint.org: an atlas of gene expression patterns in the mouse embryo. Nucleic Acids Res. 2004; 32:D552-D556.
18. Churchill GA. Fundamentals of experimental design for cDNA microarrays. Nat Genet 2002; 32(suppl): 490-495.
19. Glonek GF, Solomon PJ. Factorial and time course designs for cDNA microarray experiments. Biostatistics 2004; 5:89-111.
20. Yang YH, Speed T. Design issues for cDNA microarray experiments. Nat Rev Genet 2002; 3:579-588.
21. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 1998; 9:3273-3297.
22. Eberwine J, Yeh H, Miyashiro K, Cao Y, Nair S, Finnell R, Zettel M, Coleman P. Analysis of gene expression in single live neurons. Proc. Natl. Acad. Sci. U.S.A. 1992; 89:3010-3014.
23. Van Gelder RN, von Zastrow ME, Yool A, Dement WC, Barchas JD, Eberwine JH. Amplified RNA synthesized from limited quantities of heterogeneous cDNA. Proc. Natl. Acad. Sci. U.S.A. 1990; 87:1663-1667.
24. Brady G, Billia F, Knox J, Hoang T, Kirsch IR, Voura EB, Hawley RG, Cumming R, Buchwald M, Siminovitch K. Analysis of gene expression in a complex differentiation hierarchy by global amplification of cDNA from single cells. Curr. Biol. 1995; 5:909-922.
25. Hertzberg M, Sievertzon M, Aspeborg H, Nilsson P, Sandberg G, Lundeberg J. cDNA microarray analysis of small plant tissue samples using a cDNA tag target amplification protocol. Plant J. 2001; 25:585-591.
26. Subkhankulova T, Livesey FJ. Comparative evaluation of linear and exponential amplification techniques for expression profiling at the single-cell level. Genome Biol. 2006; 7:R18.
27. Nygaard V, Hovig E. Options available for profiling small samples: a review of sample amplification technology when combined with microarray profiling. Nucleic Acids Res. 2006; 34:996-1014.
28. Sievertzon M, Transcript profiling of small tissue samples using microarray technology, PhD Thesis. 2005. Royal Insitute of Technology, Stockholm, Sweden. p. 89.
29. van Gijlswijk RP, Talman EG, Janssen PJ, Snoeijers SS, Killian J, Tanke HJ, Heetebrij RJ. Universal Linkage System: versatile nucleic acid labeling technique. Expert Rev. Mol. Diagn. 2001; 1:81-91.
30. Yang HY, Buckley M, Dudoit S, Speed T, TechReport 584: Comparison of methods for image analysis on c{DNA} microarray data, in Department of Statistics, University of California at Berkeley Technical Reports. 2000.
31. Qin LX, Kerr KF. Empirical evaluation of data transformations and ranking statistics for microarray analysis. Nucleic Acids Res. 2004; 32:5471-5479.
32. Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, de Longueville F, Kawasaki ES, Lee KY, et al. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol. 2006; 24:1151-1161.
33. Fodor SP, Read JL, Pirrung MC, Stryer L, Lu AT, Solas D. Light-directed, spatially addressable parallel chemical synthesis. Science 1991; 251:767-773.
34. Pease AC, Solas D, Sullivan EJ, Cronin MT, Holmes CP, Fodor SP. Light-generated oligonucleotide arrays for rapid DNA sequence analysis. Proc. Natl. Acad. Sci. U.S.A. 1994; 91:5022-5026.
35. Irizarry RA, Wu Z, Jaffee HA. Comparison of Affymetrix GeneChip expression measures. Bioinformatics 2006; 22:789-794.
36. Gunderson KL, Kruglyak S, Graige MS, Garcia F, Kermani BG, Zhao C, Che D, Dickinson T, Wickham E, Bierle J, et al. Decoding randomly ordered DNA arrays. Genome Res 2004; 14:870-877.
37. Grant GR, Manduchi E, Stoeckert CJ, Analysis and management of microarray gene expression data. In: Current Protocols in Molecular Biology. 2006, John Wiley & Sons, Inc., New York.
38. Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 2004; 3:Article3.
39. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. U.S.A. 2001; 98:5116-5121.
40. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 2000; 25:25-29.
41. Azuaje F. Clustering-based approaches to discovering and visualising microarray data patterns. Brief Bioinform. 2003; 4:31-42.
42. Quackenbush J. Computational analysis of microarray data. Nat. Rev. Genet. 2001; 2:418-427.
43. R Development Core Team. R: A Language and Environment for Statistical Computing. http://www.R-project.org.
44. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Du- doit S, Ellis B, Gautier L, Ge Y, Gentry J, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004; 5:R80.
45. Saeed AI, Sharov V, White J, Li J, Liang W, Bhagabati N, Braisted J, Klapa M, Currier T, Thiagarajan M, et al. TM4: a free, open-source system for microarray data management and analysis. Biotechniques 2003; 34:374-378.
46. Parkinson H, Sarkans U, Shojatalab M, Abeygunawardena N, Con- trino S, Coulson R, Farne A, Lara GG, Holloway E, Kapushesky M, Lilja P, Mukherjee G, Oezcimen A, Rayner T, Rocca-Serra P, Sharma A, Sansone S, Brazma, A. ArrayExpress-a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 2005; 33:D553-D555.
47. Barrett T, Suzek TO, Troup DB, Wilhite SE, Ngau WC, Ledoux P, Rudnev D, Lash AE, Fujibuchi W, Edgar R. NCBI GEO: mining millions of expression profiles--database and tools. Nucleic Acids Res. 2005; 33:D562-D566.
48. Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, et al. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat. Genet. 2001; 29:365-371.
49. Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, et al. The Diploid Genome Sequence of an Individual Human. PLoS Biol. 2007; 5:e254.
See Also
Array-Based Techniques for Proteins
Array-Based Tools for Nucleic Acids
Nucleic Acids, Design and Engineering of
Oligonucleotide Arrays to Monitor Polymorphisms