Systems Approach to Studying Disease

CHEMICAL BIOLOGY

Systems Approach to Studying Disease

Gregory D. Foltz and Leroy Hood, Institute for Systems Biology, Seattle, Washington

doi: 10.1002/9780470048672.wecb587

The emerging field of systems biology promises to transform our understanding of the molecular basis of human disease. Recent technological advances allow for multiparameter measurements to be integrated across global genomic and proteomic platforms to inform predictive and probabilistic gene and protein regulatory networks. A systems approach to disease is based on the idea that disease-perturbed gene and protein regulatory networks differ from their normal counterparts, and that these differences are predictive of the disease course and response to therapy. The ability to detect disease-related perturbations in individual patients will transform health care over the next decade from our current reactive medicine to a new medical practice that is predictive, personalized, preventive, and participatory (P4 medicine).

The completion of the Human Genome Project fundamentally transformed contemporary approaches to medicine and disease. At the root of this transformation is the concept that biology is an informational science based on a digital code, encoded in the genome, from which all biologic processes are derived. This realization has several important implications for the study of disease. First, the digital code of the human genome is knowable and can be defined, interrogated, and compared in disease and healthy states. A second type of biologic information is that which emerges from the environment to modify the digital genomic readout. Thus, the integration of the digital genomic information and the environmental information across the development of organisms, their physiologic responses to the environment, and their responses to disease is the heart of what has come to be known as systems biology and systems medicine. This field provides a basic blueprint of disease from which hypotheses can be formulated regarding etiologies and possible therapeutic entry points. Second, the digital nature of genomic data, including the complex hierarchy of molecular (biologic) networks, and their dynamical response to environmental stimuli is amenable to modern computational and analytical techniques. Third, because of the parallel advancement of high-throughput genomic and proteomic technologies, the genomic code and its derivative dynamic molecular networks are increasingly accessible and testable at the individual level. Fourth, the digital code and the dynamics of its encoded networks can be compared across model organisms allowing for the targeted manipulation of gene networks implicated in disease states. A key point in the new medicine, then, is the idea that drugs can be designed to reengineer networks (and this will take multiple drugs!) to make them behave in a more normal manner—quite a different concept from the idea that one drug should destroy or enhance the activity of one particular target. Finally, unbiased systems analyses result in our understanding of the emergent properties that are not predicted a priori, which leads to new insights into disease states. As a result of these influences, medicine is evolving from an observational, population-based, reactive approach to a more quantitative, predictive, and individualized approach to understanding disease and disease treatment based on rational genomic analysis. This evolution will ultimately transform traditional medicine into a new field of “systems medicine” that will provide personalized, predictive, preventative, and participatory treatment of disease in individual patients.

Systems Biology: An Introduction

Understanding a systems approach to disease first requires an introduction to the emerging field of systems biology. Systems approaches to understanding biologic complexity emerged as a result of several key transformative technological and theoretical advances that created new ways of thinking about biologic analysis and the scientific infrastructure required for integrated experimental models. These advances include 1) the development of high-throughput platforms for the rapid acquisition of global data sets. These platforms, which include high-speed DNA sequencers, DNA microarrays, mass spectrometry-based global proteomic technologies, antibody arrays, and the study of metabolomics with nuclear magnetic resonance (NMR) and mass spectrometry, have fundamentally enabled systems biology. 2) The completion of the Human Genome Project, which provided a blueprint for the accurate prediction of the genetic “parts list” that defines all human genes, regulatory control elements, mRNAs, and proteins. The success of this large-scale discovery project led to an explosion of similar efforts to quantify globally entire genomes, transcriptomes, proteomes and inter-actomes (networks of interacting proteins and other molecules) in several individual organisms and cell types. Generating these system-wide “parts lists” is critical for accurate global analyses and model building. 3) The discovery that biologic complexity is based on a digital code that led to the development of systems biology as a multidisciplinary field that could incorporate powerful recent advances in computer science, engineering, and mathematics and apply them to the study of biology. And 4) the parallel emergence of the Internet and high-powered computational infrastructure that enabled the acquisition and dissemination of large global data sets across multidisciplinary environments both within and between institutions. This project has revolutionized large-scale collaborative efforts and has led to the rapid integration of global analytical technologies and computational software tools for mathematical modeling and network analysis.

The goal of systems biology is to establish a theoretical and experimental framework for deciphering and predictably modeling biologic complexity on a global scale. A key feature of this approach, which will prove fundamental to our understanding of disease, is the use of hypothesis-driven perturbations to identify system-wide responses, many of which would not otherwise be predicted using classic reductionistic, or one-element-at-a-time, analytical approaches. These system-wide “emergent properties” are then incorporated into the original theoretical model leading to the development of new hypotheses that can be tested in an iterative manner using specific hypothesis-driven perturbations. Repeated reformulation of the theoretical model continues until the experimental data, which is typically composed of dynamic global data sets (DNA, RNA, proteins, protein modifications and interactions, metabolites), comes into alignment with the model predictions. This iterative refinement of the theoretical model depends on the accurate definition of an initial model system. A biologic model system can be defined at the level of single molecules, molecular networks, cells, organs, individuals, and even ecosytems. The key feature that enables a systems biology analysis is the capability of studying all elements of the biologic system at the same time, including their interrelationships and responses to genetic or environmental perturbations.

A Systems Biology Approach to Disease

A systems approach to disease is derived from two very simple hypotheses. First, the functions of living organisms are executed by biologic networks of two different types: 1) protein networks (generally protein/protein interactions) that use biologic information to carry out functions such as signal transduction, metabolism, development, or physiologic responses, and 2) gene regulatory networks (transcription factors controlling layered networks of other transcription factors) that take input biologic information from, for example, signal transduction networks—integrate and modulate it—and then output it to the protein networks mediating, for example, development of physiologic responses. Second, disease states originate from one or more normal networks that have been disease-perturbed, for instance, either by mutations or pathological environmental signals. The accurate identification and subsequent functional analysis of disease-perturbed networks provides a foundation for fundamental new insights into the origins of disease states and approaches for early diagnosis and therapy.

The conceptualization of disease as a dynamic, continually changing process is reflected from the systems view in two important aspects. First, any disease, such as cancer, may really be multiple diseases with similar clinical signs and even histologic features, yet originating from distinct disease-perturbed networks. Thus, the stratification of disease into its different types may have very important implications for prognosis and therapy. A systems approach to disease that delineates the disease-perturbed networks in individual patients permits this stratification. Second, any particular disease type goes through progressive changes in the behaviors of the disease-perturbed networks. Accordingly, the systems approach to disease allows one to assess where the disease of an individual patient is with regard to the stages of disease progression for each disease type. Once again, the stage of disease progression may have very important implications for future therapies such as the requirement for the therapy to change during the progression of the disease. In this regard, a long-term goal of a systems approach to disease is to gather the network data necessary to determine the extent to which a disease may be stratified into distinct types and to identify the natural stages of disease progression for each of these types. The central task of a systems approach to disease is 1) to gather information comprehensively defining each of the disease-perturbed networks that characterize distinct disease states and (b) to integrate these data to generate predictive mathematical models of disease behavior and response to therapy.

Systems medicine requires the integration of many different types of data into models that have predictive behavior. We discuss below several of the high-throughput platforms that are generating large-scale data for a systems approach to disease.

High-Throughput Platforms for Systems Analysis

Genomics

Complete genome sequencing provided the foundation for systems biology by enabling investigations that could examine specific molecular hypotheses in the context of a completely defined catalog of genes and proteins for humans. The first complete genome of a free-living species, Haemophilus influenzae, was completed in 1995 (1) and was followed quickly by other important human bacterial pathogens (2-4), yeast (5), human (6), mouse (7) and chimpanzee (8). By 2005, only a decade later, more than 1000 complete genomes have been completed. This massive expansion of sequence data was enabled by technological improvements in the Sanger sequencing method (9), which allowed for increased automization and throughput (10, 11). Despite these advances, significant infrastructure and cost is required for sequencing even a relatively small genome. As a result, most genome sequencing is still performed at large dedicated genome centers and most completed genomes represent only one or a few sampled organisms (12). Recent developments in microfluidics, image processing, and enzymology promise to increase massively the speed and capacity of DNA sequencing at a significantly reduced cost. Currently, this “next-generation” high-throughput DNA sequencing (from companies such as 454, Solexa, Applied Biosystems, and Helicos) is being used to characterize cancer-associated mutations across extended patient populations (13). Once individual genome sequencing is achieved economically, comprehensive identification of individualized markers of disease susceptibility and treatment response will enable predictive and preventative strategies to treat and prevent disease.

One powerful application that has emerged from large-scale genomic sequencing is the identification and characterization of polymorphisms, either single nucleotide polymorphisms (SNPs) or, more commonly, simple sequence repeats, in genes that identify and predict variations in biologic response, behavior, and predisposition to disease states (by DNA arrays and hybridization or by DNA sequencing of various types). Most common diseases are thought to result from a mixture of genetic and environmental factors. Many factors demonstrate a complex genetic predisposition thought to result from the contribution of small variations in several genes. More than four million putative SNPs have been identified in the human genome. Because they are highly abundant, occurring on average every few hundred base pairs in the genome, and relatively stable, they provide useful markers for linkage analysis of genes involved in the pathogenesis of complex disease. The development of high-throughput genotyping methods makes genome-wide linkage disequilibrium mapping of SNPs a viable approach to the study of complex disease susceptibility. Another important application of SNP analysis is the identification of genetic variants that influence a patient’s response to a drug, which usually is used to predict pharmacological efficacy or the likelihood of harmful side effects. An important extension of this application is the identification of SNPs that alter cellular responses to biologic signaling during normal development and function. This largely untapped area offers vast potential for defining the genetic basis of normal biologic variation in the development of disease.

Transcriptomics

An advance in genome sequencing technology enabled the systematic measurement and comparative analysis of complete transcriptional programs of cells and tissues at various points in time and at any given developmental, pathological, or functional stage. The most widely used methodologies for transcriptome analysis include DNA microarrays, serial analysis of gene expression (SAGE), microbead-based massively parallel signature sequencing (MPSS), and the massively parallel sequencing by synthesis (SBS). DNA microarrays are a powerful tool for high-throughput identification and quantitation of nucleic acids in biologic systems. DNA arrays typically consist of thousands of short gene-specific DNA molecules spatially arranged on a solid surface. Nucleic acid-specific hybridization allows for the precise identification and quantification of transcript levels in a cellular system at two or more different states. DNA microarrays can either be spotted arrays of oligonucleotides (25-60 bp in length) or cDNA molecules, or they can be oligonucleotide arrays produced by piezoelectric deposition or in situ synthesis. Recently, custom-designed arrays have become available containing up to 400,000 oligonucleotides. This flexibility in array design allows for economical approaches to study specific disease states in systems biology.

In principle, oligonucleotide arrays are more specific than the cDNA array and have the capability to distinguish between single-nucleotide differences. This method has the advantage of distinguishing between transcripts derived from individual members of multigene families and alternatively spliced variants. The widespread acceptance of DNA array technology has led to the development of applications beyond transcriptome analysis that have an impact on the study of disease. DNA array technology has been used in genotyping studies to identify SNPs and to confirm the sequence identity of known regions of DNA. Currently, DNA array applications include promoter analysis, ChIP-on-chip studies of protein-promoter binding site occupancy, mutation analysis, comparative genomic hybridization, and genome resequencing. DNA sequences can also be tagged or labeled in such a way that they can be identified in solution. A powerful new application of sequencing technology, massively parallel signature sequencing (MPSS or SBS), combines advances in microfluidics, enzymology, and image processing technologies to allow up to 1,000,000 different sequences of up to 35 residues to be determined simultaneously per sample (14).

Proteomics

Proteomics can be defined as the global characterization of proteins in complex mixtures, including protein identity, abundance, processing, chemical modifications, interactions in protein complexes, and subcellular localization within a cell or tissue. Currently, no proteomics technology approaches the throughput and level of automation of genomic technology. Strategies for protein identification and quantification can be divided into MS-based techniques, designed to provide unbiased global measurements of protein abundance, and antibody-based techniques designed to identify known proteins in a biologic sample. These and other strategies to reduce sample complexity, differentially label protein samples, and improve relative quantification of proteins by MS analysis and antibody arrays promise to enhance a systems approach to disease.

Proteomics: mass spectrometry techniques

The standard approach to proteome analysis is based on the separation of complex protein samples by two-dimensional gel electrophoresis (2DE) and the subsequent identification of selected separated protein species by one of a variety of mass spectrometric techniques (15). This approach is limited fundamentally because specific classes of proteins either are not represented in the gels (e.g., membrane proteins, small proteins, or very basic proteins) or are undetectable because of their limited abundance. Furthermore, this method remains labor intensive despite automation of 2DE gel computerized pattern matching, protein extraction and digestion, and mass spectrometry-based analysis. In addition, the enormous dynamic range of protein abundances found in biologic systems, ranging from 1 to 10⁶copies or greater in cells and up to 1 to 10¹² in serum, is a major impediment for detecting low-abundance proteins. Improved throughput is provided by direct analysis using tandem mass spectrometry (MS/MS) of peptides generated by the digestion of complex, unseparated protein mixtures (16). The key feature of this method is the ability of a tandem mass spectrometer to collect sequence information from a specific peptide, even if numerous other peptides are concurrently present in the sample. This collection is accomplished in the instrument by the isolation of the peptide ion of interest from other peptides, fragmentation of the peptide ion in a collision cell (collision-induced dissociation, CID), and the acquisition of the fragment ion masses in a computer. It is these fragment ion masses that represent unique identifiers for a peptide and the sequence of the peptide, and therefore, the identity of a protein is determined by correlating the CID spectrum with the contents of sequence databases (17). Recently, protein separation techniques have been enhanced by the use of multidimensional liquid chromatography (LC) followed by specific protein/peptide capture strategies (18).

The development of technologies for global comparative measurements of proteomes from cells or tissues of different states (e.g., healthy vs. Disease) is a fundamental requirement for a systems approach to disease. Stable isotope-labeling of proteins/peptides enables high-throughput relative quantification of proteins using MS on a scale approaching several thousand per sample. The general strategy involves differentially labeling proteins or proteolytic peptides with stable isotopes, mixing of labeled samples at a 1:1 ratio, followed by combined sample processing and subsequent MS analysis. As the labeling reagents possess almost identical chemical properties, the labeled peptides appear closely paired in the LC and MS processes. Relative quantification is achieved by comparing ion signal intensities or peak areas of isotope-encoded peptide pairs observed in the corresponding mass spectra. Current methods in current for the introduction of mass tags to proteins and peptides include isotope-coded affinity tags (ICATs), stable isotope labeling by amino acids in cell culture (SILAC), and isobaric tag for relative and absolute quantitation (iTRAQ).

The isotope-coded affinity tag (ICAT) technique involves differential labeling of two different protein populations on the side chain of reduced cysteinyl residues using one of two chemically identical but isotopically different ICAT reagents (19). By incorporating a biotin affinity tag into the ICAT reagents, selective isolation and purification of labeled peptides substantially reduces sample complexity. The ICAT approach has been applied successfully to the systematic identification and quantification of proteins contained in the microsomal fraction of cells (20). It has also been applied, in conjunction with DNA microarray analysis, to identify differential expression profiles of hematopoietic progenitor cells (21). A major drawback of ICAT technique is that it is only labels the fraction of proteins containing cysteine residues. An alternative approach, SILAC, involves growing two populations of cells, under identical conditions except that the culture medium for one population contains all 20 essential amino acids in their naturally occurring isotopic forms (“light” population), whereas the other population is grown in medium where one or more amino acids are replaced by stable, heavy isotope labeled analogs (22). The incorporation of a heavy amino acid into a peptide, which is referred to as metabolic labeling, results in a known mass shift relative to the peptide that contains the light version of the amino acid. The advantage of using metabolic labeling is that it allows mixing of labeled and unlabeled cells before the fractionation and purification steps and therefore avoids introduction of any errors in relative quantification in subsequent sample preparation. Furthermore, all peptides within the sample can be analyzed, not just those containing cysteine residues, increasing confidence in both identification and quantification.

Isobaric tag for relative and absolute quantitation (iTRAQ) is a multiplexed strategy that allows up to four samples to be analyzed simultaneously by MS in the same experiment (23). Peptides are labeled on the free amine groups at the amino terminus and on lysine residues. Unlike other isotopic labeling strategies, the iTRAQ label reagents are designed to provide quantitative information during peptide fragmentation. This technique modifies peptides by linking a mass balance group (carbonyl group) and a reporter group (based on N-methylpiperazine). Designed to be isobaric (having the same mass), the iTRAQ reagents are chromatographically indistinguishable in the LC step, which causes the ion peak for each of the identical labeled peptides to be detected simultaneously by the mass spectrometer. When MS/MS is used for analysis, the mass balancing carbonyl moiety is released as a neutral fragment, which thereby liberates isotope-encoded reporter ions that provide relative quantitative information on protein abundance. Because four different iTRAQ reagents are currently available, comparative analysis of a set of two to four samples is feasible within a single MS run. As is the case with SILAC, all peptides are labeled in iTRAQ experiments. An advantage of this method is that the multiplexed nature of iTRAQ greatly reduces the amount of MS time required to characterize individual samples, which increases instrumentation throughput.

Proteomics: reducing complexity

Protein concentrations in biologic systems span 10¹² orders of magnitude, whereas the most common available mass spectrometry-based methods only allow for the identification of proteins spanning approximately three orders of magnitude in concentration from a given sample. Several methods have been advanced that select for specified fractions of the proteome in order to reduce the complexity of the sample sufficiently to identify biologically interesting proteins. Protein glycosylation, one of the most common posttranslational modifications, is characteristic of secreted proteins and cell-surface markers but not found on the predominant serum proteins such as albumin. Recent approaches selecting for N-linked or O-linked glycosylated peptides using affinity capture techniques or solid-phase extraction followed by stable isotope labeling enrich for these biologically active proteins (24). Zhang et al. examined glycosylated proteins from several tissues, cells, and plasma and compared the glycoproteins identified in the tissues and cells to those identified in the plasma (25, 26). A significant overlapping was observed. This study demonstrates that tissue-derived proteins are indeed present and detectable in the plasma via direct MS analysis of captured glycopeptides, proving the feasibility of MS-based approach for plasma protein discovery and analysis. A significant improvement on this technique has been the capture of glycopeptides (rather than glycoproteins) (27).

Multiple reaction monitoring (MRM) is a highly selective, highly sensitive mass spectrometry approach for detecting the presence of particular peptide species in a complex mixture such as plasma (28). A specific tryptic peptide is selected as a stoichiometric representative of the protein from which it is cleaved. This peptide is quantified by MS against a spiked internal standard (a synthetic stable isotope-labeled version of the peptide) to yield a measure of protein concentration. In principle, such an assay requires only knowledge of the masses of the selected peptide and its fragment ions and an ability to make the stable isotope-labeled version. This method can quantify reliably protein concentrations over a dynamic range of 4.5 orders of magnitude in human plasma using a multiplexed approach. MRM assays coupled with enrichment of proteins by immunodepletion and size exclusion chromatography (29), or enrichment of peptides by antibody capture have also been reported (30). Stable isotope standards and capture by anti-peptide antibodies (SISCAPA) has been shown to extend the sensitivity of a peptide assay by at least two orders of magnitude and with additional development appears capable of extending the MRM method to cover the full known dynamic range of plasma (i.e., to the pg/mL level) (30). In systems approaches to disease, many important bioactive proteins are presumed to be secreted in the blood as key regulators of systems processes. These and other strategies to reduce or overcome the complexity of serum hold great promise for identifying key proteins active in disease states.

Proteomics: antibody-based arrays

An alternative strategy to global proteome analysis using MS is the use of antibody-based array techniques (31-33). A variety of methods have been developed based on antibody binding, all of which are limited by 1) dependence on the affinity and specificity of the antibodies employed for detection, 2) relatively high cost of generating monoclonal antibodies, and 3) potential cross reactivities in complex protein mixtures. Despite these limitations, antibody arrays have the advantage of providing a quantitative and comparative platform for rapid screening of proteomes from different disease states such as lung (34), pancreatic (35), and prostate cancer (36). One emerging approach with tremendous promise is surface plasmon resonance (SPR), which enables real-time, label-free measurement of protein abundance (37). SPR is a physical phenomenon that occurs when electromagnetic waves, such as light, are reflected off a thin metal film at specific incident angles and wavelengths. A fraction of the light energy (either polychromatic, many colors, or monochromatic, one color) interacts with and transfers to the surface plasmons, thus reducing the reflected light intensity at a sharply defined angle or at a specific wavelength. Any modifications on the metal surface, such as occurs with the interaction between antibody and antigen, will affect the SPR condition and can be used to detect and monitor specific molecular interactions. Sensitivity of SPR for low-abundance proteins is estimated to be in the picogram/centimeter squared range. Current SPR-based chips have 800 unique antibodies arrayed at approximately 4-pm spatial resolution. Significant advantages of this method are that 1) protein abundance can be monitored in real time allowing for assessment of binding dynamics, and 2) slides can be regenerated allowing for cost-effective screens of multiple samples. Another recently developed technique, DNA-encoded antibody libraries (DEALs), is a highly sensitive measurement technique that can detect protein and single-stranded DNA simultaneously on a single chip (38). DNA-encoded antibodies are labeled with single-stranded DNA oligomers. DNA-encoded antibodies and secondary (fluorescently) labeled antibodies are added to the biologic sample containing the protein of interest. The entire complex is then captured by nucleic-acid hybridization onto a spot that was prepatterned with the complementary single-stranded DNA oligomer. This approach has been used for the rapid detection of multiple proteins within a single microfluidic channel with a lower detection limit of 10 fM, which is 150 times more sensitive than the analog ELISA.

Computational Approaches

A systems approach to disease requires the integration of vast amounts of quantitative biologic data generated by global genomic and proteomic analyses in order to 1) identify comprehensively key molecular components defining disease and healthy states, and 2) determine how these components interact in biologic networks in a predictable way. Initial efforts primarily integrated and analyzed large databases of gene expression data to identify subsets of genes with predictive value for disease stratification and prognosis. A variety of methods have been applied to disease diagnoses, including approaches based on support vector machines and relative expression reversals, among many others. Application of these methods has led to the discovery of molecular classifiers of varying degrees of accuracy to identify prognostic signatures for breast cancer, ovarian cancer, colon cancer, prostate cancer, and brain cancer (39, 40).

With the development of protein-protein and protein-DNA interaction databases, gene expression data can be mapped onto interaction networks to identify relevant biologic pathways active in specific disease states or experimental perturbations. This type of approach is useful for assigning disease-specific relevance to differentially expressed genes or molecular pathways and has been applied in several human diseases, most notably cancer. Although these interaction networks are very useful tools for visualizing large data sets, they are not computable, predictive network models, which are those that hold the most promise for predictive medicine and drug development. One approach uses an integrated framework, Pointillist (41), for combining diverse data sets and inferential algorithms to generate model networks, which are incorporated into Cytoscape (42) for visualization and simulation. The integration methodology of Pointillist can handle data of different types and sizes (e.g., interactions, protein expression, or gene expression) to create a higher confidence interaction network than that resulting from a single data set alone. A novel aspect of this methodology is that it does not require a “gold standard” set of data to be used for training nor does it make assumptions about the underlying statistical distributions in the form of parametric models. This process involves designing an efficient deterministic optimization algorithm to minimize the numbers of misses and false positives via an iterative element by element procedure. The methodology is general purpose so that it can be applied to integrate data from any existing and future technologies (43).

Other approaches for inferring genetic regulatory networks include parsimonious linear regression models, probabilistic Boolean networks, and/or Bayesian networks from expression data (both steady-state and time-course) (44). Probabilistic Boolean networks are robust in the face of biologic and measurement uncertainty and offer the ability to characterize and simulate global network dynamics using the inferred model structure (45). It also provides a natural way to determine the influences of particular genes on the global network behavior (46). Thus, the model can be used to predict the effects of perturbations on network dynamics, which is an important goal for understanding disease development and treatment response. These and other predictive models stemming from mathematical descriptions of biochemical reaction networks and statistical influence models are critical for identifying disease-perturbed networks in disease states. Dynamic and predictive network models have been developed for important signaling networks in disease such cancer. Such approaches are now used to predict response to network perturbations in mammalian systems using an algorithm called Reconstruction of Accurate Cellular Networks (ARACNe) (47). These and other computational modeling approaches will play a key role in the identification of disease-perturbed networks, identifying potential therapeutic targets. With the development of comprehensive databases, hypothesis-driven global analysis methods, and predictive network models, systems biology has matured as a predictive science, which is capable of generating testable hypotheses based on network models with predictable behaviors. Applying this systems approach to disease holds great promise for the development of new therapies.

P4 Medicine: Personalized, Predictive, Preventative, and Participatory

Health care will be transformed over the next decade from our current reactive medicine to a new medical practice that is predictive, personalized, preventive, and participatory (P4). Predictive medicine will have two major components: 1) Individual genome sequences will be analyzed to generate probabilistic future health histories for each individual; and 2) perhaps 2500 proteins will be analyzed from a droplet of blood for each individual perhaps twice a year—reflecting the blood molecular fingerprints derived from the 50 or so human organs and major cell types—these fingerprints will constitute a status report for each organ—distinguishing health from disease and if disease, which disease. Personalize medicine will reflect the fact that each human differs from one another by approximately 6 million DNA bases—and has unique predispositions to differing combinations of diseases. The genome analyses and blood molecular fingerprints will permit each individual to be assessed individually and to be treated individually. Preventive medicine will emerge from the realization that a systems approach to disease will bring deep insights into the disease-perturbed networks and the fact that drugs can be eventually used to reengineer network behavior. Thus, the strategy for choosing drug targets will change in a fundamental manner. In a similar vein—once an individual’s DNA has predicted a high likelihood say for brain cancer at the age of 50 or older—one could design drugs to prevent the relevant brain networks from ever becoming perturbed—if the drugs are taken 10 or 20 years before onset—and this would be preventive medicine. Finally, participatory medicine originates from the fact that given more information patients will be able to participate more fully in choosing their own health trajectories. This participation will also require the education of physicians as to the nature of P4 medicine. This transformation, if focused and properly leveraged, can both immensely improve the lives of people and reduce substantially the growing burden of health-care costs in modern countries.

The transformation that is currently on the horizon, from our current reactive medicine to a medicine that is predictive, personalized, preventive, and participatory, will impact much more than medical science. It will affect national economies, social policy, and the spectrum of business relationships, opportunities, and constraints. The science that is driving this change has not yet impacted medicine, but the changes are on the horizon. The changes in the science have been building for the past 20 years, beginning with the human genome project. In coming decades the health-care system will not only generate billions of bits of data for each individual patient but must learn how to use this data effectively—for the individual and for the collective knowledge of medical susceptibilities and response to therapies that will be enabled. This transformation will be driven by new systems strategies for studying disease, powerful new measurement technologies (e.g., nanotechnology), and revolutionary new computational and mathematical tools for dealing with the enormous amounts of information that will be gathered and for converting it into hypotheses about health and disease.

References

1. Fleischmann RD, et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 1995;269:496-512.

2. Blattner FR, et al. The complete genome sequence of Escherichia coli K-12. Science 1997; 277:1453-1474.

3. Gardner MJ, et al. Sequence of Plasmodium falciparum chromosomes 2, 10, 11 and 14. Nature 2002; 419:531-534.

4. Hall N, et al. Sequence of Plasmodium falciparum chromosomes 1, 3-9 and 13. Nature 2002; 419:527-531.

5. Goffeau A, et al. Multidrug-resistant transport proteins in yeast: complete inventory and phylogenetic characterization of yeast open reading frames with the major facilitator superfamily. Yeast 1997; 13:43-54.

6. Lander ES, et al. Initial sequencing and analysis of the human genome. Nature 2001; 409:860-921.

7. Waterston RH, et al. Initial sequencing and comparative analysis of the mouse genome. Nature 2002; 420:520-562.

8. Mikkelsen T. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 2005; 437:69-87.

9. Sanger F, Nicklen S, Coulson AR. DNA sequencing with chainterminating inhibitors. Proc. Natl. Acad. Sci. U.S.A. 1977; 74:5463-5467.

10. Madabhushi RS. Separation of 4-color DNA sequencing extension products in noncovalently coated capillaries using low viscosity polymer solutions. Electrophoresis 1998; 19:224-230.

11. Prober JM, et al. A system for rapid DNA sequencing with fluorescent chain-terminating dideoxynucleotides. Science 1987; 238:336-341.

12. Hall N. Advanced sequencing technologies and their wider impact in microbiology. J. Exp. Biol.. 2007; 210:1518-1525.

13. Margulies M. et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005; 437:376-380.

14. Brenner S, et al., Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat. Biotech- nol.. 2000; 18:630-634.

15. Patterson SD, Aebersold RH. Proteomics: the first decade and beyond. Nat. Genet.. 2003; 33:311-323.

16. Link AJ, et al. Direct analysis of protein complexes using mass spectrometry. Nat. Biotechnol.. 1999; 17:676-682.

17. Sadygov RG, et al. Code developments to improve the efficiency of automated MS/MS spectra interpretation. J. Proteome Res. 2002; 1:211-215.

18. Wolters DA, Washburn MP, Yates JR 3rd, An automated multidimensional protein identification technology for shotgun proteomics. Anal. Chem. 2001; 73:5683-5690.

19. Gygi SP, et al. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat. Biotechnol. 1999; 17:994-999.

20. Han DK, et al. Quantitative profiling of differentiation-induced microsomal proteins using isotope-coded affinity tags and mass spectrometry. Nat. Biotechnol.. 2001; 19:946-951.

21. Tian Q, et al. Integrated genomic and proteomic analyses of gene expression in Mammalian cells. Mol. Cell. Proteomics 2004; 3:960-969.

22. Ong SE, et al, Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Mol. Cell. Proteomics 2002; 1:376-386.

23. Ross PL, et al. Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Mol. Cell. Proteomics 2004; 3:1154-1169.

24. Zhang H, et al. Identification and quantification of N-linked glycoproteins using hydrazide chemistry, stable isotope labeling and mass spectrometry. Nat. Biotechnol. 2003; 21:660-666.

25. Zhang H, et al. Mass spectrometric detection of tissue proteins in plasma. Mol. Cell. Proteomics 2007; 6:64-71.

26. Zhang H, et al. High throughput quantitative analysis of serum proteins using glycopeptide capture and liquid chromatography mass spectrometry. Mol. Cell. Proteomics 2005; 4:144-155.

27. Sun B, et al. Shotgun glycopeptide capture approach coupled with mass spectrometry for comprehensive glycoproteomics. Mol. Cell. Proteomics 2007; 6:141-9.

28. Anderson L, Hunter CL. Quantitative mass spectrometric multiple reaction monitoring assays for major plasma proteins. Mol. Cell. Proteomics 2006; 5:573-588.

29. Liao H, et al. Use of mass spectrometry to identify protein biomarkers of disease severity in the synovial fluid and serum of patients with rheumatoid arthritis. Arthritis Rheum. 2004; 50: 3792-3803.

30. Anderson NL, et al. Mass spectrometric quantitation of peptides and proteins using Stable Isotope Standards and Capture by Anti-Peptide Antibodies (SISCAPA). J. Proteome Res. 2004; 3:235-244.

31. Haab BB, Dunham MJ, Brown PO. Protein microarrays for highly parallel detection and quantitation of specific proteins and antibodies in complex solutions. Genome Biol. 2001; 2(2).

32. Huang RP, et al. Simultaneous detection of multiple cytokines from conditioned media and patient’s sera by an antibody-based protein array system. Anal. Biochem. 2001; 294:55-62.

33. Olle EW, et al. Development of an internally controlled antibody microarray. Mol. Cell. Proteomics 2005; 4:1664-1672.

34. Gao WM, et al. Distinctive serum protein profiles involving abundant proteins in lung cancer patients based upon antibody microarray analysis. BMC Cancer 2005; 5:110.

35. Orchekowski R, et al. Antibody microarray profiling reveals individual and combined serum proteins associated with pancreatic cancer. Cancer Res. 2005; 65:11193-11202.

36. Miller JC, et al. Antibody microarray profiling of human prostate cancer sera: antibody screening and identification of potential biomarkers. Proteomics 2003; 3:56-63.

37. Lyon LA, Musick MD, Natan MJ. Colloidal Au-enhanced surface plasmon resonance immunosensing. Anal. Chem. 1998; 70:5177-5183.

38. Bailey RC, et al. DNA-encoded antibody libraries: a unified platform for multiplexed cell sorting and detection of genes and proteins. J Am. Chem. Soc. 2007; 129:1959-1967.

39. Mischel PS, Cloughesy TF, Nelson SF. DNA-microarray analysis of brain cancer: molecular classification for therapy. Nat. Rev. Neurosci. 2004; 5:782-792.

40. Sotiriou C, Piccart MJ. Taking gene-expression profiling to the clinic: when will molecular signatures become relevant to patient care? Nat. Rev. Cancer 2007; 7:545-553.

41. Hwang D, et al. A data integration methodology for systems biology. Proc. Natl. Acad. Sci. 2005; 102:17296-17301.

42. Shannon P, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003; 13:2498-2450.

43. Hwang D, et al. A data integration methodology for systems biology: experimental verification. Proc. Natl. Acad. Sci. 2005; 102:17302-17307.

44. Price ND, Shmulevich I. Biochemical and statistical network models for systems biology. Curr. Opin. Biotechnol. 2007.

45. Shmulevich I, et al. Probabilistic Boolean Networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics. 2002; 18:261-274

46. Shmulevich I, Dougherty ER, Zhang W. Gene perturbation and intervention in probabilistic Boolean networks. Bioinformatics. 2002; 18:1319-1331.

47. Basso K, et al. Reverse engineering of regulatory networks in human B cells. Nat. Genet. 2005; 37:382-390.

Further Reading

Heath HL JR, Phelps ME, Lin B. Systems biology and new technologies enable predictive and preventative medicine. Science. 2004; 306:640-643.

Ideker T, Galitski T, Hood L. A new approach to decoding life: systems biology. Annu. Rev. Genomics Hum. Genet. 2001; 2:343-372.

Weston AD, Hood L. Systems biology, proteomics, and the future of health care: toward predictive, preventative, and personalized medicine. J. Proteome Res. 2004; 3(2)179-196.

See Also

Functional Genomics

Informatics for Proteins

Proteomics

Systems Biology