Cellular Organization of Metabolism
Peter D. Karp, SRI International, Menlo Park, California
The field of pathway bioinformatics is concerned with the representation and the manipulation of metabolic information within computers. By organizing genome information into pathways, pathway bioinformatics places genes and their products into a mechanistic framework. This article describes how metabolic pathways are represented in a computer, and it describes the BioCyc (SRI International, Menlo Park, CA) collection of pathway/genome databases for several hundred organisms. Each BioCyc database describes the genome and the metabolic network of a single organism. This article describes computational algorithms for computing with pathway data. Pathway visualization algorithms help scientists comprehend this complex information space and facilitate analysis of large-scale omics datasets. Pathway analysis algorithms predict the metabolic network of an organism from its genome, identify the genes coding for missing enzymes in metabolic pathways, and enable the comparison of metabolic networks from multiple organisms. They also allow the prediction of the metabolic capabilities of an organism and identify potential drug targets within the metabolic network.
The field of pathway bioinformatics is concerned with a range of problems related to the representation and the manipulation of metabolic information within computers. How do we capture our knowledge accurately about metabolic pathways and enzymes within the computer? How do we construct databases of metabolic information? How do we predict the metabolic pathways of an organism from its sequenced genome, and how do we compare the metabolic networks of two organisms?
The bioinformatics subfield of pathway bioinformatics is concerned with developing computer representations of the metabolic network of an organism, with developing databases of metabolic information, and with developing algorithms for computing with metabolic information. This article will discuss the approaches for each of these problems in the Pathway Tools and BioCyc (SRI International, Menlo Park, CA) projects. Pathway Tools is a set of algorithms for computational analysis of metabolic data, and it includes computer representations of the metabolic network (1). BioCyc is a collection of Pathway/Genome Databases (PGDBs) for several hundred organisms (2). The BioCyc databases were constructed using Pathway Tools, and they can be queried and analyzed using Pathway Tools.
Together, Pathway Tools and BioCyc permit extremely fast and accurate modeling of the metabolic network of an organism from its genome sequence. Previously, hundreds of person-years of laboratory work were required to characterize an organism’s metabolic map. Now, given an annotated genome sequence for the organism, its metabolic network can be predicted computationally within a few days. Manual review of that computational prediction will yield a more accurate result within a few weeks.
Computational reconstructions of metabolism are not perfectly accurate; thus, for increased model fidelity, we recommend following the computational reconstruction with manual curation of the metabolic network. A manual curation effort surveys the past biochemical literature for an organism, tracks newly emerging literature on an ongoing basis, and updates the metabolic model within a PGDB to reflect those findings.
Representing Metabolic Knowledge in PGDBS
Metabolites, reactions, and pathways
Two alternative ways exist in which one might choose to represent the metabolic network in a computer: by listing of all metabolic reactions that occur in the cell or by partitioning that reaction set into a carefully delineated set of metabolic pathways that describe small, functionally linked subsets of reactions. Which approach is preferred? The answer is that both approaches have value, and they are not mutually exclusive; therefore, Pathway Tools supports both views of metabolism in a PGDB.
Pathway Tools conceptualizes the metabolic network in three layers. The first layer consists of the small-molecule substrates on which the metabolism operates. The second layer consists of the reactions that interconvert the small-molecule metabolites. The third layer is the metabolic pathways in which the components are the metabolic reactions of the second layer. Note that not all reactions in the second layer are included in pathways in the third layer because some metabolic reactions are not assigned to any metabolic pathway.
Scientists who choose to view the metabolic network solely as a reaction list can operate on the second layer directly without interference from the third layer. But for a scientist for whom the pathway definitions are important, the pathway layer is available.
The pathways in PGDBs are modules of the metabolic network of a single organism. Often, they are conserved across many species. These pathways are regulated as a unit (based on substrate-level regulation of enzymes, on regulation of gene expression, and on other types of regulation), and their boundaries are defined at high-connectivity, stable metabolites (3). PGDB pathways are defined based on pathways published in the experimental literature.
More precisely, the compounds, reactions, and pathways in levels 1-3 are each represented as distinct database objects within a PGDB. That is, separate PGDB objects encode each metabolite, each metabolic reaction, and each metabolic pathway.
The proteome and the genome
The PGDB definitions of metabolism provided thus far are independent of the proteins that catalyze that metabolism and of the genome that encodes those proteins. The following section describes how PGDBs define the proteome and the genome of an organism.
The proteome of the organism is described as a set of PGDB objects: one for each gene product in the organism and one for each multimeric protein formed by aggregation of those gene products. Furthermore, every chemically modified form of a monomer or of a multimer is encoded by a distinct PGDB object. Each protein object is in turn linked, through a field in the object, to the metabolic reactions that it catalyzes. Proteins can also be substrates of reactions. Additional PGDB objects define features on proteins, such as phosphorylation sites, enzyme active sites, signal sequences, and metal ion binding sites.
Protein objects are also linked to gene objects that define the gene that encodes each protein. Each gene in the genome is defined by a distinct PGDB object, as is every replicon (chromosome or plasmid) in the genome. Genes are linked to the replicon on which they reside. In addition, other features on the genome, such as operons, promoters, and transcription-factor binding sites, are described by PGDB objects.
Database relationships and attributes
The previous two subsections described the important types of objects in a PGDB. Here, we describe how these objects are linked together by biologically meaningful database relationships. PGDB relationships knit together the objects in a PGDB by defining how these objects are interrelated. For example, user queries can follow the relationship from a gene to the protein that it codes for, from a protein to a reaction that it catalyzes, and from a reaction to a metabolic pathway in which it is a component, to answer questions such as “find all metabolic pathways in which the products of a gene play a role.”
Every PGDB object has a stable unique identifier (ID), that is, a symbol that identifies that object uniquely within the PGDB. Example unique IDs include TRP (an identifier for a metabolite), RXN0-2382 (an identifier for a reaction), and PWY0-1280 (an identifier for a pathway). Relationships within a PGDB are implemented using IDs. For example, to state that the TRP (tryptophan) object is a reactant in the reaction RXN0-2382, a field of RXN0-2382 called LEFT (meaning “reactants”) contains the value TRP. Many PGDB relationships exist in both forward and backward directions; for example, the TRP object contains a field called APPEARS-IN-LEFT-SIDE-OF that lists all reactions in which TRP is a reactant. The fields LEFT and APPEARS-IN-LEFT-SIDE-OF are called inverses.
Figure 1 shows the relationships that link together levels 1-3 of the metabolic network representation. Note how inverse fields allow the user to query relationships in any direction within a PGDB; for example, given a metabolite we can query for the pathways in which it is involved, and given a pathway we can query for its metabolites.
Figure 2 shows the relationships that link together the genome and the proteome. For clarity, many related objects are omitted from Figs. 1 and 2, such as the many other Escherichia coli genes that are components of its chromosome, the other reactions that are components of TRYPSYN-PWY, and the reactants and products of RXN0-2382.
The representations in Figs. 1 and 2 must be connected because enzymes in the proteome catalyze reactions in the reactome. Thus, PGDBs contain relationships that link enzymes with the reactions they catalyze. However, these relationships are indirect, passing first through an intermediary object called an enzymatic reaction, as shown in Fig. 3. This arrangement allows us to capture the many-to-many relationship that exists between enzymes and reactions—one reaction can be catalyzed by multiple enzymes, and multifunctional enzymes catalyze multiple reactions. The purpose of the enzymatic reaction is to encode information that is specific to the pairing of the enzyme with the reaction, such as cofactors, activators, and inhibitors. Consider a bifunctional enzyme with two active sites, in which one of the active sites is inhibited by pyruvate, and the second active site is inhibited by lactate. We would represent this situation with two enzymatic reactions that link the enzyme to the two reactions it catalyzes, and each enzymatic reaction would specify a different inhibitor.
Figure 1. Relationships that link levels 1-3 of the metabolic network. The metabolite tryptophan (ID TRP) is a reactant of the reaction whose ID is RXN0-2382, which in turn is a member of the pathway whose ID is TRYPSYN-PWY. The field IN-PATHWAY is the inverse of the field REACTION-LIST.
Figure 2. Relationships that link the genome and proteome of a PGDB. The E. coli chromosome contains thousands of genes, one of which is EG11025 (trpB). Its product is the TrpB protein, whose ID is TRYPSYN-BPROTEIN. That monomer forms a homomultimer represented by the object CPLX0-2401.
Figure 3. Relationships that link the genome and proteome of a PGDB to the metabolic network.
The preceding conceptual structure underlies all PGDBs created by Pathway Tools. Those PGDBs fall into several categories. The BioCyc collection of PGDBs is a collaboration between the Bioinformatics Research Group at SRI International and the Computational Genomics Group at the European Bioinformatics Institute (2). In addition, many other PGDBs have been created by other users of Pathway Tools. Some are listed in a table on the BioCyc home page (http://BioCyc.org). These PGDBs can be accessed through the web sites operated by their creators, and in some cases they are available through the BioCyc web site. In addition, some PGDBs can be downloaded for local use within Pathway Tools through SRI’s online registry of PGDBs (http://BioCyc.org/registry. html).
The overall framework of BioCyc is to define a single foundational database of experimentally elucidated pathways from many organisms (MetaCyc; SRI International, Menlo Park, CA) that is used to predict the metabolic pathways of other organisms from their sequenced genomes. Each prediction is modeled as a single organism-specific PGDB. Thus, the BioCyc organism-specific PGDBs each model the metabolism of a single organism in detail, whereas MetaCyc captures well-defined pathways from many organisms but does not define a comprehensive model of the pathways of any organism [except for E. coli, because MetaCyc contains all metabolic pathways from the EcoCyc (SRI International, Menlo Park, CA) PGDB].
BioCyc is divided into three tiers that reflect the degree of manual curation of these databases. The Tier 1 PGDBs EcoCyc and MetaCyc have undergone more than two person decades of curation each. By curation, we mean effort on the part of biologists to read the biomedical literature and to enter information from publications into these PGDBs.
Tier 1: EcoCyc
EcoCyc (4) describes the genome, the metabolic pathways, and the transcriptional regulatory network of E. coli K-12. EcoCyc curators enter newly discovered functions of E. coli genes into EcoCyc, as reported in the literature. They also enter E. coli metabolic pathways and information about E. coli operon organization, promoter locations, and control of those promoters by binding of transcription factors to nearby DNA sites. EcoCyc contains a written summary of the function of every E. coli gene for which experimental information is available. The information in EcoCyc was obtained from the more than 14,000 publications cited by EcoCyc.
Tier 1: MetaCyc
MetaCyc (5) is a multiorganism encyclopedia of metabolic pathways and enzymes. Like EcoCyc, it contains literature-derived information on experimentally elucidated metabolic pathways and enzymes. MetaCyc version 10.5 (October 2006) contains all 197 metabolic pathways from EcoCyc and all EcoCyc metabolic enzymes. MetaCyc includes another 600 metabolic pathways from other organisms. Approximately half of the pathways in MetaCyc are from microorganisms, and approximately one-third of the pathways are from plants, with the remainder that comes largely from animals. The metabolic pathways in MetaCyc were elucidated experimentally in more than 700 organisms, and the information in MetaCyc has been drawn from more than 10,000 publications. MetaCyc contains extensive mini-review summaries and literature citations in its pathways. It also contains enzyme entries to explain the biologic functions of pathways and enzymes as well as to make this information accessible to scientists who are not experts in each pathway and enzyme.
The Tier 2 and Tier 3 PGDBs were derived computationally by applying the following sequence of computational operations to the annotated genomes of each organism, as described in more detail in the next section, and in Reference 2.
1. The annotated genome of each organism was converted to PGDB format.
2. The PathoLogic program predicted the metabolic pathway complement of each organism.
3. The PHFiller program predicted which genes within the organism will code for missing enzymes within the predicted metabolic pathways.
4. An operon predictor was executed for the bacterial genomes.
The Tier 2 PGDBs were created computationally by the preceding methodology, and then some amount of manual curation was applied to these PGDBs. For example, after being created computationally, the HumanCyc (SRI International, Menlo Park, CA) PGDB (6) received extensive curation to assign human metabolic enzymes manually to their associated reactions; to enter 10 metabolic pathways and their enzymes from the literature into HumanCyc; and to enter associated summaries, literature citations, and other information such as enzyme regulators, cofactors, and subunit structure.
The Tier 3 PGDBs were created computationally by the preceding methodology, with no subsequent manual curation.
We encourage scientists to adopt Tier 2 and Tier 3 PGDBs for ongoing curation and refinement. No single group can curate all the world’s genomes, so we encourage experts of the biology of an organism to assume responsibility for updating its PGDB to reflect existing and emerging information in the literature, on an ongoing basis.
Computing with the Metabolism of a PGDB
Once the metabolic network of an organism has been encoded using the preceding representation, many types of computational analyses are enabled.
Querying and visualization of metabolism
We are confronted immediately with the need to allow users to access information within metabolic databases. Pathway Tools provides several types of queries for each datatype within a PGDB. Users can query pathways, enzymes, metabolites, and proteins by exact name or by substring search. Additional queries supported include querying reactions by their Enzyme Commission (EC) number; querying metabolites by chemical substructures expressed in the SMILES language, querying pathways and reactions according to their substrates; and querying enzymes by molecular weight, pI, and by the small molecules that activate and inhibit them.
In presenting the answer to a query, the complexity of metabolic information demands the development of visualizations of the data that speed their comprehension by the user. Thus, an important aspect of the bioinformatics of metabolism is the visualization of metabolic information. Pathway Tools contains several visualizations of metabolism, all of which are generated automatically. It can produce drawings of individual metabolic pathways and of clusters of related pathways called superpathways. These drawings can be generated at multiple levels of detail so that the user can choose to show or to hide information such as enzyme and gene names, names of intermediate or side metabolites, and the chemical structures of metabolites. The drawings depict substrate-level regulation of the enzymes within a pathway, and all components of the drawing are clickable by the user. For example, clicking on a metabolite takes the user to a page that shows the metabolite structure and lists all its synonyms, all reactions and pathways in which it is a substrate, and all enzymes whose activities it regulates. Pathway Tools also generates information pages for enzymes and for biochemical reactions.
Pathway Tools can generate a visualization of the entire metabolic network of an organism, which we call the cellular overview diagram (7). This diagram is generated automatically from any PGDB, and it depicts all metabolic pathways in the PGDB as well as reactions not assigned to any pathway and all transporters identified in the PGDB. The overview diagram can be used to visualize omics datasets in a mode of operation called the Omics Viewer (7). The input to the Omics Viewer is a combination of gene expression data, proteomics data, metabolomics data, or other measurements that associate numbers with genes, reactions, or metabolites. The numbers are mapped to colors that are painted onto the elements of the cellular overview to allow the power of the human visual system to be used to interpret large-scale datasets in a pathway context. For example, a dot in the diagram that represents a single metabolite would be assigned a color that indicates the measured concentration of that metabolite in a metabolomics experiment. Finally, Pathway Tools can generate a poster-size version of the cellular overview complete with labels for entities in the diagram.
Prediction of metabolic pathways and pathway hole fillers
Pathway Tools predicts the metabolic pathway complement of an organism by assessing what known pathways from the Meta-Cyc PGDB are present in the annotated genome of a new organism. This inference is performed in two steps. First, enzymes in the annotated genome are assigned to their corresponding reactions in MetaCyc, which defines the reactome of the organism. The assignment proceeds by matching both the gene-product names (enzyme names) and the EC numbers assigned to genes in the genome. For example, the fabD gene in Bacillus anthracis is annotated with the function “malonyl CoA-acyl carrier protein transacylase.” That name was recognized by Pathway Tools as corresponding to the MetaCyc reaction whose EC number is 188.8.131.52. Therefore, Pathway Tools imported that reaction and its substrate into the B. anthracis PGDB, and it created an enzymatic-reaction object to link that reaction to that B. anthracis protein.
Once the reactome of the organism has been established, Pathway Tools imports into the new PGDB all MetaCyc pathways that contain at least one reaction in the organism’s reactome. Once imported, Pathway Tools attempts to prune out those pathways that are likely to be false-positive predictions. That pruning process considers both the fraction of reaction steps in the pathway that have assigned enzymes and how many of the reactions with assigned enzymes are unique to that pathway (as opposed to being used in additional metabolic pathways in that organism). The remaining pathways are those predicted to occur in the organism under analysis.
A final inference tool provided by Pathway Tools is called the pathway hole filler. A pathway hole is a reaction in a metabolic pathway for which no enzyme has been identified in the genome that catalyzes that reaction. Typical microbial genomes contain 200-300 pathway holes. Although some pathway holes are probably genuine, we believe that most probably result from the failure of the genome annotation process to identify the genes that correspond to those pathway holes. For example, genome annotation systems systematically under-annotate genes with multiple functions, and we believe that the enzyme functions for many pathway holes are unidentified second functions for genes that already have one assigned function.
The method used by the pathway hole filling program PH-Filler (8) is as follows. Given a reaction that is a pathway hole, the program first queries the UniProt database to find all known sequences for enzymes that catalyze that same reaction in other organisms. The program then uses the BLAST tool to compare that set of sequences against the full proteome of the organism in which we are seeking hole fillers. It scores the resulting BLAST hits by considering information such as genome localization, that is, is a potential hole filler in the same operon as another gene in the same metabolic pathway? At a stringent score cutoff, our method finds potential hole fillers for approximately 45% of the pathway holes in a microbial genome.
Analysis and comparison of metabolic networks
Once the metabolism of an organism is captured in a computable form, we can write programs to characterize the size and structure of the metabolic network of an organism (9). For example, version 10.5 of the EcoCyc PGDB contains 176 pathways of small-molecule metabolism, which contain 702 component reactions. Another 245 reactions of small-molecule metabolism are not assigned to a specific pathway. One hundred thirty-five reactions in E. coli are catalyzed by more than one enzyme. Conversely, 177 E. coli enzymes are multifunctional, meaning they catalyze more than one reaction. Nine hundred seventy-five metabolites in the E. coli metabolic network. Each reaction contains an average of 4.1 metabolites, and each metabolite is a substrate in 5.2 reactions, on average. Most pathways are 1-7 reactions in length, but the longest pathway contains 20 reactions. Interestingly, substrate-level inhibition of enzymes is more than four times more common than is substrate-level enzyme activation—92 enzymes have recorded inhibitors, whereas 21 enzymes have recorded activators, for a total of 97 enzymes in the metabolic network that have some type of known substrate-level regulation.
A computer formulation of metabolism also facilitates comparisons of the metabolic networks of two or more organisms. The cellular overview diagram can be used for comparative purposes by coloring those metabolic reactions shared between two organisms by using the desktop version of Pathway Tools. The Web version of Pathway Tools provides a suite of comparative analysis tools. For example, Fig. 4 shows comparisons of the overall pathway complements of E. coli and B. anthracis, which is broken down according to the Pathway Tools ontology of pathways. Figure 5 shows a detailed comparison of the pathways of biosynthesis of fatty acids and lipids in these two organisms.
A second form of pathway analysis is computing the potential outputs that the metabolic network might produce when supplied with a set of input metabolites (10). A third computational analysis method predicts choke points in the metabolic network, which are enzymes that if inhibited would be likely to create a major bottleneck in the metabolic network, and they are therefore likely to be good targets for developing antimicrobial drugs (11). In addition, it is possible to compute the equilibrium flux rates through an entire metabolic network (12).
Figure 4. Pathway comparison of B. anthracis Ames and E. coli K-12. Rows 1 and 2 of the table indicate that these organisms contain 142 and 114 biosynthetic pathways, respectively, of which 8 and 6 pathways are for biosynthesis of amines and polyamines, respectively.
Figure 5. Detailed comparison of pathways of biosynthesis of lipids and fatty acids of B. anthracis and E. coli. This report indicates the presence of specific named pathways in each organism with an ''X.. Clicking on the name of the pathway will display the pathway itself.
Computational access to PGDBs
In addition to the user-friendly graphical interfaces to PGDBs provided through the Web and desktop versions of Pathway Tools, we provide the following modes of access to PGDBs to facilitate the construction of programs that explore pathway data computationally.
Programmatic access through application program interfaces (APIs)
Programmers can access and update PGDB data directly by writing programs in the Java, Perl, and Common Lisp languages (13).
Downloadable files in multiple formats
Pathway Tools can export PGDBs into several different file formats that are described at http://bioinformatics.ai.sri.com/ptools/flatfile-format.html. These formats include column- delimited tables, SBML (see http://sbml.org/), BioPAX (see http://biopax.org/), Genbank, FASTA, and attribute-value.
Relational database access via biowarehouse
For those who want to query PGDB data through a relational database system, the attribute-value files exported by Pathway Tools can be loaded into SRI’s BioWarehouse system (14). BioWarehouse is an Oracle or MySQL-based system for integration of multiple public bioinformatics databases. PGDB data can be queried through BioWarehouse alone or in combination with other bioinformatics databases such as UniProt, Genbank, NCBI Taxonomy, ENZYME, and KEGG.
Queries using the pathway tools query language, BioVelo
Pathway Tools provides a powerful and easy-to-use query language for querying PGDBs, called BioVelo. See http://biocyc.org/query.html for details.
I thank Carol Fulcher and Alexander Shearer for comments on this manuscript. This work was supported by grants GM077678, GM077905, and GM70065 from the National Institutes of Health.
1. Karp PD, Paley S Romero P. The Pathway Tools software. Bioinformatics. 2002; 18:225-232.
2. Karp PD, et al. Expansion of the BioCyc collection of path- way/genome databases to 160 genomes. Nucleic Acids Res. 2005; 33:6083-6089.
3. Green ML, Karp PD. The outcomes of pathway database computations depend on pathway ontology. Nucleic Acids Res. 2006; 34:3687-3697.
4. Keseler IM, et al. EcoCyc: a comprehensive database resource for Escherichia coli. Nucleic Acids Res. 2005; 33:334-337.
5. Caspi R, et al. MetaCyc: A multiorganism database of metabolic pathways and enzymes. Nucleic Acids Res. 2006; 34:511-516.
6. Romero P, et al. Computational prediction of human metabolic pathways from the complete human genome. Genome Biol. 2005;6:2.
7. Paley SM, Karp PD. The Pathway Tools cellular overview diagram and Omics Viewer. Nucleic Acids Res. 2006; 34:3771-3778.
8. Green ML, Karp PD. A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases. BMC Bioinformat. 2004; 5:76.
9. Ouzounis CA, Karp PD. Global properties of the metabolic map of Escherichia coli. Genome Res. 2000; 10:568-576.
10. Romero PR, Karp P. Nutrient-related analysis of pathway/genome databases. Pac Symp Biocomput. 2001; 471-482.
11. Yeh I, et al. Computational analysis of Plasmodium falciparum metabolism: organizing genomic information to facilitate drug discovery. Genome Res. 2004; 14:917-924.
12. Varma A, Palsson BO. Metabolic flux balancing: basic concepts, scientific and practical use. Bio/Technology, 1994; 12:994-998.
13. Krummenacker M, et al. Querying and computing with BioCyc databases. Bioinformatics. 2005; 21:3454-3455.
14. Lee TJ, et al. BioWarehouse: a bioinformatics database warehouse toolkit. BMC Bioinformatics 2006; 7:170.