An Introduction to the Reactome Knowledgebase of Human Biological Pathways and Processes
The human genome encodes approximately 25,000 proteins. Functional information is available for about half of these, either from direct studies of human proteins or from studies of well–conserved homologues in model organisms. The goal of the Reactome Project is to systematically associate these proteins with their functions in order to generate a consistently annotated knowledge–base of human biological processes that is useful both as an online reference for individual processes and as a data mining and analysis resource for systems biology.
The Reactome knowledgebase embodies a reductionist data model, which asserts that all of biology can be represented as events located in subcellular compartments that convert input physical entities into output physical entities. Information is stored as a generic and qualitative parts list — neither kinetic parameters nor tissue and state specificities for events are captured. Instead, data are organized to facilitate the superposition of user–generated expression data onto the Reactome parts list and to allow the export of the list in forms that enable model building and the integration of other data types. The knowledgebase is human–centric, manually curated and linked to published data.
A brief discussion of our data model and an orthology–based electronic inference of non–human events will provide a basis for describing the potential uses for the Reactome Knowledgebase.
The Reactome data model
In a living cell, molecules are synthesized, covalently modified, degraded, transported from one location to another and form complexes with other molecules. Although these are diverse events, they can all be represented as the transformation of input physical entities to output physical entities. The Reactome captures these physical entities and their interactions in a frame–based data model, which we describe here.
Classes. Classes (frames) describe concepts such as an event, physical entity, subcellular location and catalysis. Specific reactions such as the 'Reduction of acetoacetyl–ACP to –hydroxybutyryl–ACP', entities such as 'acetoacetyl–ACP', locations such as the 'cytosol', and activities such as '3–oxoacyl–[acyl–carrier protein] reductase activity' are examples of classes. Class attributes (slots) hold information about the instances.
Events. Each 'reaction' instance has slots to record its reactants (input), products (output), catalyst and subcellular location. A 'catalyst' in turn is an instance of a class whose attributes are a physical entity and a Gene Ontology (GO) 'Molecular Function' term that describes its activity (Ashburner et al., 2000). Instances of the 'regulation' class link reactions to the factors that modulate them. The Reactome data model extends the concept of a biochemical reaction to include events such as the association of molecules to form a complex, the transport of a molecule between two cell compartments and the transmission of a signal by a receptor.
A set of reactions that are linked by shared inputs and outputs can be organized into a goal–directed 'pathway'. Attributes of a pathway instance are the names of the reactions and the smaller pathways it contains, as well as a GO 'Biological Process' term. A pathway has no molecular attributes — these are inferred from the attributes of its included reactions. A single reaction may belong to one or more pathways.
Entities. Physical entities include proteins, nucleic acids and small molecules, as well as complexes of two or more molecules. Molecules are modified, moved from place to place, cleaved or take on different three–dimensional conformations. Many of these modifications are functionally crucial: phosphorylation of a protein at a particular amino–acid residue may convert it from an inactive form to an active form. The Reactome data model captures this information in a computable format by treating each modification of a molecule as a separate physical entity. The corresponding modification process is annotated as a reaction in which the input is the unmodified physical entity and the output is the modified version.
Subcellular locations. The functions of biological molecules depend on their subcellular locations, so chemically identical entities that are located in different compartments are represented as distinct physical entities. Transport events are therefore taken as ordinary reactions: for example, the interconversion of the extracellular and cytosolic forms of a molecule that is mediated by a transport protein at the plasma membrane. Subcellular locations of molecules are annotated with a subset of the GO 'Cellular Component Ontology', refined to remove ambiguous terms such as 'intracellular'.
Reference entities. The annotation of alternative locations, post–translational modifications and conformations of a molecule causes instances of a physical entity to proliferate. The basic chemical information that all forms share is stored in a separate class of 'reference physical entities', allowing information to be entered only once, reducing error, facilitating data maintenance and explicitly linking all the alternative forms of a single entity so that, for example, all events involving all forms of a cyclin protein can be retrieved. The attributes of a reference entity include its name, reference chemical structure or sequence, and its accession numbers in reference databases such as UniProt for proteins (Wu et al., 2006), ChEBI for small molecules (de Matos et al., 2007) and EMBL for nucleic acids (Stoesser et al., 2000).
Complexes. Many biological reactions involve macromolecular complexes. The Reactome knowledgebase annotates these entities as instances of the 'complex' class, the attributes of which are the identities of components of the complex (macromolecules, small molecules and other complexes) and its subcellular location. Molecular assembly operations, such as recruiting components of the double–strand break repair complex to a site of DNA damage, can then be described as a series of reactions in which the inputs and outputs are intermediates in the formation of the DNA repair complex. Because complexes refer to all of the components they contain, it is possible to fetch all complexes that involve a particular component or to dissect a complex to find its constituents.
Entity sets. Often it is convenient to group physical entities on the basis of common properties. For example, the SLC28A2 plasma membrane nucleoside transporter operates equally well on adenosine, guanosine, inosine and uridine (Wang et al., 1997). To avoid creating four nearly identical reactions, the Reactome data model allows for the creation of 'defined sets', one comprising the extracellular forms of the four nucleosides and one comprising their cytosolic forms. SLC28A2–mediated nucleoside transport is then classed as a single reaction that converts the extracellular set into the cytosolic set. Defined sets are also used to describe protein paralogues that are functionally interchangeable, equivalent RNA splice variants and isoenzymes.
If it is unclear which one(s) of a group of related physical entities performs a particular task, a 'candidate set' is created. For example, when only one member of a structurally well–conserved protein family has been functionally characterized, the use of a candidate set composed of the members captures both the known family function and any ambiguity. Finally, an 'open set' class can be used when set members can be identified but not enumerated; for example, to annotate a splicing event that can take any mRNA precursor with an intron as the input.
Together, the entity, complex and set classes allow the detailed and flexible annotation and querying of physical entities and their interactions. For example, cytosolic CDC2 protein (UniProt: P06493) phosphorylated at threonine–14 is distinct from unmodified CDC2. Both the phosphorylated and unphosphorylated forms can also be found in complexes with two related cyclins, B1 and B2. These two cyclins are represented collectively by a defined set (cyclin B). Complexes between cyclin B and CDC2 are represented as two instances of the complex class, one consisting of the cyclin B defined set and unphosphorylated CDC2, and one consisting of the cyclin B defined set and phosphorylated CDC2. These complexes then take part in the various reactions of the cell–cycle pathway. We can simultaneously create complexes of CDC2 with individual cyclins if a particular cyclin/CDC2 complex does something that the others do not. The use of sets also simplifies any querying of the Reactome Knowledgebase. A search for cyclin B returns all of these reactions and protein isoforms.
Evidence. Every reaction that is entered into the Reactome Knowledgebase must be backed by direct or indirect evidence from the biomedical literature. Direct evidence for a human reaction comes from an assay on human cells, described in a research publication, with the PubMed identifier stored as an attribute of the reaction. Much biomedical knowledge, however, derives from observations in experimentally tractable non–human systems that are thought to be good functional homologues for human ones. We can use such non–human data to document a human reaction in two steps. First, we annotate the reaction in the non–human species, using the physical entities of that organism — for example, the Drosophila melanogaster Notch protein — with appropriate literature reference attributes. Second, we annotate the human reaction, using human physical entities — for example, the four human Notch paralogues. The human reaction has no reference within the literature, but instead has an attribute indicating its inference from the experimentally validated reaction in D. melanogaster. In this way, the complete chain of evidence is preserved from the primary experiment to the non–human reaction to the inferred human reaction.
Electronic inference of non–human events
The Reactome Knowledgebase includes computationally inferred pathways and reactions in 22 non–human species, including Mus musculus, Tetraodon nigroviridis, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, Arabidopsis thaliana, Plasmodium falciparum and Escherichia coli. These species represent more than 4,000 million years of evolution and span the main branches of life.
We can project the set of curated human reactions onto the genome of another species using OrthoMCL protein similarity clusters (Li et al., 2003). The human reaction is checked to establish whether all of its protein participants (inputs, outputs and catalysts) have at least one orthologue or recent paralogue (OP) in the other species. In the case of protein complexes, we relax this requirement so that a complex is considered to be present in the other species if at least 75% of its protein components are present as OPs. For each reaction that meets these criteria, we create an equivalent reaction for the other species by replacing all human protein components with their corresponding OPs. For proteins with more than one OP in the other species, we create a 'defined set_ named 'Homologues of ...' that contains these OPs, and use this set as the corresponding component of the equivalent reaction.
The Reactome Knowledgebase user interface
Browsing the Reactome Knowledgebase. The front page of the Reactome web site is the starting point for browsing the database and launching simple queries against it. Buttons at the top of the page provide access to additional resources and advanced querying tools (Figure 1).
The pathway topics list provides a list of pathway topics currently represented in the Reactome Knowledgebase. A more detailed list of the pathways contained in the knowledgebase is provided in the table of contents. The topics covered in the Reactome Knowledgebase to date include apoptosis, cell cycle, DNA repair, DNA replication, electron transport chain, gap junction trafficking and regulation, gene expression, HIV infection, hemostasis, Influenza infection, integration of energy metabolism, lipid and lipoprotein metabolism, metabolism of amino acids, metabolism of carbohydrates, metabolism of non–coding RNA, metabolism of xenobiotics, nucleotide metabolism, porphyrin metabolism, pyruvate metabolism, TCA cycle, post–translational protein modification, signaling by EGFR, signaling by FGFR, signaling in the immune system, signaling by insulin receptor, signaling by Notch, signaling by NGF, signaling by Rho GTPases, signaling by TGF, signaling by Wnt, telomere maintenance, and transcription.
The reaction map provides an interactive graphical representation of the Reactome pathways. A pathway is depicted as a set of linked arrows, each arrow representing a reaction. Human reactions are shown by default; the user can select other available species in the drop–down menu above the reaction map.
Every reaction in the knowledgebase is shown as a single gray arrow on the map. Reactions that follow one another are connected by a thin gray line between the two reaction arrows. When a topic name is 'moused over', the corresponding reactions are highlighted on the map in color: blue if the reaction has been manually annotated in the species displayed, purple if manually inferred and green if electronically inferred. A pale halo of the same color as the arrow(s) is added to facilitate visualization.
In the example shown in Figure 2, the red arrow on the right points to the reaction 'glutaryl–CoA + FAD crotonyl–CoA + FADH2 + CO2', within the pathway 'metabolism of amino acids'. This reaction precedes the reaction 'crotonyl–CoA + H20 (S)–3–hydroxybutanoyl–CoA' (red arrow on the left) in the 'lipid metabolism' pathway, indicated by the gray line linking the head (output) end of the first reaction to the tail (input) end of the second reaction (indicated by green arrow). Mousing–over an individual reaction arrow displays the name of that reaction and highlights the name(s) of the pathway(s) within the pathway topics list in which this reaction occurs.
Selecting a pathway in the pathway topics list or an individual reaction on the reaction map directs the user to an event page that provides a detailed description of the selected pathway or reaction.
The reaction page (Figure 3) is divided into four sections: the reaction map, the reaction diagram, the event hierarchy and the event description. The location of the reaction is highlighted on the reaction map. The reaction diagram provides a graphical representation of the reaction, showing the participating input, output and catalyst molecules or complexes, and the names of reactions that immediately precede and follow the subject reaction. All reactions and molecules are hyperlinked to their corresponding description pages. The event hierarchy shows the relationships among the subpathways ( symbol) and reactions ( symbol) within a selected pathway. Clicking on the boxed '+' signs next to a pathway name opens the list of events within the pathway. The event description section contains a brief narrative text summary and sometimes a figure describing the reaction, PubMed–linked references to published experimental data concerning the reaction, a table of reaction attributes, component molecules (input/output/catalyst), cellular location, and, if the reaction involves a catalyst, the GO molecular function term that describes its activity.
Each component molecule is hyperlinked to a separate 'molecule page' (Figure 4). This page indicates the molecule's cellular location, reference entity, and, if applicable, post–translational modification(s). Links to external databases and a list of all events in which that particular form of the molecule participates are also provided. The reference entity, in turn, is hyperlinked to a page that describes any relevant additional general features of the reference molecule. The reaction map on this page highlights all the reactions that involve any form of the reference entity (Figure 4). The reaction arrows are color coded to describe the role of the given molecule in the reaction. Red indicates that the molecule functions as an input in the reaction, Green = output, Yellow = both input and output, Blue = catalyst and Violet = input and catalyst.
Each component complex is hyperlinked to a separate 'complex page'. The layout of the complex page (Figure 5) is similar to the molecule page. In addition, a diagram with a collapsible key provides a graphical view of the components in the complex. Each component is hyperlinked to its corresponding molecule page. A hierarchical view of the component molecules of the complex, with links to external databases, is provided along with a list of all events in which the complex participates. This event list is further categorized according to whether the complex is produced or consumed in the reaction or whether it functions as a catalyst.
Stable identifiers are associated with all events (reactions, pathways and regulatory events) and physical entities (molecules, complexes and compounds) to facilitate the tracking of data and revisions between releases of the knowledgebase. Stable identifiers have the format REACT_XXX.YYY, where XXX is the identifier number assigned to the entity or event and YYY is the version number. Stable identifiers are assigned to entities and events when they are released on the Reactome Database. If the annotations of an existing data object are revised, the version number of the object's identifier is increased by one. The stable identifier of an event or entity is displayed in the 'details' section on the event pages (Figure 3), hyperlinked to a history page for the identifier. This history page shows the version numbers of the entity or event since release 15 of the Reactome Database.
Searching the Reactome Knowledgebase. There are several approaches to searching the Reactome Knowledgebase for information about specific molecules or events and to carrying out large–scale data analysis.
First, a 'simple search' for information on particular proteins, reactions or pathways can be carried out. Entering a word of interest in the simple search box, choosing a species (Homo sapiens by default) and hitting the return key retrieves a list of all data objects in the Reactome Database with names that contain the term. A simple search for 'CDC6', for example, returns 130 matches in 8 categories: 7 physical entities, 3 pathways, 8 reactions, and so forth (Figure 6a). Clicking on the number for a category yields a list of all the matched items in that category, each hyperlinked to the appropriate entity or event page. In addition, each reaction that involves the molecule is highlighted on the reaction map of the organism, providing a more integrated graphical view of the role of the molecule and its relationship with other molecules in the context of all annotated biological pathways (Figure 6b).
The molecule page for the CDC6 reference entity highlights reactions (circled in red) in which this molecule participates. Scrolling over the reaction arrow reveals its name.
The 'extended search' feature allows for a more detailed specification of the data object class in which to search, and allows searches that use multiple attributes. For example, one can search for all human reactions that produce pyruvate as an output (Figure 7a), identify all human catalytic reactions occurring in the nucleus or all pathways created by a particular author in a given year.
The Extended Search (a) and Reactome Mart (b) queries.
A user who is interested in an overview of a particular pathway can download it in a simplified PDF textbook format that includes a list of component pathways and reactions, reaction diagrams, text descriptions, figures, names of authors and reviewers, and references.
Using the Reactome Knowledgebase for data mining and large–scale data analysis. The Reactome Knowledgebase can also be used for data mining and large–scale analysis of gene functions. 'Reactome Mart' uses the Biomart query–orientated data–management system (Durinck et al., 2005) to generate integrated queries across Reactome and other databases, including Ensembl and UniProt. Several pre–formatted (canned) queries (Figure 7b) are available in the menu bar of the Reactome Mart tool. Users can also define their own queries with the menus that are accessible by means of the highlighted terms and the boxes on the mart page. For example, a coupled search across the Reactome and Ensembl databases will retrieve a list of orthologues of the human proteins that are involved in a pathway of interest, or will identify the Affymetrix IDs that are associated with genes in the selected Reactome pathways.
The SkyPainter tool allows users to visualize the functional relationships between genes that are analysed in large–scale experiments. For example, a list of identifiers for genes that are coexpressed in a microarray analysis may be submitted by the user. Figure 8 shows reactions involving proteins that are associated with a human genetic disorder (in other words, those that have a UniProt identifier with a corresponding entry in the OMIM Morbid Map (OMIM, 2006). The reactions in the reaction map are colored according to the statistical likelihood that they would contain the listed genes by chance. This highlights those pathways in which the listed genes are over–represented.
The SkyPainter tool can also be used to identify genes that have different expression levels under different conditions. In this case, numeric values, such as expression levels derived from microarray experiments can be entered in association with the gene identifiers in the list. In addition, SkyPainter can represent more complicated data sets, such as a time–course series, as an animated movie.
Finally, to support more systematic data mining, analysis and modeling based on Reactome content, individual reactions and pathways can be exported in SBML, BioPAX, Cytoscape and Protégé formats, and the entire data content of Reactome can be downloaded in SBML or BioPAX formats or as a MySQL database.
The Reactome Knowledgebase is an online, manually curated resource that provides an integrated view of the molecular details of biological processes that range from metabolism to DNA replication and repair to signaling cascades. Its data model allows these diverse processes to be represented in a consistent way to facilitate usage as online text and as a resource for data mining, modeling and analysis of large–scale expression data sets.
We are grateful to Geeta Joshi–Tope, Beth Nickerson and Marcela Tello–Ruiz who worked with us during the initial stages of the Reactome Project, and to the many scientists who collaborated with us as authors and reviewers to build the content on the knowledgebase. The development of the Reactome Knowledgebase is supported by a grant from the US National Institutes of Health (P41 HG003751), a grant from the European Union Sixth Framework Programme (LSHG-CT-2003-503269) and subcontracts from the NIH Cell Migration Consortium and the EBI Industry Programme. L.M. receives support from the Fondation pour la Recherche Médicale.
1. Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, New York 11724, USA
2. Centre d'Immunologie de Marseille-Luminy, INSERM/CNRS/Université de la Méditerranéé, Case 906, 13288 Marseille Cedex 9, France
3. New York University School of Medicine, 550 First Avenue, New York 10016, USA
4. College of Pharmacy and Allied Health Professions, St John’s University, 8000 Utopia Parkway, Queens, New York 11439, USA
5. European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
6. Lawrence Berkeley National Laboratory, 1 Cyclotron Road 64R0121, Berkeley, California 94720, USA
Correspondence and requests for materials should be addressed to L.M. (Email: firstname.lastname@example.org).
- ASHBURNER, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000).
Published online | Article | PubMed | ISI | ChemPort |
- Wu, C. H. et al. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 34, D187–D191 (2006).
Published online | Article | PubMed | ISI | ChemPort |
- DE MATOS, P. et al. ChEBI — Chemical Entities of Biological Interest. NAR Molecular Biology Database Collection http://www3.oup.co.uk/nar/database/summary/646 (2007).
- STOESSER, G. et al. The EMBL Nucleotide Sequence Database. Nucleic Acids Res. 28, 19–23 (2000).
Published online | Article | PubMed | ISI | ChemPort |
- WANG, J. et al. Na+-dependent purine nucleoside transporter from human kidney: cloning and functional characterization. Am. J. Physiol. 273, F1058–F1065 (1997).
Published online | PubMed | ISI | ChemPort |
- LI, L., STOECKERT, C. J. & ROOS, D. S. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 13, 2178–2189 (2003).
Published online | Article | PubMed | ISI | ChemPort |
- DURINCK, S. et al. BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics 21, 3439–3440 (2005).
Published online | Article | PubMed | ISI | ChemPort |
- Online Mendelian Inheritance in Man, OMIM. McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University (Baltimore, MD) and National Center for Biotechnology Information, National Library of Medicine (Bethesda, MD). http://www.ncbi.nlm.nih.gov/omim/ (2006).
© Matthews, L. et al. 2007. Published under a Creative Commons Attribution-Non-Commercial-Share Alike 3.0 Licence.