KEGG Primer: An Introduction to Pathway Analysis Using KEGG
KEGG, which stands for Kyoto Encyclopedia of Genes and Genomes, has become a major resource for pathway analysis and contains a wealth of data associated with pathways, genes, genomes, chemical compounds and reaction information, in addition to links to outside resources such as PubMed (Kanehisa et al., 2006). This primer will focus on the PATHWAY data resource in KEGG. We will describe how to browse and search the pathways that are currently available. We also discuss the KEGG Automatic Annotation Server (KAAS) service, which overlays genes onto pathways. Additionally, we introduce the KegArray tool for exploring the pathways for expressed genes based on microarray data.
Data representation in the pathway database
The pathways in KEGG are manually drawn and derived from textbooks, literature and expert knowledge. Genomic information is derived from publicly available resources — RefSeq data from the NCBI (National Center for Biotechnology Information) are a major resource — and is maintained in the KEGG GENES database. KEGG contains completely sequenced genomes in addition to draft genome sequences and EST (expressed sequence tag) contigs. All available genome sequences and their sources are listed at http://www.genome.jp/kegg/catalog/org_list.html. All protein sequences are annotated with KEGG Orthology (KO) IDs, a system that was developed on the basis of pathway maps and BRITE hierarchies; sequences that are both highly similar and involved in the same interaction in the same pathway are grouped together.
The components of the pathways in KEGG are defined at http://www.genome.jp/kegg/document/help_pathway.html. In brief, a rectangular box represents a gene product (in most cases, a protein, but sometimes an RNA molecule), a small circle represents a compound and a large oval represents a link to another pathway map. The relationships between gene products are represented by various arrows for molecular interactions, which may be labeled +p, -p, +g, +m for phosphorlyation, dephosphorlyation, glycosylation and methylation, for example. Complexes are represented as multiple rectangular boxes that touch each other.
Browsing the pathway database
The pathway database consists of various biological processes that are divided into five groups: metabolism, genetic information processing, environmental information processing (including signal transduction), cellular processes and human diseases. The metabolism group is further subdivided into carbohydrates, energy, lipids, nucleotides, amino acids, glycans, polyketides/non-ribosomal peptides, cofactors/vitamins, secondary metabolites and xenobiotics. Recently, pathways covering drug development have been created, which describe the structural relationships of drugs. The full list of pathways can be browsed on the main PATHWAY page at http://www.genome.jp/kegg/pathway.html.
One of the newest additions to KEGG is the KEGG BRITE database, which consists of hierarchical classifications of genes, proteins, compounds, reactions, drugs, diseases, cells and organisms. The KO system is part of the KEGG BRITE database, as it contains classifications of orthologous genes, including orthologous relationships of paralogous gene groups that are based on pathway information. The KO system is also the basis for drawing KEGG PATHWAY maps (and generating the genes/proteins category of KEGG BRITE).
KEGG provides pathways in multiple formats, which include the KEGG reference pathway, and reference pathways derived according to KO, reaction or species. The pathway format may be selected by the pull-down menu at the top of the web page. The KEGG reference pathways are uncolored and provide a complete pathway map that is based on existing knowledge. If the reference pathways are viewed according to species, the species-specific genes are colored in green (Figure 1). Clicking on a green-colored enzyme will lead to the species-specific gene entry for that enzyme. Another view of pathways is the KO view, which contains purple-colored boxes that correspond to genes that have been assigned to a KO group (Figure 2). Clicking on a purple-colored entry will lead to the corresponding KO entry page (Figure 3), which lists all of the orthologous genes from various species that correspond to that pathway entry. From the KO entry, one can click on any of the gene names for a specific organism to display the corresponding KEGG GENES entry (Figure 4). In the reference metabolic pathways, the gene entries are hyperlinked to the corresponding enzyme entry pages and annotated by an enzyme commission (EC) number. The genes listed on this page correspond to those that are assigned the given EC number. By contrast, the 'Reference pathways (Reaction)' maps contain gene entries that are hyperlinked to the corresponding KEGG REACTION entry pages (Figure 5). These entry pages show the actual chemical reaction that takes place, as well as the chemical compounds that are linked to the corresponding KEGG COMPOUND pages. The KEGG REACTION entry also lists the 'Reactant Pair', which consists of RDM patterns, a classification system for chemical reactions. The RDM pattern is defined as KEGG atom type changes at the reaction center (R), the difference region (D), and the matched region (M) for each reactant pair. Finally, the 'All organisms in KEGG' pathway map will display the pathway colored with purple and pink boxes (Figure 6). In this case, genes in either color are hyperlinked to their corresponding KO entry. A purple-colored entry has been assigned to at least one KEGG GENES entry, whereas a pink-colored entry corresponds to an Ortholog Table entry, which corresponds to a gene that is, for example, part of an operon and also part of a functional unit on the pathway (Aoki and Kanehisa, 2005). Ortholog Tables are in the process of being replaced by the concept of pathway modules (identified as M numbers), which is another category in KEGG BRITE that groups KO group numbers (or K numbers) based on function along the pathway.
Reference pathway with species-specific genes colored in green.
Reference pathway: purple-colored boxes correspond to genes that have been assigned to a KO group.
'All organisms in KEGG' pathway map. A purple-colored entry has been assigned to at least one KEGG GENES entry, whereas a pink-colored entry corresponds to an Ortholog Table entry. Genes in either color are hyperlinked to their corresponding KO entry.
At the top of each pathway map there is a pink 'Select' button, which allows users to customize the list of species to generate species-specific maps. Figure 7 is the dialog box that is displayed when this button is clicked. The default is all genes from completely sequenced organisms. Other options are all draft genomes (DGENES) and all ESTs (EGENES). Any combination of these three gene groups may be selected. In addition, these selections can be restricted to eukaryotes, prokaryotes, animals, plants, bacteria, archaeans or some other pre-defined subcategory. These pre-defined categories are available in this dialog box (Figure 8) and can also be found at http://www.genome.jp/kegg/catalog/org_list.html. The options may be overridden by a list of organisms entered at the bottom of the 'Select organism' dialog box by specifying the three-letter KEGG organism code.
Dialog box to generate species-specific pathway maps.
Generation of species-specific pathway maps: Dialog box of pre-defined subcategory members.
Searching the pathway database
From the main KEGG PATHWAY page, the pathways can be searched for using the KEGG pathway ID or pathway name. The pathway ID may be either a reference pathway ID (beginning with 'map') or a species-specific pathway ID (beginning with the three-letter organism code). For example, the glycolysis reference pathway ID is map00010, whereas the human glycolysis pathway ID is hsa00010.
To search for a specific gene, the gene ID prefixed by the species code should be given in the search field of either the KEGG table of contents (http://www.genome.jp/kegg/kegg2.html) or the KEGG GENES main page (http://www.genome.jp/kegg/genes.html). In either case, the GENES database contains the gene data to be searched. For example, to search for the lacZ-related gene in yeast, enter 'sce:YNL139C' in the text search box on the main pathway page. Providing a portion of the gene ID will list all gene IDs that contain the given text in yeast (for example, 'NL13' will find 12 hits including YNL139C). Once a gene entry is found, clicking on its corresponding EC number will display the enzyme information and the pathway(s) in which it is involved.
To search the KEGG resources globally, a text search is available on the KEGG Table of Contents page. For example, gene names (such as 'lacZ), chemical compounds (such as 'oxidoreductase') and pathway names may be entered here. All entries containing those terms will be returned. KEGG IDs may be found by using the prefix codes for the corresponding KEGG database. For example, the pathway for glycolysis can be searched by entering 'path:map00010' and KO group 00010 can be found by entering 'ko:k00010'. A description of all of the KEGG identifiers is provided at http://www.genome.jp/dbget/.
Using the KAAS system for automatically generating pathway maps
KAAS is available at http://www.genome.jp/kegg/kaas/ (Moriya et al., 2007). KAAS can automatically reconstruct pathways using a set of amino-acid query sequences derived from a complete genome. It is also possible to use nucleotide sequences representing a set of ESTs or EST contigs. In either case, query sequences should be in the FASTA format. The query sequences are compared against the existing genes in KEGG using BLASTP for protein sequences and BLASTX and TBLASTN for nucleotide sequences. Those that are most similar to existing genes are then mapped onto the existing pathways. The results provided are those pathways in which the query genes were found to be most similar and are colored accordingly. In practice, this server is used to annotate genes from draft genome sequences.
Figure 9 illustrates how the results of a KAAS query can be analyzed. First, the genes in the input are each assigned a K number that corresponds to the KO group to which it belongs. These K numbers can then be used as the basis for pathway mapping to reconstruct the pathways given the input genes, or KEGG BRITE mapping, to investigate the annotations that correspond to the given K numbers. In Figure 9, the list of K numbers can be viewed as proteins or ligands, as they are categorized together under both groups.
KEGG Automatic Annotation Server (KAAS) for overlaying genes onto pathways.
The computation time for comparing the query sequences against those in KEGG is proportional to the number of sequences against which the query sequences are being compared (the target sequences). Thus, it is recommended that, if possible, the scope of the target sequences be limited to those that are closely related to the query sequences.
The input sequences are assigned to existing genes based on a sequence similarity criterion that the user can define. For the most accurate results, the bidirectional best-hit (BBH) method is recommended. However, the computation time will be twice that of the single-directional best-hit (SBH) method. BBH compares a query sequence A1 against the target sequences to retrieve the top hit B1 against the targets. B1 is then used as a query against all of the query sequences and if it retrieves A1 as the top hit, then A1 and B1 are BBHs. However, it is possible that another sequence from the query set, A2, results as the top hit for B1. In this case, A1 and B1 cannot be considered BBHs, and thus A1 would not be assigned to B1. However, B1 would be considered an SBH of A1. Figure 10 describes the concepts of the BBH and SBH methods. In this figure, A1 and B1 are BBHs, whereas A2 and B2 are not. B2 is the SBH for A2 but, in contrast, A4 is the SBH for B2.
Sequence similarity methods for KAAS: Bidirectional best-hit (BBH) and single-directional best-hit (SBH).
KEGG provides the pathway-coloring functionality, which is not limited to KAAS. Users may color pathways to their liking given a set of genes or chemical compounds and corresponding colors. The pathway-coloring tool is available at http://www.genome.jp/kegg/tool/color_pathway.html. Reference maps may be colored such that only those items in the input set are colored, or species-specific pathways may be additionally colored by differentially coloring the input items from the default shade (used to identify the species-specific genes). Examples are provided on the input page.
Figure 11 is a screenshot of the pathway-coloring tool at http://www.genome.jp/kegg/tool/color_pathway.html. The 'Search against:' pull-down menu provides the list of pathways that can be colored. In the 'Enter objects' text area, if a species is selected, the gene ID(s) and/or EC number(s) and/or KEGG ID(s) can be entered. These objects should be listed each on their own line. For each object, the background color (bgcolor) and foreground (text) color (fgcolor) can be optionally specified on the same line. Alternatively, this object-coloring information may be specified in an ascii text file and inputted directly by browsing for the file name. The file should be formatted in the same manner as if inputting in the text field above. The default background color for those objects with an unspecified background color is pink. The default background color can also be modified at the bottom. The 'Genes bgcolor:' field can be used to modify the default background color for species-specific pathways from green to another color. Finally, the checkbox at the bottom specifies whether or not to display any warning messages in the results for those objects that were listed in the query but could not be found in any pathways.
Figure 12 is a screenshot of the results of coloring a pathway. All the pathways that were found to contain the input genes are listed. If any input genes could not be found in any pathways, a red warning message would be displayed at the top of this list. By clicking on the name of a pathway, a colored pathway such as in Figure 13 would be displayed. In this case, the species-specific genes were colored in yellow, and the genes in the input list were colored as specified. In this way, pathways can be colored for a number of purposes, such as for the visualization of microarray expression data overlaid on the pathway or for genes that are related in a genomic context.
KegArray for microarray data analysis on pathways
The KEGG pathways can be used to analyze microarray expression data by coloring upregulation and downregulation according to the array results. The KEGG EXPRESSION database at http://www.genome.jp/kegg/expression/ provides precomputed microarray expression data that can be mapped onto the pathways using the KegArray tool. For example, from the KEGG EXPRESSION main page, clicking on 'List of experimental data available' will display the hierarchy of available microarray data. When one particular data set is selected, the option to use KegArray will be available at the bottom. As an example, select the data set 'ex0000012' under Synechocystis sp. PCC6803 by Suzuki et al. Under the 'Options' list there is the option to launch KegArray. Once launched, as in Figure 14, there will be the option to view the pathways that contain the differentially expressed genes. In this example, the pathway map for purine metabolism will be listed. Clicking on the name of the pathway will display a map with colored enzymes: red for downregulated genes, green for upregulated genes, yellow for no difference in regulation and grey for no regulation, as in Figure 15.
KegArray tool for microarray data analysis on pathways.
KegArray result: Pathway is colored for differentially regulated genes.
For viewing expression data that are not found in the KEGG EXPRESSION database, the KegArray tool can be downloaded from http://www.genome.jp/download/. Here, other downloadable tools such as KegHier for browsing hierarchical text files and KegDraw for drawing and querying compound and glycan structures are also available. All of these tools can access the KEGG resources from the client's computer via an internet connection. Thus, it is possible to analyze custom array data and map their expression onto the KEGG PATHWAY maps to visualize the expression patterns.
KGML for pathway data exchange
KGML, which stands for KEGG Markup Language, supports the BioPAX Level 1 format, such that all the metabolic pathways can be easily transferred between systems that support BioPAX. The BioPAX version of KGML can be downloaded from ftp://ftp.genome.jp:/pub/db/community/biopax. (Further information on BioPAX, a standard exchange format for pathway data, is available at http://www.biopax.org/.) In KGML, pathways are specified as graph objects, with the entry elements as its nodes and the relation and reaction elements as its edges. The relation and reaction elements indicate the connection patterns of rectangles (gene products) and the connection patterns of circles (compounds), respectively, in the KEGG pathways. All of the metabolic and regulatory pathways in KEGG are available for download as KGML files at ftp://ftp.genome.jp/pub/kegg/xml/.
This primer mainly focuses on the KEGG PATHWAY database. However, this is only one database among the five main knowledgebases in KEGG. It is the result of continuous modifications and additions being made to the KEGG resources, with new pathways being added and novel tools being developed on a regular basis. User feedback is also often incorporated to provide the most recent and biochemically accurate information. As progress continues to be made in the fields of bioinformatics and genomics research, KEGG will continue to develop.
1. Dept. of Bioinformatics, Faculty of Engineering, Soka University
2. Bioinformatics Center, Institute for Chemical Research, Kyoto University
- KANEHISA, M. et al. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res 34, D354–D357 (2006). | Article | PubMed | ISI | ChemPort |
- AOKI, K. F. & KANEHISA, M. Using the KEGG database resource, chapter 1.12 in Current Protocols in Bioinformatics, Current Protocols (Miranker, L., series ed.) (John Wiley & Sons, August, 2005). | Article |
- MORIYA, Y. KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic Acids Res 35, W182–W185 (2007). | Article | PubMed |
© Aoki-Kinoshita, K.F. et al. 2007. Published under a Creative Commons Attribution-Non-Commercial-Share Alike 3.0 Licence.