An Introduction to COSMIC, the Catalogue of Somatic Mutations in Cancer
COSMIC (Catalogue of Somatic Mutations in Cancer; http://www.sanger.ac.uk/cosmic) is a comprehensive resource that aims to curate the world's literature on somatic mutations in known cancer genes. The catalog includes full and up-to-date curation of mutation data in over 60 well known point-mutated genes, together with novel gene fusion products expressed across genome rearrangement breakpoints, and all of the somatic mutation data from candidate gene screens at the UK's Cancer Genome Project. This primer introduces the COSMIC web system and how it may be used by cancer researchers, together with downloads available for data mining by bioinformaticians.
COSMIC (Catalogue of Somatic Mutations in Cancer) was conceived out of the need to combine cancer mutation data from a variety of disorganized and distributed sources, most notably the scientific literature, which is not accurately searchable by any aggregate or automated methods. The vast majority of available online resources, though sometimes extensive (for example, the Internal Agency for Research on Cancer's p53 database; Petitjean et al., 2007), do not provide genome-wide mutation information and comprise mainly single-locus databases. In particular, estimates of prevalence are difficult to obtain from most of these resources. Larger online resources, such as the Human Gene Mutation Database (Cooper et al., 2005) and Online Mendelian Inheritance in Man (Hamosh et al., 2002), store minimal information, usually only on high-frequency mutant alleles, thus losing much contextual detail. COSMIC avoids most of these drawbacks by curating as much detail as it is possible to extract from targeted literature. These data are presented in tandem with somatic mutation screening derived from ongoing Cancer Genome Project (CGP) studies. Large data sets are thus made available for deep data mining while maintaining sample sizes that allow good statistical significance.
The CGP maintains the Cancer Gene Census (http://www.sanger.ac.uk/genetics/CGP/Census/; Futreal et al., 2004), a listing of all genes that are proven to be involved in cancer through structural mutations. This resource, which describes almost 400 genes, has led the curation efforts of the COSMIC project. Genes with small somatic intragenic mutations have been prioritized in order of the amount of mutation data available (beginning with the enormous KRAS gene dataset), and the curation of these is nearing completion (with the notable exception of TP53, which is independently curated in the Internal Agency for Research on Cancer's TP53 database; Petitjean et al., 2007). COSMIC is also committed to updating these genes as new articles are published. In 2007, curation of cancer genes derived from translocation and gene fusion events began. In combination with these curated data, COSMIC has added the results of the CGP, which is examining a range of tumors through 4,000 candidate genes. Whereas the curated data focus on high-mutation-frequency genes, the CGP data present a much wider spread of analysis. The COSMIC product of this effort is a system that ties mutant genes to cancer phenotypes and can be examined from a genetic or a phenotypic viewpoint. COSMIC presents mutation spectra and mutation frequency statistics while allowing detailed examination of any individual data point.
Data representation in COSMIC
Fundamentally, COSMIC represents the analyses of samples through genes containing somatic, putatively oncogenic mutations. These analyses can be included in manuscripts for publication or CGP studies. Three core data classes, detailed as follows, are substantially described to ensure the website is as informative as possible.
Samples. In COSMIC, every sample must have a name (by which it is primarily identified) and a cancer phenotype. Samples are aggregated by name only when there is genotypic evidence that two reports for a sample are examining the same entity. Otherwise, the sample receives a new database entry; this usually occurs for very simple sample names or common cancer cell lines such as PC-3, which has 36 entries. For the sample entry to be meaningful, its phenotype must be described and navigated in a standardized fashion. The COSMIC project has developed its own tumor classification system to which all published data are translated during curation, so that the broadest range of cancer phenotypes may be encompassed. This system breaks the tumor classification into tissue site information (such as brain or lung) and histology (such as sarcoma, adenocarcinoma or non-small-cell lung cancer). The most precise description of a tumor may not always have an immediately intuitive translation; for instance, a hemangioblastoma is often described in the literature as a tumor of the brain, but COSMIC reclassifies it as a soft-tissue and blood-vessel tumor that occurs in the brain (simply using 'brain' as a search term will also retrieve these records). In addition to the classification of the tumor itself, its source is recorded, so it is possible to differentiate not only between cell lines and primary tumors but also between surgery and autopsy samples, frozen and fixed samples, and samples collected for testing (such as urine or stool).
Genes. A gene in COSMIC refers to its cDNA coding sequence only. The exceptions to this rule are genes involved in fusion rearrangements, where sequences from the untranslated regions (UTRs) may be used to recreate the published novel mRNA sequence. For each gene, a reference transcript is assigned, usually either the one favored in the literature or (for CGP studies) the longest of the selections available at the time. Ensembl representations of these transcripts have recently been used (http://www.ensembl.org) to facilitate the derivation of each mutation's genomic coordinates. All mutations for a given gene are mapped from all publications to this reference transcript, with coding domain sequence (CDS) coordinates reassigned where necessary (and possible, where mRNA sequences align). A gene's HUGO name is its primary identifier. Name changes are reflected in COSMIC in the release after they appear in the HUGO Gene Nomenclature Committee Gene database. Synonyms and old names are also retained so they can be searched as gene names change.
Mutations. Small intragenic mutations form the bulk of the COSMIC database, with the majority being point mutations. Substitutions are defined as single-base pair (bp) events, whereas small deletions range from 1-bp to whole-gene losses, insertions are rarely longer than 50 bp, and complex insertion, deletion or substitution mutations are usually less than 20 bp. These small mutations are positioned on the gene's CDS using the first adenine in the first methionine codon as position 1. Over 50,000 of these small mutations are currently detailed in the system (version 36, March 2008).
Fusion genes have recently been added to COSMIC, describing novel mRNAs transcribed across breakpoints of genome rearrangements. These data are described in two different ways, as gene expression across a genomic breakpoint can result in multiple different transcripts. For each publication, the range of transcripts identified in a sample is recorded in COSMIC as experimental evidence ('observed mRNAs'). To simplify these complex data for easy navigation, we have also deduced the approximate position of the genomic breakpoint from this mRNA repertoire (the 'inferred breakpoint'). This relies on the assumption that the breakpoint exists between the 3'-most expressed exon of the 5' gene partner and the 5'-most exon of the 3' gene partner from the observed mRNA repertoire reported in that sample. In Figure 1a, for example, three mRNAs were observed in a single sample. The first eight to ten exons of the 5' gene partner (red) were observed in at least one mRNA, so the tenth exon must have been retained before the breakpoint; similarly, the last three or four exons of the 3' gene partner (blue) were observed in mRNAs in the same sample, so all four exons of the 3' gene partner must be included in the fusion gene. The breakpoint is therefore inferred to be in the position between closest observed exons for each partner gene (Figure 1b).
Fusion gene description. Across the breakpoint of a genomic rearrangement, if a fusion gene is created, a number of transcripts can be expressed and detected by RT-PCR.
Fusion gene description. An approximate position of the relevant breakpoint can be inferred from this mutant mRNA repertoire, using the location of the 3'-most exon of the 5' gene partner and the 5'-most exon of the 3' fusion partner—in this case, between exons 10 and 11 of EWSR1 (red) and exons 2 and 3 of ATF1 (blue). The wild-type structure of the 5' partner is shown in red above the fusion; similarly, the wild-type 3' partner is shown in blue beneath it.
The Human Genome Variation Society mutation nomenclature system has been adopted for the concise and precise description of each sequence change. Each small mutation has a 'c'-prefixed annotation to describe the change in the cDNA (CDS) sequence and a 'p'-prefixed annotation to describe the (usually predicted) change in the peptide translation of the cDNA sequence. Some of these descriptions can become rather complex; a definition and discussion of these recommendations are available at http://www.hgvs.org/mutnomen/. Fusion mutations have similar concise descriptions, although necessarily more complex, as these describe the content of the novel mRNA in terms of portions of their respective wild-type GenBank mRNAs. Because UTR sequences are often included in the transcribed product, an 'r'-prefixed syntax has been adopted, indicating the annotation is to an mRNA and not a cDNA or peptide sequence.
These core data classes, together with much additional information, are stored in an Oracle database schema, currently using over 60 normalized tables that are reduced into a small number of warehouse tables for each public release. The normalized schema for the system is available (together with full per-release database exports) on the COSMIC FTP site at ftp://ftp.sanger.ac.uk/pub/CGP/cosmic/oracle_export/cosmic_export_feb_2006.pdf.
Use case for scientific and clinical researchers
COSMIC can be browsed in a gene-centric or phenotype-centric fashion. Entry into the COSMIC system usually begins with the selection of a gene name or tissue type. A summary page showing brief details of mutation frequencies and spread follows, leading to deeper navigation of the data. At every stage, the data being viewed can be further specialized, zoomed in or redefined. In addition to COSMIC's graphical overviews, export functions are available so that the data can be examined offline. Figure 2 summarizes the core work flow of the COSMIC system.
Work flow in COSMIC. Work flow in COSMIC is not linear, and continuous rounds of query specialization or generalization are possible. Navigation between the main Web pages is shown.
COSMIC searches usually begin at the homepage (Figure 3; http://www.sanger.ac.uk/cosmic/). Although 'Browse by' options are available, the easiest way into the system is to type a gene name, tissue or cancer type into the 'Text Search' box. For a gene search, a link to the gene summary page should be the top link in the list returned (followed by a listing of mutations in this gene); this contains a summary of its mutations together with a series of links to various views of its sequence and structure (Figure 4). If a tissue or cancer phenotype is used as a search term, the top link should be to the tissue summary page (Figure 5), detailing which genes have been examined in the tissue and which were mutated; the five genes with the highest mutation frequencies are further detailed, showing sample and mutant counts.
The COSMIC homepage. This screen shot was taken from version 36 (March 2008 release). Entry into the system is usually through a search term typed into the 'Text Search' box or by browsing by gene or tissue
Gene overview page for PTEN. A search for a gene returns the gene overview page, which displays mutation spread, external links and details of the data's origin.
5 'Tissue Overview for Lung' page. A search for a tissue or specialized phenotype returns a listing of each gene examined in the tissue, with mutation rates detailed for the five genes with the highest mutation frequencies.
Both the gene and tissue summary pages provide links to COSMIC's histogram page. The central hub of the system, this page displays the spread of mutations across the gene sequence as a histogram above the scale bar for point mutations, and as a series of icons below the scale bar for deletions (blue triangles), insertions (red triangles) and complex replacements (short bars). Below the histogram graphic, the 'Details' button, once clicked, displays a table of the mutation frequencies for this gene in a range of tissues, and the 'Mutations' button changes the table to show each unique sequence change observed and a count of how many times it was found. This zoomed-out histogram graphic provides immediate views of the gene's oncogenic characteristics. For instance, 99.6% of oncogenic JAK2 mutations (Figure 6a) are caused by a very specific alteration at one nucleotide, a clear gain of function characteristic of a dominant oncogene. The histogram for PTEN (Figure 6b), however, displays a range of mutation types across the gene length with a high rate of truncating (loss-of-function) mutations, clearly evident of a tumor suppressor gene.
Histogram page for JAK2. The histogram page contains most of the detail on gene mutations. The histogram itself describes the frequency across a gene's coding domain sequence of single-base substitutions, beneath which are indications of the position of complex replacements, deletions (blue triangles) and insertions (red triangles). Under this, domain positions from InterPro, Pfam and UniProt are indicated. Options to change the graphical view are also available. Under the graphic, the data are tabulated, displaying either mutation count and frequencies by tissue type (obtained by clicking the 'Details' button) or each individual mutation description with its number of occurrences (obtained by clicking the 'Mutations' button). In this example, the histogram for JAK2 clearly indicates a gain-of-function mutation.
Histogram page for PTEN. In contrast to JAK2, the tumor suppressor gene PTEN shows a spread of mutations across its coding domain sequence.
By default, the histogram graphic is calculated from the amino acid data using coordinates and sequences along the length of the translated CDS. An option is available to change this view to nucleotides so that the exact cDNA changes can be examined. Choosing this option also alters the Mutations table, which will now display the mutation list using nucleotide sequences and coordinates on the CDS. It is also possible to zoom in on the histogram graphic. Clicking on a histogram peak or an 'Indel' icon creates a small popup providing zoom options; choosing one of these produces a view such as that in Figure 7a (showing that the gain-of-function mutation in JAK2 is a G-to-T mutation at nucleotide 1,849; p.V617F) or Figure 7b (showing the mutation range around the peak mutation position in PTEN). Once the graphic is zoomed into a region, the two tables below it are also altered; the Details table is regenerated so that mutation frequencies are only calculated using mutations between the zoom boundaries, and the Mutations table only details sequence changes between these specified boundaries. The 'Zoom Out' tab above the histogram returns the graphic to the original gene-wide overview.
Zoomed-in view of Figure 6a. This view, focused on histogram peaks, is switched to nucleotide view to show the changes to the cDNA within the sequence context of the change.
For each individual mutation, a link is available (from either the histogram graphic or the Mutations table) to a detailed summary page (Figure 8). This page provides the mutation characteristics, a summary of tissues in which that mutation has been observed and a listing of samples containing it (which can become very long; KRAS c.35G→A; p.G12D has been identified in almost 4,000 samples). Links are available from this page and the gene summary page to the Ensembl system (http://www.ensembl.org), where the gene and mutation details in COSMIC can be examined in Ensembl's ContigView (9a), which displays comprehensive genomic annotation for a region of a chromosome. ContigView can give a much better view of the genomic context of COSMIC's mutation information, with Ensembl's navigation system providing options to scrutinize the local sequence context of an individual mutation (Figure 9b) or to examine the broad genomic region in which the selected gene resides (Figure 9a).
Detailed mutation page. A summary is available for each individual sequence change recorded, providing as much detail as possible on the mutation itself (in cDNA, amino acid and genomic contexts), together with tissue distribution and links to samples affected (see Figure 10).
COSMIC mutation data viewed in Ensembl's ContigView. This application, which is accessed using the link on the mutation overview page (Figure 8), provides a view of the mutation and its neighbors within the context of full-genome annotation. In this example, the view of the selected gene is zoomed out to show mutation positions relative to the transcript structure.
COSMIC mutation data viewed in Ensembl's ContigView. A zoomed-in view shows individual nucleotides and their range of mutations in genomic sequence context. Clicking on a mutation provides links back to COSMIC.
To examine a single COSMIC sample in isolation, a summary page detailing the sample's characteristics is available (Figure 10). Beginning with the sample's source and phenotype, the page details any mutations identified within the sample and papers and studies in which they have been described. For CGP samples (as in Figure 10), links are often also available to other CGP projects that detail examinations of microsatellite loss of heterozygosity ('LOH' link) and a more thorough assessment of the sample's genome copy number and zygosity using high-resolution whole-genome mapping arrays ('CGH' link). Finally, the summary page indicates the sample's microsatellite instability—a gauge of how unstable its genome has become—before listing the genes for which the sample was examined with no mutations found.
Sample summary page. A portion of the potentially long sample summary page is shown. The common details usually available for a sample include name, phenotype and mutation content. Many other details may be shown, depending on how well the sample was originally described.
The gene fusion data are currently derived only from literature curation. Separate pages are used to detail these mutations (Figure 11), as they are much more complex to describe than are small, intragenic sequence changes. As described above, an inferred breakpoint is identified for each sample, based on the repertoire of mRNAs expressed at the rearrangement breakpoint. Details of this breakpoint are indicated relative to the wild-type mRNA, in tabulated form; in the example shown, precise locations of the breakpoint are often defined with coordinates relative to the nearest exon boundary. These mutation annotations use locations relative to the mRNA, rather than the CDS, and include all the UTR sequences in the specified sequence. Reciprocal fusions are also detailed where this information is available. Graphical representations are provided for each observed mRNA product, and frequencies of fusions between selected gene pairs are summarized for each examined tissue type. Links from this page lead to a mutation summary page, similar to that described above, detailing exact characteristics of the individual mutation.
Fusion gene page. When two genes are frequently fused in cancer, this summary page overviews the range of mutant transcripts detected. Beginning with a tabulation of inferred breakpoints (see Figure 1), the total observed mRNA repertoire is presented graphically, followed by the tissue distribution of all fusions between the gene pair.
Use case for bioinformaticians
Scientists approaching COSMIC from a research or clinical perspective will navigate the system in similar ways, as described above. Bioinformaticians, however, may want to mine the data with different perspectives than those available on the website. For these users, a series of data export files are provided on COSMIC's FTP site (ftp://ftp.sanger.ac.uk/pub/CGP/cosmic). For simple examinations of large amounts of data, the 'data_export' folder holds all of the sample and mutation data in spreadsheet format. These are provided as one file per gene or as one file summarizing the whole release. Each file contains all pertinent information for each mutation in each sample, including mutation genomic coordinates. This format is the easiest way to input large amounts of data to simple programs, perform statistical evaluations or compare genomic localizations with other required targets.
For deep data mining, a full export of the Oracle database that is used to drive the website is provided in the 'oracle_exports' folder. Using this export requires significant informatics support, as it must be imported into an established Oracle database system (COSMIC fits easily within the constraints of Oracle's free 'express' edition). Once installed, the database can be examined using Oracle's SQL computer language to perform very complex analyses and statistical examinations of COSMIC's contents.
For software developers wishing to integrate COSMIC into their own systems, details of gene transcripts and mutations have been localized on the National Center for Biotechnology Information's genome consensus sequence (NCBI36.1) and offered online as a Web service detailed in the Distributed Annotation System registry (http://www.dasregistry.org/showProjectDetails.jsp?project_id=75). This allows the database to be queried online for specific data sets at any time using the latest release; the data returned in XML format can be integrated into any piece of software.
The COSMIC project comprises three distinct subprojects, each with separate color-coded websites (Figure 12). The literature curation project aims to fully curate all of the published data on known cancer genes, beginning with all genes with primarily small, intragenic mutations (including KRAS, HRAS, NRAS, BRAF, PTEN and RB1). Most of this effort is now complete, so a new effort to curate published data on oncogenic gene fusion events has begun. Summary statistics of the literature curation subproject are presented on COSMIC's gold webpage (http://www.sanger.ac.uk/genetics/CGP/Classic/); these data represent roughly half of the database's contents.
The three COSMIC subprojects. The main COSMIC website is presented largely in blue. Each of the three main subprojects has its own color-coded website. COSMIC green and COSMIC red have the full functionality of the blue website but are restricted to only the portion of the total data generated by the Cancer Genome Project. COSMIC gold is a breakdown of the literature curation project. Links to the total data, presented in blue, are always available on all three subsites.
The rest of COSMIC's data are derived exclusively from the CGP, whose resequencing studies represent an effort to identify novel somatic mutations by resequencing approximately 4,000 candidate genes across 40 matched-pair cell lines and 96 matched-pair primary renal-cell carcinomas. These studies also include a similar analysis of 518 protein kinase genes through a large number of primary tumors and cell lines from a wide range of phenotypes (Greenman et al., 2007). The CGP resequencing studies have found many new sequence variants, already highlighting a new mutation hotspot not previously identified in the well-known oncogene KRAS (Edkins et al., 2006). The results of this effort are available on COSMIC's red website (http://www.sanger.ac.uk/genetics/CGP/Studies/), which is a fully functional and color-coded version of the COSMIC web system.
Finally, the third component project is the CGP Cancer Cell Line Project, which aims to resequence up to 70 known cancer genes with somatic intragenic mutations in nearly 800 common cancer cell lines. Within this subproject, mutations are annotated as to their likely role, and only those mutations considered to be cancer-causing are released. The results of this project can be examined on COSMIC's green website (http://www.sanger.ac.uk/genetics/CGP/CellLines/), again a fully functional and color-coded version of the COSMIC web system. The blue COSMIC web pages, probably the best known, provide an overview of the data from these three component subprojects, including fully integrated statistics and summaries.
Until 2005, the COSMIC system was designed exclusively for curation of data from the literature and included publications usually investigating only one or two genes in a small number of samples. In that year, the system was upgraded to handle the much larger data sets generated by high-throughput resequencing efforts such as that at the CGP. In 2008, a new challenge has emerged with the CGP reporting its first successful examinations of whole genomes using next-generation sequencers (Campbell et al., 2008). Individual samples can now receive annotations across their entire genome, not only for mutations in putative cancer genes but also for genome mutations and rearrangements regardless of their location relative to known coding or regulatory sequences. With total mutation numbers in the thousands, the storage and navigation of these data in COSMIC will need to be radically altered to offer options to browse the system by anonymous genome positions and to provide broader graphical summaries of mutation content. Integrating this into existing systems so that standard gene-based data can be examined as usual is an upgrade challenge that is currently being addressed. Our goal is to keep COSMIC up to date as mutation detection technology improves apace.
Cancer Genome Project, The Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, UK CB10 1SA
Correspondence should be addressed to P.A.F. (email@example.com).
- Petitjean, A. Impact of mutant p53 functional properties on TP53 mutation patterns and tumor phenotype: lessons from recent developments in the IARC TP53 database. Hum Mutat 28, 622–629 (2007)
- Cooper, D.N., Stenson, P.D. & Chuzhanova, N.A. The human gene mutation database (HGMD) and its exploitation in the study of human mutational mechanisms. Curr. Protoc. Bioinformatics 1.13.1–1.13.20 (2005)
- Hamosh, A. Online Mendelian Inheritance in Man (OMIM), a knowledge base of human genes and genetic disorders. Nucl. Acids Res. 30, 52–55 (2002)
- Futreal, P.A. A census of human cancer genes. Nat. Rev. Cancer 4, 177–183 (2004) | Article | PubMed | ISI | ChemPort |
- Greenman, C. Patterns of somatic mutation in human cancer genomes. Nature 446, 153–158 (2007) | Article | PubMed | ISI | ChemPort |
- Edkins, S. Recurrent KRAS codon 146 mutations in human colorectal cancer. Cancer Biol. Ther. 5, 928–932 (2006)
- Campbell, P.J. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat. Genet. 40, 722–729 (2008) | Article | PubMed | ChemPort |
© Forbes, S.A. et al. 2008. Published under a Creative Commons Attribution-Non-Commercial-Share Alike 3.0 Licence.