Slideshow OverRepresentation (about 12 minutes)
Handout OverRepresentation_Handout.pdf (6 pages)
Tutorial Curators Scooter Morris
Data Files collins.cys, GPL51filt.txt
Version Applies to clusterMaker2 0.95, setsApp 2.1.0, BiNGO 3.0.3, and ClueGO 2.2.5. Last updated 8/29/2016
Over-representation (or enrichment) analysis is a technique for determining if a set of categories are present more than would be expected (over-represented) in a subset of your data. Often this is applied to lists of genes or proteins that have been selected from a genome or transcriptome based on some criteria such as over or under expression in the presence of a condition and the categories are the GO terms or pathway annotations for those genes or proteins. For example, the human transcriptome has about 30,000 genes. If 200 genes are categorized as "ribosome biogenesis" and in an experiment we find 1000 genes are differentially expressed, and 150 of those genes have the "ribosome biogenesis" category, what are the chances that this is random? In this tutorial, we will load an expression dataset into Cytoscape and using hierarchical clustering from clusterMaker2 we will determine a set of genes are consistently over expressed and a different set of genes that are consistently under expressed. We will use those sets of genes to determine the GO terms which are enriched in those two data sets using BiNGO and optionally ClueGO. A note about ClueGO: if you intend on doing the ClueGO section of this tutorial, you will need to get a license (free for non-commercial entities) from: http://www.ici.upmc.fr/cluego/cluegoLicense.shtml.
Biological Use Case: Find GO terms or pathways over represented in a particular subset of a transcriptome.
Dependencies: This Tutorial will use clusterMaker2, setsApp, BiNGO, and (optionally) ClueGO.
- From the App store, load clusterMaker2, setsApp, BiNGO, and (optionally) ClueGO. Remember that if you are going to use ClueGO, you'll need to apply for a license (free for non-commercial users).
- Start by loading a yeast interactome. The file collins.cys is a data set from the 2007 Paper: Toward a Comprehensive Atlas of the Physical Interactome of Saccharomyces cerevisiae. Mol Cell Proteomics 6(3):439-50 (PubMed). Download the file and then in Cytoscape, go to File → Open and select collins.cys to load the session.
Map expression data onto the network
The NCBI GEO and EBI ArrayExpress servers provide a set of tools to assist in analyzing deposited expression data sets and to obtain gene lists of over expressed or under expressed genes. In general, I find it useful to load the log2 normalized differential expression data directly into Cytoscape to allow me to interactively cluster the data and view the heatmaps in the context of the network. This can be somewhat cumbersome in GEO depending on the way the author deposited the "processed" data. For GSE18, an early yeast stress response data set, the differential expression values were included in the uploaded data. For other datasets, more processing may be required. ArrayExpress (particularly the Expression Atlas) on the other hand, seems to do a better job for recent datasets in providing tools to download processed data that includes the differential expression. To simplify this tutorial, I have included the data for the first experiment in GSE18 (GPL51) and modified it slightly to set all missing values to 0.0, purely for illustrative purposes.
- Download the expression data file: GPL51filt.txt. This file contains data from a Yeast Stress Response experiment published in 2000 by Gasch et al. and was one of the early uses of microarray technology.
- Once the file is downloaded, import it into your Cytoscape session: File → Import → Table → File... and select the downloaded file. This will bring up the Table Import dialog shown below.
- Make sure that all of the data columns are floating point values, then import the network by selecting OK. This will add the expression data to all of the nodes.
Run clustering to determine interesting subnets
- Select Apps → clusterMaker → Hierarchical cluster.
- In the Node attributes for cluster box, select all of the GPL51- values.
- Deselect Only use selected nodes/edges for cluster.
- Select Show TreeView when complete
- Click OK.
Save the over and under expressed genes as sets
At this point, we could save all genes that are differentially expressed, but for illustrative purposes, let's separate those genes that are over expressed from those that are under expressed. To help us "remember" those selections, we're going to use the setsApp, which provides tools to save selections, and if desired perform union, intersection, and difference set operations on multiple sets.
- The resulting tree view (shown below) shows a collection of consistently over expressed (yellow) genes at one end and under expressed (blue) genes at the other.
- Using the dendrogram to the left of the full view, select the under expressed genes. These will also be selected in the network.
- In the Cytoscape Control Panel, select the Sets tab (this assumes you previously installed the setsApp).
- Click the + and choose selected nodes.
- Name the set Down
- Repeat steps 2-5 for the up regulated nodes (substituting Up for Down in step 5.
- This should result in 83 nodes as part of the Down set and 80 nodes in the Up set.
- At this point, you can close the TreeView.
Determine the GO over representation for both sets
At this point, we have a set of nodes that are over expressed under stress and another set that are under expressed under stress. We now want to find out if these sets of nodes are enriched in any GO terms. We'll start by using BiNGO to look.
- In the Sets panel, select Down to select all of the under expressed genes.
- Start BiNGO by selecting Apps → BiNGO.
- This will bring up the BiNGO Settings panel.
- For Cluster name: enter Down
- Select Get Cluster From Network
- Under Select ontology file: choose GO_Fill.
- Select Start BiNGO. This will calculate the over representation and provide a new network that should have three connected components, one for each branch of GO. The darker colored nodes represent terms that are over represented. For example, in the "molecular function" branch, we see that "RNA binding" is enriched. Looking the table, the p-value is 5.16 X 10-14. The most significant p-values are for "nucleolus" in the "cellular component" branch and "ribosome biogenesis" in the "biological process" branch.
- Now, Cytoscape Control Panel go to the Network panel and select the main network "compabined_scores_good.txt".
- Go back to the Sets panel and select Up to select all of the over expressed genes.
- Enter Up in the Cluster name: field for back in the BiNGO Settings panel, and click Start BiNGO
- Now explore the resulting network. Notice that the most over represented terms are various catabolic processes, which makes sense in response to stress.
Optional: use ClueGO to view the over represented terms
BiNGO does a nice job showing us the over represented terms for our over expressed genes and our under expressed genes. However, it would be nice to look at both over expressed and under expressed genes in the same visualization, and we would like to also know if any particular known pathways are enriched. ClueGO provides a nice set of tools for exactly that purpose. We'll now use ClueGO to analyze the same data that we viewed above.
- To avoid cluster, if desired, close previous BiNGO output panels and color scales.
- Once you have a ClueGO license, go to Apps → ClueGO. This will bring up the ClueGO license panel and allow you to enter the license you obtained.
- Select the ClueGO panel in Cytoscape's Control Panel.
- Since our data is from Yeast, the first step is to load the Saccharomyces cerevisiae gene list. Under Load Marker List(s) next to the species list (which probably shows Homo sapiens by defailt), is a small icon that is supposed to suggest a disk with a down arrow. Selecting that will allow you select new species to download. You will want to download Saccharomyces cerevisiae.
- Once the species is downloaded, we can create our two groups (called Clusters in ClueGO). Since we already have our lists defined, we can use our sets to populate our clusters. Start by selecting the Up set to select all of the nodes in the network that are over expressed.
- Now we have the set of over expressed genes selected in our network, so we can select Network in the ClueGO panel. ClueGO needs to know which data column to use to populate the field, so select name next to the Load Attributes button (but don't select the button).
- To load the gene names, click on the little file folder icon. That will create Cluster #1.
- Click on the + icon to get space to create another cluster.
- Select the 'Down genes previously defined using the previously defined set.
- Populate the cluster using the directory icon.
- In the ClueGO Settings section, select all three GO branches and KEGG. If desired, change the shape of the KEGG pathway nodes (I set them to diamonds).
- Finally, click Start (you may need to scroll down to see the Start button).
- You can now explore the network to see the over represented terms in each Cluster or in the overall group. By default, the network is organized by functional group (see below).
- You can also see the a set of nodes that compares between the clusters. Unfortunately, by default ClueGO uses a red-green color gradient, which can cause significant difficulties for those with red-green color blindness. In the image below, I've changed the labels to all black and changed the color scale to cyan-yellow using the Style panel.
ClueGO results can be saved and restored for further analysis. For more details on ClueGO, please see the ClueGO Documentation