Stanford MicroArray Database
WORLD
  Password   
Sign Out

SMD : Help : Help Data Selection for Analysis
 

Help : Help Data Selection for Analysis


Contents

Related Help Documents

  • Data Selection: Explanation of the program used to select hybridizations (arrays) for viewing or analyzing data
  • Analysis Methods: Information about the algorithms used for hierarchical clustering and Self-Organizing Maps (SOMs)
  • File Formats: Information about preclustering (.pcl), clustered data table (.cdt), gene tree (.gtr) and array tree (.atr) files generated in the process of clustering data


Description

The Data Selection for Analysis tool is available only after you have selected a set of hybridized arrays using either the Basic Search or the Advanced Search programs. Once a set has been selected, Data Selection for Analysis allows you to select genes or spots to cluster, and to filter data based on a variety of parameters. This tool can be used to generate a preclustering (.pcl) file, or the files needed for viewing a cluster with TreeView. In addition, Data Selection for Analysis will lead you to tools that will let you view clustered data via the web.

Data Selection for Analysis is split into three large steps:

  • Gene Selection & Annotation allows you to choose the genes or spots to retrieve for analysis, how to represent and annotate the genes and how to describe the hybridized arrays you've selected.
  • Data Filtering Options gives you options for selecting which data column to retrieve and to filter the data retrieved based on values of any of the data associated with the results.
  • Gene Filtering Options allows you to filter genes based on their data as well as to transform (center) data.

Gene Selection Options

Although we use the word 'gene,' it really refers to any DNA sample spotted on the microarrays. A 'gene' might be a PCR product representing an entire section of a gene, a portion of a gene, a clone associated with a gene, an intergenic region or anything at all.

This section allows you to first specify which genes are of interest to you, then decide how to collapse your data, how to identify genes in your output file, select biological annotation and to choose a way to label the arrays you're using.

  • Specify genes or clones for which to retrieve results:
    Use one of the following three options for deciding which genes on your arrays for which to retrieve data. Only genes that have at least one piece of data will be included in the final .pcl file - see Choose the data column to retrieve, below.

    • Use all genes/clones on arrays
      You can select all the genes/clones in the experiments you have selected.
    • Select a list of genes
      This will select genes based on those that exist within a genelist file, if you are an owner of a "loader.stanford.edu" account. Shared standard files are available for many organisms. In addition, you may create your own precompiled list of genes. To do this, use the "genelists" directory in your loader account that was created automatically together with your account. Then create a tab-delimited text file that contains either the sequence NAME, SUID, LUID, or SPOT of each of the genes as the first column. (Example sequence names are YPR119W for yeast and HPY1808 for H. pylori. For cloned organisms (human, mouse, fly) cloneIDs are used, e.g. IMAGE:1542757). Your files will appear in the pull-down menu under 'Select a list of genes.' Your file may contain additional columns for your own information, but the database will not read them. The one exception to this is if you check the "or keep annotation from genelist (if using one)" button in the "Biological Data To Select" section. If this radio button is checked, the second column is retained as annotation. The first line(header) of the genelist file should have then the appropriate label for the data contained within it (either NAME, SUID, LUID, or SPOT).
    • Enter gene names
      You may enter gene names using 2 colons (::) between names. All the genes you enter that have data in the chosen experiments will be selected. Use the systematic names used in SMD (e.g. clone IDs or ORF names, as appropriate), not the actual gene names. Examples of the systematic names appearing on the first selected array are provided, for guidance.

  • Decide how to collapse data
    When a single gene is represented more than once on an array, you can choose how to respresent the different spots. When you retrieve by SUID, you will average the results from sequences with the same identifier in the database (the same SUID). On the other hand, if you retrieve data by LUID you will only average data for spots that were derived from the same original microtiter well sample in the laboratory (those having the same LUID). You can retrieve data by spot which works only if all your arrays are from the same print. In this case, no averaging will be performed.

  • Choose the contents of the UID column of the output file
    If you like, you can label each row of data with the SMD ID (SUID), the laboratory's microtiter well ID (LUID) and the spot identifier (SPOT). This information will be produced in your output preclustering file. For more information, see the File Format Help page.

  • Choose your biological annotation
    The results of these selections will appear in your clustering results. The information will vary depending on the organism you selected. You can select multiple types of biological data from the pull-down menu or you may check the retain annotation from genelist (if using one) button if you are using your own precompiled list of genes. For organism-specific details, please refer to the Tables for Specific Organisms on the SMD Specifications page.
  • Choose a label for each array/hybridization
    You can label each hybridized array with either the experiment name or the slide name in the the output preclustering file. For more information, see the File Format Help page.

Data Filtering Options

This section of the tool allows you to choose what data you think is reliable enough to include in your analysis. The steps are:
  • Choose the data column to retrieve
    You can select any measurement produced by the feature extraction software used to analyze the arrays. Different options will be presented depending on the software used (e.g., GenePix versus Affymetrix MAS 5). Any field may be used for clustering, but the defaults presented generally make the most sense. Note that some fields presented as options may be invalid: e.g., ScanAlyze and GenePix data are stored together and the same options are presented, but ScanAlyze and older versions of GenePix do not produce all of the measurements shown. If no data are retrieved for a given gene (spot, clone, etc.), either for this reason or because the data are bad or non-existant for that clone, it will not appear in the final .pcl file even if you specifically requested data for it in the gene selection step, above.

  • Decide whether to filter by spot flag
    Sometimes a spot may be flagged as unreliable, either by software or based on visual inspection by the experimenter. If a spot has NOT been flagged, its flag value is 0. If you do not want to retrieve spots that have been flagged as unreliable, simply keep the default selection.

  • Decide how to handle reverse-dye experiments
    This only shows up if you use experiments denoted as reverse. It inverts ratio and log ratio data properly. If you cluster the resulting data, the appearance will change and the experiments may cluster differently, but the gene clustering won't be affected (just due to the mathematics involved).

  • Select criteria for spots to be selected
    You can choose to filter out datapoints based on multiple criteria using these filters. You can combine these filters in several possible ways using filter strings. Each filter has a checkbox to make it active or inactive. Check this box if you want to use the filter. The first pull-down menus indicate which measurement or data point you want to use in the filter. Remember that not all measurements are available for hybridizations that were scanned with ScanAlyze instead of GenePix, or older versions of GenePix. The second pull-down menu gives you several mathematical operators you can use on your measurements. The final section you can edit to indicate the value to which you want to compare your measurements. Several default examples are available, but you should change the filters as you see fit.

    If you don't want your filters joined by "AND"s, use the FilterString box to enter the method by which you want your filters joined. If you do not enter a filter string, the default is that all active filters will be connected with the AND operator.
    You may enter a string that dictates how you want the filters combined. For instance, the filter string:

    1 AND (2 OR 3)

    means that you want datapoints that pass filter 1 and either filter 2 OR filter 3. (Note: filters 1, 2, and 3 must all be active for this to work.)
    You may also use more complex queries, such as:

    (1 AND ((2 OR 3) AND (4 OR 5))) OR 6

    The filtering will abort with an error message if the parentheses don't match or if the string is not syntactically correct.

  • Decide on some image presentation options
    If you are planning on viewing an assembled image of each array, select the retrieve spot coordinates option. If you are retrieving a large number of arrays, you are best served by NOT using this option, since you might run out of memory. The show all spots option allows you to view even the spots that you filtered out, but can make data retrieval extremely slow.

Gene Filtering Options

There are several steps to this part of the tool. Which options appear depends on what sort of data you have retrieved. Operations are carried out in the order in which they are presented on the page. The steps are:
  • Choose options for transformation of single-channel data
    These options are available only for single channel data, including single-channel intensities from two-color arrays. You may choose to adjust the average values of the retrieved data by multiplying each value by a constant factor (each array will have a constant calculated for it specifically). This is essentially a simple cross-array normalization. Second, you may choose to log-transform the data, with or without addition of a constant for variance stabilization. This is generally appropriate if you intend to cluster the data.

  • Choose one of these methods to filter genes based on data distribution
    If you don't wish to filter genes based on the disribution of their data, leave the "Do not filter genes on the basis of data distribution" option selected. Otherwise, you can choose one of two options.

    You can use the Rank filter to select only those genes whose retrieved values are in the top Nth percentile. You can decide what the percentile must be and the number of arrays for which a gene must be in your percentile. If you elect to show the percentiles in your preclustering file (for more information, see the File Format Help page), you will be unable to cluster your data with our tools.

    You can use the Deviations filters to select only those genes with a retrieved value different from the mean (for a single array) by more than a selected multiple of the standard deviation (for that array). You can decide what that multiple is and over how many arrays it must be true.

  • Decide whether to center data
    This option is only appropriate if you are retrieving log-transformed data. Centering is a data transformation that adjusts the values of your data. If, for example, you choose to center genes by means, the mean value for each gene will become zero after the centering. You can decide whether you want to center genes and/or arrays by either means or by medians. The mean or median of all values, for each gene or array, is subtracted from each value for that gene or array. Centering data for each gene is usually done in those cases where you are comparing hybridized arrays that use a common reference in the green channel.

    When you choose to center both by gene and by array, you can decide whether or not to iterate the operation. Upon centering arrays, values for centered genes may be thrown off, because of missing values, or when centering by medians. Iterating allows the centering to be repeated on both genes and arrays until the values stop changing. Obviously, iterating will increase the time spent calculating your results. Iteration continues until the maximum change to any array is less than 0.01 (in units of log-ratio), up to a maximum of ten iterations.

  • Decide whether to zero-transform a time course
    This option is most appropriate for time-course experiments. It allows you to adjust the data so that the values for each gene are relative to a specified zero-time point array (or multiple arrays, if you have repeated measurements of the zero-time point). For each gene, the value of that gene in the zero-time point array (or the average value in several) is subtracted from all values for that gene. This is most appropriate for log-transformed data.

    If you choose to zero-transform your data, you must indicate one or more arrays that represent the zero-time point, and a method for averaging their values (mean or median) if you select more than one.

  • Select a method to filter genes based on data values
    You can choose not to filter genes based on their data values, but if you do, there are two options. The first is to use a Cutoff value, to require values to exceed a given value for some number of arrays. The mathematical operator to use for comparison and the value to which the gene's log ratio is being compared are determined by you. The default setting for log-transformed ratio data selects genes which are at least 4 fold induced or 4 fold repressed in at least 1 experiment. (Note that it is 4 fold, because it is the absolute log(base2) ratio that must be greater than 2, and thus the ratio must greater than 4 fold up or down (2^2).) Note that the default value for intensity data is not appropriate if you log-transform the data, and should be adjusted appropriately. You may change these settings to suit your needs. For example, you may filter out genes that vary by this amount in fewer than 3 experiments, or you can choose ones that vary by a different amount.

    If you are retrieving log-transformed ratio data, you can also select only those genes whose distance in result-space exceeds a given value. The log transformed data for a given gene across the selected experiments constitute a vector, and this filter determines whether the length of this vector is greater than the specified minimum.

  • Choose whether to filter genes and arrays based on the amount of data passing the spot filter criteria
    Based on the filtering criteria you entered in the Select criteria for spots to be selected in the Data Filtering Options section of this tool, you can now indicate which genes or arrays to use. You can enter a percentage of arrays for which any gene must pass your filter criteria. In addition, you can select only those arrays that have some percentage of spots passing your filter criteria. For example, if a gene passes your filter in more than 80% of the hybridized arrays you are analyzing, you will retrieve data for that gene, but only the data that passes your filter criteria. The data that doesn't pass will be discarded. If you selected non-log transformed data earlier, this is the only option available for you to filter the data.

Viewing Clustering Results

Once you've submitted a clustering query, you will see a page where text writes to your screen. When the preclustering file is complete, the last line will read, "...genes were selected."

  • 'Download Preclustering File' allows you to download the raw data to your machine for analysis using your own methods.
  • 'Clustering and Image Generation' allows you to view the results after setting some final clustering option and image generation options.

Data Analysis

SMD allows you to perform some data analysis on your preclustering file, using either of two methods:

Clustering Options

You have the define the following options when hierarchically clustering

  • Whether to cluster genes, and if so whether to use a centered, or a non-centered metric.

    The centered vs non-centered metric only applies if you are using the Pearson Correlation (see below). It will not make a difference if using the Euclidean distance.

  • Whether to cluster experiments

    The same considerations apply for experiments as described for genes above.

  • Whether to use the Pearson Correlation or the Euclidean distance

    These are distance metrics that are used for measuring the similarity of expression between genes.

  • Whether to Hierarchically Cluster, or make a Self Organizing Map.

    If you choose 'Self Organizing Map Cluster', be sure to specify x and y dimensions. Your settings for hierarchical clustering described above will still be used when each partition of the SOM is clustered.

  • If you want to generate a file of sorted correlations, the default correlation is .8. Click 'Submit' when you have chosen the appropriate options.

    Image Generation Options

    Here are a couple tips that will help you optimize the time it takes to analyze the experiments you selected.
    • Selecting 'Show spot images' will slow down the analysis.
    • Broken up images load faster and can be navigated more quickly than unbroken images.

    Browsing, Viewing, and Downloading Clustered Data

    To interactively browse the clustered data, click the red and green image in the lower left-hand corner of the window. This takes you to the 'Hierarchical Cluster View' where you can focus on specific gene sub-clusters.
    • The map on the left contains the entire cluster, and its size can be changed by entering new parameters in the upper left-hand corner.
    • Clicking on this map changes the view of the graph on the right, which contains the experiment names as columns and gene names as rows.
    You can view the clustered data in the following ways.
    • 'View broken images' displays a .gif of the clustered genes based on the average retrieved value.
    • 'View broken spot images' displays a .gif of the clustered genes. The spots of the experiment are displayed in a way that allows you to see the variation within the spot.
    • 'View joint broken images' places both the above .gifs in the same window. If you don't see the broken spot image, scroll left to bring it onto your screen.
    • Clicking on 'pcl' at the bottom of the screen allows you view the preclustering file.
    The other links at the bottom of the screen download files to your machine.
    • 'cdt' downloads the complete tree view datafile.
    • 'gtr' downloads the genetree view datafile, which describes the tree of clustered genes.
    • If you chose an experiment clustering option on the previous page, you will also have the option to click on 'atr' to download the arraytree file.


    Please send comments or questions to: array@genome.stanford.edu