Advanced Clustering

This function will retrieve all homologous groups of sequences in the limits set. To set the expectation cutoff limit type the normalized expectation value (absolute value of the exponent in the expectation value) number, bit score cutoff number, identity cutoff (in percent), and a number for the overlap cutoff (aa or nt length).

When using this function it is highly recommended to increase the number of displayed allignments to a bigger number than the default '3' in the PyMood BLAST Launcher before BLAST runs.

After the analysis, three new files will be produced:
     .CLUST.group_info
     .CLUST.all_pairs
     .CLUST.adj_list

The .CLUST.group_info file consists of five columns and can be viewed in a spread sheet program such as MS Excel where the column are:

     A: gene ID
     B: number of other genes clustered to the current gene
     C: number of genes in the cluster
     D: group number
     E: Either **** or blank. The **** indicates the beginning of a new group. This is only for visual purposes

The .CLUST.all_pairs file is a pairwise (binary) matrix for all primary hits if they are better than cutoff values. The file consists of six columns and can be viewed in a spread sheet editor such as MS Excel where:
     A and B columns: pairs of genes
     C column: normalized expectation (absolute value of the exponent in the expectation value)
     D: bit scores
     E: percentage of identity
     F: length of the alignment

The .all_pairs file can be used by PhyloGrapher and GenomePixelizer as matrix file with the following modification:
     1. The third column should contain the data which you would like to use for displaying the results. Any of the columns C (expectation), D (bit scores), or E (identity) could be used, just move them into the third column position.
     2. This data should be normalized between 0 and 1.
Both steps, 1 and 2 can be easily accomplished using MS Excel.

The .CLUST.adj_list file is an adjacency list for sequences based on the .CLUST.all_pairs file. If the query sequence in the first column is similar to other genes (subject in BLAST report) within defined cutoff values (expectation, identity, bit scores and alignment length), then these gene IDs are written to the corresponding row.