AdvancedSequence Processor

Sequence processor removes and masks undesired (low quality, vector, contaminant, etc) sequences in FASTA files. It sorts DNA and protein sequences according to: The sequences from the query file are separated into a new ".good" FASTA file (having at least some good content) and a ".bad" FASTA file (having no good content). The undesired portions of the ".good" sequences are masked, so they are not used in subsequent analyses.

Step 1. Blast for Sequence Processor

When processing DNA sequences, select “blastn”. For protein sequences, select "blastp". The query file is the sequence file to be analyzed; the target file, or the reference sequence file can be any FASTA file with sequences such as vector sequences, repeats, primers, specific protein domains, etc.

Here, as a reference target file we supply a FASTA file ("Vector_M_ATGC_R.fasta") containing all vector sequences from GenBank, mononucleotide repeats, dinucleotide repeats, and trinucleotide repeats. This file can be found in the “cocoa” project folder.
  1. To start Sequence Processor, open PyMood Blast Launcher, click the "Advanced" menu, and select "Sequence Processor..."
  2. Click the "Set BLAST options..." button.
In the BLAST Launcher window:
  1. Change options if necessary
  2. Select a query FASTA file
  3. Select the reference target FASTA file
  4. Click BLAST!
  5. If requested click “Format”
  6. Wait until the “BLAST is done” message appears, then click "OK". The BLAST run can take from a few minutes to many hours depending on the size of the files and the computer processor capacity. 6557 Cocoa ESTs are typically blasted against the supplied reference file “Vector_M_ATGC_R.fasta in a few minutes.
  7. Now proceed to Step 2 in the Sequence Processor window
Step 2. Sequence Processor

Options
All sequences that meet the first three options are written to a ".good" file, the ones that do not are written to a ".bad" file.

Length – sequence length
% N – percent occurrence of “N” letters in sequences. Note: when using protein FASTA files set this to 100
% GC – percent “GC” letters content. Note: when using protein FASTA files set this to “0 to 100” (two input boxes)

The last two options affect only ".good" sequences.

Masking letter - letter that will be used to mask the undesired portions of good sequences.

Unmasked letters – the minimum number of unmasked letters in a sequence to be placed in the “.masked.fasta” file. (others go in “.maskedx”)
File options:
Fasta file – select the query file for processing.
.all_hits file – select the corresponding all_hits file for processing. This file is produced by the PyMood BLAST parser from the corresponding BLAST output file.
Output suffix – suffix attached to Sequence Processor files. Do not type a filename.

Output Files produced by PyMood Sequence Processor

.stat – a tab-delimited file with data on sequence composition for each sequence in the original query FASTA file
.all.fasta – the original query FASTA file formatted so that each sequence is in one line
.bad.fasta – a new FASTA file containing only sequences that have not passed the first three selected options in the Sequence Processor
.good.fasta – a new FASTA file containing only sequences that have (at least partially) passed the first three selected options in the Sequence Processor
.good.masked.fasta – a new FASTA file containing only sequences that have passed all selected options in the Sequence Processor and have the undesired parts masked with the masking letter
.good.maskedx – a new FASTA file containing sequences that passed the first three selected options in the Sequence Processor but have not passed the last option.

.tab_all – a tab-delimited file (produced from the .proc.all.fasta file)
.tab_bad – a tab-delimited file (produced from the .proc.bad.fasta file)
.tab_good – a tab-delimited file (produced from the .proc.good.fasta file)
These three files have three columns:
  1. sequence ID
  2. sequence length
  3. sequence composition
Output file produced by PyMood BLAST Launcher (Step 1) and used in Step 2

The file resulting from the BLAST run and used in the Sequence Processor is .all_hits file. It contains 14 columns with the results from BLAST output, and can be open in a spreadsheet editor. An example of the file name would be:
cocoa_vs_ Vector_M_ATGC_R.matrix.all_hits
The description of the columns:

     A. unique identifier for the query gene
     B. unique identifiers of all BLAST hits above the cutoff of the 'Expect value'
     C. normalized expectation values for all hits above the cutoff of the 'Expect value'
     D. score (bits)
     E. percentage of identity between overlapping regions
     F. number of identical letters in the overlap
     G. length of the overlap
     H. order number of the hit for every query gene
     I. assigns number 1 to the every primary alignment*, assigns numbers 2, 3, 4, etc. to the alternative alignments**, if any
     J. indicates if the alignment is primary, 'PRM' or alternative, 'ALT'
     K. first position of the query sequence in the alignment
     L. last position of the query sequence in the alignment
     M. first position of the target sequence in the alignment
     N. last position of the target sequence in the alignment

* the alignment is primary when it is the best scored alignment between the query sequence and the particular target sequence
** the alignment is alternative when it is not the best scored alignment between the query sequence and the particular target sequence

Link: Other files produced by PyMood BLAST Launcher