user manual SeqHandle.pl


Introduction
============

[...]


File Format Support
===================

 The input file format is determined automatically and may be one of:

   Experiment     Staden Experiment file, may be multi-sequence.
   fastA          Pearson format
   GAP4 database  contig consensus sequences - input only.
                  NOTE: sequence ID will be ID of the leftmost reading in the
                  contig. Remember this when you are using switch -SlcID=S.
   GenBank        GenBank
   GFF            open GFF-file(s) and an accompanying sequence file which will
                  be recognised by having the same name root and one of the
                  suffixes: '.fa', '.fasta', '.tbl', '.table', '.pln', ''.
   plain          just the plain sequence. The filename will be interpreted
                  to yield a sequence identifier. This format is limited to
                  single sequence containers.
   selex          like "table"
   struct         plain file format encoding nested data structures
   table          table format file containing lines with ID & plain sequence
                  separated by <TAB> or spaces.

 See module SeqLab::SeqFormat for details of sequence format processing.

 Output is written to STDOUT in standard fastA format or one of:

   Experiment     Staden Experiment file, may be multi-sequence.
   fastA          Pearson format, with a line feed every 60 characters
   FeatureTable   feature table format that may be used as annotation input to
                  NCBI's sequin tool
   GenBank        GenBank
   GFF            GFF
   plain          sequence output in condensed plain text format, with a line
                  feed every 60 characters.
   PrettyHTML     formatting of annotations using HTML tags. Block structuring
                  of the sequence, similar to GenBank format.
                  --PosRef=N  set position N to output position 0
   selex          table format holding identifiers and sequences using space
                  characters as the field delimiter. The sequence fields are
                  indented to the same adjusted column position.
   struct         plain file format encoding nested data structures. This is
                  a plain formatted mirror of internal data representation,
                  and therefore optimally conserves complex sequence information
                  (like nested annotation sub-structures).
   table          TAB-delimited table format of identifier and sequence


SeqHandle.pl standard options
=============================

abbreviations for switch argument types used in the descriptions:
 B := boolean
 F := floating point/scientific
 N := integer
 S := string
 X := varying type

 -debug[=N]    print debug protocol to STDERR (sometimes STDOUT). Keep temporary
               files. Switch argument N specifies a debug depth value,
               default 1.
 -FilterDescr=S
               apply a regular expression to the sequence descriptions, ignore
               matching sequence entries
 -FilterID=S   apply a regular expression to the sequence IDs, ignore matching
               sequence entries
 -fofn=S       append the entries of the specified file to the list of command
               line arguments. Multiple -fofn switch statements are allowed,
               taking effect cumulatively.
 -lower        force input sequences to lower case letters
 -OutDir=S     directory for file-targeting output. This switch overrides any
               directory statement provided with switch -OutStump.
 -OutSeq=S     file path for sequence output, default: STDOUT. For multi-file
               output use switches -OutDir and -OutStump.
 -OutSeq="rewrite"
               preserve the file structure as it is found for the input. The
               input files will get overwritten by the output (unless option
               -OutDir is set). Possible combination with switches:
                 -OutSeqFmt  write in specified file format
                 -OutDir     write files into specified directory
 -OutSeq="SingleSeq"
               write single-sequence output files with filenames corresponding
               to the sequence IDs
 -OutSeqFmt=S  format of sequence output, default "fastA"
 -OutSeqSort[=S]
               sort sequence output ascending according to one of criteria:
                 "id" (default) or
                 "descr"
 -pid=S        write process ID to file. This may be useful for monitoring of
               background processes in pipeline architectures.
 -pure[=S]     purify input sequence strings to leave letters which conform with
               the sequence alphabet. You may specify a sequence type (possible:
               DNA, DNA5, RNA, RNA5, protein). Then, fuzzy letters are converted
               to official "unknowns", i.e. N for nucleotide, X for protein
               sequences.
 -SlcID=S      apply a regular expression to the sequence IDs and skip sequence
               entries that do not match
 -SlcLen=N1[..N2]
               select input sequences according to their length
                  N1  minimum length
                  N2  maximum length, default: no limit
 -SlcType=S    select input sequences according to their sequence type, "DNA" or
               "protein".
 -upper        force input sequences to upper case letters


SeqHandle.pl -BreakIntoAssembly
===============================

This program mode offers a way to reformat sequences from standard file
formats to so-called "Experiment" files that can be imported to a GAP4 database.
Because GAP4 has limitations to the size of sequences (readings), a "directed
assembly" of overlapping fragments is generated. Regions of fragment overlap
will be turned to Ns in one of the fragments in order to a avoid a bias in
consensus computations. Conforming with GAP4 behaviour, a file of filenames
"fofn" is generated, listing all fragments of the directed assembly. Experiment
files and fofn are created in the current working directory, or a custom
directory that is speciefied by option -OutDir.

COMMAND LINE SYNTAX
 SeqHandle.pl -BreakIntoAssembly[=size1[,size2]] [options] <seqfile1>  \
   [<seqfile2> ...]

program mode arguments, optional:
 size1         fragment size, default 1000
 size2         fragment overlap, default 5

arguments:
 seqfile       sequence source

options:
 -AnnotLbl=S   an annotation is created to cover the complete assembly. This is
               done by default, using annotation label "ENZ9" and entering the
               sequence description field into the annotation text.
 -lower        force input sequences to lower case letters
 -OutDir=dir   place directed assembly in a directory different from current
               working directory
 -upper        force input sequences to upper case letters

EXAMPLES

 SeqHandle.pl -BreakIntoAssembly seq.fa

   Break all sequences in seq.fa to overlapping fragments of size 1000, and
   generate single-fragment sequence files in Experiment format, written to cwd.
   Additionally, create file "fofn" that lists the names of the Experiment
   files.

 SeqHandle.pl -BreakIntoAssembly=3000,1 -OutDir=da seq.fa

   Break all sequences in seq.fa to overlapping fragments of size 3000, overlap
   size 1, and write directed assembly to directory "./da" (automatically
   created).


SeqHandle.pl [-cat]
===================

Just re-output the input sequences. This is the default function of the program
that offers a way to apply the many options for selection and filtering. In
addition, the output may be written in a different format (cf. manual section
"File Format Support"), split to single-sequence files etc.

COMMAND LINE SYNTAX
 SeqHandle.pl -cat [options] <seqfile1> [<seqfile2> ...]

arguments:
 seqfile       sequence input

options:
 -FilterDescr=S
               apply a regular expression to the sequence descriptions, ignore
               matching sequence entries
 -FilterID=S   apply a regular expression to the sequence IDs, ignore matching
               sequence entries
 -fofn=S       append the entries of the specified file to the list of command
               line arguments. Multiple -fofn switch statements are allowed,
               taking effect cumulatively.
 -lower        force input sequences to lower case letters
 -OutDir=S     directory for file-targeting output. This switch overrides any
               directory statement provided with switch -OutStump.
 -OutIdFmt=S   ...
 -OutSeq=S     file path for sequence output, default: STDOUT. For multi-file
               output use switches -OutDir and -OutStump.
 -OutSeq="rewrite"
               preserve the file entities as found in the input. The input
               files will get overwritten by the output unless option -OutDir
               is set. Possible combination with switches:
                 -OutSeqFmt  write in specified file format
                 -OutDir     write files into specified directory
 -OutSeq="SingleSeq"
               write single-sequence output files with filenames corresponding
               to the sequence IDs
 -OutSeqFmt=S  format of sequence output, default "fastA"
 -OutSeqSort[=S]
               sort sequence output ascending according to one of criteria:
                 "id" (default) or
                 "descr"
 -pure[=S]     purify input sequence strings to leave letters which conform with
               the sequence alphabet. You may specify a sequence type (possible:
               DNA, DNA5, RNA, RNA5, protein). Then, fuzzy letters are converted
               to official "unknowns", i.e. N for nucleotide, X for protein
               sequences.
 -SlcDescr=S   apply a regular expression to the sequence descriptions and skip
               sequence entries that do not match
 -SlcID=S      apply a regular expression to the sequence IDs and skip sequence
               entries that do not match
 -SlcLen=N1[..N2]
               select input sequences according to their length
                  N1  minimum length
                  N2  maximum length, default: no limit
 -SlcType=S    select input sequences according to their sequence type, "DNA" or
               "protein".
 -upper        force input sequences to upper case letters

EXAMPLES

 SeqHandle.pl -cat -SlcLen=100 myseq.fa

   lists all sequences from file "myseq.fa" that have a minimum sequence
   length of 100 nts/aas, in fasta format.

 SeqHandle.pl -OutSeq=SingleSeq -OutDir=./singseq/ -OutSeqFmt=Experiment  \
   myseq.fa

   writes all sequences from file "myseq.fa" into single-sequence files, in
   Experiment format.  The output files are created in directory "./singseq/".


SeqHandle.pl -CatAnnotSeq
=========================

Given a combination of sequences and feature annotations, e.g. a sequence
database entry in GenBank format or a combination of fasta-formatted sequence
and a GFF file, this program mode allows to extract the sub-sequences that are
covered by a certain annotation type.  Sequences are printed to STDOUT.

COMMAND LINE SYNTAX
 SeqHandle.pl -CatAnnotSeq=AnnotLbl [options] <seqfile1> [<seqfile2> ...]

arguments:
 seqfile       sequence input

program mode argument:
 AnnotLbl      label of annotations whose sequences shall be extracted

options:
 -AnnotLbl     no effect, cf. program mode argument
 -flank=N      add flanks of specified size to output sequence
 --gap=N       join sequence of two annotated ranges if they are separated by
               less than the specified gap length, default: no join of adjacent
               sequence ranges
 --uplow=B     turn annotated sequence upper case, flanking sequence lower case

EXAMPLES

 SeqHandle.pl -CatAnnotSeq=CDS AY937229.gb

   lists all CDS regions of annotated genome fragment AY937229, in fasta
   format.

 SeqHandle.pl -CatAnnotSeq=CDS -flank=10 --uplow=1 AY937229.gb

   again, lists all CDS regions of annotated genome fragment AY937229, in fasta
   format.  In addition, 10 nt flanking sequences are displayed in lower-case
   letters while the actual CDS is highlighted in upper-case letters.
