kPerl package
user manual SeqMotif.pl
Sequence Laboratory - Motifs in Sequences


Introduction
============

[...]


Strand Models for the Analysis of DNA and RNA Molecules
=======================================================

DNA may be viewed from different perspectives, concerning the strandedness
of the molecule. The kPerl package allows to define four different strand
models to serve specific desires in sequence motif analysis. On the command
line, a certain strand model can be chosen via option `-strands=<model#>'
where <model#> may be one of:

  0  Consider the polynucleotide as a duplex molecule. Motifs are treated
     as duplex fragments rather than words in a single-strand molecule.

  1  Analyse the forward (Watson) strand only

 -1  Analyse the reverse-complement (Crick) strand only

  2  Analyse both DNA strands independently. For example, treat the occurrence
     of a symmetric motif as two independent instances in each the forward and
     the reverse-complement strand. The result is the same as applying model 1
     and -1 successively and adding the counts/measures. This way, relative
     frequencies will sum to 1.00.

The front-end programs which make use of these model definitions are:

 SeqHandle.pl
 SeqMotif.pl

In addition, some modules' behaviour directly depend on the choice of a
strand model:

 SeqLab/Motif*.pm
 SeqLab/SuffixTrie.pm

The following listing will explain, more specifically, what program functions
are affected in which way by the choice of the strand model.

 SeqMotif.pl -TupleLib

   For strand model 0 (duplex model), the program calculates frequency values
   that reflect the probability of observing a given tuple in at least one of
   both DNA strands. The overall sum of frequencies will be some less than
   2.00 due to the existence of symmetric tuples. But this applies only to
   tuples of even-numbered size, because only these can have symmetrical
   instances.  A typical underlying idea for this DNA strandedness model would
   be the question: How often does a restriction endonuclease recognition
   motif occur in a dsDNA molecule?
   For strand model 1 or -1 (single-strand models), the program calculates
   frequency values that always sum to exactly 1.00.

 SeqMotif.pl -randomize

   Strand model 0 does not make sense here. If specified, it will effectively
   provoke use of model 2, which is the default model for this program mode.
   Model 1 or -1 provoke strand-specific base or word frequencies in the
   randomized output.


SeqMotif.pl -randomize
----------------------

Statistical sequence analysis essentially requires a null model which
describes what would be expected if the sequence had a purely random
composition, i.e. no purifying selection related to functional constraints
is taking place. A null model helps to predict the frequency of sequence
motifs, like transcription factor binding sites, splice motifs, and protein
domains.  Random/Randomized sequences represent a way to provide a stocha-
stical sequence model to any analysis program. This method is generally
applicable in software evaluation studies, independent of pre-existing
stochastical components of the evaluated software.

Random sequence models may have different levels of complexity and accuracy.
The simplest possible model assumes that there is a homogeneous occurrence
of sequence symbols (nucleotides, in case of DNA or RNA sequence). More
complicated models try to mimick the natural nucleotide composition,
with respect to the frequency of single nucleotides, or even di- and
trinucleotide substrings.

The program function `SeqMotif.pl -randomize allows to generate random
nucleotide sequence sets based on Monte Carlo simulation of a single-
nucleotide frequency model (default) or an Nth-order Markov chain model
(option -TupleSize=M, M=N+1). The program analyzes the input for its sequence
composition to derive a model. Baed on this model, it creates random sequences
that conform with this composition. The output will contain the same number of
sequences as the input sample, unless option --nseq=N is used to specify
a desired number. Likewise, the sequence lengths of the output will be
a randomization of the values found in the input (cf. options --seqlen
and --seqlenipol). Output is done to stdout, using the fasta format and
the sequence identifiers "rand0", "rand1", ...


COMMAND LINE SYNTAX
 SeqMotif.pl -randomize [options] <seqfile1> [<seqfile2> ...] > <seqout>

arguments:
 seqfile       sequence source

options:
 -strands=N    strand model for sequence composition analysis, default:
               2 := analyze both strands separately. See also chapter
               "sequence strand model".
               In case of a strand-specific orientation of input
               sequences, this switch can be used to conserve that
               that feature.
 -TupleSize=N  elementary unit size for randomization, default: 1 bp.
               A unit size N>1 will provoke a randomization mechanism
               based on an (N-1)th-order Markov model.
 --nseq=N      number of sequences in randomized output, default: same
               number as in input
 --seqlen=N    fixed sequence length, default: input values randomized
 --seqlenipol=B
               interpolate sequence lengths, default: explicit input
               values randomized

EXAMPLES

 SeqMotif.pl -randomize -strands=1 --nseq=1000 intron.fa  \
   > intron_rand1k.fa

   Randomize an RNA sequence, using the strand-specific base frequency
   derived from the input sequence(s) in file intron.fa. Output 1000
   randomized sequences.

 SeqMotif.pl -randomize -TupleSize=4 --seqlen=1000000 Ecoli_frag.fa  \
   > Ecoli_rand.fa

   Create a random sequence, using a 3th order Markov model that is
   built from genomic input sequence. The program generates as many
   sequence entries as there are in the input, each having 1 Mbp size. 

