Current version v1.1.1 (2008-12-08). Download source as *.zip or *.tgz.
-- tuple_plot README --
Copyright (C) 2006
Karol Szafranski & Niels Jahn
Genome Analysis Group, Leibniz Institute for Age Research - Fritz Lipmann
Institute, Jena (Germany)
PURPOSE
Tuple_plot identifies and visualizes local similarities between two genomic
sequences, typically 100 kb or longer, by applying the well-known dotplot
principle. The implemented scoring scheme results in a high signal-to-noise
ratio.
INSTALLATION
The software is known to build and run properly on different Linux/UNIX
platforms and under MacOS. It should compile and run under any system
providing an ANSI-conforming C++ compiler (C89), preferably GNU g++.
Tuple_plot requires some shared libraries to be installed on your system
GD (http://www.boutell.com/gd/)
which itself depends on
libjpeg (http://www.ijg.org/)
libpng (http://www.libpng.org/pub/png/libpng.html)
libttf or freetype (http://www.freetype.org)
Installation under MacOS first requires installation of the compiler Xcode as
well as X11 (both available on the OS disc). For GD library installation on
the MacOS via terminal, some useful descriptions are available on the web
(e.g. http://www.paginar.net/matias/articles/gd_x_howto.html). You may also
follow the protocol file macos-install-gd.rtf included in this package.
Darwinport offers another way to do the installation (http://homepage.mac.com/
duling/halfdozen/GD-Howto.html).
On Linux systems, installation of required libraries should be convenient
using rpm files, as provided by the system distributor or available on the
web. Note that you need to install the shared library as well as development
versions of the library packages, at least for GD.
To compile the tuple_plot program, run the makefile included in this package
with commands:
cd install_dir
make
If you successfully installed the program on a new platform, or you
encounter any problems during installation, contact the authors through
the distribution web site http://genome.fli-leibniz.de/software.html .
PROGRAM DESCRIPTION
This section will focus on the command line interface of tuple_plot. The
implemented algorithm and the general procedural scheme has been published
(see below).
The minimal program call requires at least two statements: (i) path(s) of the
two input sequences and (ii) a directive describing what type of output is
desired. The input sequences must be provided in fasta format, either
together in a single file or separately in two files. The output mode may
be either a PNG image file alone (directive -o ) or that image file
wrapped by an HTML document (directive -H ).
tuple_plot -o tplot.png seq1.fa [seq2.fa]
tuple_plot -H tplot seq1.fa [seq2.fa]
The latter mode is highly recommended since the HTML document (file
ofile_stump.html) verbosely describes all program settings used for
computation as well as the steps of the computational process. This
information will provide a detailed documentation of the sequence comparison
results for later inspection, and it allows to develop an effective strategy
to optimize the comparison task, if necessary.
tuple_plot dynamically adjusts different parameters of the comparison proced-
ure. This self-parametrization will result in satisfying results, in most
cases. However, several program options can be used in order to obtain
optimal, and these will be described in the following. A complete listing
of the command line options can be obtained calling the built-in usage help:
tuple_plot -h
First, to better understand the available command line options, it is useful
to know about the basic structure of the program's work flow. It is
organized in three sections:
A. prelude
- analysis of sequence composition
- suggestion of optimal word size used for local sequence comparison
- construction of word frequency and word instance dictionaries
- masking of overrepresented words
- ranking of words
B. actual performance of the sequence comparison
- word hits are sampled, scored, and transferred to the dot matrix
C. dotplot presentation
- application of display thresholds to values of the dot matrix
- preparation of the dotplot image
- merging of supplied annotation data
- finishing of output files
Options that affect the sensitivity and specificity of the dotplot approach
will modulate steps in sections A and C, as indicated by headlines in the
command line help (section A: "options affecting the sequence comparison";
section C: "thresholding hit display" and "options affecting the dotplot
image").
The user can choose if the sequence comparison is performed both,
co-directional and counter-directional (forward/forward as well as reverse/
forward; this is default behavior), co-directional only (option -f), or
counter-directional only (option -r). Co-directional and counter-directional
hits are computed as independent layers of the dotplot and will be displayed
by different colors (co-directional black, counter-directional red).
A word (tuple) size optimal for comparison is automatically suggested by the
program, dynamically adapted to the length of the input sequences. Forced
settings (option -t) will have little effect on the dotplot results unless
extreme values are applied. Note that increasing the word size will cause
longer computation time and increased memory requirement. However, both
these effects are not an issue with sequence sizes below 1 Mb.
A stochastical scoring scheme is the outstanding feature of tuple_plot which
results in appreciable signal-to-noise ratio. First, words will be completely
ignored if their overall frequency is x-fold compared to the expected
frequency (option -i), compared to a homogeneous distribution of words.
Second, the expected frequency of random hits is used to counter-correct
the observed hits (default -s1, switched off by -s0). A second correction
scheme (option -s2), additional to the one described in the publication,
uses squared correction weights and results in slightly different results.
However, since the latter is less founded theoretically we recommend the
default correction scheme. Reports that allow to monitor the process of
word exclusion and word/hit scoring can be invoked using options -n and
option -m, possibly in combination with option -M.
After scored hits have been sampled to the dotplot matrix (work flow
section B), the next subtask is to transfer the matrix data to a graphical
representation, i.e. the dotplot image. Parametrization of this subtask
applies to the fraction of the dotplot image pixels that shall be colored
colored to indicate hit state. The default behavior refers to the expecta-
tion that the dotplot will show a perfect solid diagonal, composed by
2 * min(size_x,size_y)
pixels. With default settings (corresponding to option -A 1.0), the program
determines this number of highest-scoring matrix values and transfer these
to colored pixels into the dotplot image. If you expect (or experience)
much background signal that scatters outside the expected match diagonal,
it is reasonable to rise the sensitivity of the sequence comparison by
increasing values given with option -A. Option -a similarly scales the
signal of the dotplot image, directly specifying the fraction of colored
pixels. Option -c directly sets the score threshold that is applied during
transfer of dotplot matrix values to the dotplot image. Option -A is
recommended in favor of -a or -c because it gives most robust behavior with
varying settings of image size and other parameters that influence the
sensitivity/specificity of the comparison.
Finally, a set of options influences the shape of the dotplot image. Options
-x and -y set the image dimensions. As a default, the maximum edge size is
500 pixels and the ratio of horizontal (x) and vertical (y) dimensions is
proportional to the sizes of the two input sequences. Option -q turns off
the proportional scaling and forces quadratic shape. Option -g provides an
interface to user-supplied GFF-formatted annotations that will be merged
into the dotplot image, using colors specified in the feature field (field
#3 according to GFF definition, cf. http://www.sanger.ac.uk/Software/formats/
GFF/) using a hexadecimal RGB color format as defined by the HTML standard
(e.g. "#C8E2C8"). Use sequence IDs "seq1"/"seq2" or "seqA"/"seqB" in the
GFF sequence field (field #1) to refer to one of the input sequences.
HOW TO CITE
The program tuple_plot and its underlying algorithm is described in a
publication
Szafranski K, Jahn N, Platzer M. tuple_plot: fast pairwise nucleotide
sequence comparison with noise suppression. Bioinformatics 22, 1917-1918
(2006).
THANKS TO
We thank Christoph Grunau for documentation material concerning gdlib
installation under MacOS, Klaus Huse for extensive beta testing.
LICENSE
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
|