Software / GenALA

GenALA: a toolkit facilitating prokaryotic genome projects by linking GAP4 assembly and GenColors annotation

GenALA is a suite of programs (Table 1) that integrates essential steps (see flow diagram (pdf)) in prokaryote genome analysis projects. Their functionality includes the import of annotated genome sequences and gene predictions into the Staden Genome Assembly Program (GAP4) and the data export from GAP4 into GenBank format. A novel genome (target) can be assembled using an ab initio or a backbone procedure depending on the availability of a closely related genome. At any stage, the consensus sequence including quality values (confidence and coverage) and annotations can be exported from the GAP4 database into the web-based software/database system GenColors for a fast and reliable annotation by genome comparisons. The refined GenColors annotation can be returned to the GAP4 database, thus enabling maintenance of the annotation during further gap closure and finishing steps. This iterative process can be performed until the finally annotated target genome sequence is obtained. Furthermore, from backbone assembly projects whole genome alignment files can be exported and analyzed in GenColors.

The software is free for academic and non-profit use (download here GenALA.tar.gz and genALA_samples.tar.gz )

Acknowledgement: This work was supported by the German Ministry of Education and Research, grant 0312704E.

Table 1. Short description of GenALA tools
Name Description Input Output
genbank2gap transforms a GenBank flat file into a GAP4 tagged sequence backbone and/or into GAP4 consensus tags GenBank flat file *.fofn, *.exp, *.tags
gap2genbank extracts from a GAP4 database consensus the sequence and annotation tags as a GenBank flat file GAP4 database *.gb
bbgap2genbank extracts from a backbone GAP4 database GenBank flat files for several target variants (t, n, h) and the reference (r), the target-reference alignment as well as confidence and coverage GAP4 database *, *_t.gcc, *, *_n.gcc, *, *_h.gcc, *, *.msf
gap2annotation concatenates GAP4 consensus sequences for external feature predictions GAP4 database *.fa, *.rc, *.fap, *.fau
annotation2gap parses annotations from a simple tabular format into a GAP4 tag file tab delimited table file *.tags
trna2gap analysis tRNA content of a GAP4 genome project (tRNAscan-SE) GAP4 database *_trs.tags
gapdeletag.tcl deletes tags from a GAP4 database GAP4 database GAP4 database
gapsequence.tcl exports gap sequences GAP4 database cons.exp

Hint to CONSED users: tools are available to swap between GAP4 and CONSED.

Detailed descriptions

    genbank2gap transforms a genbank file for import into GAP4 projects
    USAGE: genbank2gap [OPTIONS]

       generates a GAP4 readable tag output of genbank features.
         NOTE: May be bound to an existing GAP4 project (-g) to consider 
               pads. The genbank file should have the same sequence as the GAP4 
               project or the tags will be placed at wrong positions.
             - Not meant to update a GAP4 project with new annotations in that 
               case you must use -u.



       generates files for a directed assembly into a GAP4 project. Unless 
       -i is given, the identifier of the genbank file will be used as 
       artificial "readname".



       Genbank features will replace existing GAP4 tags of the corresponding
       database reference (same GC2ID).
         NOTE: Requires -g PROJECT option!
             - Not meant for import of new db_xref numbers! (The lack of
               corresponding db_refs in GAP4 will result in loss of new db_xref
             - GC2IDs not found in the project will be lost!
             - Project tags without a corresponding GC2ID in the genbank file 
               will remain untouched.
       (--not TAG)

          List tags that shall remain like in the gap project and must not
          be overwritten by the update. Multiple names must be comma
          seperated, no blanks

    Optional switches

       If there are more or different tags to be included in the process
       (see below "Converted tags"), they must be listed here. Each one 
       with -c/--convert. The program's settings for this tag will be
         NOTE: Use GENBANK_TAG=undef to turn off a conversion.

    (--delete_tags MIN_LEN)

       Delete tags above min len
         NOTE: This option can be used with -u and it will run 
               gapdeletag for you.

    (-g PROJECT, --gap_project PROJECT)

       Name of the target gap project.

    (-i TEXT, --id TEXT)

       Contig ID (Name of first read). If multiple place in "" and seperate
       by blanks.
NOTE: If there are values given here, they will override potential values read from the file. (-o NUM, --opt_len NUM) Change the length of the "optimal size length" (standard is 3500 bases) for the backbone fragments. (-w FILENAME, --write FILENAME) Write results to named file VERSION: 4.02, DATE: 21.06.2006 USAGE: genbank2gap [OPTIONEN] DESCRIPTION 1) genbank2gap -f transforms the contents of a genbank file to stdout or -w FILENAME, that can be imported into an existing GAP4 project via it's ">Edit >Enter tags" function. If the target GAP4 project contains pads (*) it is advisable to use the -g option, as the program will take the pads into account while calculating the tag's positions. Example: genbank2gap -f -w import.tags 2) genbank2gap -n generates files for a directed assembly and a *.tags file for the ">Edit>Enter tags" function of Gap to import the consensus tags. Concatenated entries work well. If no contig IDs are given from the genbank file as a "segment" qualifier, the genbank display ID will be used. Example: genbank2gap -n 3) genbank2gap -u updates an annotated GAP4 project with the data from a genbank file. The genbank-features will replace GAP4 tags with the same IMBGC2 identifiers in the db_xref. GAP4 tags missing a valid GC2ID will remain untouched. The output can be imported into the GAP4 project via it's ">Edit >Enter tags" function. Example: genbank2gap -u -g test.0 -w test.tags Before importing these new tags, all existing tags of the kind to be imported, should be removed to prevent overloading of existing tags. 4) Preparing a genbank file for import into an existing project which may only differ by the existance of pads (*) in the GAP4 project: Example: genbank2gap -f -g test.0.aux -w test.tags Converted features So far the conversion of the following feature tags is supported: * genbank-feature => GAP's tag * CDS genbank-feature => CDS_ tag * rRNA genbank-feature => RRNA tag * tRNA genbank-feature => TRNA tag To modify use the -c option. All qualifiers of the processed features will be converted into gap comment lines. KNOWN BUGS Split genes over the entrie's ends Genes that extend over the start and end of the provided sequence will result in an empty first contig and an all sequence containing second contig. Deleting this split gene from the genbank file (you will loose this gene tag!) will allow the creation of a proper project and/or tag file. Tag inside a tag Tags that rest entirely inside an other tag will be left out. The list is saved (*skipped). Very long words For genbank's format, long tag lines without blanks (eg. long enzyme names) are broken into suitable length. In gap this will result in lines that have a blank at the former linebreak, not in a continuous line. Gap can't identify IUPAC bases If there are IUPAC bases (eg.: w, s, r even n) in the genbank sequence, gap will translate them into an "a" (adenine) in the consensus, as long as they have a confidence value associated with them. genbank2gap will tag these bases with an UNSR (unshure) tag. AUTHOR Markus B Schilhabel mail: mbs

    gap2genbank generates a genbank file out of a GAP4 project

    -g FILE, --gap_project FILE

       Name of GAP4 project to generate new the data file(s) from. Ocurring
       pads will be stripped, or add the -s option keep them.


    -f FILE, --file FILE

       Name of experiment file, if there already is one you wish to use. 
       Make shure that you saved only non-cutoff reading annotations.

    -a NUM, --accvers NUM

       Version number of this accession


       If there are more or different tags to be included in the process
       (see "Converted tags"), they must be listed here. Each one with
       --convert. Existing values will be overwritten!
         NOTE: Use GAP_TAG=undef to turn off a conversion.

    (-e FILE, --edit FILE)

       There are entries that can not be found in the GAP4 projects. Name
       the file containing the data if you have one. Otherwise you will be
       asked to enter some data (eg.):

         ORGANISM    = Borrelia garinii
         strain/ssp  = PBi
         ORG_Lineage = Bacteria; Spirochaetes; ...;  Borrelia.
         codon_table = 11
         locusID     = BGC
         division    = BCT
         DEFINITION  = linear chromosome
         ACCESSION   = AC00000 (or a Name eg IMB_PBil)
         KEYWORDS    = 
         mol_type    = genomic DNA

        One line per entry!
        Lineage separated by ; from Kingdom -> Species

     (-h, --help)

        print this help.

     (-l NUM, --low_limit NUM)

        Define a minimum lenght for the contigs to be used in the
        genbank output.

     (-q NUM, --qual_cov NUM)

        Add quality and coverage files.

     (-r, --readtags)

       Also add tags from reads into genbankfile. Beware: tags on seperate
       reads will result in seperate entries!

     (-s, --strip_no)

       Don't strip pads.

     (-w FILENAME, --write FILENAME)

       Write to file.

    VERSION: 3.74, DATE: 27.07.2006

    gap2genbank is designed to generate a genbank file out of an existing
    GAP4 project.

    As input it needs either the name of an existing GAP4 project or the
    name of an existing experiment file.

    For each contig there will be a seperate genbank entry. The entries will
    be printed to stdout as a single stream (unless using -w).

    * GAP tag   => genbank qualifier*
    * CDS_      => feature CDS tag.
    * RRNA      => feature rRNA tag.
    * TRNA      => feature tRNA tag.
    * REPT      => feature repeat_region tag.

    The CDS tag will be translated into proteins.

    If you need more tags in the genbank file, use --convert.

    The comments in the GAP4 tags must follow the naming convention (one
    comment per line):

    * GAP comment in tag    => genbank qualifier
    * >tag                  => /tag
    * >tag=comment for tag  => /tag="comment for tag"
    * tag="comment for tag" => /tag="comment for tag"

    all other comments will be collected in "note" fields:

    * any text here => /note="any text here"

    The oputput is the genbankrecord (please redirect to a file).
    If there are more contigs in the project, every contig gets its own
    entry but all will be printed in a single stream. They can be seperated
    at the "end of sequence tag": //

    A program doing this is available from: mbs at

    Program messages will be printed to STDERR. Errors are saved.

    Tags will not be sorted by position. The program relies on the output of
    the GAP4 project.

    Long lines from Genbank files will be broken when imported into GAP4 they
    are lacking the qualifier tag and will be added as note.

    1) Making a genbank file out of a GAP4 project:

      gap2genbank -g gap_p.0 > or
      gap2genbank -g gap_p.0 -w

    2) Making a genbank file out of an experiment file, there is a saved
    data file (-e) and you want the ENZ5 tag to be a misc_feature in the
    genbank file:

     gap2genbank -f cons.exp -e -c ENZ5=misc_feature -w

     Markus B Schilhabel         mail: mbs at

    bbgap2genbank v.1.1

    writing of different types of sequence files (*.fa / *.gb) of a genome
    assembly project (GAP4) - reference projects


     [ -p 'GAP4_project_name.Version' ]
     [ -b 'common_root_of_all_reference_reading_names' ]
     [ -h ] print this online help
     [ -e 'resource_file' ]
     [ -d 'directory_of_result_files' ]
    bbgap2genbank writes a collection of different sequence files:

    1. consensus sequence of the reference assembly project with reference
       sequence in case of gaps in target sequence (*_h.fa, *

    2. consensus sequence of the reference assembly project with masked
       reference sequence (N) in case of gaps in target sequence(*_n.fa,

    3. consensus sequence - only target sequence (*_t.fa, *

    4. consensus sequence - only reference sequence (*_r.fa, *

    5. reference sequence with pads (*_rp.fa)

    6. target sequence with pads (*_tp.fa)

    7. msf-alignment of reference and target sequence

    8. quality files (tab delimited table) of hybrid-, target- and
       reference-project (*_h.gcc, *_t.gcc ,*_r.gcc)

File name explanation
     <project>_h.fa/                     (1)+(8)  
     <project>_n.fa/                            (2)      
     <project>_t.fa/                     (3)+(8)  
     <project>_r.fa/_r.gk/_r.gcc                     (4)+(8)  
     <project>_tp.fa                                 (5)      
     <project>_rp.fa                                 (6)      
     <project>.msf Alignment of reference and target (7)

     gap2annotation v.1.1 

     concatenates GAP4 consensus sequences for external feature predictions


      [ -p 'GAP4 project name' ]

      [ -h ] this online help
      [ -c 'lower limit of contig length' ]
      [ -g 'spacer length between 2 joined contig sequences' ]
            default value 1000 x n
            this is necessary because GeneMarkS accepts only FASTA-files
            with one single sequence with a minimum length of 1 Mb.
      [ -l 'maximum length of sequence in FASTA format' ]
            default value : 15 Mb
            actually no known upper limit for GeneMarkS

        test__gms.fa   (concatenated sequence)
        test__gms.fap  (padded concatenated sequence)
        test__gms.fau  (undpadded concatenated sequence)
        normally you need only test_1.fau

Index file
        here are written the contig positions in the FASTA-File
        you need this file for annotation2gap

Example of program call (minimal)
        gap2annotation -p  -g 40

     annotation2gap v.1.1 

     parses annotations from a simple tabular format into a GAP4 tag file


      [ -p 'GAP4 project name' ]
      [ -d 'directory of result files' ]
      [ -f 'space delimited table' ]
        file from GeneMarkS with CDS-Positions
        in mailbox: 'GeneMarkS: Gene Listing: <input file>'
        make 'copy and paste' - it's the easiest way
        default value: test_pos.txt

      [ -h ] this online help

      Example of required table:

      Gene  Strand  LeftEnd  RightEnd   Gene   Class
       #                               Length
       1      +        <3      1403     1401     1
       2      -      1751      3946     2196     2
       3      +      4267      6708     2442     1

     tag-file: <project name>.<version>_cds_tags.txt
        in Experiment-File-Format (q.v. STADEN-Package)

     trna2gap v.1.4

     produces a tag file with padded positions of tRNA-genes in a
     GAP4-project (based on a tRNAscan-SE analysis)


      [ -p 'GAP4 project name.version' ]

      [ -c 'lower limit of contig length' ]
      [ -d 'directory of result files' ]
      [ -h ] this online help

     tag-file: <project name>.<version>_trs.tags
        in Experiment-File-Format (q.v. STADEN-Package)

deletes cons or read tags
gapdeletag.tcl -g gap_project.v -r [-c or -C NAME] [-l -a NUM -t TAG1,TAG2,... or all tags]

  -g FILE.V   name of gap_project
  -r          delete tags from reads
  -c          delete tags from consensus (all)
  -C NAME     delete tags from listed consensus (comma seperated, no blanks)
 (-a NUM      delete above contig length (useless for read tags!))
 (-t TAGS     TAG1,TAG2,TAG3, ... name of tags to be deleted,
              comma seperated, no blanks)
 (-l          only looking, not doing anything yet)
 (-v          be verbose)
  -h          print a help

saves consensus like GAP4
gapsequence.tcl -g gap_project.v [-c NAME-f [F|X|S] -o [filename] -s [y|n]]

  -g FILE.V   name of gap_project
 (-c NAME(S)  selected contigs in C1,C2,C3    (default: all))
 (-f FORMAT   X(periment), F(asta), S(taden)  (default: F))
 (-o FILE     output filename                 (default: cons))
 (-s y/n      strip pads [y or no]            (default: n))
 (-h          print a help)