
      SPLITDB                                              update 16 Aug 2001
      
      
      NAME
            splitdb - split GenBank files into annotation, sequence, and index
      
      SYNOPSIS
            splitdb  [-gepvlct] dbfile anofile seqfile indfile
      
      DESCRIPTION
      Splitdb splits a database (dbfile) among three files: anofile, seqfile
      and indfile. Splitdb ignores any header information that might be in the
      file and begins processing at the first entry. 
      
      anofile contains the annotation portion of each entry. Entries are
      terminated with '//' or '///' (PIR only). Trailing blanks present in
      dbfile are omitted in anofile.
      
      seqfile contains the sequence data for each entry. Each sequence 
      entry begins with a header line, followed by sequence data on 
      succeeding lines of 75 characters per line.  The header line 
      includes the header flag character '>' in column 1, followed by the 
      name, followed by the first 50 characters of the 1st 
      DEFINITION line. An example is shown below:
      
      >UNHOR1 - Unicorn horn protein 1, complete cDNA sequence
      attcctctatagtctattctagctagccaaataggttagatggctgtcttactacttacgc
      ...
      
      Removal of blanks and numbers from sequence lines makes makes split
      datasets about 8-9% smaller than the original GenBank files.

      indfile is an index which tells the line numbers for each entry in 
      anofile and seqfile.  It is assumed to be in alphabetical order by 
      name. Each line contains a name and accession number, followed by the
      line numbers on which the annotation and sequence data begin in anofile 
      and seqfile, respectively.  Thus the file plants.ind might contain:

                 1         2         3         4         5         6         7         8
        12345678901234567890123456789012345678901234567890123456789012345678901234567890

	A15660           TA156608        1        1
	A15671           A15671         33       11
	A15673           A15673         65       25
	A15675           AK156751       97       36
	A15677           BA156770      128       46
	A16780           BA167807      160       57
	A16782           A16782        192       70
	ATHRPRP1C        GM905105      225       83
        etc...
      
      Note that indfile is a perfectly legitimate .nam file, for use with 
      programs such as getloc, getob, or comm. 


      The following options identify the type of database being read:

            -g    GenBank (default)
            -e    EMBL
            -p    PIR (NBRF)
            -v    Vecbase
            -l    LiMB

      Other options:
            -c    Compress 3 or more leading blanks in annotation lines
                  to take the form <CRUNCHFLAG><CRUNCHCHAR>, where CRUNCHFLAG
                  is the ASCII character specified by the Pascal const 
                  CRUNCHOFFSET, which is set to 33 ("!") in the current 
                  implementation. For each annotation line read, if the
                  number of leading blanks is >=3, splitdb sets CRUNCHCHAR
                  to CRUNCHOFFSET+the number of blanks. Thus, for lines 
                  with 3, 4, or 5 leading blanks, CRUNCHCHAR would be
                  '$', '%' and '&', respectively. GETLOC and GETOB
                  automatically expand crunched blanks when CRUNCHFLAG
                  is encountered on an input line. Empiracle observations
                  indicate that the -c option decreases the size of 
                  GenBank files by about 10%.

                  This compression method may fail when the number of 
                  leading blanks exceeds 127-CRUNCHOFFSET. However, 
                  none of the above mentioned databases currently
                  supports any datafield with anywhere near that number
                  of leading blanks.
                  
             -t   (GenBank only) Append all information in the first
                  ORGANISM to the end of each line in indfile. For example,
                  the entry which begins:
                  
     LOCUS       GORMTDLOOZ    282 bp    DNA             UNA       11-MAR-1996
     DEFINITION  GGGOMT493; Gorilla gorilla gorilla (BomBom, ISIS 438, Audubon
                 Zoological Gardens) mitochondrial D-loop DNA.
     ACCESSION   L76759
     NID         g1222584
     KEYWORDS    D-loop.
     SOURCE      Mitochondrion Gorilla gorilla gorilla (individual_isolate BomBom,
            ISIS 438, Audubon Zoological Gardens, sub_species gorilla) male
            DNA.
       ORGANISM  Mitochondrion Gorilla gorilla gorilla
                 Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;
                 Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Gorilla.
                  
                  might be indexed as
                  
      GORMTDLOOZ   L76759   1    1 Mitochondrion Gorilla gorilla gorilla
                 
                  This is useful for taxonomic studies, or as a way of making
                  it easy to create subsets from a single index. Thus,
                  'grep gorilla primates.ind' would print all lines in the
                  file that contained the word gorilla. The output from
                  this command could be used as a .nam file for extracting
                  just gorilla sequences from a larger dataset using 
                  fetch.
                 
                
     NOTES
       1. Header lines that aren't part of entries are automatically
       stripped out during processing. For example, in a file containing
       GenBank entries, all lines up to the first occurrence of 'LOCUS'
       starting in column 1, are ignored. Similarly for PIR, processing
       begins on the first line containing 'ENTRY' beginning in column 1.
       2. GenBank/EMBL/DDBJ entries created on or after Feb. 1, 1996,
       have accession numbers of 8 characters, rather than 6. Previously
       assigned accession numbers will remain at 6 characters. Splitdb has
       been updated to write all accession numbers to the .ind file, left
       justified in a field of 8 characters.  
       3. As of GenBank Release 127.0, the LOCUS names will increase from
       a maximum of 12 to a maximum of 16 characters. To accommodate this
       change, the NAME filed in .ind files has expanded to 16 
       characters. The current version of splitdb can read both old and
       new GenBank flatfiles, but will create .ind files using the
       new format.
  
     SEE ALSO
            getloc, getob, comm(1) (Unix command).

     AUTHOR
       Dr. Brian Fristensky
       Dept. of Plant Science
       University of Manitoba
       Winnipeg, MB  Canada  R3T 2N2
       Phone: 204-474-6085
       FAX: 204-261-5732
       frist@cc.umanitoba.ca

     REFERENCE
       Fristensky, B. (1993) Feature expressions: creating and manipulating
       sequence datasets. Nucleic Acids Research 21:5997-6003.
