


User Commands                                             tacg(1)



NAME
     tacg - finds short patterns in nucleic acids, translates DNA
     <-> protein.


SYNOPSIS
     tacg -flag [option] -flag [option]  ...   tacg  takes  input
     from  stdin  (|  or  <);  spits  output to screen (default),
     >file, | next command

     [-chHlLqQsv]  [-b   #]   [-e   #]   [-C   {0-12}]   [--clone
     '#_#,#x#...']   [--cost  #]  [-D  0-4]  [--dam]  [--dcm] [--
     example] [-f {0|1}] [-F {0-3}] [-g #,#] [-G  #,{X|Y|L}]  [-H
     (--HTML)  0|1]  [-i (--idonly) 0-2] [-m #] [-M #] [-n {3-8}]
     [--numstart] [--notics]  [-o  {0|1|3|5}]  [-O  {1-6},#]  [-p
     Name,pattern,Err]        [-P       NameA,(+|-)(l|g)Dist_Lo(-
     Dist_Hi),NameB] [--ps] [--pdf] [--logdegens]  [-r  (--regex)
     {'Label:RegexPat'    |    'FILE:FileOfRegexPats'}]   [--rule
     'Name,(LabA:m:M&LabB:m:M),Win']                  [--rulefile
     '/path/to/rulefile']  [-R alterative Pattern/Matrix file] [-
     -raw]  [-S  (1*|2)]   [--silent]   [--strands   {1|2}]   [-T
     {0|1|3|6},{1|3}]  [-V  {1-3}]  [-w {1|#}] [-W (--slidwin) #]
     [-x NameA(=),NameB..(,C)] [-X (--extract)  {b,e,[0|1]}]  [-#
     %] [--rev] [--comp] [--revcomp]


DESCRIPTION
     tacg takes input from stdin, automagically  translates  most
     standard  ASCII  formats of Nucleic Acid (NA) sequence, then
     analyses that sequence for restriction enzyme (RE) sites and
     other  NA  motifs  such as Transcription Factor (TF) binding
     sites (w/ or w/o mismatch errors), matrix matches, and regu-
     lar  expressions,  finally  writing  analyses to stdout.  It
     also can translate the NA input to  protein  in  any  frame,
     using  any  of  a  number  of Codon translations tables, and
     search for Open Reading Frames (ORFs), as  well  as  perform
     many  other  analyses.   Most  of  the internals use dynamic
     memory so there are few limits on sequence  input  size  and
     pattern  number.   It's  ~  5-50x faster than the comparable
     routines in GCG or EMBOSS and as it's writ in ANSI C,  port-
     able to all unix variants, and even Microsoft Win32 with the
     Cygwin and the ming32 toolkits.

     tacg searches the sequence read from stdin for matches based
     on  descriptions  stored  in  a database of patterns, either
     explicit sequences, possibly containing  IUPAC  degeneracies
     (default  _r_e_b_a_s_e._d_a_t_a, in GCG format or extended format), or
     matrix descriptions (default _m_a_t_r_i_x._d_a_t_a, in  TRANSFAC  for-
     mat),  regular  expressions (default _r_e_g_e_x._d_a_t_a, in GCG-like
     format), or a rules file (default  _r_u_l_e_s._d_a_t_a  in  a  simple
     format)  based on matches and options entered on the command
     line, sends ALL output to stdout. (Unless requested,  it  no



SLuansOtSc5h.a8nge: tacg (v3.5) - a command line tool for DNA and Protein Ana1lysis






User Commands                                             tacg(1)



     longer sends errors to stderr (except failure errors) and it
     no longer emits default output - you  have  to  request  all
     output,  except  for  the  simplest case: the '-p' flag will
     also set the -S flag to generate Sites.)


     tacg now automagically translates most ASCII  formats  (Gen-
     bank,  FASTA,  etc)  via  Jim Knight's SEQIO library and now
     handles multiple sequences at one time, internally  convert-
     ing  'u's  to  't's.   It considers both strands at the same
     time so you don't  have to manually reverse  complement  the
     sequence  (altho you can - see --rev, --comp, --revcomp, and
     will by default accept all IUPAC degeneracies  (yrmkwsbdhv),
     performing  all  possible  operations on that sequence.   It
     treats degeneracies in the input sequence in different  ways
     depending   on the -D flag (see below). It either strips all
     letters other than 'a','c','g',  or  't'  and  analyzes  the
     sequence  as  'pure'  using a fast incremental hashing algo-
     rithm or it treats it as degenerate and analyses  it  via  a
     slower  de  novo  hash.   By  default, it treats sequence as
     'pure' unless it detects an IUPAC degeneracy, in which  case
     it  will  adaptively  switch back and forth between the fast
     and slow hashing routines.


     NB: tacg can produce  lots  of  output,  especially  in  the
     Linear  map  mode;  while  it's  possible  to pipe direct to
     lp/lpr, you'll probably regret it.


REQUIREMENTS
     tacg 3.5 requires an external Codon file _c_o_d_o_n._d_a_t_a but does
     not  absolutely  require a pattern/REBASE file, allowing you
     to enter patterns  via the command line with the '-p'  flag.
     However,  most  users  will want to use a REBASE file in GCG
     format to supply the RE definitions.  By default the name of
     this  (supplied) file is:  _r_e_b_a_s_e._d_a_t_a, altho other files in
     the same format can be specified by the -R flag.  While  you
     can  use  the  default  GCG-formatted file from NEB's REBASE
     distribution (http://rebase.neb.com), additional information
     is  required  to  use  the  --dam, --dcm, or --cost options.
     This info is included in the distribution of tacg and can be
     added or modified with a text editor. Searching for Matrices
     requires the use of a TRANSFAC-formatted file (also supplied
     in the default name of _m_a_t_r_i_x._d_a_t_a ).


     The codons/pattern/matrix data files may exist in any  of  3
     locations  which  are  searched in the order of: the current
     directory $PWD, your  home  directory  $HOME,  or  tacg  lib
     $TACGLIB. Many shells will automatically define the 1st two;
     the last must be specified either via  command  line  or  in



SLuansOtSc5h.a8nge: tacg (v3.5) - a command line tool for DNA and Protein Ana2lysis






User Commands                                             tacg(1)



     your ._c_s_h_r_c file.

     ie. 'setenv TACGLIB /usr/local/lib/tacg'   [csh/tcsh]

     or 'export TACGLIB=/usr/local/lib/tacg'    [bash]


FLAGS and OPTIONS
     {} = required for flag; * =  default  (doesn't  need  to  be
     entered); # = an integer value; () = optional

     ie. -f {0,1*} means that the flag must be entered -f1  or  -
     f0.  The  flags -f0 or -f 0 are equally acceptable and flags
     without variables can be grouped together (-sScl).  A single
     flag  requiring  an  option  can be appended to the end of a
     string of simple flags, but not more more than  1.   -Ls  is
     OK, and -Lsn6 is OK, but -Lsn6F3 is NOT - it must be entered
     -Lsn6 -F3. Appending a flag that  expects  an  option  value
     without one will cause odd behavior, usually a cryptic error
     message and the program halting.  NOT entering the flag will
     cause the default behavior.


     -b {#}
          select the beginning of a  subsequence  from  a  larger
          sequence  file;  1*  for  1st  base of sequence. In the
          Linear Map output, the upper label indicates  numbering
          from  beginning  of  subsequence; the lower label indi-
          cates  numbering  from  the  beginning  of  the  entire
          sequence  (see  file 'tacg.main.html' for more detail).
          The smallest sequence that tacg can handle is 4  bases,
          10  for  the  ladder map (-l).  This allows analysis of
          primers and linkers.


     -e {#}
          select the end of a subsequence from a larger  sequence
          file;  0*  for  last  base  of  sequence.   The largest
          sequence that I've sent thru it is ~225MB.


     -c   order the output by #  of  cuts/fragments  by  each  RE
          (Strider  style)  and  thence alphabetically; otherwise
          output is by order of appearance in the REBASE file.




     -C {1*-13}
          Codon Usage table to use for translation:





SLuansOtSc5h.a8nge: tacg (v3.5) - a command line tool for DNA and Protein Ana3lysis






User Commands                                             tacg(1)



           1 Standard                  8 Euplotid_Nuc
           2 Vert_Mito                 9 Bacterial
           3 Yeast_Mito               10 Alt_Yeast_Nuc
           4 Mold_Protoz_Coel_Mito    11 Ascidian_Mito
           5 Invert_Mito              12 Flatworm_Mito
           6 Ciliate_Nuc              13 Blepharisma_Nuc
           7 Echino_Mito

          The Codon Usage file used in Ver 3  (_c_o_d_o_n._d_a_t_a)  is  a
          slightly  modified  NCBI  format,  which  includes info
          (currently ignored) about multiple initiator codons and
          references.


     --clone '#_#,#x#...'
          Clone finds sequence ranges which either  MUST  NOT  be
          cut (#_#) or that MUST be cut (#x#), up to a maximum of
          15 at once.  Ranges not specified can be either cut  or
          not cut.  The output first lists all REs (if any) which
          match ALL the rules, then  all  REs  which  match  SOME
          rules  as  long as all NO-CUT rules are respected.  The
          same filters that work in other RE selections (-n,  -o,
          -m,  -M,  --cost,  --dam/dcm)  can  be  applied here to
          fine-tune the selection.


     --cost {#}
          Cost controls which REs are  chosen,  based  on  the  #
          units/$,  where  the  higher  the number, the lower the
          cost (>100 U/$ is cheap; <10 U/$  is  quite  expensive,
          based  on  the prices quoted in NEB's catalog for their
          high unit products.


     -D {0-4}
          Degeneracy flag - controls input and analysis of degen-
          erate sequence input where:

           0   FORCES  exclusion  of  degens  in  sequence;  only
          'acgtu' accepted
           1* cut as NONdegen unless degen's found; then  cut  as
          '-D3'
           2  degen's OK; ignore in KEY hexamer, but  match  out-
          side of KEY
           3  degen's OK; expand in KEY hexamer, find only  EXACT
          matches
           4  degen's OK; expand in KEY hexamer, find ALL  POSSI-
          BLE matches

          The pattern matching is adaptive; given a small  window
          of  nondegenerate  sequence,  the  algorithm will match
          very fast; if degenerate sequence is detected, it  will



SLuansOtSc5h.a8nge: tacg (v3.5) - a command line tool for DNA and Protein Ana4lysis






User Commands                                             tacg(1)



          switch  to  a slower, iterative approach.  This results
          in speed that is proportional to  degeneracy  for  most
          cases.  If you have long sequences of 'n's (inserted as
          placekeepers,  for  instance),  -D2  may  be  a  better
          choice.  In all cases, as soon as degeneracy of the KEY
          hexamer exceeds a compiled-in limit  (usually  256-fold
          degeneracy), the KEY is skipped.


     --dam
          Dam sensitivity simulation of Dam  methylation  of  the
          DNA.   Dam methylase has a palindromic recognition site
          (GmATC) which can interfere with the binding and   cut-
          ting  of  a number of Type II REs.  This flag simulates
          the effect of Dam methylation, but requires  extra data
          to  be available in the rebase file.  If the RE is com-
          pletely blocked,  it will be noted that it did not  cut
          at all in the summary statement.  Otherwise, the effect
          is noted only by difference  in  the  number  of  sites
          listed  for  the -S  and -F flags.  The sites are still
          listed in the Linear Map to indicate where  they  WOULD
          be if the DNA was not methylated.


     --dcm
          Dcm sensitivity similar to '--dam' simulation above but
          with  Dcm  methylation  of the DNA.  Dcm methylase also
          has a palindromic recognition site (CmCWGG)  which  can
          interfere with RE action.


     --example {1-10}
          example code to show how to  add  your  own  flags  and
          functions.   Search  for  'EXAMPLE' in 'SetFlags.c' and
          'tacg.c' for the code.


     -f {0|1*}
          form (or topology) of DNA - 0 (zero)  for  circular;  1
          for linear.  This flag also operates on subsequences.


     -F {0*-3}
          print/sort Fragments, based on the user-supplied selec-
          tion  criteria ('-n', '-m', '-M', '-o', etc).  See also
          '-c' above.

           0*-omit;
           1-unsorted; fragments printed in order of generation.
           2-sorted; fragments sorted by size, smallest to  larg-
          est.
           3-both.  This flag has been left active for the matrix



SLuansOtSc5h.a8nge: tacg (v3.5) - a command line tool for DNA and Protein Ana5lysis






User Commands                                             tacg(1)



          matching, even tho it doesn't make much sense to use it
          in that way.


     -g {min#(,Max#)}
          specify if you want a pseudo-gel map  graphic,  with  a
          low  end  cutoff of min# bases and a high end cutoff of
          Max#. If Max # is omitted, the length of  the  sequence
          is  assumed,  altho you can set Max to be any number so
          as to constrain  the  output  for  comparisons  between
          sequences.   These  numbers  can  be  any  any  integer
          exponent of 10  (10,  100,  1000,  etc).  See  examples
          below.


     -G {binsize,X|Y|L}
          Graphic data output, so  (mis)named  for  its  original
          use, where:

          binsize = # bases for which hits should be pooled X|Y|L
          indicates  whether the BaseBins should be on the X or Y
          axis
           X: BaseBins 1000 2000 3000 4000  ..
              NameA      0    4    0    7   ..
              NameB     22   57   98   29   ..     (#s =  matches
          per bin)
              NameC      1    0    0    3   ..
              .
           Y: BaseBins  NameA   NameB   NameC   ..
                1000      0      22       1     ..
                2000      4      57       0     ..
                3000      0      98       0     ..
                4000      7      29       3     ..
               .
           L: Basebins  NameA
                1000      0
                2000      4
                  .      .
              Basebins  NameB
                1000     22
                2000     57
                  .      .

          This addresses  some  missing  features  -  allows  the
          export  of  match  data for the selected Names to allow
          external analysis of the raw data.  Like other  output,
          it  is  streamed  to stdout, so it's not wise to mix -G
          with other analyses; the lines generated (esp.  w/  the
          X  option),  can  be quite long and are NOT governed by
          the -w flag).





SLuansOtSc5h.a8nge: tacg (v3.5) - a command line tool for DNA and Protein Ana6lysis






User Commands                                             tacg(1)



     -h   brief help page (condensed man page).


     -H (--HTML) {0*|1}
          generates complete or partial  HTML  tags  for  viewing
          with a Web browser. 0 - (default) makes standalone HTML
          page, with Table  of  Contents  (TOC).   1  -  no  page
          headers, only TOC, to embed in other HTML pages.

          Not useful in a functional sense in  the  command  line
          version.   Always  more  HTML markup can be done as eye
          candy.


     -i (--idonly) {0*-2}
          controls the output for  sequences  (in  a  collection)
          that  have  no  hits  for  the  options  selected.  0 -
          (default) ID line and normal output regardless of  hits
          1  - BOTH ID line and normal output are printed ONLY IF
          there are hits.  2 - ONLY ID line is printed  if  there
          are  hits  (to identify sequences of interest in a scan
          for further analysis).



     -l   specify if you want a ladder map of  selected  enzymes,
          much  like  the GCG MAPPLOT output. Also appends a sum-
          mary of those enzymes that  match  a  few  times.   The
          number  of  matches  that is included in the summary is
          length-sensitive in the distributed source code, but it
          can  be  overrriden  by  changing the value assigned to
          '#define SUMMARY_CUTS' in 'tacg.h'



     -L   specify if you WANT a Linear map. This spews  the  most
          output  (about  10x  the  #  of  input  characters) and
          depending on what other options are specified,  can  be
          of moderate to very  little use.  This option no longer
          generates the co-translation by default as  it  did  in
          prior versions.  If you want the co-translation, you'll
          have to specify it via the -T flag below.   The  Linear
          map  also  no  longer shows ALL the patterns that match
          from the pattern file. It now obeys the same  filtering
          rules  that  the Sites, Fragments, Ladder Map and other
          analyses do.  This behavior was  requested  by  several
          people,  and  I  have  to admit it makes sense.  tacg 3
          also labels non-palindromic patterns as to  orientation
          if  they  are  reversed  relative  to the way they were
          enterered, by appending a ~ character to the end of the
          pattern label in the linear map.




SLuansOtSc5h.a8nge: tacg (v3.5) - a command line tool for DNA and Protein Ana7lysis






User Commands                                             tacg(1)



     --strands {1|2*}
          in Linear Map, print 1 or 2 strands.   Along  with  '--
          notics',  can  be used to compact the output by 2 lines
          per stanza.  1 - only the top strand is printed.   2  -
          both top and bottom strands are printed



     --notics
          in Linear Map, DON'T print the tics -  can be  used  to
          compact the output by up to 2 lines per stanza.


     --numstart {#}
          the value given with this flag is the beginning  number
          in  the  Linear  Map  (-L) output.  This can be used to
          force a particular numbering scheme on the output or to
          force   upstream  (negative)  numbering  for  promoters
          sequences.



     -m/M {#}
          select enzyme by minimum (-m)  and/or  Maximum  (-M)  #
          cuts  in  sequence;  0*  for all. Affects the number of
          enzymes displayed by the sites  (-s),  fragments  (-F),
          gel (-g), ladder (-l), and linear map (-L) flags.


     -n {3*-10}
          select enzymes by magnitude of recognition  site;  3  =
          all,  5  = 5,6,7,8...  n's don't count, other degenera-
          cies are summed  ie:   tgca=4,  tgyrca=5,  tgcnnngca=6,
          tannnnnnnnnnta=4


     -o {0,1*,3,5}
          select enzymes by overhang generated; 5 = 5', 3 = 3', 0
          for blunt, 1 for all.


     -O {1-6(x),MinSiz}
          crude ORF analysis producing either a line or  a  block
          (depends on -w) for each ORF including:

           = Frame of the Current ORF
           = Sequence # of the Current ORF
           = Offset from the start in both bases and AAs
           = Size of the ORF in AAs and KDa
           = ORF itself in 1 letter code
           = if 'x' is  appended  to  frames,  extended  info  is
          included (# & % of total AAs)



SLuansOtSc5h.a8nge: tacg (v3.5) - a command line tool for DNA and Protein Ana8lysis






User Commands                                             tacg(1)



          NB: If -w is set to 1, the output is  written  in  a  2
          line,  FASTA-like  stanza for each ORF (the header pre-
          fixed by '>',  and  the  ORF  itself),  so  that  line-
          oriented  pattern-matching tools (grep, egrep, awk) can
          examine the ORF for matching regular  expressions  (see
          the  GNU  grep  man  page for an explanation of regular
          expressions). In this way you can search all  6  frames
          of >MinSize AAs for whatever pattern interests you.  If
          -w is set to one of the regular widths, the ORF will be
          wrapped  at that length to form a FASTA formatted block
          for analysis by other  apps,  more  biologically  aware
          tools like FASTA, BLAST, etc.

          Examples:
           -O 145,25  frames 1,4,5 with a min ORF size of 25 AAs
           -O 35x,200  frames 3 & 5 with a min ORF  size  of  200
          AAs, with extended info.
           -O 2,66    frame 2 with a min ORF size of 66 AAs

     -p {Name,Pattern[,Err]}
          allows entry of search patterns from the command line;

             Name = Pattern name (1-10 chars)
             Pattern = <30 IUPAC characters (ie. gryttcnnngt)
             Err = (optional) max # of errors that are tolerated
                   (<6). If omitted, Err is set to 0

          This flag also logs the patterns  you've  entered  into
          the  file _t_a_c_g._p_a_t_t_e_r_n_s in the correct format for later
          copying to  a REBASE file.  Can enter up to 10 of these
          at a time. Patterns should consist of < 30 IUPAC bases.
          This uses a brute force approach, so long patterns with
          high #s of errors (>3) will cause SUBSTANTIAL cpu usage
          (ie. minutes) in validating the  patterns.  But  actual
          the search will go very fast.


     -P   {NameA,[+-][lg]Dist_Lo[-Dist_Hi],NameB}
          Proximity matching.  Use this option to search for spa-
          cial  relationships between factors, 2 at a time (up to
          a total of 10).

          NameA and NameB must be  in  a  REBASE-formatted  file,
          either  the default _r_e_b_a_s_e._d_a_t_a or another specified by
          the -R flag and are case INsensitive.  NameA/B patterns
          can  be  composed  of any IUPAC bases and ERRORs can be
          specified in the REBASE entry ie:

           Pit1  5  WWTATNCATW  0  2 ! a Pit1 site with 2 error
           Tataa 4  TATAAWWWW   0  1 ! a Tataa site with 1 error

           +  NameA is DOWNSTREAM of NameB (default is either)



SLuansOtSc5h.a8nge: tacg (v3.5) - a command line tool for DNA and Protein Ana9lysis






User Commands                                             tacg(1)



           -  NameA is UPSTREAM of NameB (ditto)
           l  NameA is LESS THAN Dist_Lo from NameB (default)
           g  NameA is GREATER THAN Dist_Lo from NameB
           Dist_Hi - if used, implies a RANGE, obviates l or g

          Example I
             -PHindIII,350,bamhi      Match  all  HindIII   sites
          within 350 bases of BamHI sites

          Example II
             -PPit1,-30-2500,Tataa   Match all  Pit1  sites  that
          are 30 to 2500 bases UPSTREAM of a Tataa site.


     --ps generates a postscript plasmid map (and multiple  pages
          with the same parameters if fed a multi-sequence file).
          The output file is  named  _t_a_c_g__M_a_p._p_s  and  additional
          plots  will  be appended to it if it exists in the same
          directory.  REs to be plotted can be selected with  the
          usual  parameters:  (-m -M --cost --n -x -p) but you'll
          usually want to use -M1 or -M2. Degeneracies are  plot-
          ted  along the rim as grayscale arcs (remember tacg can
          tolerate degeneracies in sequence, so you  can  compose
          accurate  plasmid  maps  by  connecting known sequences
          with N's.)  ORFs from any and all frames can be plotted
          internal to the sequence ring by using the -O flag.


     --pdf
          Invokes  --ps  above  and  automatically  converts  the
          Postscript  putput to Adobe's Portable Document Format,
          which is considerably more compact.


     --logdegens
          (off by default) Using this flag forces the logging  of
          every  degeneracy  in  the sequence, trivial if a short
          sequence (<1Mb), but of  concern  for  chromosome-sized
          chunks.   This  info  will  be used for drawing graphic
          maps of the  sequence  and  shading  degeneracies  dif-
          ferently.  It is quite memory intensive as it marks the
          beginning and end of every degeneracy run.   No  exter-
          nal  data is produced, but could be as it's just a sim-
          ple 2-step array.



     -q   Work quietly. DISallows  sending  diagnostic  UDP  info
          back  to  author;  this is now the default behavior, on
          the request of a number of people.  If you wish to send
          UDP  packets  back to my server and possibly annoy your
          network security people, you may try  to  with  the  -Q



SLuansOtSc5h.a8nge: tacg (v3.5) - a command line tool for DNA and Protein An1a0lysis






User Commands                                             tacg(1)



          flag.


     -Q   Work UNquietly. Allows sending diagnostic UDP info back
          to author's machine.  Report stream includes this info:
          Date Time IP# UID hardware OS  OS_version  TACG_version
          [tacg  commandline]  <#  bases analyzed> ie. 1996-03-08
          17:02:26   128.200.2.43:[uid=502    hw=i486    os=Linux
          osver=1.2.6]  [TACG Version 1.33F] tacg -t 3 -n 6  <434
          bp>





     -R {REBASE|Matrix file}
          specifies an alternative database, (RE or Matrix)  use.
          The  RE  database  must  be  in  the same GCG format as
          _r_e_b_a_s_e._d_a_t_a.  There are some example alternative REBASE
          files shipped with the tacg distribution named '*.RB'.

          The latest REBASE files are available via FTP:

          ftp://ftp.neb.com/pub/rebase/

          or via WWW:

          http://www.neb.com/rebase/rebase.html

          and the latest TRANSFAC database is available at:

          http://transfac.gbf.de/TRANSFAC/index.html

          The file specified with the -R flag is searched for  in
          the  same order as the other data files: $PWD , $HOME ,
          $TACGLIB.


     --raw
          makes tacg  consider  ALL  input  as  raw,  unformatted
          sequence.   This allows it to process unstructured data
          such as fragments of  files  and  editor  buffers.   It
          ignores  everything  NOT  an IUPAC degeneracy, but will
          consider all possible IUPAC degeneracies, so will  pro-
          duce  odd  output if fed a regularly formatted sequence
          file  (it  will  process  headers  and   comments    as
          sequence.)   This is the behavior of the version 2 tacg
          (before SEQIO).







SLuansOtSc5h.a8nge: tacg (v3.5) - a command line tool for DNA and Protein An1a1lysis






User Commands                                             tacg(1)



     -r (--regex) {'Label:RegexPat'} | {'FILE:FileOfRegexPats'}
          searches for regular expressions entered from the  com-
          mandline  using  the 1st approach above or searches for
          the regular expressions read from a file using the  2nd
          approach.   The regular expression syntax can be formal
          regex patterns or the  IUPAC'ed  version  thereof;  the
          translation  from one to the other is handled automati-
          cally.  Because regex's typically have many  characters
          that  shells  are  happy  to  misinterpret,  the single
          quotes (') surrounding the  option  string  are  almost
          always  required.   When  trying to specify a file, the
          term FILE must be in CAPs (so don't go naming  a  regex
          pattern 'FILE').  Specific regex patterns from the file
          can be specified by using the '-r' flag  to  name  them
          explicitly.   Regular expression searches are consider-
          ably slower than other types of searches, but  searches
          of  100Kb,  with  <10 regex patterns of even reasonably
          high complexity should be tolerable.



     --rule {logic}
          (see also -P above) --rule allows you to specify  arbi-
          trarily complex logical associations of characteristics
          to detect the patterns that interest  you.  Admittedly,
          that  phrase  is incomprehensible on its own, so let me
          give an example:

          Say you wanted to  search  for  an  enhancer  that  you
          suspected  might  be  involved  in  the transcriptional
          regulation of a pituitary-specific gene.  You knew that
          you  were  looking for a sequence about 1000 bp long in
          which there were at least 2 Pit1 sites and 3-5 Estrogen
          response  elements,  but  NO  TATAA  boxes.  If you had
          defined these patterns in a  file  called  _p_i_t._s_p_e_c_i_f_i_c
          as:

           Pit1  0  WWTATNCATW    0 1 ! Pit1 site w/ 1 error
           ERE   0  GGTCAGCCTGACC 0 1 ! ERE site w/ 1 error
           TATAA 0  tataawwww     0 0 !  TATAA  site,  no  errors
          allowed

           you could specify this search by:

          tacg --rule '((Pit1:2:7&ERE:3:5)&(TATAA:0:0),1000)'  -R
          pit.specific <


          This query searches a sliding window of  1000  bps  (-W
          1000)  for  ((2-7  Pit1 AND 3-5 ERE sites) AND (0 TATAA
          sites)).  These combinations can be as large as your OS
          allows  your command-line to be with arbitraily complex



SLuansOtSc5h.a8nge: tacg (v3.5) - a command line tool for DNA and Protein An1a2lysis






User Commands                                             tacg(1)



          relations represented with logical AND (&), OR (|), and
          XOR  (^)  as  conjunctions.   Parens enforce groupings;
          otherwise it's evaluated left to right.




     --rulefile '/path/to/the/rulefile'
          This option allows you to read in a  complete  file  of
          the kind of complex rules described above and have them
          all evaluated.  The file format  is  described  in  the
          example data file supplied _r_u_l_e_s._d_a_t_a


     -s   prints the summary of site information, describing  how
          many   times   each pattern matches the sequence. Those
          that match zero times are shown first.  In Ver >2, only
          those  that match at least once are shown in the second
          part (the 0 matchers are  not reiterated)


     -S (1*|2)
          prints the the actual matched Sites  in  tabular  form,
          much like Strider's output. See also '-c', above.


     --silent
          requests  that  the  nucleic  sequence   submitted   be
          translated starting at the 1st base, in frame 1 (use -b
          to shift the starting base),  according  to  the  Codon
          Translation  table  specified  with  -C,  then  reverse
          translated, using the same table, using all the  possi-
          ble degeneracies, then restrict that (quite) degenerate
          sequence and show all the REs that will match it.   You
          should  use  the  '-L'  and  '-T' flags to generate the
          linear  map  which  shows  both   the   REs   and   the
          cotranslated  sequence  to  verify  that  all  is as it
          should be. NB: Depending on Codon Table, some  AAs  are
          not reversibly translatable.  Using the standard table,
          Arg (=mgn), Leu (=ytn), and Ser (=wsn) cannot  be  For-
          ward translated from their Reverse translation.


     --tmppath /path/to/tmp/dir
          passes the path to tacg to cooperate with CGIs or other
          programs  that  need  to  tell  tacg where to place the
          ps/pdf files.


     -T {[0*|1|3|6],[1|3]}
          requests frames 1, 1-3, or 1-6 to be cotranslated  with
          the  Linear Map using 1 or 3 letter codes.  Requires '-



SLuansOtSc5h.a8nge: tacg (v3.5) - a command line tool for DNA and Protein An1a3lysis






User Commands                                             tacg(1)



          L' to have any effect.

          Ex: "-T3,3" translateslates Frames 1,2,3 with 3  letter
          labels.
              "-T1,1"  translateslates  Frame  1  with  1  letter
          labels.



     -v   asks for program version (there may  be  multiple  ver-
          sions  of  the  same  functional  program  to track its
          migration).


     -V {1-3}
          Verbose output- requests all kinds of  ugly  diagnostic
          info  to be spat to the screen.  May be useful in diag-
          nosing why tacg did not behave as  expected..but  maybe
          not.   The  values  1 - 3 ask for increasing amounts of
          detail.


     -w {1|#}
          output width in bp's (the option number must be exactly
          1 or between 60* and 210.

          The number (if not 1)  is  truncated  to  a  #  exactly
          divisible  by  15  ('-w 100' will be interpreted as '-w
          90') and actual printed output will be about 20 charac-
          ters  wider.  Also  applies to output of the ladder and
          gel maps, so if you're trying to get more accuracy  and
          your  output  device  can  display small fonts, you may
          want to use this flag to widen the output.  In  version
          3, the option '-w 1' allows you to put as much informa-
          tion as possible on one line for easier parsing by some
          external apps.

          Ex: "-w 1" prints output in one line
              "-w 150" causes wrapping at  about  170  characters
          (150 bp wide in the Linear map option).



     -x {Label(,=),Label..(,C)}
          used to restrict the  patterns  searched  for  by  Name
          label  (either  from  the  1st field of a REBASE format
          file or the NA field from a TRANSFAC format file) up to
          a  maximum of 15. Case INsensitive (HindIII = hindiii =
          HinDiIi), but it HAS to be  spelled  exactly  like  the
          entry  in rebase.data with no spaces.  (HindIII != Hind
          III != Hind3).




SLuansOtSc5h.a8nge: tacg (v3.5) - a command line tool for DNA and Protein An1a4lysis






User Commands                                             tacg(1)



          The '=' tag invokes the Hookey  function  (named  after
          its  requestor, John Hookey), in which the '=' tags the
          RE to which it is appended.  This is useful  if  you're
          trying  to  discern or predict a labelled fragment in a
          mixture of fragments.  The output shows  the  fragments
          generated  only if they have one or both ends generated
          by the tagged RE.  This option works even if there  are
          a  number  of  REs, but only one can be tagged. Ex: '-x
          HindIII,=,MseI,HinfI' causes the DNA to be cut by  Hin-
          dIII,  MseI,  and HinfI, but only fragments that have a
          HindIII end will be shown.  The output  is  shown  both
          unsorted  and  sorted by fragment size.  If you want to
          cause the output to simulate a multiple digest with all
          the  REs  designated,  append  a ',C' to the list of RE
          names.  Ex: -xBamHI,EcorI,NruI,C

          NB: Don't assign the name 'C' to any patterns or REs.




     -X (--extract) {b,e,[0|1]}
          causes the sequences bounding the match to be  spat  to
          stdout  in  FASTA format. b and e are the beginning and
          ending offsets  respectively  for  varying  the  window
          around  the  match.  NB: both b and e are measured from
          the start of the match, so e must be corrected for  the
          length of the pattern itself.



     -# {#}
          calls for matrix matching of either ALL the patterns in
          the  default  Matrix file _m_a_t_r_i_x._d_a_t_a or that specified
          via the '-R' flag, or ONLY THOSE specified via  the  '-
          r'  flag,  regardless  of  the  input file.  The number
          indicates the CUTOFF as the  percentage of the  maximum
          score  possible  (the  sum of the highest score at each
          nucleotide across the matrix - see _t_a_c_g_3._m_a_i_n._h_t_m_l  for
          more   info).    Example:   'tacg  -#  95  -r  GCN4  -S
          <yeastchromo4.genbank' will search all  of  _m_a_t_r_i_x._d_a_t_a
          as  GCN4  at  a cutoff of 95% (the pattern has to match
          the matrix at 95% or better).



     --rev
          causes the sequence(s) to be reversed before  analysis:
          tacg -> gcat.  Useful for figuring out sequencing/entry
          errors.





SLuansOtSc5h.a8nge: tacg (v3.5) - a command line tool for DNA and Protein An1a5lysis






User Commands                                             tacg(1)



     --comp
          causes  the  sequence(s)  to  be  complemented   before
          analysis:  tacg  ->  atgc.   Useful  for  figuring  out
          sequencing/entry errors.


     --revcomp
          causes  the  sequence(s)  to  be   reverse-complemented
          before analysis: tacg -> cgta.  Useful for checking the
          translation in opposite orientation without  having  to
          read translation backwards or convert with another pro-
          gram.




RELATED PROGRAMS
     In    Ver    3,    tacg    incorporated     Jim     Knight's
     (jknight@guarneri.curagen.com)  SEQIO  library calls to pro-
     vide automagic  format  conversion  of  incoming  sequences.
     This  also  allows  multiple sequences to be run at the same
     time, allowing tacg to scan databases.

     Wu and Manber's agrep is an amazing piece  of  software  for
     searching  for  multiple  patterns  with  errors.  While not
     optimzed for molecular biology,  it  can  be  used  to  scan
     sequences.   Jim  Knight  distributes a variant of it called
     grepseq with his SEQIO pkg, which IS molbio-aware,  but  not
     as generally useful (to me anyway) as tacg, as it only scans
     one strand and will only search up to  6  matches  for  some
     reason.  However,  I've  started  to incorporate the grepseq
     core    into    tacg.     agrep     is     available     via
     ftp://ftp.cs.arizona.edu/agrep/ or http://manber.com.
      The SEQIO pkg is distributed around the web.


     You can also use the excellent paging utility less  to  move
     thru  your  sequence  file  and  use  its marking and piping
     facility to punt the sequence of  interest  to  'tacg'.   In
     many  terminal  emulators  it  will  also  highlight matched
     search terms, and so makes an excellent way to scan the out-
     put for regions of interest.  Many editors also allow piping
     a selection of text to an external program and inclusion  of
     the  result  into  another  window  ( nedit, crisp, joe, the
     indefatiguable emacs/xemacs and others).


     Much of the output benefits from wider-than-normal printing.
     The  '-w#'  flag  allows  output  up to about 230 characters
     wide, however to print this without wrapping,  you  need  to
     use  small fonts.  A number of unix printing utilities allow
     you      to      do      this,      notably       genscript:



SLuansOtSc5h.a8nge: tacg (v3.5) - a command line tool for DNA and Protein An1a6lysis






User Commands                                             tacg(1)



     http://www.hut.fi/%7Emtr/genscript/index.html




EXAMPLES
     Used alone:

          tacg -f0 -n5 -T3,1 -sL -F3

          Translation: read  sequence  from  _N_e_w_F_i_l_e._G_e_n_b_a_n_k  and
          analyze  it  as  circular (-f0), with 5+ cutters (-n5),
          returning both site info and linear map (-sL)  as  well
          as  sorted  and  unsorted  fragment data (-F3) and do 3
          frame translation w/ 1  letter  codes  (-T3,1)  on  the
          linear  map, and produce a pseudo gel diagram for those
          enzymes that pass the filtering, with a low  cutoff  of
          100  bp  and  a high cutoff of 1000 b(-g100,1000), then
          write the output to _o_u_t_p_u_t._f_i_l_e.


     Matching matrices:

          tacg -R yeast.matrices -# 85 -sSlc

          Translation: Search the sequence in  _y_s_t__c_h_r__4._s_e_q  for
          all the matrices described in the file _y_e_a_s_t._m_a_t_r_i_c_e_s ,
          applying a uniform cutoff of 85% (-# 85) to the maximum
          possible score, writing the summary, Sites, ladder map,
          doubly-sorted (-sSlc) printed 90 characters wide (-w90)
          to the file _o_u_t


     Specifying patterns on the command-line

          tacg -p Pit1,tatwcata,1 -p ap2,tgygcatw,1 -w90

          Translation: search for the patterns labeled  Pit1  and
          ap2  with 1 error each and search the sequence from the
          file _r_p_r_l_P_r_o_m_o._s_e_q for them, printing the results (sum-
          mary (-s), Sites (S), and the Linear Map (L) 90 charac-
          ters wide (-w90) to the file _p_r_o_m_o._m_a_p


     Used to search the entire yeast  500bp  Upstream  Regulatory
     sequences  (a database of 6226 500 bp sequences) for matches
     to the MATa1 binding site (from TRANSFAC) :

          tacg -R TRANSFAC.data -sScw1 -rMATa1 -#95

          Translation: translate  each  of  the  FASTA  formatted
          entries in the input file _u_t_r_5__s_c__5_0_0._f_a_s_t_a into usable



SLuansOtSc5h.a8nge: tacg (v3.5) - a command line tool for DNA and Protein An1a7lysis






User Commands                                             tacg(1)



          sequence, and after finding the MATa1 (-r MATa1) matrix
          description  from  the database -R _T_R_A_N_S_F_A_C._d_a_t_a search
          the sequences for matches at 95% of the max score  that
          it  has in the TRANSFAC database (-# 95), returning the
          summary (-s), the sites (-S) sorted  in  Strider  order
          (-c) with results printed on 1 line (w1), directing the
          output into the file _y_e_a_s_t._s_u_m_m_a_r_y




BUGS and ODDITIES
     Major



          tacg, if used with the -Q spits back about 100 bytes of
          information about it's use (the hosting OS, the command
          line flags, and the sequence length) to  enable  me  to
          track  how  and how often it is being used.  If you are
          uncomfortable with this trait, you may disable it  from
          the command line ('-q', see above), but this is now the
          default, so unless you WANT  me  to  get  the  data,  I
          won't.  I would appreciate it tho.


          the inclusion of the  seqio  functions  has  caused  an
          enormous  increase  in the compiled size of the execut-
          able to ~340kB (up from ~50kb before).  If I get a  lot
          of  complaints about this, I'll look into stripping out
          the functions that I use from the  SEQIO  library,  but
          I'd  rather  not  as  it does include a lot of (hidden)
          functionality that I plan to use later.


          tacg v2.0 will not currently cut sequence shorter  than
          5  bases; if you need to analyze sequences shorter than
          this, perhaps you're using the wrong program.


          main() and functions were originally written as  single
          pass  code but with the help of Gray Watson's excellent
          (!) dmalloc malloc  debugging  library,  available  at:
          http://www.dmalloc.com  I've  recently  put some effort
          into tracking memory leaks, especially  since  much  of
          the  code  has to be re-entrant for doing analyses over
          many sequences.   However, it's  not  completely  leak-
          free yet, so user beware.


          The command  line  handling  has  been  completely  re-
          written,   using   the   getopt()   and   getopt_long()



SLuansOtSc5h.a8nge: tacg (v3.5) - a command line tool for DNA and Protein An1a8lysis






User Commands                                             tacg(1)



          functions, so the flags are considerably less sensitive
          to spacing and order.


          Translation  in  6  frames  assumes  circular  sequence
          regardless  of  '-f' flag, so that the last amino acids
          in frames 5 and 6 in the 1st output block are obviously
          incorrect if you are assuming linear sequence.

          See the manual for other bugs  thinks  are  less  prob-
          lematic.


          Harry Mangalam (hjm@tacgi.com)









































SLuansOtSc5h.a8nge: tacg (v3.5) - a command line tool for DNA and Protein An1a9lysis



