                        Shell Scripts for use with

                               fastDNAml

                                  and

                              DNAml_rates



                                SUMMARY


UNIX shell scripts have proven quite useful in running the fastDNAml and/or
DNAml_rates programs.  They have been used in two different contexts.  First,
many of the program options can be invoked by simple editing of the input.
The second category are scripts that help run and maintain results of the
program.

bootstrap           add B (bootstrap) option (and optional seed) to input
categories          add C (rate categories) option and values to input
categories_file     add Y (categories file) option to input
clean_checkpoints   remove checkpoint files when there is a finished treefile
clean_jumbles       remove all but one optimal jumble for a given result
fastDNAml_boot      loop over bootstrap seeds, doing 1 or more jumbles each
fastDNAml_loop      do jumbles, stopping when same best tree found n times
frequencies         add F (empirical frequencies) option to input
global              add G (global) option (and optional region size) to input
jumble              add J (jumble) option (and optional seed) to input
min_info            add M (minimum information) option and value to input
n_categories        add C (categories) option (without rate values) to input
out.PID             append process ID to output file name of a program
outgroup            add O (outgroup) option and number to input
printdata           add 1 (print data) option to input
quickadd            add Q (quickadd) option to input
restart             add R (restart) option and checkpoint tree to input
scores              summarize and sort likelihoods from jumble output files
transition          add T (transition/transversion) option and value to input
treefile            add Y (treefile) option to input
trees2NEXUS         combine trees and add a NEXUS wrapper for PAUP and MacClade
trees2prolog        convert Newick format trees to prolog facts
userlengths         add L (userlengths) option to input
usertree            add U (usertree) option, tree count, and tree(s) to input
usertrees           add U (usertree) option, tree count, and tree(s) to input
weights             add W (userweight) option and values to the input
weights_categories  add W and C options and values to the input



                SCRIPTS THAT INVOKE DNAML OPTIONS


GENERAL COMMENTS:

The program fastDNAml takes data from standard input.  Thus, to run the
program with data in the file called "infile", the command would be

    fastDNAml <infile

In this case, the output goes to standard output (generally the user's
terminal).  To put in into a file, one can use output redirection as in

    fastDNAml <infile >outfile

Because of the use of standard input, the input to fastDNAml can by
preprocessed by a function, and then piped to the program.  For example,

    bootstrap <infile | fastDNAml >outfile

       or

    bootstrap 137 <infile | fastDNAml >outfile

can be used to add the bootstrap option and a random number seed to the input,
and then pass it on to fastDNAml for analysis.

Many of the fastDNAml options are amenable to this arrangement.  In
each case, the preprocessing can simply add options (and auxiliary data
lines, as necessary) to the input.  In addition to avoiding the need to play 
with UNIX text editors, there are several advantages to this approach:

   1. The files remain relatively compatible with PHYLIP DNAML.
   2. It reduces the chance of introducing errors into the data.
   3. It is easier to try alternative options on the same data.
   4. If the data for each sequence are provided in one long line (so that
      interleaved and non-interleaved formats are the same), then some text
      editors will truncate the lines.

Shell scripts are available for each of the above program options.  The
corresponding formats and effects are described below.


THE SCRIPTS:


BOOTSTRAP (B)

Format:  bootstrap [random_seed]

Adds a bootstrap option and a random number seed to the input.  If the random
seed is not supplied, then the process id of the bootstrap shell is used.


CATEGORIES (C)

Format:  categories  categories_data_file

Adds the categories option and the corresponding data to the input.  The data
must have the format specified for PHYLIP dnaml 3.3.  The first line must be
the letter C, followed by the number of categories (a number in the range 1
through 35), and then a blank-separated list of the rates for each category.
The next line should be the word Categories followed by one rate category
character per sequence position.  The categories 1 - 35 are represented by the
series 1, 2, 3, ..., 8, 9, A, B, C, ..., Y, Z.  These latter data can be on
one or more lines.  For example,

    C  12  0.0625  0.125  0.25  0.5  1  2  4  8  16  32  64  128
    CATEGORIES  5111136343678975AAA8949995566778888889AAAAAA9239898629AAAAA9
                633792246624457364222574877188898132984963499AA9899975

In order to generate output compatible with PHYLIP dnaml, this should be the
first option added (so that the categories data are inserted immediately
before the sequence data).


CATEGORIES_FILE (Y)

Format:  categories_file

Adds the Y option to the input data to cause the DNAml_rates program to write
a file of weights and categories that can be directly added to the input for
the fastDNAml program (see weights_categories script).


FREQUENCIES (F)

Format:  frequencies

Adds empirical frequencies option (F) to the input stream.


GLOBAL (G)

Format:  global [final_tree_rearrangements [partial_tree_rearrangements]]

Adds a global option to the input.  If a rearrangement distance is
specified, then this value is added as part of the option auxiliary
information.  In this latter case, it is essential that the input contain
(or the global command be preceded by) a B, J or T option.


JUMBLE (J)

Format:  jumble [random_seed]

Adds a jumble option and a random number seed to the input.  If the random
seed is not supplied, then the process id of the jumble shell is used.


MIN_INFO (M)

Format:  min_info  minimum_unambiguous_residues

Adds the minimum information option and an auxiliary data line to the input
for the DNAml_rates program.  This changes the threshold value of unambiguous
residues that must be present in a column before the program will consider the
rate to be "defined".  The default value is currently 4.


N_CATEGORIES

Format:  n_categories  number

Adds the C options and an auxiliary data line of the form "C number" to the
input.  This is for use with the program DNAml_rates.


OUTGROUP (O)

Format:  outgroup  outgroup_number

Adds the outgroup option and appropriate auxiliary data line to the input.


PRINTDATA (1)

Format:  printdata

Adds a printdata option to the input.


QUICKADD (Q)

Format:  quickadd

Adds a quickadd option to the input.  This greatly decreases the time in
initially placing a new sequence in the growing tree (but does not change the
time required to subsequently test rearrangements).  Its downside, if any, is
unknown.  This will probably become default program behavior in the near
future.


RESTART (R)

Format:  restart  checkpoint_file_name

Adds a restart option and appends the last tree in the checkpoint file to the
input.  This conflicts with a usertree option.  Note that this script makes a
valiant attempt to figure out the beginning location of the last tree in the
file.  In particular, you should not modify lines so that they end with a
semicolon or a period (other than the final line of a tree).


TRANSITION-TRANSVERSION RATIO (T)

Format:  transition  ratio_value

Adds the transition-transversion option and appropriate auxiliary data line
to the input.  This option with a value of 2.0 (the program's default value)
can be used before a global or treefile option with auxiliary data.


TREEFILE (Y)

Format:  treefile  [format_number]

Adds the treefile option, and, if requested, the treefile format auxiliary
data line to the input data.  In this latter case, it is essential that the
input contain (or the treefile command be preceded by) a B, J or T
option.  The only formats at present are 1 (default Newick format) and 2 (a
prolog phylip_tree fact).


USERLENGTHS (L)

Format:  userlengths

Adds the userlengths option to the input.  This must be used in conjunction
with the usertree option, so it is usually easier to add the second parameter
to the usertree script (see below).  This script is useful if the usertree is
already part of the input, and you only need to alter the lengths option.


USERTREE (U)

Format:  usertree   tree_file [ L ]
or       usertrees  tree_file [ L ]

Adds the usertree option and appends the tree count and all of the trees in
the named file to the input.  If the L parameter is included, it is placed on
the options line to active that use of the program userlengths option.  Thus,
the following two commands are equivalent:

   usertree intree L < indata | fastDNAml
   usertree intree < indata | userlengths | fastDNAml

Note that all of the trees in the file are added.  Thus, this is not an 
appropriate script for evaluating the last tree in a checkpoint file.  Restart 
is the script to use for that latter task.


WEIGHTS (W)

Format:  weights  weight_data_file

Adds the weights option (W) and the corresponding data to the input.  The data
must have the format specified for PHYLIP dnaml 3.3.  The first line must be
the word Weights and AT LEAST 3 BLANK SPACES, followed by one weight value per
sequence position.  The weights 0 - 35 are represented by the series 0, 1, 2,
3, ..., 8, 9, A, B, C, ..., Y, Z.  These latter data can be on one or more
lines.  For example,

    Weights     111111111111001100000100011111100000000000000110000110000000
                111101111111111111111111011100000111001011100000000011


WEIGHTS and CATEGORIES (W C)

Format:  weights_categories  weight_and_category_data_file

Adds both the userweights and categories options to the input, along with the
weights and categories data in the named file.  This is useful for importing
the data from the DNAml_rates program into fastDNAml.  The format of the data
file should conform to that of weights auxiliary data followed by categories
auxiliary data.  As noted for the categories option above, in order to
generate output compatible with PHYLIP dnaml, this should be the first option
added (so that the categories data are inserted immediately before the
sequence data).




              SCRIPTS THAT PERFORM OTHER USEFUL FUNCTIONS



clean_checkpoints

Usage:  clean_checkpoints

Locates checkpoint files for which there is a corresponding treefile, and
deletes the former.  This is useful in cleaning up the directory when the
treefile option is in use.



clean_jumbles

Usage:  clean_jumbles  [ -nosummary ]  file_name_root

Finds all files with the form "file_name_root.PID", where PID is the process
ID appended by fastDNAml_boot and fastDNAml_loop to results from analyses with
different jumble seeds.  Unless the nosummary flag is included, all likelihoods
are summarized in a file named "file_name_root.summary".  The likelihoods are
screened for the best value.  An output file with the optimal score is renamed
to "file_name_root.out", and the corresponding treefile (if any) is renamed
"file_name_root.tree".  All other treefiles and output files matching the name
are deleted.  Corresponding checkpoint files are also deleted.



fastDNAml_boot

Usage:  fastDNAml_boot [-seed seed]  [-boots nboot] \
                       [-max maxjumble]  [-nice nicevalue]  [-noclean] \
                       in_file  n_best  [ "dnaml_opt1 [ | dnaml_opt2 ... ]" ]

For the current bootstrap seed, the sequence input order is jumbled (up to
maxjumble times) until the same best tree is found n_best times.  The output
files are then reduced to a summary of the scores produced by jumbling, and one
example of the best tree.  The number process is then repeated with new
bootstrap seeds until nboot samples have been analyzed.

Boot, jumble, treefile and quickadd are included by the script and should not
be specified by the user or in the data file.  AdditionalfastDNAml
program options are enclosed in quotes, and separated by vertical bars (|).

Flags and parameters:

    in_file         -- name of the input data file
    n_best          -- input order is jumbled (up to maxjumble times) until
                       same tree is found n_best times
    -boots nboot    -- number of different bootstrap samples (Default=1)
    -seed seed      -- seed for first bootstrap (Default is based on the process
                       ID and time of day)
    -max maxjumble  -- maximum attempts at replicating inferred tree
                       (Default=10)
    -nice nicevalue -- run fastDNAml with specified nice value (Default=10)
    -noclean        -- inhibits cleanup of the output files for the individual
                       seeds

Example:

    fastDNAml_boot -boots 100 -seed 137 -max 8 -nice 20  test_data.phylip  2



fastDNAml_loop

Usage:  fastDNAml_loop [-max maxjumble]  [-nice nicevalue]  [-noclean] \
                       in_file  n_best  [ "dnaml_opt1 [ | dnaml_opt2 ... ]" ]

For the given input file, the sequence input order is jumbled (up to maxjumble
times) until the same best tree is found n_best times.  The output files are
then reduced to a summary of the scores produced by jumbling, and one example of
the best tree.

Jumble, treefile and quickadd are included by the script and should not
be specified by the user or in the data file.  AdditionalfastDNAml
program options are enclosed in quotes, and separated by vertical bars (|).

Flags and parameters:

    in_file         -- name of the input data file
    n_best          -- input order is jumbled (up to maxjumble times) until
                       same tree is found n_best times
    -max maxjumble  -- maximum attempts at replicating inferred tree
                       (Default=10)
    -nice nicevalue -- run fastDNAml with specified nice value (Default=10)
    -noclean        -- inhibits cleanup of the output files for the individual
                       seeds

Example:

    fastDNAml_loop -max 15 -nice 20  test_data.phylip  3



out.PID

Usage:  out.PID  program  outfile
   or   out.PID  -o outfile  program  program_args

When a process sends its output to standard out, this script puts it in a
file whose name is the outfile.PID, where outfile is the name given by the
user and PID is the process ID of the running program.  This is useful when
a program is run a number of times, and output needs to be put in different
files.  Since the fastDNAml program uses the PID in generating the checkpoint
file name and the tree file name, it is useful to be able to put the program
output itself in a file with the corresponding suffix.



scores

Usage:  scores  files

Extracts and sorts likelihood values from fastDNAml output files.  The
corresponding file names are also provided.



trees2NEXUS

Usage:  trees2NEXUS  [ treefile ... ]

Concatenate Newick format trees from named file(s) or from standard input
into a single file suitable for the PAUP (tested) or MacClade (untested)
programs.  For example, it is possible to combine trees from bootstrap
replicates in order to use the PAUP concensus tree functions.



trees2prolog

Usage:  trees2prolog  [ treefile ... ]

Accepts Newick format trees from named file(s) or from standard input and
as converts them to prolog facts of the form pseudoNewick(Comment, Tree).
Each tree must be entirely on one line, and they must be similar to the
form produced by the fastDNAml program.  In particular, the only permitted
comment is one preceeding the whole tree.  If there is no leading commment,
then the fact produced will have arity 1.
