
     comp.doc                                           update   5/15/90
                                      COMP

     I. Function- COMP determines the base or amino acid composition of a DNA, 
     RNA, or protein sequence as a function of position.  The user defines a 
     set of characters, called COMPSET, which are searched for at each 
     position. At the starting position, COMP calculates the percentage of the 
     first REGION bases or amino acids that are members of COMPSET. It then 
     shifts SKIP positions to the right and recalculates the percentage. This 
     cycle is repeated until the entire sequence or subsequence has been 
     searched.  The output may be sent to any file in the form of a table, but 
     by default, COMP will write the resultant coordinates to a file that is 
     directly readable by LINEPLOT.  LINEPLOT then uses these points to create 
     a graph of composition as a function of position in the sequence. 

     NOTE: COMP ONLY READS FREE FORMAT FILES, NOT GENBANK, NBRF, OR BIONET.

     II. Program Flow
     Program output and user responses are listed as they would actually 
     appear on the screen.  Comments, which are listed here for explanatory 
     purposes but would not appear, are enclosed in the symbols (* *).

                         COMP                 Version 5/15/90
     Type N for DNA or RNA, P for protein sequence:
     P
     Type input filename:
     b:humhbb.pro                           (* IBM-PC DOS protocol *)
     Reading input file...

     MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG
     AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN
     ALAHKYH*
     Type output filename:
     b:humhbb.dat                           (* IBM-PC DOS protocol *)
     Type name to appear on output:
     Beta-globin

     (* COMP displays parameters which may be changed by user *)
     Parameter   Description/Response                    Value
     ---------------------------------------------------------
      1)START    first position searched                    1
      2)FINISH   last  position searched                  148
      3)REGION   width of region searched at each posn.    10
      4)SKIP     move right SKIP posn.after each search     5
      5)FORMAT   G: print a graph  T: print a table         G
      6)COMPSET  bases or amino acids to search for
        COMPSET  [ ]

     Type number of parameter you wish to change
     (0 to continue)
     6


     COMPSET:    [ ]

     The following may be added to or subtracted from COMPSET:
      1) a single amino acid
      2)NONPOLAR    [ A F I L M P V W ]
      3)UNCHPOLAR   [ C G N Q S T Y ]
      4)ACIDIC      [ D E ]
      5)BASIC       [ H K R ]

     Type the number of your choice:
        (0 to continue)
     2                   (* user has chosen to search for non-polar a.a.'s*)
     Type + to add, - to subtract:
     +
     COMPSET:    [ A F I L M P V W ]    (* Current value of COMPSET *)

     The following may be added to or subtracted from COMPSET:
      1) a single amino acid
      2)NONPOLAR    [ A F I L M P V W ]
      3)UNCHPOLAR   [ C G N Q S T Y ]
      4)ACIDIC      [ D E ]
      5)BASIC       [ H K R ]

     Type the number of your choice:
        (0 to continue)
     0

     Parameter   Description/Response                    Value
     ---------------------------------------------------------
      1)START    first position searched                    1
      2)FINISH   last  position searched                  148
      3)REGION   width of region searched at each posn.    10
      4)SKIP     move right SKIP posn.after each search     5
      5)FORMAT   G: print a graph  T: print a table         G
      6)COMPSET  bases or amino acids to search for
        COMPSET  [ A F I L M P V W ]

     Type number of parameter you wish to change
     (0 to continue)
     0

     (* COMP calculates the coordinates and writes them to the output file.  
     Using LINEPLOT, MAXHSCALE is set to 0.15 (ie. 0.15 x 1000 amino acids) 
     and the graph is printed as shown below. *)










     P 1.000E+02|
     E          |
     R          |
     C          |
     E 9.000E+01|
     N          |
     T          |
                |
     [ 8.000E+01|
     A          |
     F          |
     I          |
     L 7.000E+01|                                    *
     M          |
     P          |
     V          |
     W 6.000E+01|    *    * *                          *  * *
     ]          |
                |
                |
       5.000E+01....*............*.*.....*.........*.........*.....
                |
                |
                |
       4.000E+01| *    * *    **    * * *  * **   *     *
                |
                |
                |
       3.000E+01|                               *
                |
                |
                |
       2.000E+01|
                |
                |
                |
       1.000E+01|
                |
                |
                |
       0.000E+00-----+----+----+----+----+----+----+----+----+----+
          0.0E+00   3.0E-02   6.0E-02   9.0E-02   1.2E-01   1.5E-01
     Posn. in  Beta-globin (REGION= 10 SKIP=       5)

          If only a few datapoints are expected, COMP may also be used to 
     produce output in tabular format.  This is most useful if you want to 
     find the base composition of a large region or the entire sequence.  To 
     use the table option, change FORMAT to 'T'.  Now, COMP will search as 
     above, but only print the actual values found at each position, omitting 
     graph parameters.  After each search has been completed, the message 

     Type Q to quit, S to search again:

     gives the user the option to change search parameters and search again.  
     If, for the sequence used above, the parameters had been changed so that 
     TABLE='T' and FINISH=50, the non-polar amino acid content of the entire 
     sequence would have been calculated and sent to the output file as shown 
     below: 

     Beta-globin         SKIP=        5
     POSN.  PERCENT   [ A F I L M P V W ]
     0.006  40.0 
     0.011  50.0 
     0.016  60.0 
     0.021  40.0 
     0.026  40.0 
     0.031  60.0 
     0.036  60.0 
     0.041  40.0 

     III. Parameters
     START
     FINISH
     START and FINISH determine the part of the sequence to be searched. By 
     default, START is the first position and FINISH is the last. 

     REGION
     REGION is the width of the region centered on a given position in the 
     sequence, for which a percent composition is to be calculated.  Thus, if 
     REGION = 30 and the current position is 260, COMP will calculate the 
     percent composition of the part of the sequence beginning at 245 and 
     ending at 274.  COMP can only determine composition for complete regions. 
     Thus if REGION=20, the first position at which a value can be calculated 
     is 11. 
          There is a direct relationship between the percent composition and 
     the size of the REGION searched.  As one might expect, over very large 
     REGIONs, the composition will tend to dampen in amplitude to a constant 
     value. Conversely, for small values of REGION (eg. a few nucleotides), 
     the resultant graph will have numerous jagged peaks and valleys. 

     SKIP
     After calculating the percent composition at a given position, COMP moves 
     right SKIP positions.  Generally, SKIP should be small, relative to 
     REGION. If REGION <= the size of the sequence, then the entire sequence 
     will be searched once, and the value of SKIP is irrelevant.  This would 
     occur if the base composition of the sequence as a whole were to be 
     determined. 

     FORMAT
     By default, FORMAT=G, which results in the output file being written in 
     a format readable by LINEPLOT.  Setting format to T will result in the 
     output appearing in tabular form, one coordinate per line.  This is only 
     recommended if only a few output points are expected. 

     COMPSET
     The user is given the option of defining a set of nucleotides or amino 
     acids to search for. For example, to search for purine rich regions, 
     COMPSET would be set to [A G]. Nucleotides or amino acids can be added to 
     or subtracted from COMPSET one at a time, or, for amino acids, in groups, 
     such as ACIDIC, BASIC etc. Subtracting nucleotides or amino acids that 
     are not members of COMPSET will have no effect. 

                           
     IV. Input file
     The input for COMP may be any DNA, RNA, or protein sequence file as 
     described in the general notes. 


     V. Usage notes
     1. When COMP calculates base composition for a given region of a DNA 
     sequence, N's are ignored.  Similarly, for proteins, X's and *'s are 
     ignored.  Thus, the A-composition for a given REGION is 


                                                  A
                 A composition=      ---------------------------
                                            A + G + C + T

     such that the sum of the base compositions always equals unity.  If N's 
     were included in the calculation, all base compositions would be 
     underestimates, since, in reality, even the unknown part of the sequence 
     consists of A,G,C, & T.  Although the true base composition of the 
     unknown part of the sequence may differ from that of the known part, it 
     is probably best, unless other information is available, to assume that 
     they are the same. 

