
     hom.doc                                               update 10/29/90
                            SIMILARITY SEARCH PROGRAMS

     I.  Theory      
          The programs P1HOM, P2HOM, D3HOM and D4HOM perform matrix similarity
     searches similar to those described by Pustell & Kafatos [1].  Briefly, 
     the two sequences to be compared are placed on the X and Y axes of a 
     matrix.  The programs begin the search by comparing the first L-bases or 
     peptides of SEQY (the sequence on the Y axis) with every group of L-bases 
     or peptides in SEQX (the sequence on the X axis).  If a match is found 
     which is as good or better than some predefined minimum value, a 
     character is printed at the corresponding X/Y position in the graph. The 
     programs then move down one position in sequence Y and repeat the 
     comparisons.  This cycle continues until every posible comparison has 
     been made. Similarities are detected as diagonal lines of characters in the 
     matrix.  Below is a sample plot, which compares part of the Alu1 Family 
     consensus sequence [2] with itself. 


     D3HOM         Version  3/23/90
     X-axis: Alu1-Consensus
     Y-axis: Alu1-Consensus
     SIMILARITY RANGE:  10      MIN.PERCENT SIMILARITY:  50
     SCALE FACTOR:    0.95      COMPRESSION:              1

                   185       195       205       215       225       235
            GGAGGCTGAGGCAGGAGAATCGCTTGAACCCAGGAGGTGGAGGTTGCAGTGAGCCGAGAT
           GA        .         .         .         .         .         .
           G A       .         .         .   U     .         .         .
           A  A      .         .         .    U    .         .         .
           G   A     .         .         .     S   .         .         .
        180G....A.......................................................
           C     A   .         .         .         .         .         .
           T      A  .         .         .         .         .         .
           G       A .         .         .         .         .         .
           A        A.         .         .         .T        .         .
           G         A         .         .         . V       .         .
           G         .A        .         .         .         .         .
           C         . A       .         .         .         .         .
           A         .  A      .         .         .         .         .
           G         .   A     .         .         .         .         .
        190G..............A.............................................
           A         .     A   .         .         .         .         .
           G         .      A  .         .         .         .         .
           A         .       A .         .         .         .         .
           A         .        A.         .         .         .         .
           T         .         A         .         .         .         .
           C         .         .A        .         .         .         .
           G         .         . A       .         .         .         .
           C         .         .  A      .         .         .         .
           T         .         .   A     .         .         .         .
        200T........................A...................................

          The efficiency of this search algorithm is proportional to the 
     lengths of SEQX and SEQY (N & M, respectively) times the length of the 
     subsequences compared (L).  Thus, a search of two sequences 100 bases 
     long, using subsequences of 11 bases requires L x M x N = 110,000 base
     comparisons.  A quick inspection of most matrix similarity outputs shows 
     that the vast majority of the area of the matrix contains either blank 
     space, which indicates that no local similarities were found, or very 
     small similarities which have probably occurred at random.  Thus, the 
     majority of the search time is spent investigating regions of 
     non-similarity.
          P1HOM, P2HOM, D3HOM and D4HOM use an algorithm [3] that greatly 
     speeds up the search.  A brief description of the algorithm for 
     comparison of two DNA sequences, used in D3HOM, follows.  The first step 
     involves the construction of a table listing each occurrence of the 64 
     trinucleotides in SEQX.  The table will contain N-2 numbers representing 
     the locations of the trinucleotides, divided among 64 classes.  Now the 
     search begins at the first trinucleotide in SEQY.  Using the table as a 
     guide, each occurrence of that trinucleotide in SEQX is located, and the 
     region centered on that position, HOMRANGE nucleotides to the left and 
     the right, are compared with the corresponding position in SEQY.  If the 
     match is good enough, a character is printed at the point in the matrix 
     which corresponds to the centers of the two trinucleotides.  The process 
     is repeated for each trinucleotide in SEQY.  Since each trinucleotide 
     occurs on the average only once every 64 bases, the algorithm only makes 
     N/64 searches for each triplet in Y, rather than N.  Thus, the efficiency 
     of the algorithm is L x M x N/64.  Likewise, D4HOM uses a 
     tetranucleotide table, with a search efficiency of L x M x N/256. The 
     search for similarities between two protein sequences is even faster.  In 
     P2HOM, a table of dipeptides is used, and since there are 400 possible 
     dipeptides, the efficiency of the protein comparison is L x M x N/400.  
     P1HOM, which uses a monopeptide table, is a more thorough but slower 
     search.  Its efficiency is L x M x N/20. 
          Whereas the simple L x M x N algorithm is exhaustive (ie. every 
     possible comparison is made), the new algorithm described herein will 
     miss some similarities.  This is because the search requires that a 
     similarity be centered on a perfect match of one or two amino acids (P1HOM 
     or P2HOM) or three or four nucleotides (D3HOM or D4HOM).  The search 
     algorithm allows us to generalize about how frequently the programs will 
     sample a given pair of sequences.  In the case of D3HOM, we expect that a 
     trinucleotide match will occur once every 64 bases by random chance 
     alone.  Since regions which share significant similarity must by definition 
     have a frequency of matches which is greater than random chance can 
     account for, similar regions will have more frequent matches, and are 
     consequently more likely to be sampled in a search.  See [3] for a 
     thorough discussion these and other points. 
          The power of the matrix similarity approach lies in the fact that 
     the real work of finding similarities is done not by the computer, but by 
     the user's brain.  It is the user's ability to recognize diagonal lines 
     which makes this method work. In the exhaustive approach, the local 
     similarities which constitute a diagonal are overlapping, and thus often 
     superfluous.  For example, if one is searching for similarities of 14/20 
     or better, and a similarity of 15/20 is found at position (x,y) in the 
     matrix, then the probability that the next similarity (x+1,y+1) will be 
     good enough is 1.  Hence, the redundancy of the search allows the 
     luxury of missing some local similarities without losing the overall 
     picture of the resultant diagonal.  It has been demonstrated
     mathematically that most significant similarities will be found by this
     approach [3].

     II. Menus

     The similarity programs begin by asking the user for tne names of files
     containing sequences to be placed on the X and Y axes of the matrix, and
     and one for output to be written to. In the case of the X-axis file, the
     user is also asked whether that sequence represents the opposite strand
     of the original sequence (see note 6).
     _____________________________________________________________________
     D3HOM                      MAIN MENU
     _____________________________________________________________________
     X-axis file:       b:aluicon.dna
     Y-axis file:       b:aluicon.dna
     Output file:       prn
     _____________________________________________________________________
                         1) Read in a new X-axis sequence
                         2) Read in a new Y-axis sequence
                         3) Open a new output file
                         4) Change parameters
                         5) Compare sequences and write output to screen
                         6) Compare sequences and write output to file
     _____________________________________________________________________
     Type the number of your choice  (0 to quit program)
     4
     
     Choosing option 4 in the main menu brings the user into the parameters
     menu.  The parameters menu can be thought of as dealing with the 
     sequence itself, and is thus independent of operating system.  The name
     and topology are read from the sequence file in the case of GENBANK or
     BIONET files, or are typed in when the sequence is read, in the case of
     Free-format files.      
     
     X-axis: Alu1-Consensus   Topology:     LINEAR    Length:      303 nt
     Y-axis: Alu1-Consensus   Topology:     LINEAR    Length:      303 nt
     _____________________________________________________________________
            Parameter   Description/Response                 Value
     _____________________________________________________________________
             1)STARTX   first nucleotide position in SEQX        1
             2)FINISHX  last  nucleotide position in SEQX      303
             3)STARTY   first nucleotide position in SEQY        1
             4)FINISHY  last  nucleotide position in SEQY      303
             5)HOMRANGE dist. from central triplet in a match   10
             6)SCALEFAC scale factor for exponential curve    0.95
             7)MINPER   minimum percent similarity printed      60
             8)COMPFACT graph compression factor                10
             9)KIND     D:DNA            R:RNA                   D
            10)LINESIZE width of output line (ex.70,120)        70
     _____________________________________________________________________
     Type number of parameter you wish to change (0 to continue)
     0
 

     Search begins...
     (* The output appears on the printer as shown below. *)

     D3HOM         Version 3/23/90
     X-axis: Alu1-Consensus
     Y-axis: Alu1-Consensus
     SIMILARITY RANGE:  10      MIN.PERCENT SIMILARITY:  60
     SCALE FACTOR:    0.95      COMPRESSION:             10

                   100       200       300

            C        .   QK    .         .         .         .         .
             A   T   .    I    .         .         .         .         .
              A      .     HK  .         .         .         .         .
               A     .      J  .         .         .         .         .
                A    .       MN. U       .         .         .         .
             T   A   .         .         .         .         .         .
                  A  .         .R        .         .         .         .
                   A .         .S        .         .         .         .
                    A.         .     L   .         .         .         .
        100 .........A...............LM.................................
                     .A        .         .         .         .         .
                     . A       .       P .         .         .         .
                     .  A      .        NS         .         .         .
            Q        .   A     .         .         .         .         .
            KI       .    A    .         .         .         .         .
              H      .     A   .         .         .         .         .
              KJ     .      A  .         .         .         .         .
                M    .       A .US       .         .         .         .
                N    .        A. T       .         .         .         .
        200 ...................A........................................
                  RS .       U .A        .         .         .         .
                U    .       ST. A       .         .         .         .
                     .         .  A      .         .         .         .
                     .         .   A     .         .         .         .
                     .         .    A    .         .         .         .
                    LL         .     A   .         .         .         .
                     M         .      A  .         .         .         .
                     . P       .       A .         .         .         .
                     .  N      .        AE         .         .         .
        300 ............S...............EAS.............................


     III. What the output means
          Since much has already been written on the theory and applications 
     of matrix similarity searches, a thorough discussion here would be 
     redundant, and the reader is strongly urged to read references 1,3 and 4
     before attempting to use these programs.  In reference 4, Maizel and Lenk 
     provide a clear and concise discussion of the theory of matrix similarity 
     searches.  In reference 1, Pustell and Kafatos present the character 
     matrix format used by P1HOM, P2HOM, etc., with some differences.  These 
     programs, as in the Pustell package, use characters instead of dots to 
     designate the positions of local similarities in the matrix.  Each 
     character represents a small range of similarity as shown in the table 
     matrix represents a local homology with 100% match.  "P" represents a 
     local similarity whose score is 70-71% of the score for a perfect match, 
     using the weighting function described in SCALEFAC (below) and in [1].  

     Char.    %        Char.    %        Char.    %        Char.    %
     A      100        N    74-75        a    48-49        n    22-23
     B    98-99        O    72-73        b    46-47        o    20-21
     C    96-97        P    70-71        c    44-45        p    18-19
     D    94-95        Q    68-69        d    42-43        q    16-17
     E    92-93        R    66-67        e    40-41        r    14-15 
     F    90-91        S    64-65        f    38-39        s    12-13
     G    88-89        T    62-63        g    36-37        t    10-11
     H    86-87        U    60-61        h    34-35        u     9- 8
     I    84-85        V    58-59        i    32-33        v     6- 7
     J    82-83        W    56-57        j    30-31        w     5
     K    80-81        X    54-55        k    28-29
     L    78-79        Y    52-53        l    26-27
     M    76-77        Z    50-51        m    24-25

          Unlike the Pustell package, no separate programs have been written 
     for comparisons of opposite strands or for finding inverted repeats.  
     Instead, these features have been incorporated into both programs.

     IV.  Constants
          Constant values are defined in the constant definition parts of 
     the main procedures of D3HOM and P2HOM.  To change them, it is 
     necessary to change their values in the Pascal text and re-compile.

     MAXSEQ- Maximum length of a DNA,RNA, or protein sequence.  The IBM-PC 
     defaults for D3HOM and D4HOM are 19000, and in P1HOM and P2HOM, 3000. 
     MAXWORD- Maximum length of sequence name on printout. (Default: 25) 
     MAXRANGE- Maximum value of RANGE parameter. (Default: 30)
     MAXLINE- Maximum width of output graph line.  Default value for
              LINESIZE.

     V.   Parameters
     STARTX
     FINISHX
     STARTY
     FINISHY
     STARTX and FINISHX are the first and last nucleotide or peptide positions 
     of SEQX to be compared with the part of SEQY beginning at STARTY and 
     ending at FINISHY.  For DNA sequences, it is necessary to compare both 
     strands of one sequence with one strand of the other sequence.  For 
     simplicity, D3HOM and D4HOM only allow SEQX to be reversed, while SEQY 
     remains constant.  To compare the opposite strand of SEQX with SEQY, use 
     NUMSEQ to create a file containing the opposite strand (see NUMINFO).  
     Now, use this file as input, and answer Y when asked whether SEQX is the 
     inverse strand.  In the printout, the numbering system of the original 
     strand will be retained.  Remember, however that since the numbering will 
     be decreasing, FINISHX must be less than STARTX.  Finally, although 
     D3HOM and D4HOM do not allow START and FINISH values which span the 
     origin of a circular molecule, similarities which overlap the origin will 
     still be correctly evaluated, because both "ends" of a circular sequence 
     are padded with a short stretch of sequence from the other end. 

     HOMRANGE
     HOMRANGE defines distance on either side of a local similarity which is 
     used in a comparison.  In D3HOM and P1HOM , the position of a local 
     similarity is the center of a given trinucleotide or peptide, and the size 
     of the area compared is L = (2 x HOMRANGE) + 1.  In P2HOM and D4HOM, the 
     position of a local homology is the leftmost amino acid or nucleotide and 
     the size of the area compared is L = 2 x HOMRANGE.  The defaults for 
     HOMRANGE are 10 in the DNA searches and 5 in the protein searches. 

     SCALEFAC
     SCALEFAC is the scale factor by which the score at a given distance 
     from the center of a local similarity is multiplied, to the power of the 
     distance.  The score given to an identity at the n-th position from the 
     center is
                                           n
                SUBSCORE[n] =  v x SCALEFAC

     where v is the value (100) for an identity at the center of the local 
     similarity.  In D3HOM and P1HOM, the center of a similarity is the center
     of a triplet or monopeptide, 0 bases or amino acids from the center.  In 
     P2HOM and D4HOM, both amino acids or bases in the central match are 1 
     from the center, and both are given a value of 100.  The total score is 
     the sum of the SUBSCORES at each positon, and the value of the homology 
     is expressed as a percent of the score that would be obtained if all 
     positions matched.  The percentage is indicated by the character printed 
     (see section III).  For example, if SCALEFAC= 0.95 and HOMRANGE=5, then a 
     match at the central nucleotide receives a subscore of 100, while an
     identity one base from the center receives a subscore of 95, 2 bases,
     90.25, and so on.  The maximum possible score (ie., 11/11 matches) would
     be 959.63... 
          SCALEFAC has the effect of weighting the central positions in a 
     local similarity more heavily than the outer ones.  This helps 
     eliminate some of the redundancy in the output of matrix plots, as well 
     as some of the background.  A SCALEFAC of 1 weighs matches at all 
     positions equally, so that the percent printed is the actual percent 
     identity.

     MINPER
     MINPER is the minimum percent identity printed on the matrix.  For D3HOM, 
     MINPER may range between 25 and 100 (default 60), while D4HOM only prints 
     similarities of 40% or better.  For P1HOM and P2HOM, MINPER may range 
     between 5 and 100 (default 50). 

     COMPFACT
     If COMPFACT= 1, the sequences will be printed on the X and Y axes, and 
     each character position in the graph will represent one nucleotide or 
     amino acid position.  For higher values of COMPFACT, each character 
     position in the graph represents COMPFACT positions.  For example, if 
     COMPFACT = 10 (the default), then each character in the matrix 
     represents 10 x 10 = 100 comparisons.  The best similarity found at that 
     point in the graph will always be printed.  Although COMPFACT may be 
     set as high as 100, it is not recommended to use a value greater than 
     20, since many similarities will be difficult to distinguish above 
     background. 

     KIND (D3HOM & D4HOM only)
     If KIND='D' (the default) the nucleotide sequence will be printed as 
     DNA. If KIND='R' then it will be printed as RNA.

     LINESIZE
     Width in characters of one graph line.  For printers with standard 8 
     1/2 in. wide pages, use 70 (the default).  For printers capable of 
     handling wide forms, use 100 or 120.  These sizes may be expanded 
     further on dot-matrix printers capable of using small characters.  See 
     note 2 below for further discussion. 

     VI. Usage Notes 
     1.  As with any similarity search, the programs may be used to find 
         direct repeats and inverted repeats.  Direct repeats within a 
         sequence may be found by comparing a sequence with itself, as shown 
         in the self comparisons of the Alu1 Family consensus sequence above.  
         The "line of identity", which is composed of 'A's, occupies the 
         main diagonal, indicating that the sequence is identical 
         with itself.  The off-center diagonals of imperfect similarity
         show that the Alu1 Family sequence is really a dimer, formed by an 
         ancient duplication event.  
              Inverted repeats, which in DNA or RNA sequences are regions  
         which potentially could form stem and loop structures, are found by 
         comparing a given strand, written 5'-->3', with the same strand 
         written  3'-->5'.  NUMSEQ can be used to generate a 3'-->5' strand.  
         However,  this method of finding potential secondary structures has 
         its limitations.  For more sophisticated methods, see articles in 
         [5]. 
              Very poor similarities, in which the percent identity of local 
         matches is close to background, can often be visualized as a 
         line of unconnected local matches lying on a long diagonal.  
         Frequently, their percentage identity will be consistantly a little 
         higher than background. 

     2.  When comparing sequences of more than a few hundred bases or amino 
         acids, one must consider both speed of execution and memory 
         limitations.  By using high vales of COMPFACT and LINESIZE, it is 
         possible to fit long stretches of sequece on one output page.  For 
         example, if COMPFACT=20 and LINESIZE=100, 2000 bases of SEQX may 
         fit on one page.  If the region between STARTX and FINISHX is 
         greater than 2000, then that region will be compared with SEQY in 
         increments of 2000. 
             
         The efficiency of the search algorithm used in the programs must be 
         paid for by an increase in memory used.  If SEQX requires N-bytes of 
         memory, then the table for SEQX will require approximately 4N-bytes.  
         In practice, however, this is usually not a problem, since only the 
         part of the sequence being searched at any given time needs to have a 
         table.  Thus, if the most positions which will fit on a page are 
         2000, then the table need only be 8000 bytes.  For computers larger 
         than the AppleII, the limiting factor becomes the page size.  Since 
         readability is greatly impaired using COMPFACT values greater than 
         20, probably no more than 2000 bases should ever be searched at one 
         time. 
         
         There is a trade-off between the sizes of the sequences which can be 
         stored in memory and the size of the region which may be compared at 
         one time (ie. on one page).  The programs allocate room for two 
         sequences of length MAXSEQ-MAXRANGE, regardless of how long the two 
         sequences actually are.  This sort of storage is referred to as a 
         static array, and is required by Standard Pascal.  However, the table 
         used to search SEQX is allocated dynamically, and is thus limited by 
         the amount of free memory in the heap.  This amount is much smaller 
         than one might guess. The amount actually allocated depends on the 
         size of the region of SEQX searched at one time, which, in turn, 
         depends upon the values of STARTX, FINISHX, COMPFACT, and LINESIZE.  
         Thus, it is possible to increase MAXSEQ in order to compare longer 
         sequences, but memory limitations may require that shorter stretches 
         be compared at one time. 

     3.  When comparing a very long sequence with a short sequence, it is 
         probably easiest to put the smaller sequence on the X-axis and the 
         longer sequence on the Y-axis.  Although the time required to 
         actually make up the table is negligible, it will take less printing 
         time to do the search in one long page rather than a series of short 
         pages.  Remember that the efficiency of the search is the same, 
         regardless of which sequence is on the X-axis. 

     4.  Unknown nucleotides in a DNA or RNA sequence (N's) or unknown 
         peptides in a protein (X's) are always counted as mismatches in a 
         similarity comparison.  This is sensible, since, in a comparison of 
         any two bases or amino acids chosen at random, it is always more 
         likely that a mismatch will occur, rather than a match.  One 
         unexpected consequence of this is that the ends of linear 
         molecules, which are padded with N's or X's to facilitate an 
         efficient search, will always show imperfect homology, even in a 
         self comparison.

         Other ambiguous nucleotides (eg. Y, which stands for C or T) can be    
         read in by D3HOM and D4HOM, but will only count as a match when
         matching themselves.  For example, Y in sequence X will match Y in
         sequence Y, but not A or G.  Additionally, no ambiguous nucleotides
         may appear in the central triplet or tetranucleotide of a match.
         These rules allow ambiguous nucleotides to serve as place holders
         without contributing significantly to the overall score of similarity.
         At the same time, they make it possible to perform similarity searches 
         between two sequences containing many ambiguities, such as consensus
         sequences. 

     5.  P1HOM & P2HOM permit the use of STOP (*) as an amino acid.  This 
         allows one to translate long stretches of DNA into the 
         corresponding amino acid sequence and compare them without having 
         to divide them into discrete polypeptides. 

     6.  If both strands of a DNA sequence must be compared, choose one
         sequence to appear on the X-axis, and generate its opposite strand
         using NUMSEQ.  (The Y-axis sequence will remain constant.)  Compare 
         both original strands first, and then use option 1 in the main menu
         to read in the opposite strand.  When asked whether the sequence 
         being read is the opposite strand, type Y. Subsequently, numbering on
         the X-axis will decrease from left to right, to retain the numbering
         scheme of the original sequence. Note that the Y-axis sequence can not
         be numbered as an opposite strand.

     VII. References 
     1.  Pustell,J. and Kafatos, F., "A high speed, high capacity similarity 
         matrix: zooming through SV40 and polyoma" Nucl. Acids Res. 10:15 
         (1982) pp4765-4782.
     2.  Deininger, P.L. et al., "Base Sequence Studies of 300 Nucleotide 
         Renatured Repeated Human DNA Clones", J.Mol.Biol. 151 (1981) 
         pp17-33.                          
     3.  Fristensky, B. "Improving the Efficiency of Dot-Matrix Similarity
         Searches Through Use of an Oligomer Table", Nucl. Acids Res. 14
         p597-610 (1986).
     4.  Maizel, J.V. and Lenk, R.P.,"Enhanced graphic matrix analysis of 
         nucleic acid and protein sequences", Proc. Natl. Acad. Sci. USA 
         78:12 (1981) pp7665-7669.
     5.  "The applications of computers to research on nucleic acids II",
         Soll,D. and Roberts, R.J. ed., Nucl. Acids Res. 12:1 (1984).
