Instructions for  reticulate, 
a program for calculating Compatibility Matrices

Copyright 1995 by Ingrid Jakobsen, Australian National University

Permission is granted to use, copy, modify and distribute this program 
and accompanying documentation provided no fee is charged, and this 
copyright notice is not removed. 
No representations are made about the suitability of this software for 
any purpose.

email    ingrid@helios.anu.edu.au

Note: this file includes ASCII diagrams. You will need to display/print it
using a non-proportional font (Courier) to get full benefit of the diagrams.

Contents
********
List of supplied files
Advice on compiling the program successfully
Modifying the program for systems that don't have X11 graphics
How to run the program, including description of data input format
Notes on the various options available
Descriptions of the randomisation functions
References

Supplied Files
**************
Instructions	This file

reticulate.c	Main body of program
XPlot.c		X graphics functions
XPlot.h		Headers for X graphics
Makefile	Makefile

gamglob.gde	Gamma globin sequences aligned in gde flat file format


Compilation of program
**********************
You need the four files: reticulate.c XPlot.c XPlot.h Makefile.

Edit the makefile to suit your local system, most importantly there must
be a line (without the #) where CFLAGS points at the local X11 include files.

If necessary, edit the #define statements at the beginning of reticulate.c.
It may be required to set the papersize to suit the printer you intend 
to send output to. The program currently sets up the page to be the smaller
of A4 and US letter in each direction, which should allow printing on most 
printers, but you may wish to set it to the exact local size. In addition,
the margins the program will print out to are quite narrow, and may
need to be widened. All measurements are in points, where 72 points = 1 inch.
Unfortunately points do not convert nicely to cm, since 1 inch = 2.54 cm.

Type   make   at the prompt. The compiled program is called reticulate.

There may be unknown difficulties compiling this program on other systems,
I only have experience with Sun OS 4.1.3 and Silicon Graphics IRIX. I have
tried very hard to only use standard ANSI C.

Compiling the program without X11
*********************************
You can compile and run a modified version of reticulate. Unfortunately, you 
won't be able to see the matrices on screen. You will only need reticulate.c
Edit it, to remove the following lines:

    (In the headers to the program)
#include <X11/Xlib.h>
#include <X11/Xutil.h>
#include "XPlot.h"

    (In the main body of the program, the interactive section)
	if (option == 'd') /* Display the matrix on screen */
	{
	    printf("\n\nThe matrix will be displayed in a separate window.");
	    printf(" When you have examined\nit, click the mouse once in the ");
	    printf("window to continue the program.\n");
	    RetCode = ShowXPlot(NumSeq, SeqLen, codonnum, codon_pos, 
			site_matrix[into]);
	    if (RetCode)
		printf("Unable to display matrix. \n");
	}

    (edit the immediately following line to remove the "else")
	else if (option == 'p')  /* postscript output */
    (becomes:)
	if (option == 'p')  /* postscript output */

    (You can also remove this line for niceness:)
	printf("  (d) Display the current matrix on screen (X display)\n");

    (You may also wish to change the definition of 'option' in main() , to 
     whatever suits you, 'p' is probably best)
    char    option = 'd';
     (becomes:)
    char    option = 'p';

The program should now compile with a standard C compiler, for example
gcc reticulate.c -o reticulate

Alternatively, you may have a local graphics package you can plug in. Instead
of removing the above lines, you need to edit them to include the package,
however it works. Basically the idea is to generate a square window, the
size of the number of cladistically informative sites, and make all 
co-ordinates corresponding to the letter 'm' in the compatibility matrix
black, the rest should be left white. See the ShowXPlot function in XPlot.c

Running the program, data input formats
***************************************
Type   reticulate   at the prompt.

The program will ask for the name of the input file, followed by various
options. The most common problem we encounter when running the program is to 
be in a different directory than the input data file.

There should be at least 4 sequences to be analysed, and they should previously
have been aligned. If the alignment isn't suitable for phylogenetic analysis,
it's not suitable for a compatibility matrix.

The format of the data should be FASTA or related formats, eg

>Sequence1
ACTACGGGATAACAATT
AGATACAGATACGAGAG
AGTCTCTC
>Sequence2
ACTACGCGATAATGATT
AG--ACAGATAGCAGAG
?GTCCCTC
>Sequence3
-----GGGATTACGATG
AG--ACAG?TACCAGAG
AGTCTCTT

etc. 

Each sequence appears as a complete block, one after the other. The first
line of each sequence has a marker character, followed by the sequence name
and any other comments you like, but no actual sequence data. Only a fixed
number of characters are read in and saved as the sequence name. This is
currently set to 15, but can be altered in the source code.

The marker character can be any non-alphanumeric character.
For example   >   #   ;   %   are all acceptable. The same character should 
be used to mark all sequences. It should be the first character on the 
name line. For example, check that there are no spaces in front of it. 
Avoid using  ?  or  -  as a marker character. They are used to mark 
unknowns and gaps in the sequences themselves.

Gaps should be marked with '-' characters, as many as the length of the
gap, to maintain the correct alignment. However, if any sequence is shorter
than the others, the final gap does not need to be marked. It will be filled
automatically.
Unknown characters should be marked with a '?', or alternatively one
of the characters 'N' (for DNA) or 'X' may be used. The program will ask you
if any 'N' or 'X' characters it encounters are to be taken as unknown and if
you reply yes, will code them internally as '?'. It is only possible to set
one of 'N' or 'X' to be recorded as unknown.

There can be any amount of whitespace in the sequences, it will be ignored.
Make sure that for example no gaps in the alignment have been marked 
using spaces.

Related formats which are also accepted include GDE flat file format:

#Sequence1
ACTACGGGATAACAATT
AGATACAGATACGAGAG
AGTCTCTC
#Sequence2
ACTACGCGATAATGATT
AG--ACAGATAGCAGAG
?GTCCCTC
#Sequence3(5)
GGGATTACGATGAG--A
CAG?TACCAGAGAGTCT
CTT

etc

In GDE flat file format, a number in parentheses on the name line is 
interpreted as an initial number of gaps at the front of the sequence
(i.e. an offset). The program asks if numbers in parentheses on the name 
line should be interpreted as initial gaps. Note that answering this 'no'
when there are gaps marked in this way, will result in the wrong alignment 
being read in, and possibly vice versa if you happen to have any comments in
brackets on the sequence name line.

Calculating the matrix
**********************
The program will read in the data file and then examine the file for 
cladistically informative sites. The options for dealing with sites
with more than two characters (non-binary sites) are: to ignore those 
sites altogether; to leave them as multiple-character sites; to convert 
DNA into transversion sites, or to manually decide binary partitions for 
each such site in turn.
In the manual option, the number of sequences with each character state are 
presented for each site. The options are to ignore the site, convert to 
binary, or to leave as is. When converting to a binary site, enter the 
characters that you would like grouped into the first of the two groups. As 
a check, the binary characters and counts are listed before the next site is 
presented.
When the DNA transversion site option is invoked, non-binary sites are
grouped into 'Y' (Y or C or T/U) and 'R' (R or A or G). If any other
character is found, the program moves onto the manual option above to
resolve the site.
At this point, the informative sites can be saved into a text file, with
the character states in the form the program used to determine compatibility
- for example if the transversion option was set for DNA, non-binary sites
will appear as 'Y' and 'R' characters rather than nucleotides.

For determining sites and calculating compatibility, sequences with gaps or
unknowns will be ignored. If for example a gap needs to be considered for
determining compatibility, one of the gap characters should be replaced by
a letter (or number) in all the sequences sharing the gap.

Compatibility is calculated by comparing two informative sites. All distinct
combinations of characters at the two sites are found. For example, for the
five sequences below, the program had previously identified three informative
                            sites (they are repeated to the right of the 
     2   6 8       2 6 8    alignment) then, when comparing sites 2 and 6,
A   ACGAACGTACGT   C C T    the pairs (C,C) (A,C) (A,T) are found, for
B   ACGTACGGACGT   C C G    2 and 8 they are (C,T) (C,G), (A,T) (A,G) and
C   AAGTACGTACGT   A C T    for 6 and 8, (C,T) (C,G) (T,G). Note that the
D   AAGTATGGACGC   A T G    order within the pairs is important. To find out
E   AAGTATGGACGT   A T G    if two sites are compatible, any pairs that have
                            a unique character in either position is 
eliminated. So for sites 2 and 6 we can eliminate (C,C) since no other pair 
has a C in the first position, and (A,T) since no other pair has a T in 
the second position. The process is repeated and (A,C) can now be eliminated, 
since it is the only pair left. Since all pairs are eliminated sites 2 and 6 
are compatible. On the other hand, for sites 2 and 8, we cannot eliminate
any pair, as both C and A in the first position occurs in two pairs, and 
similarly T and G in the second position. This means sites 2 and 8 are
incompatible. For sites 6 and 8, (C,T) and (T,G) can be eliminated - the 
T is unique in each position. Thus 6 and 8 are also compatible.

   2 6 8   The comparisons of sites are displayed in a matrix format.
2  \ . M   The compatible comparisons are coloured white, the incompatible
6  . \ .   are black. On the left, white squares are represented  by  .
8  M . \   and  black by M . The diagonal (self with self) is \ .

The compatibility matrix is calculated as the pairwise compatibility of all 
informative sites and displayed in a separate X11 window on the screen. 
Click the mouse anywhere in the matrix window to continue the program.

Options
*******
The program ends with an interactive options menu. The options are to 
display the matrix (using X11), save the matrix in postscript for printing,
and a randomisation option which allows you to find some statistics about 
the matrix, and the significance of any clustering you may have observed.

The random matrix option calculates the matrix's neighbour similarity score
and compatibility of subregions of the matrix if desired. Random matrices
are then calculated, as many as desired, provided memory can be allocated.
The summary statistics for the random matrices are printed in a text file.
The final random matrix is stored. In the main options menu, it is possible
to change "current matrix" to the random matrix, and so display or save the 
random matrix. Obviously, by calling the randomisation option several times,
you can view a number of different random matrices of the original data. 
However, the randomisation is done based on the "current matrix", so it is 
necessary to return to the original matrix before running the randomisation 
option again, if you would like the statistics to make any sense.

Description of randomisation algorithm and Neighbour similarity score.
**********************************************************************
The neighbour similarity score is calculated by examining each internal
edge in the (triangular half) matrix and scoring those edges separating
squares of the same colour. Thus most squares are compared to four other
squares, while those on the edges and corners are only compared to three,
two respectively. The diagonal entries are included and treated as compatible.
However since this convention is maintained when scoring the random matrices, 
it should be of little practical importance.

The compatibility score of the overall matrix or of subregions about the
diagonal is the fraction of (off-diagonal triangle) entries that are compatible
in that region. Thus the compatibility of a region with only one site is
undefined, two sites have only one entry, three sites have three, four sites
have six, etc. Again the value is calculated identically for the observed
and random matrices.

The matrix is randomised by shuffling the sites rather than the individual
squares. It is a shuffle rather than a bootstrap - each original site appears 
once in each random matrix. The randomisation by sites ensures that the 
individual character of sites is retained, for example hypermutating sites 
remain as "black" rows and columns in the matrix. Preliminary testing had 
shown that randomisation of squares rather than sites made matrices appear 
much more statistically significant - the test was of the degree of 
compatibility of individual sites rather than their order along the sequence.

Notes on pairwise comparisons and non-binary sites
**************************************************
The current program only calculates compatibility for two sites at a time,
however it is possible to do compatibility for multiple sites at once. It
wasn't implemented here as it is difficult to display and interpret multi-
dimensional matrices. As you would have noticed in the example, it is 
possible for site A to be compatible with B and B with C but that does not
mean A is compatible with C. Somewhat similarly it is possible for A to be
compatible with B, B with C and A with C, but all three sites are *not*
compatible when considered as a triplet. This problem however only occurs
when sites have more than two distinct character states. This is why the
program allows sites with more character states to be ignored altogether,
or converted into binary sites.

If you are worried about a group of sites (all of which have more than two
character states) being pairwise compatible but not overall compatible, the
algorithm for determining compatibility can be extended to triplets etc. It
gets progressively more messy however. "Sets" (the generic of the pairs) are
not eliminated when unique in one position, but in some cases replaced by 
new sets. For the above example (which we already know to be incompatible 
since 2 and 8 are incompatible) the distinct triplets are (C,C,T) (C,C,G)
(A,C,T) (A,T,G). In (A,T,G), T is unique in the middle position. The triplet
is replaced by any (A,*,G) we can infer from the other sets. (A,C,T) 
matches in the first position generating (A,C,G) and similarly (C,C,G)
matches in the last position, also generating (A,C,G). However, only one is
included as each distinct set is only represented once, as before. Now
(C,C,T) (C,C,G) (A,C,T) and (A,C,G) are left and the three sites are 
incompatible since every character in every position appears in at least one
set. The fact that the second position (site 6) is now only C, confirms that
the incompatibility in this triplet is really between 2 and 8.

The pairwise algorithm is really a special case of the description above, as
the pair(s) that replaces an eliminated pair is already present in the list.

References
**********
Sneath, P.H.A., Sackin,M.J. & Ambler,R.P.
Detecting evolutionary incompatibilities from protein sequences
Syst. Zool. 24: 311-332 (1975)

Jakobsen, I.B. & Easteal, S.
A program for calculating and displaying compatibility matrices as an aid
in determining reticulate evolution in molecular sequences
submitted to CABIOS (1995)

**********************************************************
