  
      XYLEM_IDENTIFY                                          update 26Jul 02 
      
      
      NAME
            xylem_identify - creates a file of locus names corresponding to
             lines found by grep in a GenBank annotation file.
      
      SYNOPSIS
            xylem_identify grepfile indfile namefile findfile
      
      DESCRIPTION
      grepfile is created using the Unix grep command to search a .ano 
      file created by splitgb.  For example, to find all lines containing 
      the word 'chlorophyll' in plant.ano, use
      
           grep -n -i 'chlorophyll' plant.ano > plant.grep
      
      In the example shown, the -n option causes each line written to 
      plant.grep to be preceeded by the number of that line in plant.ano. 
      (The -i option causes grep to ignore case.) Identify can use the 
      indfile do determine which entry a given numbered line was found 
      in, and writes the corresponding LOCUS name to namefile.  In 
      addition, all lines found in a given entry are re-written to 
      findfile without the line numbers, and preceeded by the LOCUS name 
      for that entry. 
      
      EXAMPLES
      Suppose you wanted to obtain a list of names for all plant 
      sequences which code for proteins.  The task is complicated by the 
      fact that many fungal sequences are included in the GenBank plant 
      file.  You could begin by searching plant.ano (containing all 
      GenBank plant entries) for the word 'Planta':
      
      grep -n 'Planta' plant.ano > Planta.grep
      
      However, we want to eliminate all fungal sequences, as well as all 
      sequences for RNAs other than mRNAs.  If we create the file 
      bad.str containing the keywords
      
      Mycophyta
      tRNA
      rRNA
      uRNA
      
      we can then type 
      
      grep -n -f bad.str plant.ano > bad.grep
      
      bad.grep now contains all lines containing the offending keywords.  
      We next use xylem_identify to find the names of the entries found by 
      grep.
      
      xylem_identify Planta.grep plant.ind Planta.nam Planta.fnd
      xylem_identify bad.grep plant.ind bad.nam bad.fnd
      
      Next, we can use the Unix comm command to compare the two .nam 
      files and produce an output file containing only names which are 
      present in Planta.nam but not bad.nam:
      
      comm -23 Planta.nam bad.nam > plants.nam
      
      The file plants.nam now contains names of either plant cDNA or 
      genomic sequences which do not code for structural RNAs.
      At this point, getloc could to create a sub-database containing 
      only those entries listed in planta.nam.  See documentation for 
      getloc for a more detailed discussion. 
           
      SEE ALSO
            grep, fgrep, egrep, ngrep, comm, splitgb, getloc
      
     AUTHOR
       Dr. Brian Fristensky
       Dept. of Plant Science
       University of Manitoba
       Winnipeg, MB  Canada  R3T 2N2
       Phone: 204-474-6085
       FAX: 204-261-5732
       frist@cc.umanitoba.ca

     REFERENCE
       Fristensky, B. (1993) Feature expressions: creating and manipulating
       sequence datasets. Nucleic Acids Research 21:5997-6003.
