update February 7, 2010
NAME
uniqid.py - Replace names in a file with a unique identifier, or restore the original names by substituting the unique identifier
SYNOPSIS
uniqid [options] -encode sourcein sourceout csvout
uniqid
[options] -decode textin textout csvin

DESCRIPTION
The problem: The diversity of programs out there handle identifiers, such as sequence names, in many different ways. For example, the PHYLIP programs and HMMR both truncate sequence names in output reports. If two names exist that are identical after truncation, there is no way to be sure which sequence is being referred to in the report.

The solution: uniqid replaces each name in an sourcein with a unique identifier, and writes the substituted output to sourceout. The unique id's, along with the corresponding original names, are stored as tab-separated key-value pairs in csvout. csvout is a csv file that could be read by any spreadsheet program.

The data from source out can be taken through any number of analytical steps, retaining the identities of the data (eg. sequences) through the unique id's. At the end of the data pipeline, the original names can be restored by running uniqid using the -decode option. Textin is the textfile containing unique ids, csvin is the csvout file produced when the names were first encoded, and textout is the file to which the substituted output will be written.

OPTIONS

-encode (default)   options begin with a dash; filenames do not. The first three filenames on the command line are read as sourcein, the original file; sourceout, the file in which the description line is replaced with a unique ID; and csvout, a comma-separated value file containing the unique identifier and the corresponding definition line

-decode options begin with a dash; filenames do not. The first three filenames on the command line are read as textin, any text file containing unique IDs generated from a previous run using  -encode; textout the output file in which the unique ID is replaced by the original name, or the name plus parts of the definition line; csvin, the csv file generated by a previous run using -encode.

-f list_of_fields   similar to -f in the Unix cut command. A comma-separated list of fields to be
written to textout when decoding files.

-s seperator - seperator is a character to use as the seperator  when parsing a definition line into fields.   default = " ", a blank space

-nf string - string is one or more characters to begin the unique identifier, which which the definition line is replaced.

EXAMPLES


FEATURES NOT YET IMPLEMENTED

At present, uniqid is mainly setup to work with fasta files. It should be possible to expand the types of input files in this program in two ways. First, we should support the numerous sequence file formats that are already supported in readseq, probably using the same switch readseq uses to specify the format of the input file. Even more generic would be the ability to specify a regular expression to be substituted with a unique id. In principle this program could then be used with any type of text file.


SEE ALSO

 
AUTHOR
Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB  Canada R3T 2N2
frist@cc.umanitoba.ca
http://home.cc.umanitoba.ca/~frist