update February 7, 2010
NAME
uniqid.py - Replace names in a file with a unique
identifier, or restore the original names by substituting the unique
identifier
SYNOPSIS
uniqid [options] -encode sourcein sourceout csvout
uniqid [options] -decode textin textout csvin
DESCRIPTION
The problem: The diversity of programs
out there handle identifiers, such as sequence names, in many different
ways. For example, the PHYLIP programs and HMMR both truncate sequence
names in output reports. If two names exist that are identical after
truncation, there is no way to be sure which sequence is being referred
to in the report.
The solution: uniqid replaces each name in an sourcein with a unique
identifier, and writes the substituted output to sourceout. The unique
id's, along with the corresponding original names, are stored as
tab-separated key-value pairs in csvout. csvout is a csv file that
could be read by any spreadsheet program.
The data from source out can be taken through any number of analytical
steps, retaining the identities of the data (eg. sequences) through the
unique id's. At the end of the data pipeline, the original names can be
restored by running uniqid using the -decode option. Textin is the
textfile containing unique ids, csvin is the csvout file produced when
the names were first encoded, and textout is the file to which the
substituted output will be written.
OPTIONS
-encode (default) options
begin with a dash; filenames do not. The first three filenames on the
command line are read as sourcein, the original file; sourceout, the
file in which the description line is replaced with a unique ID; and
csvout, a comma-separated value file containing the unique identifier
and the corresponding definition line
-decode options begin with a dash; filenames do not. The first three
filenames on the command line are read as textin, any text file
containing unique IDs generated from a previous run using
-encode; textout the output file in which the unique ID is replaced by
the original name, or the name plus parts of the definition line;
csvin, the csv file generated by a previous run using -encode.
-f list_of_fields similar to -f in the Unix cut command. A
comma-separated list of fields to be
written to textout when decoding files.
-s seperator - seperator is a character to use as the seperator
when parsing a definition line into fields. default = " ",
a blank space
-nf string - string is one or more characters to begin the unique
identifier, which which the definition line is replaced.
EXAMPLES
FEATURES NOT YET IMPLEMENTED
At present, uniqid is mainly setup to
work with fasta files. It should be possible to expand the types of
input files in this program in two ways. First, we should support the
numerous sequence file formats that are already supported in readseq,
probably using the same switch readseq uses to specify the format of
the input file. Even more generic would be the ability to specify a
regular expression to be substituted with a unique id. In principle
this program could then be used with any type of text file.
SEE ALSO
AUTHOR
Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2
frist@cc.umanitoba.ca
http://home.cc.umanitoba.ca/~frist