generic_ncbi_data_fetcher.pl

This script uses NCBI's Entrez Programming Utilities to perform
searches of NCBI databases. This script can return either the complete
database records, or the IDs of the records (recommended). It is up to
you to know how to handle the IDs and records. The results are written
to a single output file.

For additional information on NCBI's Entrez Programming Utilities see:
http://eutils.ncbi.nlm.nih.gov/entrez/query/static/esearch_help.html

There are five required command line options:

-q unescaped query text, i.e. the query as a user would enter it.

-o the results file to create.

-d the database to search. The following values are supported: pubmed
protein  nucleotide  nuccore  nucgss  nucest  structure  genome books
cancerchromosomes  cdd  domains  gene  genomeprj  gensat  geo  gds
homologene journals  mesh  ncbisearch  nlmcatalog  omia  omim  pmc
popset  probe  pcassay pccompound  pcsubstance  snp  swissprot
taxonomy  unigene  unists.

-r the return type. Use "id" to obtain record ids, or use "complete"
to specify that the complete records in default format should be
obtained. Alternatively, supply any of the formats supported by
NCBI. The accepted formats vary depending on the database you are
searching. To specify certain formats for certain databases, edit the
"Associating certain formats with certain databases" portion of this
script. See how searches of "swissprot" are handled, for example, in
this section.

-m the max number of records or ids to obtain. The recommended value is
100.

There is one optional command line option:

-s species to restrict search to.


Example usage:

The following obtains up to 100 NCBI ids for sequences found in
swissprot. The search phrase is "diabetes" and the search is
restricted to homo sapiens. 

perl generic_ncbi_data_fetcher.pl -q diabetes -o results.txt -d
swissprot -r id -m 100 -s homo sapiens

The following obtains up to 10 PubMed ids for articles found
PubMed. The search phrase is "dysphagia" and the search is restricted
to homo sapiens.

perl generic_ncbi_data_fetcher.pl -q dysphagia -o results.txt -d
pubmed -r id -m 100 -s homo sapiens

The following obtains up to 50 protein sequences in fasta format from
GenBank. The search phrase is "telomere" and the search not restricted
to any organism.

perl generic_ncbi_data_fetcher.pl -q telomere -o results.txt -d
protein -r fasta -m 50

Exit status:
If the script encounters an error it exits with a status of 1. If no
error is encountered the script exits with a status of 0 upon
completing.

Written by Paul Stothard, Canadian Bioinformatics Help Desk.

stothard@ualberta.ca
