Skip to content

AlexOrlek/getNCBImetadata

Repository files navigation

getNCBImetadata

DOI

Retrieves NCBI metadata from nucleotide or biosample accession ids.

Table of contents

Requirements

  • Linux or MacOS or Windows with Windows Subsystem for Linux (WSL) installed
  • Bash shell, which is the default shell on MacOS and many Linux distributions
  • Python 2.7 or Python 3
  • Edirect
  • A computer with internet access via the HTTPS protocol - required for retrieving data from NCBI

Installation

git clone https://github.com/AlexOrlek/getNCBImetadata.git
cd getNCBImetadata

You should find the getmetadata.py executable script within the repository directory. If you add the path of this directory to your $PATH variable, then the executable can be run by calling getmetadata.py [arguments...] from any directory location. Note also that the edirect directory must also be available in your $PATH variable.

Quick start

The -t flag specifies whether nucleotide or biosample accessions are provided in accessions.txt.
The -e flag should be your own email address; this is provided to NCBI so that they can monitor usage.
accessions.txt is a text file where the first column contains NCBI (nucleotide or biosample) accession ids.

Nucleotide metadata can be retrieved by running the following code:

getmetadata.py -a accessions.txt -t nucleotide -o outdir -e first.last@company.com

Either Refseq or Genbank nucleotide accessions can be provided. Nucleotide accessions can be provided in either "accession" or "accession.version" format.

BioSample metadata can be retrieved by running the following code:

getmetadata.py -a accessions.txt -t biosample -o outdir -e first.last@company.com --biosampleattributes attributes.txt

The --biosampleattributes flag is optional. It is used to specify a path to a file containing harmonized attribute names in the first column. A full list of BioSample attribute harmonized names is provided here. The specified attributes will be retrieved, in addition to default retrieved fields (see Output for details).

Output

Nucleotide metadata

When nucleotide accessions are provided, the following fields are extracted:

  • AccessionVersion
  • Dates of first submission and last update: Create Date, Update Date
  • Molecular characteristics: Molecule Type (e.g. dna), Length, Completeness, Source Genome Type (e.g. plasmid)
  • Taxonomy data: Source Taxon, Source Taxonomic ID
  • Genome assembly data: Assembly Method, Genome Coverage, Sequencing Technology
  • Genome annotation data: Annotation Pipeline, Annotation Method
  • DBLink data: Bioproject Accession, Biosample Accession, Sequence Read Archive Accession, Assembly Accession
  • PubMedID

Biosample metadata

When biosample accessions are provided, the following fields are extracted:

  • Identifiers: Accession, Accession ID, Sample name
  • Submission data: Model, Package
  • Dates: last_update, publication_date, submission_date
  • Title
  • Comment
  • Taxonomic data: taxonomy_id, taxonomy_name, OrganismName
  • Affiliation data: Owner/Name, email, Contact/Name/First, Contact/Name/Last
  • Attribute data will be retrieved if a file containing harmonized attribute names is provided to the --biosampleattributes flag.

License

MIT License

About

Retrieves NCBI metadata from nucleotide or biosample accession ids.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages