Molecular Biology Data Sources: Sequence Data
On This Page
Nucleotide Sequences
- GenBank
- Database of Short Genetic Variation (dbSNP)
- Database of Genomic Structural Variation (dbVar)
- Sequence Read Archive (SRA)
Amino Acid Sequences
GenBank
What is it?
- Annotated collection of all publicly available DNA sequences
- Part of International Nucleotide Sequence Database Collaboration
- New release every 2 months
Access
- Public access
- NCBI places no restrictions on use or distribution of data
- Submitters may place patent, copyright or other intellectual property rights restrictions on all or a portion of data
How to Retrieve Data
- Search for sequence identifiers and annotations with Nucleotide
- Search and align sequences to query sequence with BLAST
- Search, link and download sequences using NCBI E-utilities
- Download sequence records in flat file format via FTP. Full contents are described in the README.genbank file on GenBank's FTP server
Submissions
- Accepts mRNA or genomic sequence data determined directly by submitter
- Several options for submitting data to GenBank
- Will, upon request, withhold release of new submissions for specified period of time
Database of Short Genetic Variations (SNP)
What is it?
- Archive and repository for short genetic variations, including:
- Single nucleotide polymorphisms
- Small-scale multi-base insertions or deletions
- Short tandem repeats
- Integrates genetic variation and clinical data in collaboration with locus-specific databases and diagnostic laboratories
- Two major classes of content:
- Submitted: original observations of sequence variation
- Computed: generated during SNP build cycle
- Each entry includes:
- Sequence surrounding polymorphism
- Occurrence frequency of polymorphism
- Experimental methods used to assay the variation
Access
- Public access
How to Retrieve Data
- Searching
- Where to start
- Multiple search options are available:
- Visit dbSNP Access section of the new NCBI Handbook for more information
- Use the SNP Advanced Search Builder page to construct a complex search using combinations of different search fields and qualifiers (Advanced Search Builder video tutorial)
- Data available for download via FTP from the dbSNP homepage
Submissions
- Accepts submissions from all organisms, including prokaryotes
- Loose definition of SNPs: no requirement about minimum allele frequencies
- Large-scale insertions, deletions, etc. should be submitted to dbVar
- Need handle assignment from NCBI prior to submission
- For more information, visit Submissions to dbSNP
Database of Genomic Structural Variation (dbVar)
What is it?
- Database for large-scale genomic variants: insertions, deletions, translocations, inversions
- Accepts data from all species and clinical data
Access
- Public access
How to Retrieve Data
- Enter search terms into dbVar search box on homepage
- Search will return studies and variants
- Refine search using Limits or Advanced Search
- For more information, visit dbVar Entrez Search Help
- Data available via FTP on a per study and per assembly basis
- For more information, visit dbVar FTP Site Help
Submissions
- Recommend submitting variant data > 50 base pairs to dbVar
- Accepts submissions from all organisms
- Accepts human clinical studies
- Complete information on how to submit data available on the dbVar homepage under "Submitting Data"
NCBI General Resources
Search Tools
- Entrez: NCBI's integrated database retrieval system that allows searching across 35 databases or within a single database
- Basic Local Alignment Search Tool (BLAST): Locates regions of similarity between biological sequences
Submission Information
- NCBI Submission Portal: NCBI's submission portal provides information on how to submit data to its databases, organized according to the type of data being submitted: sequence, project, biological materials, microarray, manuscript or clinical
- Submit Sequence Data to NCBI: Information on how to submit specific types of sequence data
Sequence Read Archive (SRA)
What is it?
- Repository for raw sequencing data from next-generation sequencing technologies, including:
- Roche 454 GS System®
- Ion Torrent®
- Illumina Genome Analyzer®
- SOLiD®
- Helicos HeliScope®
- Complete Genomics®
- Goals for SRA
- Provide central repository for next generation sequencing data
- Provide links to other resources using this data
- Provide retrieval based on ancillary information and sequence comparison
- Track studies and experiments
- Separate submission from content
Access
- Public access
How to Retrieve Data
- Searching
- Where to start
- SRA Homepage
- Entrez
- Multiple search options available, for more information, visit Searching for SRA Data
- Where to start
- Multiple options for downloading data
- Via Aspera or FTP from SRA homepage
- Run Browser
- Download data from one or more runs in fasta and fastq form
- Under "Browse" tab on SRA homepage
- Individual level data will require controlled access to dbGaP
- For more information, visit SRA Download Guide
- Using data
- SRA Systems Development Kit provides Application Programming Interfaces for accession and manipulation of larger quantities of data
- On SRA Homepage, under "Software"
Submissions
- Accepts primary sequencing data from next-generation sequencing platforms
- Submissions tools available under "Submit" tab on SRA homepage
- For more information, visit Submitting to the SRA
Protein
What is it?
- Collection of protein sequences
- Source of sequences
- GenPept sequences: translations of annotated coding regions in GenBank
- RefSeq database: curated database of genomic DNA, transcripts (RNA) and protein products
- Third Party Annotation (TPA) database: database of sequences derived from primary sequences
- UniProtKB/Swiss-Prot
- Protein Research Foundation
- Protein Data Bank
Access
- Public access
How to Retrieve Data
- Searching
- Enter keywords into search box on Protein homepage
- Can limit search to records within a particular component database, gene location, publishing or modification date
- Search and align sequences using BLAST
- Select 'Protein BLAST' under Basic BLAST heading
- Additional tools available on Protein homepage
- Enter keywords into search box on Protein homepage
- Multiple options available for downloading data
- Download directly to file on computer by selecting 'Send To' in upper right hand corner of results page
- Via FTP from:
- GenBank
- RefSeq
- BLAST
- Links to these resources found on Protein homepage
Submissions
- Protein sequences alone are not accepted, they must be accompanied by a nucleotide sequence (DNA or RNA), which can be submitted through GenBank
- For more information, please view the information in this guide on GenBank