The GigaDB website allows any user to browse, search, view datasets and access data files. If you want to submit a dataset, save searches or be alerted of new content of interest we request that you create an account.
A 'Latest news' section will be visible to announce any updates or new features to the database and the RSS feed automatically announces each new dataset release.
The GigaDB homepage allows you to browse datasets by type eg Genomic, Metagenomic, Transcriptomic. Clicking on the DOI (digital object identifier) or image will take you directly to the webpage for the dataset of interest.
Alternatively you can use the search functions to find datasets, samples or files of interest.
To search across all Dataset, Sample and File records in GigaDB, simply enter a search term in the search bar found at the top of all GigaDB pages.
The search is case insensitive which means both uppercase and lowercase keywords will have the same result.
The search results are grouped by GigaDB Datasets (G), Samples (S) and Files (F).
For each dataset result, author names and DOI are displayed. Hovering over dataset name provides the description of dataset. Dataset and sample names are linked to the specific DOI page for those data, as well as file links are provided to download.
For each sample result, the sample name, species name and species ID are displayed with links to the NCBI taxonomy page for the species and to the GigaDB dataset page.
For each file result, the file name, file type and file size are displayed with a direct link to the FTP server location of that file.
Only those objects that have direct matches are displayed in the search results, i.e. the only Files to be displayed in the search results will be those with matches to the search term, all other files within the same dataset will NOT be displayed.
For example, searching for the term "Potato" will return the dataset with the title "Genomic data from the potato" which contains 17 files, however, the search results table will only display 3 of those 17 files because only 3 contain the search term “potato”. To find all data associated with a dataset you must follow the link to the dataset page.
On the left of the search results you have the option to further refine the results by using the filters. By default all filters are disabled, allowing you to see all search results for your keyword. If you want to hide some results based on some criteria, choose the filter for your criteria, and select the options that match what you want to see.
TFilter options for Datasets:
- Dataset Type (Dataset Type controlled vocabulary eg 'Genomic', 'Proteomic')
- Project (eg 'Genome 10K', '1000 Genomes'
- External Link Types (Controlled vocabulary: 'Genome Browser' or 'Additional Data')
- Publication Date (From and To. Format: dd-mm-yyyy)
Filter options for Samples:
- Common Name (Internally controlled eg 'Human', 'Mouse')
Filter options for Files:
All sequence, assembly, variation, and microarray data must be deposited in a public database at NCBI, EBI, or DDBJ before you submit them to GigaDB. In the cases where you would like GigaDB to host files associated with genomic data not fully consented for public release, you must first submit the non-public data to dbGaP or EGA.
The template file contains:
- 3 tabs which must all be completed [Study, Samples, Files]
- 4 informational tabs [Samples (info), Files (info), CV, Links]
Mandatory fields are highlighted in yellow.
Required information includes submitter name, email and affiliation, upload status [can we publish this dataset immediately after review (Publish) or should it be held until publication (HUP)], author list, dataset type(s) (selected from a controlled vocabulary list), dataset title and description, estimated total size of the files that will be submitted and dataset image information.
Optional information includes links to additional resources and related manuscripts, accessions for data in other databases (prefixes are found in the Links tab), and relationship (if any) to a previously published GigaDB dataset (selected from a controlled vocabulary list).
Optional information includes sample attributes (these are automatically populated in GigaDB if an NCBI BioSample ID is provided).
Required information includes a file name or path relative to your home directory and file type (selected from a controlled vocabulary list). A readme file must be provided.
-Filenames should be unique.
-Filenames should not include spaces. We recommend using the underscore (_) in place of spaces in the filenames.
-Filenames should only include the following characters a-z,A-Z,0-9,_,-,+,.
Optional information includes a file description and a sample ID or name.
You can expect a response from the GigaDB team within 5 days to verify the information in your submission and to arrange upload of your files to our FTP site.
If you have any questions, please contact us at email@example.com.
Genomic - includes all genetic and genomic data eg sequence, assemblies, alignments, genotypes, variation and annotation. Minimal requirements: DNA sequence data eg next-gen raw reads (fastq files) OR assembled DNA sequences (fasta files)
Epigenomic - includes methylation and histone modification data. Minimal requirements: Details on methylation sites/status eg qmap files OR details on histone modification sites/status.
Metagenomic - includes all genetic and genomic data eg sequence, assemblies, alignments, genotypes, variation and annotation from environmental samples. Minimal requirements: Environmental DNA sequence data eg next-gen raw reads (fastq files) OR assembled DNA sequences (fasta files).
Proteomic - includes all mass spec data. Minimal requirements: Peptide/protein data eg mass spec.
Transcriptomic - includes all data relating to mRNA. Minimal requirements: RNA sequence data eg next-gen raw reads (fastq files) OR transcript statistics eg RNA coverage/depth.
Additional dataset types can be added, upon review, as new submissions are received.
File types and examples of associated file extensions:
Alignments: .bam, .chain, .maf, .net, .sam
Allele frequencies: .frq
Annotation: .gff, .ipr, .kegg, .wego
Coding sequence: .cds, .fa
InDels: .gff, .txt, .vcf
ISA-Tab: see ISA tools
Genome assembly: .agp, .contig, .depth, .fa, .length, .scafseq
Genome sequence: .fastq, .fq
Methylome data: .fa, .qmap, .rpm, .txt
Protein sequence: .fa, .pep
Readme: .pdf, .txt
SNPs: .annotation, .gff, .txt, .vcf
SVs: .gff, .txt, .vcf
Transcriptome data: .depth, .rpkm, .wig
Other: .xls, .pdf, .txt
Additional file types can be added, upon review, as new submissions are received.
AGP (.agp) - the Accessioned Golden Path (AGP) file describes the assembly of a larger sequence object from smaller objects:
chr1 1 1972671 0 W scaffold43 1 1972671 m chr1 1972672 3061819 1 W scaffold8 1 1089148 p chr1 3061820 3181505 2 W scaffold548 1 119686 m chr1 3181506 4176151 3 W scaffold313 1 994646 m
The large object can be a contig, a scaffold (supercontig), or a chromosome. See AGP Specification v2.0
BAM (.bam) - the Binary Alignment/Map (BAM) format is the compressed binary version of the Sequence Alignment/Map (SAM) format, a compact and index-able representation of nucleotide sequence alignments.
BIGWIG (.bw) - the BIGWIG format is for storing dense, continuous data (such as GC percent, probability scores, and transcriptome data) that will be displayed in the UCSC Genome Browser as a graph. BIGWIG files are created initially from wiggle (WIG) type files, using the program wigToBigWig.
CHAIN (.chain) - the CHAIN format describes a pairwise alignment that allow gaps in both sequences simultaneously and is used by the UCSC Genome Browser.
CONTIG (.contig) - the CONTIG format is a direct output from the SOAPdenovo alignment program:
>1 length 32 cvg_0.0_tip_0 GAGAACGGCGAAGCCTGCTCGGGCCCGTTATA >3 length 32 cvg_23.0_tip_0 TAGCAGCGATTTGATCAAACTCAATCTTACCG >5 length 32 cvg_40.0_tip_0 GGTAAGATTGAGTTTGATCAAATCGCTGCTAT
EXCEL (.xls, .xlsx) - Microsoft office spreadsheet files
FASTA (.fasta, .fa, .seq, .cds, .pep, .scafseq [SOAPdenovo output file - sequence of each scaffold]) - FASTA is a text-based format for representing either nucleotide sequences or peptide sequences.
FASTQ (.fq, .fastq) - the FASTQ format stores sequences (usually nucleotide sequence) and Phred qualities in a single file.
GFF (.gff) - The General Feature Format (GFF) is used for describing genes and other features of DNA, RNA and protein sequences.
CR_ENSP00000334840 CR_ENSMMUP00000018123 IPR000504 IPR003954 CR_ENSP00000333725 IPR001781 IPR015880 IPR007087 IPR001909
CR_ENSMMUP00000031408 ko03010 CR_ENSP00000364815 ko00970 ko00290 CR_ENSP00000414605 ko05146 ko04510 ko04512
MAF (.maf) - the Multiple Alignment Format (MAF) stores a series of multiple alignments at the DNA level between entire genomes.
NET (.net) - the NET file format is used to describe the axtNet data that underlie the net alignment annotations in the UCSC Genome Browser.
PDF (.pdf) - portable document format
PNG (.png) - portable network graphics
QUAL (.qual) - the QUAL file format represents base quality score file for NextGen data (similar in format to fasta).
RPKM (.rpkm) - Gene expression levels are calculated by Reads Per Kilobase per Million (RPKM) mapped reads eg 1kb transcript with 1000 alignments in a sample of 10 million reads (out of which 8 million reads can be mapped) will have RPKM = 1000/(1 * 8) = 125:
ENSP00000379387 15.5651433366423 6002951 289 3093 ENSP00000349977 24.7483107230444 6002951 398 2679 ENSP00000368887 24.6477413647837 6002951 174 1176
SAM (.sam) - the Sequence Alignment/Map (SAM) format is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. Most often it is generated as a human readable version of its sister BAM format, which stores the same data in a compressed, indexed, binary form. Currently, most SAM format data is output from aligners that read FASTQ files and assign the sequences to a position with respect to a known reference genome. In the future, SAM will also be used to archive unaligned sequence data generated directly from sequencing machines. See The Sequence Alignment/Map format and SAMtools
TAR (.tar) - an archive containing other files
TEXT (.doc, .readme, .text, .txt) - a text file
VCF (.vcf) - the Variant Call Format (VCF) is a text file format for representing eg SNPs, InDels, CNVs, SVs, microsatellites, genotypes.
Bmb015379_2_IPR001092 Bmb003749_1_IPR006329 GO:0009168 GO:0003876 Bmb006173_1_IPR000909 GO:0007165 GO:0004629 GO:0007242<
UNKNOWN - any file format not in this list
XML (.xml) - eXtensible Markup Language
Publish: this dataset is fully consented for immediate release upon GigaDB approval
HUP: this dataset should be Held Until Publication (HUP)
The DOI relationship vocabulary is taken from the DataCite 'relationType' schema property (ID=12.2).
Definition: Description of the relationship of the resource being registered (A) and the related resource (B).
IsSupplementTo: indicates that A is a supplement to B
IsSupplementedBy: indicates that B is a supplement to A
IsNewVersionOf: indicates A is a new edition of B, where the new edition has been modified or updated
IsPreviousVersionOf: indicates A is a previous edition of B
IsPartOf: indicates A is a portion of B; may be used for elements of a series
HasPart: indicates A includes the part B
References: indicates B is used as a source of information for A
IsReferencedBy: indicates A is used as a source of information by B
Missing Value reporting
For attributes (sample, dataset or files) that have some or all values missing please use the following controlled value terms to describe the exact reason for the missing value.
not applicable: information is inappropriate to report, often this attribute can be removed entirely.
restricted access: information exists but cannot be released openly because of privacy concerns
not provided: information is not available at the time of submission, a value may be provided at the later stage
not collected: information was not collected and will therefore never be available
The current API version is available on our main production database. This version will be periodically updated with new additional functionality, we will whenever possible maintain backwards compatability, but occassionally this may not be possible, for this reason we recomend regularly checking and updating you usage of our API.
The basic functionality of the API is to retrieve dataset metadata held in GigaDB. The actual data files will still need to be pulled by FTP, but you can gather the exact FTP locations from the metadata using the API, then use that to pull only the files you actually need/want.
Search function is based on the web-search function and will therefore give the same results.
Comments and Bug reporting
The GigaScience github issue for the API works is here:
Please add feedback / comments/ questions to that issue.
It is currently possible to search "all" fields, or to specify one of a select few fields to search.
It is possible to have results return all metadata for each dataset with "hits" to the search term, or to specify a particular portion of the metadata, these portions are currently "dataset", "sample" and "file", which is in alignment with the same functionality on the web-search tool. The default is to return results as GigaDB v3 XML
It is planned that we will have the option to specify the format to be GigaDBv3-JSON or ISA2.0-JSON in the future, but that has not been implemented yet.
To specify exact fields to return data from, use terms; dataset?=, sample?=, file?=, (or experiment?=*)
* - experiment will be implemented in the future
To search for datasets without the ID's, use the term search?keyword=
To search by specific attributes use search?<attribute_name>=
Available attribute_name to search include:
taxno = Taxonomic ID (NCBI)
taxname = species name (nb must exact spelling, no synonyms searched)
author = restricts search to the author table
datasettype = restricts search to the types of datasets, e.g. metagenomic, genomic, transcriptomic etc..
manuscript = restricts search to the manuscript ID associated with GigaDB dataset(s) e.g. search?manuscript=10.1186/2047-217X-3-21
project = restricts search to the project name, e.g. Genome 10K
To specify results to be returned are ONLY a particular level of data, add the phrase &results=dataset ,or file or sample:e.g.
NB - the search still looks everywhere, but the results returned are only those samples that are in datasets that are found by the search.
Default results are "dataset" only.
1. retrieve known datasets by doi
2. retrieve samples from a known DOI
3. retrieve file information from a known DOI
4. Search all GigaDB by keyword, return only the top level dataset metadata
5.Search all GigaDB by keyword, return only the sample level metadata
6.Search all GigaDB by keyword, return only the file level metadata
7. refine search to just the title of the dataset
8. refine search to the descriptions of datasets
9.refine search to NCBI taxonomic ID
10. refine search to taxonomic names
11. refine search to Authors
12. refine search to linked manuscript IDs
13. refine search to dataset types
14. refine search to project names
15. list all dataset doi
16. dump the database
Command line usage
You can also use the curl commands on the command line to retrieve metadata :
If you want to check whether a search will work you can use the -I flag:
HTTP/1.1 200 OK
HTTP/1.1 404 Not Found / HTTP/1.1 500 Internal server error