Help

The GigaDB website allows any user to browse, search, view datasets and access data files. If you want to submit a dataset, save searches or be alerted of new content of interest we request that you create an account.

A 'Latest news' section will be visible to announce any updates or new features to the database and the RSS feed automatically announces each new dataset release.

The GigaDB homepage allows you to browse datasets by type eg Genomic, Metagenomic, Transcriptomic. Clicking on the DOI (digital object identifier) or image will take you directly to the webpage for the dataset of interest.

Alternatively you can use the search functions to find datasets, samples or files of interest.

GigaDB is an open-access database. As such, all data submitted to GigaDB must be fully consented for public release (for more information about our data policies, please see our Terms of use page).

All sequence, assembly, variation, and microarray data must be deposited in a public database at NCBI, EBI, or DDBJ before you submit them to GigaDB. In the cases where you would like GigaDB to host files associated with genomic data not fully consented for public release, you must first submit the non-public data to dbGaP or EGA.

Step 1 - Create an account or log in to GigaDB

Step 2 - Download and complete the Excel template file. Completed example files for the E. coli (10.5524/100001) and Sorghum (10.5524/100012) datasets are available.

The template file contains:

  1. 3 tabs which must all be completed [Study, Samples, Files]
  2. 4 informational tabs [Samples (info), Files (info), CV, Links]

Mandatory fields are highlighted in yellow.

Study

Required information includes submitter name, email and affiliation, upload status [can we publish this dataset immediately after review (Publish) or should it be held until publication (HUP)], author list, dataset type(s) (selected from a controlled vocabulary list), dataset title and description, estimated total size of the files that will be submitted and dataset image information.

Optional information includes links to additional resources and related manuscripts, accessions for data in other databases (prefixes are found in the Links tab), and relationship (if any) to a previously published GigaDB dataset (selected from a controlled vocabulary list).

Samples

Required information includes a sample ID or name (please use an NCBI BioSample ID when possible), species NCBI taxonomy ID, and species common name.

Optional information includes sample attributes (these are automatically populated in GigaDB if an NCBI BioSample ID is provided).

Files

Required information includes a file name or path relative to your home directory and file type (selected from a controlled vocabulary list). A readme file must be provided.

Please note;
-Filenames should be unique.
-Filenames should not include spaces. We recommend using the underscore (_) in place of spaces in the filenames.
-Filenames should only include the following characters a-z,A-Z,0-9,_,-,+,.

Optional information includes a file description and a sample ID or name.

Step 3 - confirm you have read our Terms of use page and upload the completed Excel template file.

You can expect a response from the GigaDB team within 5 days to verify the information in your submission and to arrange upload of your files to our FTP site.

If you have any questions, please contact us at database@gigasciencejournal.com.

Dataset types

Genomic - includes all genetic and genomic data eg sequence, assemblies, alignments, genotypes, variation and annotation. Minimal requirements: DNA sequence data eg next-gen raw reads (fastq files) OR assembled DNA sequences (fasta files)

Epigenomic - includes methylation and histone modification data. Minimal requirements: Details on methylation sites/status eg qmap files OR details on histone modification sites/status.

Metagenomic - includes all genetic and genomic data eg sequence, assemblies, alignments, genotypes, variation and annotation from environmental samples. Minimal requirements: Environmental DNA sequence data eg next-gen raw reads (fastq files) OR assembled DNA sequences (fasta files).

Proteomic - includes all mass spec data. Minimal requirements: Peptide/protein data eg mass spec.

Transcriptomic - includes all data relating to mRNA. Minimal requirements: RNA sequence data eg next-gen raw reads (fastq files) OR transcript statistics eg RNA coverage/depth.

Additional dataset types can be added, upon review, as new submissions are received.

File types

File types and examples of associated file extensions:

Alignments: .bam, .chain, .maf, .net, .sam

Allele frequencies: .frq

Annotation: .gff, .ipr, .kegg, .wego

Coding sequence: .cds, .fa

InDels: .gff, .txt, .vcf

ISA-Tab: see ISA tools

Genome assembly: .agp, .contig, .depth, .fa, .length, .scafseq

Genome sequence: .fastq, .fq

Haplotypes: .haplotype

Methylome data: .fa, .qmap, .rpm, .txt

Protein sequence: .fa, .pep

Readme: .pdf, .txt

SNPs: .annotation, .gff, .txt, .vcf

SVs: .gff, .txt, .vcf

Transcriptome data: .depth, .rpkm, .wig

Other: .xls, .pdf, .txt

Additional file types can be added, upon review, as new submissions are received.


File formats

  1. AGP
  2. BAM
  3. BIGWIG
  4. CHAIN
  5. CONTIG
  6. EXCEL
  7. FASTA
  8. FASTQ
  9. GFF
  10. IPR
  11. KEGG
  12. MAF
  13. NET
  14. PDF
  15. PNG
  16. QMAP
  17. QUAL
  18. RPKM
  19. SAM
  20. TAR
  21. TEXT
  22. VCF
  23. WEGO
  24. WIG
  25. UNKNOWN
  26. XML

AGP (.agp) - the Accessioned Golden Path (AGP) file describes the assembly of a larger sequence object from smaller objects:

chr1 1       1972671 0 W scaffold43  1 1972671 m
chr1 1972672 3061819 1 W scaffold8   1 1089148 p
chr1 3061820 3181505 2 W scaffold548 1 119686  m
chr1 3181506 4176151 3 W scaffold313 1 994646  m

The large object can be a contig, a scaffold (supercontig), or a chromosome. See AGP Specification v2.0

BAM (.bam) - the Binary Alignment/Map (BAM) format is the compressed binary version of the Sequence Alignment/Map (SAM) format, a compact and index-able representation of nucleotide sequence alignments.

BIGWIG (.bw) - the BIGWIG format is for storing dense, continuous data (such as GC percent, probability scores, and transcriptome data) that will be displayed in the UCSC Genome Browser as a graph. BIGWIG files are created initially from wiggle (WIG) type files, using the program wigToBigWig.

CHAIN (.chain) - the CHAIN format describes a pairwise alignment that allow gaps in both sequences simultaneously and is used by the UCSC Genome Browser.

CONTIG (.contig) - the CONTIG format is a direct output from the SOAPdenovo alignment program:

>1 length 32 cvg_0.0_tip_0
GAGAACGGCGAAGCCTGCTCGGGCCCGTTATA
>3 length 32 cvg_23.0_tip_0
TAGCAGCGATTTGATCAAACTCAATCTTACCG
>5 length 32 cvg_40.0_tip_0
GGTAAGATTGAGTTTGATCAAATCGCTGCTAT

EXCEL (.xls, .xlsx) - Microsoft office spreadsheet files

FASTA (.fasta, .fa, .seq, .cds, .pep, .scafseq [SOAPdenovo output file - sequence of each scaffold]) - FASTA is a text-based format for representing either nucleotide sequences or peptide sequences.

FASTQ (.fq, .fastq) - the FASTQ format stores sequences (usually nucleotide sequence) and Phred qualities in a single file.

GFF (.gff) - The General Feature Format (GFF) is used for describing genes and other features of DNA, RNA and protein sequences.

IPR (.ipr) - the Web Gene Ontology (WEGO) Annotation format consists of the protein ID, followed by column(s) that are the IPR (InterPro) ID(s):

CR_ENSP00000334840
CR_ENSMMUP00000018123 IPR000504 IPR003954
CR_ENSP00000333725    IPR001781 IPR015880 IPR007087 IPR001909

See WEGO: a web tool for plotting GO annotations

KEGG (.kegg) - the Web Gene Ontology (WEGO) Annotation format consists of the protein ID, followed by column(s) that are the KEGG (Kyoto Encyclopedia of Genes and Genomes) ID(s):

CR_ENSMMUP00000031408 ko03010
CR_ENSP00000364815    ko00970 ko00290
CR_ENSP00000414605    ko05146 ko04510 ko04512

See WEGO: a web tool for plotting GO annotations

MAF (.maf) - the Multiple Alignment Format (MAF) stores a series of multiple alignments at the DNA level between entire genomes.

NET (.net) - the NET file format is used to describe the axtNet data that underlie the net alignment annotations in the UCSC Genome Browser.

PDF (.pdf) - portable document format

PNG (.png) - portable network graphics

QMAP (.qmap) - QMAP files are generated for methylation data from an internal BGI pipeline.

QUAL (.qual) - the QUAL file format represents base quality score file for NextGen data (similar in format to fasta).

RPKM (.rpkm) - Gene expression levels are calculated by Reads Per Kilobase per Million (RPKM) mapped reads eg 1kb transcript with 1000 alignments in a sample of 10 million reads (out of which 8 million reads can be mapped) will have RPKM = 1000/(1 * 8) = 125:

ENSP00000379387 15.5651433366423 6002951 289 3093
ENSP00000349977 24.7483107230444 6002951 398 2679
ENSP00000368887 24.6477413647837 6002951 174 1176

SAM (.sam) - the Sequence Alignment/Map (SAM) format is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. Most often it is generated as a human readable version of its sister BAM format, which stores the same data in a compressed, indexed, binary form. Currently, most SAM format data is output from aligners that read FASTQ files and assign the sequences to a position with respect to a known reference genome. In the future, SAM will also be used to archive unaligned sequence data generated directly from sequencing machines. See The Sequence Alignment/Map format and SAMtools

TAR (.tar) - an archive containing other files

TEXT (.doc, .readme, .text, .txt) - a text file

VCF (.vcf) - the Variant Call Format (VCF) is a text file format for representing eg SNPs, InDels, CNVs, SVs, microsatellites, genotypes.

WEGO (.wego) - the Web Gene Ontology (WEGO) Annotation format consists of the protein ID, followed by column(s) that are the GO ID(s):

Bmb015379_2_IPR001092
Bmb003749_1_IPR006329 GO:0009168 GO:0003876
Bmb006173_1_IPR000909 GO:0007165 GO:0004629 GO:0007242<

See WEGO: a web tool for plotting GO annotations

WIG (.wig) - the output file from TopHat is a UCSC wigglegram of alignment coverage.

UNKNOWN - any file format not in this list

XML (.xml) - eXtensible Markup Language


Upload status

Publish: this dataset is fully consented for immediate release upon GigaDB approval

HUP: this dataset should be Held Until Publication (HUP)


DOI relationship

The DOI relationship vocabulary is taken from the DataCite 'relationType' schema property (ID=12.2).

Definition: Description of the relationship of the resource being registered (A) and the related resource (B).

IsSupplementTo: indicates that A is a supplement to B

IsSupplementedBy: indicates that B is a supplement to A

IsNewVersionOf: indicates A is a new edition of B, where the new edition has been modified or updated

IsPreviousVersionOf: indicates A is a previous edition of B

IsPartOf: indicates A is a portion of B; may be used for elements of a series

HasPart: indicates A includes the part B

References: indicates B is used as a source of information for A

IsReferencedBy: indicates A is used as a source of information by B


Missing Value reporting

For attributes (sample, dataset or files) that have some or all values missing please use the following controlled value terms to describe the exact reason for the missing value.

not applicable: information is inappropriate to report, often this attribute can be removed entirely.

restricted access: information exists but cannot be released openly because of privacy concerns

not provided: information is not available at the time of submission, a value may be provided at the later stage

not collected: information was not collected and will therefore never be available

Availability

The current API version is available on our main production database. This version will be periodically updated with new additional functionality, we will whenever possible maintain backwards compatability, but occassionally this may not be possible, for this reason we recomend regularly checking and updating you usage of our API.

The basic functionality of the API is to retrieve dataset metadata held in GigaDB. The actual data files will still need to be pulled by FTP, but you can gather the exact FTP locations from the metadata using the API, then use that to pull only the files you actually need/want.

Search function is based on the web-search function and will therefore give the same results.


Comments and Bug reporting

The GigaScience github issue for the API works is here:

https://github.com/gigascience/gigadb-website/issues/27

Please add feedback / comments/ questions to that issue.


Summary

It is currently possible to search "all" fields, or to specify one of a select few fields to search.

It is possible to have results return all metadata for each dataset with "hits" to the search term, or to specify a particular portion of the metadata, these portions are currently "dataset", "sample" and "file", which is in alignment with the same functionality on the web-search tool. The default is to return results as GigaDB v3 XML

It is planned that we will have the option to specify the format to be GigaDBv3-JSON or ISA2.0-JSON in the future, but that has not been implemented yet.


Terminology

To specify exact fields to return data from, use terms; dataset?=, sample?=, file?=, (or experiment?=*)

* - experiment will be implemented in the future

To search for datasets without the ID's, use the term search?keyword=

To search by specific attributes use search?<attribute_name>=

Available attribute_name to search include:

taxno = Taxonomic ID (NCBI)

taxname = species name (nb must exact spelling, no synonyms searched)

author = restricts search to the author table

datasettype = restricts search to the types of datasets, e.g. metagenomic, genomic, transcriptomic etc..

manuscript = restricts search to the manuscript ID associated with GigaDB dataset(s) e.g. search?manuscript=10.1186/2047-217X-3-21

project = restricts search to the project name, e.g. Genome 10K

eg..../search?taxno=9606

To specify results to be returned are ONLY a particular level of data, add the phrase &results=dataset ,or file or sample:

e.g.

http://gigadb.org/api/search?project=Genome%2010K&result=sample

NB - the search still looks everywhere, but the results returned are only those samples that are in datasets that are found by the search.

Default results are "dataset" only.


Examples

1. retrieve known datasets by doi

http://gigadb.org/api/dataset?doi=100051

2. retrieve samples from a known DOI

http://gigadb.org/api/sample?doi=100051

3. retrieve file information from a known DOI

http://gigadb.org/api/file?doi=100051

4. Search all GigaDB by keyword, return only the top level dataset metadata

http://gigadb.org/api/search?keyword=chimp&result=dataset

5.Search all GigaDB by keyword, return only the sample level metadata

http://gigadb.org/api/search?keyword=chimp&result=sample

6.Search all GigaDB by keyword, return only the file level metadata

http://gigadb.org/api/search?keyword=chimp&result=file

7. refine search to just the title of the dataset

http://gigadb.org/api/search?keyword=title:human&result=dataset

8. refine search to the descriptions of datasets

http://gigadb.org/api/search?keyword=description:human&result=dataset

9.refine search to NCBI taxonomic ID

http://gigadb.org/api/search?taxno=9606&result=dataset

10. refine search to taxonomic names

http://gigadb.org/api/search?taxname=Homo%20sapiens&result=dataset

11. refine search to Authors

http://gigadb.org/api/search?author=Wang%20Jun

12. refine search to linked manuscript IDs

http://gigadb.org/api/search?manuscript=10.1371/journal.pone.0005795

13. refine search to dataset types

http://gigadb.org/api/search?datasettype=Genomic

14. refine search to project names

http://gigadb.org/api/search?project=Genome%2010K&result=sample

15. list all dataset doi

http://gigadb.org/api/list

16. dump the database

http://gigadb.org/api/dump


Command line usage

You can also use the curl commands on the command line to retrieve metadata :

eg.

curl http://gigadb.org/api/dataset?doi=100051

If you want to check whether a search will work you can use the -I flag:

curl -I http://gigadb.org/api/dataset?doi=100051

HTTP/1.1 200 OK

or

HTTP/1.1 404 Not Found / HTTP/1.1 500 Internal server error