The chimpanzee is arguably the most important species for the study of human origins. A key resource for these studies is a high quality reference genome assembly. The current iteration of the chimpanzee reference genome assembly (Pan_tro_2.1.4) is highly fragmented, with more than 183,000 contigs and incorporating over 159,000 gaps, with a genome wide contig N50 of 51 Kbp.
In this work we produce an extensive and diverse array of sequencing datasets to rapidly assemble a new chimpanzee reference that surpasses previous iterations in bases represented and organized in large scaffolds. We show substantial improvements over the Pan_tro_2.1.4 version by several metrics: increased contiguity by >750% and 300% on contigs and scaffolds, respectively; closure of 77% of gaps in the Pan_tro_2.1.4 assembly gaps spanning >850 Kbp of novel coding sequence based on RNASeq data. We furthermore report over 2,700 genes that had putatively erroneous frame-shift predictions to human in Pan_tro_2.1.4 and show a substantial increase in the annotation of repetitive elements.
We apply a simple 3-way hybrid approach to considerably improve the reference genome assembly for the chimpanzee, providing a valuable resource to study human origins. We furthermore produced extensive sequencing datasets that are all derived from the same cell line, generating a broad non-human benchmark dataset.

Kuderna, L. F. K., Tomlinson, C., Hillier, L. W., Tran, A., Fiddes, I. T., Armstrong, J., … Marques-Bonet, T. (2017). A 3-way hybrid approach to generate a new high-quality chimpanzee reference genome (Pan_tro_3.0). GigaScience, 6(11), 1–6. doi:10.1093/gigascience/gix098

WGS: AACZ04000000
BioProject: PRJEB18078

Sample IDTaxonomic IDCommon NameGenbank NameScientific NameSample Attributes
SAMEA45578389598 chimpanzeePan troglodytes Description:DNA extracted from Chimpanzee cell lin...
Specimen voucher:Coriell:S006007
File NameSample IDData TypeFile FormatSizeRelease Date 
Genome sequenceFASTA932.4 MB2017-08-24
Genome assemblyFASTA896.82 MB2017-08-24
annotationUNKNOWN181.84 MB2017-08-24
annotationUNKNOWN25.03 MB2017-08-03
Sequence variantsTSV8.17 MB2017-08-24
Scriptzip200.92 KB2017-08-24
OtherBED124.95 KB2017-08-24
Sequence variantsTSV85.75 KB2017-08-24
ReadmeTEXT2.37 KB2017-08-03
Funding body Awardee Award ID Comments
Swedish Foundation for Strategic Research L Feuk F06-0045
National Institutes of Health AJ Sharp DA033660
National Institutes of Health AJ Sharp HG006696
National Institutes of Health AJ Sharp HD073731
National Institutes of Health AJ Sharp MH097018
Ministerio de Economía y Competitividad T Marques-Bonet MINECO BFU2014-55090-P
National Institutes of Health EE Eichler HG002385
National Institutes of Health B Paten HG007990
National Institutes of Health B Paten HG007234
March of Dimes Foundation AJ Sharp 6-FY13-92
Ministerio de Economía y Competitividad T Marques-Bonet BFU2015-6215-ERC
Ministerio de Economía y Competitividad T Marques-Bonet BFU2015-7116-ERC
Ministerio de Economía y Competitividad LFK Kuderna BFU2014-55090-P FPI fellowship
Date Action
September 13, 2017 Dataset publish
November 13, 2017 Manuscript Link added : 10.1093/gigascience/gix098