Analyses of Genomes from Science Publication
Complete Genomics has published a report describing three human genome sequences
in the journal Science. Two of the samples were derived from potentially different passages of cell lines
used in the International HapMap project: 1) a Caucasian male of European descent (NA07022) , and 2) a
Yoruban female (NA19240). The third sample was generated from lymphoblast DNA from a Personal Genome Project
(PGP) Caucasian male sample (NA20431). Sequencing of these genomes was conducted at Complete Genomics’
commercial-scale genome center.
Results presented in this paper
and discussed below include:
NOTE: Analysis of these data is ongoing, and we have made considerable additions to our
production analysis software since this paper was written. Please refer to the
Sequence Data Available for Download page for updated results.
Summary of Genomes Sequenced
Summary information from mapping and assembly of the three genomes.
| NA19240 | 178 | 63 | 95% | 1% |
| NA07022 | 241 | 87 | 91% | 2% |
| NA20431 | 124 | 45 | 86% | 3% |
Results were obtained by mapping sequence reads to the human genome reference(
NCBI Build 36) and assembling variants with custom algorithms specifically designed for
Complete Genomics data. Between 124 and 241 Gigabases (Gb) were mapped, for an overall mean depth
of coverage of 45- to 87-fold per genome. Fully called regions are those where both diploid
alleles could be determined at high accuracy (see below), while partially called regions are those
where one of the two alleles was determined but the second was not.
Summary of Variations
Variations detected relative to reference genome (NCBI Build 36).
| SNPs | All | 4,042,801 |
19% | 3,076,869 |
10% | 2,905,517 | 10% |
| Homozygous | 1,297,601 |
4% | 1,097,899 |
2% | 965,029 | 1% |
| Heterozygous | 2,639,864 |
27% | 1,800,287 |
15% | 1,657,540 | 16% |
| Transitions | 3,635,882 | | 2,858,818 | | 2,658,112 | |
| Transversions | 1,706,195 | | 1,316,837 | | 1,213,232 | |
| Coding | 23,000 |
16% | 18,723 |
9% | 16,532 | 10% |
| Non-synonymous | 11,400 |
19% | 9,286 |
11% | 8215 | 12% |
| Indels | Short insertions | 242,391 |
40% | 168,909 |
37% | 136,786 | 37% |
| Short deletions | 253,803 |
44% | 168,726 |
37% | 133,008 | 36% |
| Total |
496,194 |
|
337,635 |
|
269,794 |
|
| Coding short indels | 549 |
56% | 556 |
58% | 435 | 59% |
| Frameshifting short indels | 327 |
61% | 310 |
62% | 299 | 71% |
| Block substitutions | Length conserving | 54,054 |
39% | 40,103 |
42% | 38,449 | 33% |
| Length changing | 34,432 |
64% | 22,680 |
61% | 18,166 | 60% |
Between 2.91 to 4.04 million single nucleotide polymorphisms (SNPs) with
respect to the reference genome were identified. Of these SNPs, 81 – 90% have been previously
reported in dbSNP build 129. This is consistent with reports of other complete human genome
sequences from different ethnicities compared to this reference.
Concordance with HapMap and other orthogonal technologies
The data generated show excellent concordance with SNP genotypes generated by
the HapMap project, particularly with the highest quality Illumina Infinium™ subset. The HapMap
paper can be found at:
http://hapmap.org/downloads/presentations/nature_hapmap3.pdf; see Supplementary Table 3 for
details of genotyping accuracy by technology and center:
http://www.hapmap.org/downloads/presentations/nature_supp3.pdf.
The high concordance of our genotypes with those generated using independent technologies affirms
the accuracy of Complete Genomics' sequencing technology for discovery and validation of polymorphisms.
Sample NA19240
Genotype calls compared against HapMap Phase I and II genotypes and the HapMap Infinium subset.
| # reported | 3.8M | 144K |
| % called | 98.46% | 98.45% |
| % locus concordance | 99.14% | 99.85% |
| HapMap genotype calls | Homozygous ref (% concordance) | 99.22% | 99.92% |
| Heterozygous (% concordance) | 99.62% | 99.81% |
| Homozygous alt (% concordance) | 98.26% | 99.79% |
Sample NA07022
Genotype calls compared against Infinium 1M, HapMap Phase I and II genotypes,
and the HapMap Infinium subset. To determine whether discordances were due to errant calls in the
Complete Genomics data or the Infinium subset of HapMap, discordant loci were tested by Sanger sequencing.
| # reported | 1M | 3.9M | 143K | These data correct | These data incorrect | % affirmed |
| % called | 95.98% | 94.39% | 96.00% |
| % locus concordance | 99.89% | 99.15% | 99.88% |
| HapMap Genotype calls | Homozygous ref (% concordance) | 99.96% | 99.34% | 99.96% | 18 | 2 | 90% |
| Heterozygous (% concordance) | 99.78% | 99.39% | 99.80% | 28 | 46 | 38% |
| Homozygous alt (% concordance) | 99.81% | 98.14% | 99.84% | 28 | 12 | 70% |
Additionally, to determine a whole-genome false positive rate, 291 novel non-synonymous
variants (a category enriched for errors) were tested with Sanger sequencing. This approach yielded an
extrapolated false positive rate of approximately 1 in 100 kilobases. For more detail, see Supplemental
Table S8 in the Science publication.
Sample NA20431
Genotype calls compared to Affymetrix® 500K SNP genotypes. Genotypes were assayed in
duplicate, and only SNPs with identical calls between the replicates are considered.
| # reported | 475K |
| % called | 94.18% |
| % locus concordance | 99.75% |
| Array genotype calls | Homozygous ref (% concordance) | 99.88% |
| Heterozygous (% concordance) | 99.45% |
| Homozygous alt (% concordance) | 99.78% |
©2010 Complete Genomics, Inc. All rights reserved. cPAL and DNB are trademarks of Complete Genomics, Inc.
in the US and certain other countries. All other trademarks are the property of their respective owners.