Complete Genomics README File Chromosome 21 Data Subset for NA19240 (S3 version) February 2010 Software Version 1.7.0 File Format Version 0.5 I. General Information: This data set includes a subset of the genome sequence derived from HapMap individual NA19240, a Yoruban female. We have included a complete set of files in this data subset, as are typically produced by Complete Genomics when delivering complete genome sequences. However, the data included here describe only the sequences and variants found on NA19240 chromosome 21. We also include only the supporting data for that subset of calls, which includes reads, mappings, coverage data and assemblies (evidence). Given that chromosome 21 represents approximately 1.5% of the human genome, we suspect this relatively small data set will be a useful yet nimble resource for various purposes, such as: - Understanding CGI file formats and data - Preparing software and systems for CGI data - Evaluating certain aspects of CGI data It is important to note that this data subset was created by extracting results from a complete genome sequence of NA19240. Thus there are a number of attributes of these data one should consider: - All of the overall statistics (such as those in the summary file, mentioned below) are computed genome-wide, not only on chromosome 21. - The initial mappings of reads against the reference genome sequence were also performed genome-wide. For DNBs (clones) with arms (ends) which map to multiple possible locations in the genome, if any of those locations for either arm are on chromosome 21, then the read and all of the genome-wide mapping information are included in this data subset. For this reason, approximately 2.5% of the mapped reads from the full genome set are included. Reads with no mappings meeting our minimum threshold on chromosome 21 are not included, thus there are no unmapped reads. - The assembly process was also performed genome-wide, and the algorithms we use for calling variants in duplicated and repetitive regions takes into account the conserved regions on chromosome 21 along with those located elsewhere in the genome. If one were to falsely treat the reads in this data subset as isolated from the rest of the genome and re-map against only chromosome 21, then we expect different (and generally poorer) results. Finally one should note that this genome assembly is derived from the same raw data as described in our Science paper (Drmanac et al., ePub 2009, Jan 2010 print edition). These data have has been re-processed with newer software to be more representative of other data sets now produced by Complete Genomics, and thus the results are not identical to those described in the paper. II. Downloading and unpacking instructions: Data included in this release are broken into four portions for ease in downloading. 1. GS19240-170-21-ASM-LIB-DOC-MAP-S3-pt1.tar Contains the contents of the DOC, ASM and LIB folders. These folders contain the complete set of derived results from the genome sequencing: variant calls, annotations, coverage information and assemblies. They also include the individual reads, quality scores, and initial mapping results for a single lane of one slide, primarily as an example. 2. GS19240-170-21-MAP-S3-pt2.tar Contains individual reads, quality scores, and initial mapping results for about a third of the data set. 3. GS19240-170-21-MAP-S3-pt3.tar Contains individual reads, quality scores, and initial mapping results for roughly another third of the data set. 4. GS19240-170-21-MAP-S3-pt4.tar Contains all remaining individual reads, quality scores, and initial mapping results, also roughly a third of the data set. You should be sure to download these data in binary mode if given the option (web-based downloads from a compatible browser should automatically use binary mode). These data are provided as a ÔtarÕ format archive that can be unpacked using a Linux/Unix/MacOSX systemÕs tar command line tool, or by using many other graphical utilities on various platforms. Each tar file should be extracted starting from the same working directory, as when extracted they will each create and populate various contents of a subdirectory hierarchy starting at a folder named GS19240-170-21-ASM. Many of the contents of the tar files are compressed using gzip (.gz files), and thus the tar file itself is not compressed. Manifest files are provided for users to confirm that the data have been downloaded and unpacked without issue. These files contain MD5 checksums formatted for use with the Ômd5sumÕ utility present on most Linux systems. One file (manifest.all) is provided for the complete data set, and additionally a separate manifest (pt1, pt2, and pt3) each provided for each of the three download components listed above. Note that all CGI data files (when decompressed) use text formats formatted for use on Linux, Unix and MacOSX. Working with these files on a Windows machine may require format conversion because of the differences in line break characters used in standard Windows text files (CR-LF) vs. those on Unix (LF only). Whether conversion is needed will depend on the Windows application you are using: some Windows programs read Unix format files without conversion while other Windows software does not. Windows Notepad, for example, does not. III. Contents of the data set: We encourage users to read the DataFileFormats.pdf document provided in the DOC folder which contains detailed descriptions of Complete GenomicsÕ data files and their formats. In brief summary you will find: DOC Folder: NA19240_DataSheet.pdf contains summary information describing the sequence results of NA19240. DataFileFormats.pdf contains detailed descriptions of Complete Genomics data formats. ASM Folder (within GS00028-DNA-C01) Contains various files describing the overall results of the sequencing project, including variants found and their annotations against known genes and known polymorphisms. summary-GS19240_170_chr21_ASM.tsv.gz: Overall information and statistics for this sequencing project. var-GS19240_170_chr21_ASM.tsv.gz: Called sequence relative to NCBI reference including all variant base calls, reference base calls, and no-calls. dbSNPAnnotated-GS19240_170_chr21_ASM.tsv.gz: For known polymorphisms in dbSNP, states call at the corresponding site in the genome sequence. gene-GS19240_170_chr21_ASM.tsv.gz: Functional annotation of individual variants called in NA19240 (nonsynonymous SNPs, frameshifting indels, etc) against NCBI RefSeq gene alignments provided in build 36.3. gene-var-summary-GS19240_170_chr21_ASM.tsv.gz: For each gene, summarizes the numbers of variations with certain functional impacts (counts of nonsynonymous SNPs, etc.) EVIDENCE Folder (within ASM folder): Contains the underlying assemblies of variant regions. The formats of these files are described in DataFileFormats.pdf REF Folder (within ASM folder): Contains one file, coverageRefScore-GS19240_170_chr21_ASM.tsv.gz, which reports unique mappable coverage (read depth) for each base position. It also reports the Reference Score, which measures discordance between the mapped reads and the reference sequence. Reference Score is one of the metrics we use to find regions of possible variation to select for later re-assembly. Please note that the coverage reported in this file represents ONLY reads which are determined to map uniquely at a high threshold by the initial mapping process. It does not take into account the additional reads determined to map to each location later in the assembly process, which uses both a more sensitive and more stringent alignment method. The values in the coverageRefScore will systematically under-represent reads from both repetitive or duplicated regions as well as alleles with larger degrees of variation (most indels, large block substitutions, and regions of dense SNPs). The more sensitive data following assembly are provided in the EVIDENCE folder. LIB Folder (within GS00028-DNA-C01): Described the gap structure of the reads in the library. The formats of these files are described in DataFileFormats.pdf MAP Folder: Contains individual reads, quality scores, and initial mapping results. The formats of these files are described in DataFileFormats.pdf VI. Errata The DataFileFormats.pdf document omits a detailed description of the ÒOffsetInLaneÓ field in evidenceDnbs files. When this number is less than 30,000,000, the referenced reads are in the first chunk (eg the mapping and reads files ending with Ò_001Ó) at this 0-based position (data line) in that file. When the number is 30,000,000 to 59,999,999, the reads are in the second chunk (Ò_002Ó), with an offset position in that file of 30,000,000 less than the index provided. When the number is greater than 60 million, the reads are in the third chunk and 60M should be subtracted from the index to get the position, etc. Users who are familiar with previous Complete Genomics data sets may wish to request ÒRelease NotesÓ from support@completegenomics.com, which highlight differences between the current and previous data sets. V. DISCLAIMER Disclaimer of Warranties. COMPLETE GENOMICS, INC. PROVIDES THESE DATA IN GOOD FAITH TO THE RECIPIENT "AS IS." COMPLETE GENOMICS, INC. MAKES NO REPRESENTATION OR WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE OR USE, OR ANY OTHER STATUTORY WARRANTY. COMPLETE GENOMICS, INC. ASSUMES NO LEGAL LIABILITY OR RESPONSIBILITY FOR ANY PURPOSE FOR WHICH THE DATA ARE USED.