Friday, February 9, 2018

New technique closes gaps in GRCm38.p6


Ongoing efforts to close gaps and to correct clone problems remaining in the GRCm38 mouse reference assembly have proved difficult. The available clone library resources have been exhausted, and the remaining gaps are recalcitrant to cloning, with either no clones available or gap-spanning clones deleted for the expected genomic sequence. The GRC has previously used contigs from publicly available whole genome shotgun assemblies to provide sequence at some of these gaps, and in some cases have been able to close gaps entirely with this approach. Nonetheless, several hundred sequence gaps, many of which are known to contain genes, remain.

With the release of 17 strain-specific genome assemblies from the Mouse Genomes Project, the GRC evaluated alignments between C57BL/6NJ, the most closely related strain, and the GRCm38 reference (C57BL/6J). This evaluation found genes missing from the reference assembly to be present in the new strain assembly. Utilising the C57BL/6J read set (PRJNA51977) deposited in GenBank by the Broad Institute, and used in the production of the C57BL/6J ALLPATHS WGS assembly GCA_000185105.2, the Genome Reference Consortium’s sought to generate local assemblies from these reads that could be used for curation of the GRCm38 reference. The read set was initially aligned to the C57BL/6NJ assembly using bwa-mem. Once completed, reads aligning to the C57BL/6NJ assembly corresponding to GRCh38 gaps and the location of clone-assembly problems in the GRCm38 reference were identified and subsequently assembled using the Geneious software platform (version 10.1.3). The resulting assembly BAMs were then loaded into GAP5 for manual curation. The assembled WGS contigs were then submitted to GenBank.

The patch release GRCm38.p6 addresses 20 regions with these newly created and submitted sequences. These contigs fix and improve representation for several genes, examples of which are shown in Table 1 and Figure 1.

Table 1: Examples of issues fixed in GRCm38.p6 using assembled Illumina reads.

Figure 1 Top: Incomplete representation of Anxa13 gene in GRCm38 due to a deletion in reference component AC152395.9. Middle: clone error corrected in GRCm38.p6. Fix patch uses MF597750.1 and MF597749.1 to add deleted sequence to AC152395.9. It also provided a complete representation of Anxa13. Bottom: Representation of Anxa13 by reference chr. 15 and fix patch highlighting complete representation of Anxa13 (NM_027211.2).

Wednesday, September 13, 2017

GRCh38.p11: Clinically Relevant Updates to SLC39A4

The GRCh38.p11 patch release includes the fix patch scaffold KZ208914.1/NW_018654716.1 that updates two bases in SLC39A4, a change prompted by user request and that has implications for clinically-relevant variant analyses of this gene. This GRCh38 fix patch restores the gene to the same representation found in GRCh37.

SLC39A4, solute carrier family 39 member 4 (NCBI Gene ID: 55630), encodes a protein that is required for dietary zinc absorption in the intestine. In GRCh37, the BAC clone AF205589.5 (CTA-393G12) provided the sequence for SLC39A4 on chromosome 8. In GRCh38, AC233992.5 (RP11-735F20) replaced AC110280.8 (CTD-3232M19), the BAC clone upstream and adjacent to AF205589.5. As a consequence of this change, the default switch point in the overlap between these two clones led to the newly added RP11 clone providing the underlying sequence for SLC39A4 in GRCh38. A user subsequently contacted the GRC to report that this GRCh38 update introduced changes at two bases, a G to C substitution at CM000670.2/NC_000008.11: 144,414,297 (rs1871534) and a G to A substitution at CM000670.2/NC_000008.11: 144,415,811 (rs1871533). The “A” allele in GRCh38 has a global MAF of only 0.0098, according to data from the 1000 Genomes project, making it a rare allele. Additionally, the GRCh38 haplotype confounds clinical prediction algorithms at this locus because the p.Leu372Val (rs1871534) allele found in GRCh38 is incompatible with the p.Leu372Pro variant, which is implicated in Acrodermatitis entropathica (PMID: 12032886).

Review of Illumina genome sequencing data from an RP11 paired end WGS library (SRR834589) aligned to GRCh38 confirms the sequence of AC233992.5, the RP11 BAC clone providing the SLC39A4 sequence in GRCh38. Even though there are no sequence errors at these locations, the GRC agreed to further update these sites to address the needs of clinical researchers and to reduce the presence of globally rare alleles in the reference. Changing the switch point between the two GRCh38 BAC components allows the clone AF205589.5 to once again provide the sequence for SLC39A4 (Figure 1), and review by RefSeq annotators confirmed that this update addresses the two bases without introducing other changes that negatively affect gene representation. These updated bases are now available in the fix patch, and will be incorporated into the primary chromosome at the next major assembly release (GRCh39, currently unscheduled).