Tuesday, July 11, 2017

GRCh38.p11: Update to GCNT2

The GRC prioritizes curation efforts that improve gene representation in the human reference genome assembly. In some cases, such curation takes the form of base-pair level edits. The recent GRCh38.p11 patch release includes a new, curated, representation for the GCNT2 gene. The representation of the GCNT2 gene in the GRCh38 reference assembly contains the "C" allele for SNP rs539351 on chromosome 6 (NC_000006.12) at position 10,586,805, which reflects the sequence from the underlying component AL358777.12 (RP11-421M1) (Figure 1, top). During human development, the fetal blood group antigen (i) is converted to the adult antigen (I) by a beta-1,6-N-acetylglucosaminyltransferase-2 (GCNT2). Alternative splicing of the gene generates 3 isoforms, which differ only in their first exon. The SNP rs539351 is found in the first exon unique to the GCNT2 isoform C, which is the only one expressed in red blood cells, where this conversion occurs (NM_145655.3: c.816C>G (NP_663630.2: p.Asp272Glu). A user contacted the GRC with information that the reference allele had previously been described as a rare allele [1].

Although the reference assembly does not provide the most common alleles for all loci, the GRC does make an effort to make sure that reference alleles are not  universally rare (defined for reference purposes as those with a global MAF < 5%), provided that it can do so while representing a biologically valid haplotype and a functional allele. Data from the 1000 Genomes project revealed that the "C" allele in GRCh38 had a global MAF=0.017. Thus, this allele was in scope for an update.

The GRC used sequence from ABBA01022081.1, a component of the HuRef assembly, as a new assembly component to provide the more common G allele at this position (Figure 1, bottom). We used haplotype information provided by Ensembl to confirm that the new coding representation is one that is biologically valid (GCNT2: 272D>E). This update is now included in the fix patch (KZ208911.1). This update should improve reviewing variation analyses results in which the reference assembly is being used as a model. The GRC continues to make these base updates for GRCh38. If you have questions or concerns about this process, let us know.

Figure 1 Top: Zoomed-in graphical view of the GCNT2 gene in GRCh38. The assembly sequence is shown at the top. The GCNT2 is shown in green. The reference allele D272 is a minor allele (brown box). Bottom: Zoomed-in graphical view of HG2057_PATCH, represents the more common allele (G) from ABBA01022081.1 (red box).  


  1. Reid M., et al. The Blood Group Antigen FactsBook (3rd Edition), 603–608 (2012)

Wednesday, June 28, 2017

Improvements in the 5S rRNA gene cluster on chromosome 1q42.11-q42.13

Sequence updates that improve gene representation in the human reference genome assembly are priorities for the GRC. The recent GRCh38.p11 patch release includes a newly curated representation of the 5S RNA gene cluster (RN5S1@) located on chromosome 1q42.11-q42.13. The 5S ribosomal RNA (rRNA) is a component of the large subunit of the ribosome in all organisms. In humans, the 5S rRNA cluster is comprised of individual rRNA genes repeated in head-to-tail orientation with non-rRNA sequences in the spacer regions. The number of 5S rRNA repeats per haploid human genome is highly polymorphic, in a range of 35-175 (1).

The repetitive clustered nature of the 5S rRNA region has long complicated both its sequencing and assembly, and its representation is incomplete in GRCh37 and GRCh38, the last two major reference assembly versions, though in different ways. The underlying components AL139288.15 (RP5-915N17) and AL713899.14 (RP4-621O15) provide the sequence for the 5S rRNA region in both assemblies. In GRCh37, a false alignment between repeat copies in the two components led to a contiguous, but collapsed, representation. In GRCh38, the false alignment was broken and a default 50 kb gap was inserted in chromosome 1 (CM000663.2/NC_000001.11) at 228,558,365 bp as a placeholder for the missing sequence (Figure 1, top). The GRCh38 representation of the cluster includes only 19 5S rRNA gene unit copies (17 functional and 2 pseudogenes) (Figure 1, top).

The fix patch (KZ208906.1) included in the GRCh38.p11 release now provides a contiguous and validated representation of the 5S rRNA genomic region. The patch closes the assembly gap and replaces the 5S rRNA copies from AL139288.15 and AL713899.14 with sequences from AC275639.1 (CH17-275P10), a BAC clone that completely spans the cluster (Figure 1, bottom). The patch provides 35 copies (34 functional and a single pseudogene) of 5S rRNA genes (Figure 1, bottom). The haplotype represented in this clone has been verified by BioNano optical map data for the haploid CHM1 sample, from which the clone library was derived. This new representation should serve as an improved substrate for analysis of the region, including read alignment and variation analysis.

Figure 1 Top: 5S rRNA region in GRCh38. Incomplete representation of 5S rRNA gene cluster in GRCh38 due to an assembly gap. Bottom: 5S rRNA fix patch in GRCh38.p11. The gap is closed and a complete representation of the 5S rRNA is provided.

  1. Stults, DM. et al. Genome Res. 18(1):13-8 (2008)

Tuesday, May 23, 2017

GRCz11 – the latest zebrafish reference genome assembly

After 2.5 years of assembly curation, the GRC is proud to present the new zebrafish reference genome assembly, GRCz11.

This latest assembly has been refined by the addition of nearly 1000 finished clone sequences and by the resolution of more than 400 assembly issues. This resulted in a significant reduction in scaffold numbers (3399 to 1905) and increase in scaffold N50 (2.18 Mb to 7.5 Mb) whilst the overall genome size was not affected.  Figure 1 shows an overview of contig and scaffold N50s over time, indicating the advance in assembly curation.

Figure 1: Contig vs. scaffold N50s for zebrafish reference genome assemblies. Release dates: Zv7: 2008, Zv8: 2009, Zv9: 2010, GRCz10: 2014, GRCz11: 2017.

Alignments of 16133 RefSeq sequences showed a further improvement over past assemblies: only 31 sequences remained not found (down from 34), 105 transcripts are still split between locations (down from 205) and only 441 exhibit less than 95% CDS coverage (down from 566). Figure 2 shows an example of an improved region, correcting the representation of two genes.

Figure 2: gEVAL screenshot of the supt4h1 gene (red arrow) in GRCz10 (top) and GRCz11 (bottom). In GRCz10 the supt4h1 gene on chromosome 5 is incomplete, missing its first exon, and surrounded by a truncated supplicated copy of rnf150b (blue arrow). In GRCz11, the supt4h1 gene is complete and neighbouring the hsf5 gene, as seen in other vertebrates, whereas the rnf150b gene is now complete and located singularly on chromosome 23. gEVAL, the GRC’s genome assembly evaluation browser, indicates completeness of genes and other features via colours (green > 98% coverage, yellow = 50-98% coverage, red < 50% coverage).  

GRCz11 was built as described previously using clone sequences ordered and oriented according to genetic markers and BioNano data, the latter greatly influencing the scaffolding. Remaining gaps were filled with selected contigs from whole genome sequencing assemblies, mainly WGS31, and in a few cases, WGS32.

For the first time in a zebrafish assembly, GRCz11 also features alternate loci scaffolds (ALT_REF_LOCI). The alternate loci represent variant sequence representations for certain genomic regions. They were selected from a pool of 1895 finished clones that were found to inhabit an assembly region already occupied by clone sequence and were therefore not included in the primary chromosomal path. All surplus clones that exhibited at least 5 kb of unique sequence not present in the primary chromosomal path were added to the assembly as alternate loci scaffolds, totaling 186 Mb of additional sequence in 1150 clones. The alignments of the alternate loci scaffolds to the primary chromosomal path are also included in the GRCz11 assembly to provide the chromosome context for these alternate sequences. The alternate loci will be represented in genome browsers in the same way as human and mouse ALT_REF_LOCI. Additional smaller scale variation will be submitted to dbSNP/dbVAR/EVA.

After release of this assembly, within the GRC, the Sanger team is transferring the maintenance of the zebrafish reference to the new GRC member ZFIN. ZFIN will take on the future curation of the assembly, and invites user reports on assembly issues. Future updates to the assembly will be issued as patch releases, adding sequence but not impacting the chromosomal coordinates.