Chapter: Understanding Human Genome Versions

The reference genome is one of the most fundamental tools in bioinformatics. It serves as the coordinate system for almost all genomic analysis — from aligning sequencing reads to identifying variants. However, the human genome is far from simple. Multiple versions exist, and understanding their relationships and differences is critical for anyone working in genomics.

1. A Brief History of the Human Reference Genome

The first human genome reference was a monumental achievement, but it was incomplete, full of gaps, and built from the DNA of just a few anonymous donors. Over time, this reference has been improved through a series of versions released by the Genome Reference Consortium (GRC).

Two major versions dominate the field today:

GRCh37, also known as hg19 (UCSC naming)
GRCh38, also known as hg38

These references form the basis for billions of dollars worth of research, clinical pipelines, and diagnostics.

2. GRCh37 vs. GRCh38 (hg19 vs. hg38)

The GRCh identifiers come from the GRC, while the hg names are used primarily by the UCSC Genome Browser. Although GRCh37 and hg19 refer to the same underlying assembly, there may be small differences in naming conventions or annotations depending on the source.

GRCh37 / hg19: Released in 2009, still widely used in many labs and legacy databases.
GRCh38 / hg38: Released in 2013, includes corrected misassemblies, improved centromere representation, and more alternate loci.

3. The Rise of the Telomere-to-Telomere (T2T) Assembly

While GRCh38 remains the most current GRC version, it still contains unresolved gaps — especially in repetitive or structurally complex regions like centromeres.

The T2T-CHM13 project, published in 2022, is the first truly complete human genome:

It spans entire chromosomes from end to end (telomere to telomere).
It includes previously missing regions such as centromeres and satellite DNA.
T2T does not yet replace GRCh38 in most tools due to limited annotations and ecosystem support.

4. What’s in a Reference Genome?

Each reference genome includes:

Chromosomes 1 to 22: The autosomes.
chrX and chrY: The sex chromosomes.
chrM (mitochondrial DNA): Special in being circular, and very compact (~16.5 kb).
Alternate loci and patches: Supplementary representations for highly variable or complex regions.

4.1 chrX and chrY

These chromosomes differ between biological sexes. While chrX is present in both males and females, chrY is only in males. They share small homologous regions called pseudoautosomal regions (PARs), which behave like autosomes during recombination.

4.2 Mitochondrial DNA: chrM

The mitochondrial genome is distinct in several ways:

It is circular, not linear.
It is inherited maternally.
It is present in hundreds to thousands of copies per cell.
Its short size makes it a popular target for forensic and ancestry analysis.

5. Coordinate Systems and Naming Conventions

Within the same version of the genome, you may encounter different naming conventions for chromosomes:

"1" — simple numeric form (used by Ensembl and many pipelines)
"chr1" — used by UCSC tools and some aligners
"NC_000001.11" — RefSeq accession format

While these all point to the same chromosome, mixing formats in one analysis can lead to errors. Tools often require matching conventions between reference genomes and annotation files.

6. Patches and Alternate Loci

Even after a genome version is released, the GRC may issue patches:

Fix patches correct errors in the primary assembly.
Alternate loci provide alternative representations of complex regions (e.g., the MHC locus).

These updates do not change the coordinate system, so analyses remain compatible, but additional sequences may appear in alignments or reference databases.

7. Base Representation and IUPAC Codes

While the reference genome is used as a “standard,” it does not necessarily represent the most common allele at each position. It is a composite built from multiple individuals and regions.

In some contexts, ambiguous bases are represented using IUPAC codes:

IUPAC Code	Possible Bases
R	A or G
Y	C or T
S	G or C
W	A or T
K	G or T
M	A or C
N	Any base (A/C/G/T)

Some reference sequences use these codes, especially in alternate loci or uncertain regions, but many pipelines prefer using a single base even if it's not the most common.

8. The Reference Genome in Practice

8.1 Variant Calling and VCF Files

The reference genome plays a key role in variant calling:

Sequencing reads are aligned to the reference (e.g., in BAM files).
Differences between the reads and the reference are reported as variants (in VCF format).
The reference base at each position is recorded in the VCF alongside the observed alternate base(s).

8.2 Why Version Matters

Analysis pipelines are tightly coupled to the reference version:

A VCF aligned to hg19 is not directly comparable to one aligned to hg38 without liftover.
Annotation tools (e.g., SnpEff, VEP) require reference-specific databases.
Using the wrong version can result in incorrect variant positions or gene annotations.

8.3 Current Practices

Despite the release of GRCh38 over a decade ago, many labs still use hg19:

Legacy databases and regulatory pipelines are built around it.
Switching to a new version requires time-consuming validation.

That said:

New sequencing projects generally use hg38.
hg19 is sufficient for exome sequencing, where coverage is targeted and consistent.
For whole-genome sequencing, hg38 is preferred due to improved structure and reduced bias.
T2T is promising but currently lacks broad support in annotations and tools.

🔬 Advanced: Liftover — A Way to Convert Coordinates Between Versions

Sometimes you need to translate coordinates between genome versions. This is possible using tools like:

However, not all positions can be lifted. Reasons include:

Sequence has changed significantly between versions.
The region does not exist in the target assembly.
The original coordinate maps to multiple locations in the new genome.

In these cases, the tool may drop the record or mark it as "unmapped." Always verify your results when lifting VCF, BED, or GTF files.

⚠️ Liftover is lossy. Never assume a perfect 1-to-1 mapping between assemblies.

🧬 Advanced: Known GRCh38 Issues — Duplications, Haplotypes, and Mapping Pitfalls

While GRCh38 improved many aspects of the genome, it also introduced some new complexities:

1. Duplicated Regions

GRCh38 includes regions that appear more than once, such as:

Segmental duplications
Pseudogenes
Duplicated contigs (e.g. on chr21, chr22)

If read mapping is not performed carefully, reads can align to the wrong copy of a duplicated region. This leads to:

Misaligned reads
Apparent homozygosity
Missing variants

This is especially problematic when using BWA-MEM with default parameters. Special flags like -a (report all alignments) or using more robust aligners like DRAGEN or Giraffe may help in certain regions.

2. Alternate Haplotypes

GRCh38 includes over 200 alternate loci — these are alternative representations of complex, variable regions (e.g., the MHC, KIR cluster).

These loci are not on the primary chromosomes.
They may be included or excluded during mapping depending on the index used.
This creates ambiguity: where should a read go — the primary locus or the alternate?

Incorrect mapping in these areas can lead to:

False negatives (variants not detected)
Overrepresentation of certain alleles
Misinterpretation of zygosity

3. Recommendations

Use genome indices that are aware of alternate loci.
Avoid using GRCh38 if your downstream pipeline does not support alt-aware mapping.
Be cautious when analyzing immune-related regions (MHC, TCR, KIR).

🧠 GRCh38 is more complete, but also more complex. Know your tools and whether they handle alt sequences, decoys, and duplicate regions correctly.

9. The Future: Graph Genomes

Linear references are inherently limited in representing human diversity. The emerging solution is the graph genome:

Instead of a single path through the genome, a graph can represent multiple alleles and structural variations.
This reduces reference bias and improves variant discovery, especially in underrepresented populations.

Projects like the Human Pangenome Reference Consortium are actively developing such models.

10. Summary

The human genome reference is central to modern bioinformatics. Understanding its versions, structure, and naming conventions is essential for accurate analysis.

Version	Description	Notes
GRCh37/hg19	Older, widely used	Still standard in many pipelines
GRCh38/hg38	Improved, more complete	Preferred for new analyses
T2T-CHM13	Telomere-to-telomere, complete	Limited tool and annotation support

✅ Always align your tools, data, and databases to the same genome version. A mismatch can break your analysis.

1. A Brief History of the Human Reference Genome​

2. GRCh37 vs. GRCh38 (hg19 vs. hg38)​

3. The Rise of the Telomere-to-Telomere (T2T) Assembly​

4. What’s in a Reference Genome?​

4.1 chrX and chrY​

4.2 Mitochondrial DNA: chrM​

5. Coordinate Systems and Naming Conventions​

6. Patches and Alternate Loci​

7. Base Representation and IUPAC Codes​

8. The Reference Genome in Practice​

8.1 Variant Calling and VCF Files​

8.2 Why Version Matters​

8.3 Current Practices​

🔬 Advanced: Liftover — A Way to Convert Coordinates Between Versions​

🧬 Advanced: Known GRCh38 Issues — Duplications, Haplotypes, and Mapping Pitfalls​

1. Duplicated Regions​

2. Alternate Haplotypes​

3. Recommendations​

9. The Future: Graph Genomes​

10. Summary​