Representing Genetic Variants

When working with genomic data, it's essential to understand the different ways in which variants can be represented. This page focuses on:

SNVs: Single Nucleotide Variants
MNVs: Multiple Nucleotide Variants
small InDels: Small insertions or deletions (usually up to ~50bp)

These are the most common types of genetic variants in research and clinical practice.

Genomic Representation

One of the most explicit and common ways to describe a variant is the chromosome-position-reference-alternate format:

chr1:123456\:A\:T

This states that on chromosome 1, position 123,456 (1-based), the reference genome has an A, and the sample has a T.

Coordinate Systems: 0-based vs 1-based

A critical detail:

VCF format uses 1-based coordinates.
BED and some tools use 0-based coordinates.
SPDI also uses 0-based.

Be careful—mixing these can lead to incorrect interpretations.

SPDI: Sequence Position Deletion Insertion

SPDI is a machine-friendly model developed by NCBI. Its structure:

NC\_000001.11:123456\:A\:T

NC_000001.11: Reference sequence (genome build and contig)
123456: 0-based position
A: Deleted sequence
T: Inserted sequence

SPDI avoids ambiguity and is suited for pipelines and APIs.

🧬 All variant representations must be interpreted in the context of a specific reference genome (e.g., GRCh37 or GRCh38).

HGVS: Human Genome Variation Society Nomenclature

HGVS notation is human-readable and used to describe variants relative to sequences at different levels:

g. – Genomic (e.g., NC_000001.11:g.123456A>T)
c. – Coding DNA (e.g., NM_000059.3:c.2207A>T)
n. – Non-coding transcript (e.g., NR_046018.2:n.123A>G)
p. – Protein (e.g., NP_000050.2:p.Lys736Asn)

⚠️ HGVS notations must include the transcript or protein ID. For example, p.Lys736Asn is not valid unless associated with a specific transcript (e.g., NP_000050.2:p.Lys736Asn).

Transcript Identifiers and Versions

HGVS relies on reference transcripts. Two major sources are:

RefSeq:
- NM_000059.3 → NM: mRNA, .3: version
Ensembl:
- ENST00000380152.4 → ENST: transcript, .4: version

Transcript versions can change due to reannotation or sequence corrections. Gene symbols (e.g., BRCA1) can also change over time. This complicates reproducibility and variant comparison.

Variant Identifiers and Registries

dbSNP (rsID)

rs123456: A commonly seen identifier.
Issues:
- Can be merged or removed.
- May represent multiple alternate alleles.
- Some alleles may be benign, others pathogenic — not always specific.

ClinVar Allele ID

Numeric, allele-specific identifiers assigned by ClinVar.
More precise than rsIDs, but still not globally unique in all contexts.

ClinGen Allele Registry (CA ID)

Format: CA123456789
Globally unique, stable identifier.
Supports mapping across genome builds, transcripts, and notation systems.
Recommended for use in production systems and variant databases.

Historical / Colloquial Names

Example: BRCA1 185delAG
Still seen in older literature or clinical reports.
Not always interpretable by software and may be ambiguous or incorrect.

Choosing the Right Representation

Use Case	Recommended Format
Human-readable interpretation	HGVS `p.` notation with transcript (e.g., `NP_000050.2:p.Lys736Asn`)
Precise computational processing	`chr-pos-ref-alt` or SPDI
Stable ID across builds	ClinGen Allele Registry ID (`CA...`)
VCF-compatible tools	`chr-pos-ref-alt` (1-based)
Database interoperability	CA ID + HGVS `p.` or SPDI

✅ Best practice: combine a unique ID (e.g., CA or chr-pos-ref-alt) with a human-readable description (e.g., HGVS p. notation with transcript) whenever possible.

Summary

Describing variants is not as straightforward as it might seem. Variants evolve over time with better annotations, reference updates, and transcript redefinitions. This makes reproducibility challenging, especially across systems or institutions.

To ensure clarity:

Always specify the reference genome.
Always include transcript or contig identifiers and versions.
Prefer stable, globally recognized IDs (e.g., ClinGen CA IDs) when sharing data.
Use HGVS notations with complete metadata when communicating with clinicians or curating databases.

Genomic Representation​

Coordinate Systems: 0-based vs 1-based​

SPDI: Sequence Position Deletion Insertion​

HGVS: Human Genome Variation Society Nomenclature​

Transcript Identifiers and Versions​

Variant Identifiers and Registries​

dbSNP (rsID)​

ClinVar Allele ID​

ClinGen Allele Registry (CA ID)​

Historical / Colloquial Names​

Choosing the Right Representation​

Summary​