Skip to main content

Representing Genetic Variants

When working with genomic data, it's essential to understand the different ways in which variants can be represented. This page focuses on:

  • SNVs: Single Nucleotide Variants
  • MNVs: Multiple Nucleotide Variants
  • small InDels: Small insertions or deletions (usually up to ~50bp)

These are the most common types of genetic variants in research and clinical practice.


Genomic Representation​

One of the most explicit and common ways to describe a variant is the chromosome-position-reference-alternate format:


chr1:123456\:A\:T

This states that on chromosome 1, position 123,456 (1-based), the reference genome has an A, and the sample has a T.

Coordinate Systems: 0-based vs 1-based​

A critical detail:

  • VCF format uses 1-based coordinates.
  • BED and some tools use 0-based coordinates.
  • SPDI also uses 0-based.

Be carefulβ€”mixing these can lead to incorrect interpretations.


SPDI: Sequence Position Deletion Insertion​

SPDI is a machine-friendly model developed by NCBI. Its structure:


NC\_000001.11:123456\:A\:T

  • NC_000001.11: Reference sequence (genome build and contig)
  • 123456: 0-based position
  • A: Deleted sequence
  • T: Inserted sequence

SPDI avoids ambiguity and is suited for pipelines and APIs.

🧬 All variant representations must be interpreted in the context of a specific reference genome (e.g., GRCh37 or GRCh38).


HGVS: Human Genome Variation Society Nomenclature​

HGVS notation is human-readable and used to describe variants relative to sequences at different levels:

  • g. – Genomic (e.g., NC_000001.11:g.123456A>T)
  • c. – Coding DNA (e.g., NM_000059.3:c.2207A>T)
  • n. – Non-coding transcript (e.g., NR_046018.2:n.123A>G)
  • p. – Protein (e.g., NP_000050.2:p.Lys736Asn)

⚠️ HGVS notations must include the transcript or protein ID. For example, p.Lys736Asn is not valid unless associated with a specific transcript (e.g., NP_000050.2:p.Lys736Asn).

Transcript Identifiers and Versions​

HGVS relies on reference transcripts. Two major sources are:

  • RefSeq:
    • NM_000059.3 β†’ NM: mRNA, .3: version
  • Ensembl:
    • ENST00000380152.4 β†’ ENST: transcript, .4: version

Transcript versions can change due to reannotation or sequence corrections. Gene symbols (e.g., BRCA1) can also change over time. This complicates reproducibility and variant comparison.


Variant Identifiers and Registries​

dbSNP (rsID)​

  • rs123456: A commonly seen identifier.
  • Issues:
    • Can be merged or removed.
    • May represent multiple alternate alleles.
    • Some alleles may be benign, others pathogenic β€” not always specific.

ClinVar Allele ID​

  • Numeric, allele-specific identifiers assigned by ClinVar.
  • More precise than rsIDs, but still not globally unique in all contexts.

ClinGen Allele Registry (CA ID)​

  • Format: CA123456789
  • Globally unique, stable identifier.
  • Supports mapping across genome builds, transcripts, and notation systems.
  • Recommended for use in production systems and variant databases.

Historical / Colloquial Names​

  • Example: BRCA1 185delAG
  • Still seen in older literature or clinical reports.
  • Not always interpretable by software and may be ambiguous or incorrect.

Choosing the Right Representation​

Use CaseRecommended Format
Human-readable interpretationHGVS p. notation with transcript (e.g., NP_000050.2:p.Lys736Asn)
Precise computational processingchr-pos-ref-alt or SPDI
Stable ID across buildsClinGen Allele Registry ID (CA...)
VCF-compatible toolschr-pos-ref-alt (1-based)
Database interoperabilityCA ID + HGVS p. or SPDI

βœ… Best practice: combine a unique ID (e.g., CA or chr-pos-ref-alt) with a human-readable description (e.g., HGVS p. notation with transcript) whenever possible.


Summary​

Describing variants is not as straightforward as it might seem. Variants evolve over time with better annotations, reference updates, and transcript redefinitions. This makes reproducibility challenging, especially across systems or institutions.

To ensure clarity:

  • Always specify the reference genome.
  • Always include transcript or contig identifiers and versions.
  • Prefer stable, globally recognized IDs (e.g., ClinGen CA IDs) when sharing data.
  • Use HGVS notations with complete metadata when communicating with clinicians or curating databases.