Usage Guide for pygenebe
pygenebe
is a Python client designed to integrate with the GeneBe platform, offering efficient annotation of genetic variants via its API. It supports pandas DataFrames, VCF files, HGVS parsing, and more, making it a versatile tool for genetic research. Below are detailed usage instructions and examples.
Basic Setup
After installing pygenebe
(see the installation guide), import the library to start annotating genetic variants.
import pygenebe
Annotating Variants
Annotating a Single Variant
To annotate a single genetic variant, use the annotate_variant
function. Provide the chromosome, position, reference allele, and alternate allele:
result = pygenebe.annotate_variant(chr="6", pos=160585140, ref="T", alt="G")
print(result)
This returns a JSON-like response with annotations such as gene function, population frequencies, and pathogenicity scores.
Example with Additional Options
You can specify the genome build (e.g., GRCh38 or GRCh37) and request specific fields:
result = pygenebe.annotate_variant(
chr="6",
pos=160585140,
ref="T",
alt="G",
genome="GRCh38",
fields=["variant_id", "gene", "consequence"]
)
print(result)
Annotating Variants in a Pandas DataFrame
For batch processing, pygenebe
supports pandas DataFrames. Create a DataFrame with columns for chromosome, position, reference, and alternate alleles:
import pandas as pd
data = pd.DataFrame({
"chr": ["1", "6"],
"pos": [10020, 160585140],
"ref": ["A", "T"],
"alt": ["G", "G"]
})
annotated = pygenebe.annotate_dataframe(data)
print(annotated)
Customizing DataFrame Annotation
You can customize the annotation by specifying the genome build and desired fields:
annotated = pygenebe.annotate_dataframe(
df=data,
genome="GRCh37",
fields=["variant_id", "gnomad_af", "pathogenicity"]
)
print(annotated)
The result is a DataFrame with additional annotation columns appended.
Annotating Variants from HGVS Notation
pygenebe
can parse HGVS (Human Genome Variation Society) notation for variant annotation:
result = pygenebe.annotate_hgvs("NM_000546.5:c.215C>G")
print(result)
This returns annotations for the specified HGVS variant, such as its genomic coordinates and effects.
Annotating a VCF File (CLI)
To annotate variants from a VCF file, use the command-line interface (CLI). The input VCF must be single-allelic (split multi-allelic entries with bcftools
if needed):
genebe annotate --input input.vcf.gz --output output.vcf.gz
The output VCF includes additional annotation fields. Requires the cyvcf2
package (pip install cyvcf2
).
VCF Example with Specific Fields
Request only specific annotation fields:
genebe annotate --input input.vcf.gz --output output.vcf.gz --fields variant_id,gene,consequence
Handling Large Datasets with API Key
For large datasets (e.g., over 10,000 variants), request limits may apply. Create a GeneBe account, generate an API key, and use it to increase your limit.
CLI with API Key
genebe annotate --input input.vcf.gz --output output.vcf.gz --username your_username --api-key your_api_key
Python with API Key
Set credentials before making requests:
pygenebe.set_credentials(username="your_username", api_key="your_api_key")
result = pygenebe.annotate_variant(chr="6", pos=160585140, ref="T", alt="G")
print(result)
Check your account limits:
genebe account
Using Docker
For a pre-configured environment, use the Docker image:
docker run -v input.vcf:/tmp/input.vcf --rm genebe/pygenebe:0.0.14 genebe annotate --input /tmp/input.vcf --output /dev/stdout
Mount your VCF file and retrieve the annotated output.
Additional Usage Examples
Exploring Variant Details
To explore detailed annotations for a variant:
result = pygenebe.annotate_variant(chr="17", pos=7674220, ref="G", alt="A")
print(result["gene"]) # Access specific annotation fields
print(result["gnomad_af"]) # Population allele frequency
Parse variants
Parse variants can translate variant expressed in multiple form, especially:
- as a HGVS (f.ex.
NM_000546.5:c.215C>G
) - as dbSNP id (f.ex.
rs11
) - as gene with aminoacid change (f.ex.
AGT M259T
)
Look at the usage examples below:
res = gnb.parse_variants(
["NM_000546.5:c.215C>G", "AGT M259T", "rs10"],
multiple=True,
)
print(res)
If multiple is True then return all possible values (result in List[List[str]]
). Otherwise returns one value per variant.
res = gnb.parse_variants(
["NM_000546.5:c.215C>G", "AGT M259T", "rs10"],
multiple=True,
)
print(res)
If you want to use pandas dataframe as input and output:
df = pd.DataFrame({"variant": ["AGT M259T", "rs10", "rs11", "rs12"]})
gnb.parse_variants_df(
df,
multiple=True,
)
variant parsed_variants
0 rs10 [7-92754574-A-C, 7-92754574-A-G, 7-92754574-A-T]
1 rs11 [7-11324574-C-T]
2 rs12 [7-11297537-A-C]
If you are trying to parse multiple variants from some external source it may happen, that some of them are invalid. The backend will throw an error in such case. If you want to ignore these errors and continue use ignore_errors=True
switch. Look at the example below, where DHFR:p.N51I
is an invalid HGVS (the reference aminoacid is incorrect).
df = pd.DataFrame({"variant": ["DHFR:p.N51I", "rs10"]})
res = gnb.parse_variants_df(
df,
endpoint_url="http://localhost:7180/cloud/api-public/v1/convert",
multiple=True,
ignore_errors=True,
)
Error Handling
If a variant is invalid, pygenebe
raises an exception:
try:
result = pygenebe.annotate_variant(chr="X", pos=-1, ref="A", alt="T")
except ValueError as e:
print(f"Error: {e}")
Notes and Limitations
- VCF Format: Ensure VCF files are single-allelic. Use
bcftools norm -m -
to split multi-allelic variants. - Request Limits: Free usage has limits to prevent server overload. API key holders get higher limits (tens of thousands of requests daily). Contact GeneBe support for custom limits.
- Documentation: For more details, see the official documentation and GitHub examples.
Troubleshooting
- Installation Issues: Ensure
cyvcf2
is installed for VCF support (pip install cyvcf2
). - API Errors: Verify your API key and network connection. Check GitHub issues for help or to report problems.
pygenebe
simplifies genetic variant annotation with flexible options for researchers and bioinformaticians. Dive into its features and enhance your genetic analysis workflows!