Common File Formats in Bioinformatics

In bioinformatics workflows, data is stored and transferred in various specialized file formats. Understanding these formats is crucial for effective data management and analysis. This page provides an overview of the most common file formats you'll encounter.

On this page you will find information about the FASTQ, BAM and VCF file formats. The FASTQ files are used to build the BAM file, which is then used to generate the VCF file — the starting point for subsequent expert analysis.

FASTQ Format

FASTQ is the standard file format for storing the raw output of high-throughput sequencing instruments.

Purpose:

FASTQ files serve as the primary storage format for raw sequencing data directly from sequencing machines. They represent the starting point of most bioinformatics analyses and contain both the sequence data and quality information needed for downstream processing.

FASTQ files are typically the format in which data is shared between labs to reproduce genetic analyses, as they are usually the primary input for bioinformatics pipelines.

Key characteristics:

File extension: .fastq or .fq (usually compressed as .fastq.gz or .fq.gz)
Compression: Almost always gzipped to save storage space
Human readability: Not easily readable by humans in its raw form
Size: For human exome sequencing, typically several GB even when compressed
Distribution pattern: Usually distributed in pairs (R1 and R2 in filenames), indicating "paired-end" sequencing

Data contained:

FASTQ files contain:

Sequence identifier (header line starting with '@')
Raw sequence (nucleotides as A, C, G, T, N)
Separator line (starting with '+')
Quality scores (encoded as ASCII characters) for each nucleotide

@SRR001666.1 HWUSI-EAS1513_0001:1:1:1429:1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

The quality scores represent the probability of sequencing error for each base, encoded in ASCII format.

For more detailed information about FASTQ format, see the Wikipedia article on FASTQ format.

BAM Format

BAM (Binary Alignment Map) is the compressed binary version of the SAM format used to store sequence alignments.

Purpose:

BAM files store aligned sequencing reads, showing how each read maps to a reference genome. They are essential for variant calling, coverage analysis, and visualization of sequencing data in its genomic context. The binary format allows for efficient storage and rapid access to specific genomic regions via indexing.

Key characteristics:

File extension: .bam
Human readability: Not readable by humans (binary format)
Companion file: Usually paired with a .bai index file
Size: Typically several GB for human samples
Reference genome: Aligned to specific genome versions (e.g., GRCh38)

Data contained:

BAM files contain:

Read sequences aligned to a reference genome
Quality scores for each base
Alignment information (position, CIGAR string, flags)
Optional metadata fields
Depending on the source experiment, duplicated reads in BAM files may be marked as duplicates

BAM files require the use of specialized tools for viewing and manipulation, most commonly:

Samtools for command-line operations
IGV (Integrative Genomics Viewer) for visualization

Reference genome versions

BAM files are aligned to specific reference genome versions. The most commonly used human reference genome version currently is GRCh38 (also known as hg38). Earlier versions like GRCh37 (hg19) are still in use in some contexts.

Converting to FASTQ

It's possible to extract the original FASTQ sequences from a BAM file, which can be useful for realignment or other analyses. This can be done using tools like Samtools or Picard.

CRAM Format

CRAM is an alternative to BAM that provides additional compression to reduce file sizes.

Purpose:

CRAM was developed specifically to address the growing storage challenges of genomic data. Its primary purpose is to provide a more space-efficient format for long-term archiving of aligned sequence data while maintaining the ability to reconstruct the original information when needed.

Key characteristics:

File extension: .cram
Human readability: Not readable by humans (binary format)
Companion file: Usually paired with a .crai index file
Reference dependency: Requires access to the reference sequence used for alignment
Size: Significantly smaller than BAM files (typically 30-60% smaller)

Data contained:

Similar to BAM files, CRAM contains aligned sequence data but uses reference-based compression to achieve smaller file sizes.

Usage considerations:

Best used for archiving data
Not recommended for distributing data due to the reference sequence dependency
Requires specialized tools similar to those used for BAM files
May have compatibility issues with some older software

VCF Format (Variant Call Format)

VCF is the standard format for storing genetic variation data, such as SNPs, insertions, deletions, and structural variants.

VCF files are typically the format in which variant data is shared between labs to support downstream interpretation, as they represent the final output of many bioinformatics pipelines.

Purpose:

VCF files store information about genomic variations relative to a reference genome. They serve as a standard interchange format for variant calling, annotation, and analysis. VCF enables researchers to categorize, filter, and analyze genetic variations across populations or within individuals.

Key characteristics:

File extension: .vcf (often compressed as .vcf.gz, sometimes accompanied with another file, an index .vcf.gz.tbi)
Human readability: Semi-readable by humans, but complex
Machine friendliness: Somewhat challenging for automated processing due to complexity
Size: Much smaller than raw sequence data; varies widely based on variant content

Data contained:

VCF files have:

A header section with metadata and descriptions of annotations
A data section with variant information

Each variant entry contains information about position, reference and alternative alleles, quality metrics, and optional annotations. VCF files can contain data for a single sample or multiple samples, allowing for population-level variant analysis.

Usage notes:

Can represent small variants (SNPs, indels) or large structural variants
May include data for a single sample or multiple samples
Can contain various annotations about functional impact, population frequencies, etc.
Can be viewed in spreadsheet software like Excel, but not ideal due to formatting issues
Specialized viewers like IGV provide better visualization

BED Format

BED (Browser Extensible Data) is a flexible, line-oriented format used to define genomic regions of interest.

Purpose:

BED files define specific genomic intervals for various applications including targeted sequencing capture regions, gene annotations, and visualization tracks. They're particularly important in targeted sequencing experiments to define which regions of the genome were specifically captured and analyzed.

Key characteristics:

File extension: .bed
Human readability: Easily readable (tab-delimited text format)
Machine friendliness: Simple format that's easy to parse and generate
Size: Typically very small (kilobytes to a few megabytes)

Data contained:

BED files consist of tab-separated fields with the first three fields being mandatory:

Chromosome name (chrom)
Start position (chromStart, 0-based)
End position (chromEnd)

Additional optional fields can include: 4. Name of the region 5. Score (often used for visualization) 6. Strand (+ or -) 7. thickStart/thickEnd (for display purposes) 8. RGB color value (for display) 9. Block count (for representing exons) 10. Block sizes 11. Block starts

Usage in targeted sequencing:

In targeted sequencing (like exome or gene panel sequencing), BED files define the regions that capture probes were designed to target. This information is crucial for:

Calculating coverage across targeted regions
Evaluating capture efficiency
Restricting variant calling to intended regions
Interpreting results in the context of what was actually targeted

Example BED file entry:

chr7    127471196    127472363    Pos1    0    +
chr7    127472363    127473530    Pos2    0    +

BED files can be easily manipulated with tools like BEDTools and viewed in genome browsers like UCSC or IGV.

FASTQ Format​

Purpose:​

Key characteristics:​

Data contained:​

BAM Format​

Purpose:​

Key characteristics:​

Data contained:​

Reference genome versions​

Converting to FASTQ​

CRAM Format​

Purpose:​

Key characteristics:​

Data contained:​

Usage considerations:​

VCF Format (Variant Call Format)​

Purpose:​

Key characteristics:​

Data contained:​

Usage notes:​

BED Format​

Purpose:​

Key characteristics:​

Data contained:​

Usage in targeted sequencing:​

Example BED file entry:​

FASTQ Format

Purpose:

Key characteristics:

Data contained:

BAM Format

Purpose:

Key characteristics:

Data contained:

Reference genome versions

Converting to FASTQ

CRAM Format

Purpose:

Key characteristics:

Data contained:

Usage considerations:

VCF Format (Variant Call Format)

Purpose:

Key characteristics:

Data contained:

Usage notes:

BED Format

Purpose:

Key characteristics:

Data contained:

Usage in targeted sequencing:

Example BED file entry: