Common File Formats in Bioinformatics
In bioinformatics workflows, data is stored and transferred in various specialized file formats. Understanding these formats is crucial for effective data management and analysis. This page provides an overview of the most common file formats you'll encounter.
On this page you will find information about the FASTQ, BAM and VCF file formats. The FASTQ files are used to build the BAM file, which is then used to generate the VCF file — the starting point for subsequent expert analysis.
FASTQ Format​
FASTQ is the standard file format for storing the raw output of high-throughput sequencing instruments.
Purpose:​
FASTQ files serve as the primary storage format for raw sequencing data directly from sequencing machines. They represent the starting point of most bioinformatics analyses and contain both the sequence data and quality information needed for downstream processing.
FASTQ files are typically the format in which data is shared between labs to reproduce genetic analyses, as they are usually the primary input for bioinformatics pipelines.
Key characteristics:​
- File extension:
.fastq
or.fq
(usually compressed as.fastq.gz
or.fq.gz
) - Compression: Almost always gzipped to save storage space
- Human readability: Not easily readable by humans in its raw form
- Size: For human exome sequencing, typically several GB even when compressed
- Distribution pattern: Usually distributed in pairs (R1 and R2 in filenames), indicating "paired-end" sequencing
Data contained:​
FASTQ files contain:
- Sequence identifier (header line starting with '@')
- Raw sequence (nucleotides as A, C, G, T, N)
- Separator line (starting with '+')
- Quality scores (encoded as ASCII characters) for each nucleotide
@SRR001666.1 HWUSI-EAS1513_0001:1:1:1429:1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
The quality scores represent the probability of sequencing error for each base, encoded in ASCII format.
For more detailed information about FASTQ format, see the Wikipedia article on FASTQ format.
BAM Format​
BAM (Binary Alignment Map) is the compressed binary version of the SAM format used to store sequence alignments.
Purpose:​
BAM files store aligned sequencing reads, showing how each read maps to a reference genome. They are essential for variant calling, coverage analysis, and visualization of sequencing data in its genomic context. The binary format allows for efficient storage and rapid access to specific genomic regions via indexing.
Key characteristics:​
- File extension:
.bam
- Human readability: Not readable by humans (binary format)
- Companion file: Usually paired with a
.bai
index file - Size: Typically several GB for human samples
- Reference genome: Aligned to specific genome versions (e.g., GRCh38)
Data contained:​
BAM files contain:
- Read sequences aligned to a reference genome
- Quality scores for each base
- Alignment information (position, CIGAR string, flags)
- Optional metadata fields
- Depending on the source experiment, duplicated reads in BAM files may be marked as duplicates
BAM files require the use of specialized tools for viewing and manipulation, most commonly:
- Samtools for command-line operations
- IGV (Integrative Genomics Viewer) for visualization
Reference genome versions​
BAM files are aligned to specific reference genome versions. The most commonly used human reference genome version currently is GRCh38 (also known as hg38). Earlier versions like GRCh37 (hg19) are still in use in some contexts.
Converting to FASTQ​
It's possible to extract the original FASTQ sequences from a BAM file, which can be useful for realignment or other analyses. This can be done using tools like Samtools or Picard.
CRAM Format​
CRAM is an alternative to BAM that provides additional compression to reduce file sizes.
Purpose:​
CRAM was developed specifically to address the growing storage challenges of genomic data. Its primary purpose is to provide a more space-efficient format for long-term archiving of aligned sequence data while maintaining the ability to reconstruct the original information when needed.
Key characteristics:​
- File extension:
.cram
- Human readability: Not readable by humans (binary format)
- Companion file: Usually paired with a
.crai
index file - Reference dependency: Requires access to the reference sequence used for alignment
- Size: Significantly smaller than BAM files (typically 30-60% smaller)
Data contained:​
Similar to BAM files, CRAM contains aligned sequence data but uses reference-based compression to achieve smaller file sizes.
Usage considerations:​
- Best used for archiving data
- Not recommended for distributing data due to the reference sequence dependency
- Requires specialized tools similar to those used for BAM files
- May have compatibility issues with some older software
VCF Format (Variant Call Format)​
VCF is the standard format for storing genetic variation data, such as SNPs, insertions, deletions, and structural variants.
VCF files are typically the format in which variant data is shared between labs to support downstream interpretation, as they represent the final output of many bioinformatics pipelines.
Purpose:​
VCF files store information about genomic variations relative to a reference genome. They serve as a standard interchange format for variant calling, annotation, and analysis. VCF enables researchers to categorize, filter, and analyze genetic variations across populations or within individuals.
Key characteristics:​
- File extension:
.vcf
(often compressed as.vcf.gz
, sometimes accompanied with another file, an index.vcf.gz.tbi
) - Human readability: Semi-readable by humans, but complex
- Machine friendliness: Somewhat challenging for automated processing due to complexity
- Size: Much smaller than raw sequence data; varies widely based on variant content
Data contained:​
VCF files have:
- A header section with metadata and descriptions of annotations
- A data section with variant information
Each variant entry contains information about position, reference and alternative alleles, quality metrics, and optional annotations. VCF files can contain data for a single sample or multiple samples, allowing for population-level variant analysis.
Usage notes:​
- Can represent small variants (SNPs, indels) or large structural variants
- May include data for a single sample or multiple samples
- Can contain various annotations about functional impact, population frequencies, etc.
- Can be viewed in spreadsheet software like Excel, but not ideal due to formatting issues
- Specialized viewers like IGV provide better visualization
BED Format​
BED (Browser Extensible Data) is a flexible, line-oriented format used to define genomic regions of interest.
Purpose:​
BED files define specific genomic intervals for various applications including targeted sequencing capture regions, gene annotations, and visualization tracks. They're particularly important in targeted sequencing experiments to define which regions of the genome were specifically captured and analyzed.
Key characteristics:​
- File extension:
.bed
- Human readability: Easily readable (tab-delimited text format)
- Machine friendliness: Simple format that's easy to parse and generate
- Size: Typically very small (kilobytes to a few megabytes)
Data contained:​
BED files consist of tab-separated fields with the first three fields being mandatory:
- Chromosome name (chrom)
- Start position (chromStart, 0-based)
- End position (chromEnd)
Additional optional fields can include: 4. Name of the region 5. Score (often used for visualization) 6. Strand (+ or -) 7. thickStart/thickEnd (for display purposes) 8. RGB color value (for display) 9. Block count (for representing exons) 10. Block sizes 11. Block starts
Usage in targeted sequencing:​
In targeted sequencing (like exome or gene panel sequencing), BED files define the regions that capture probes were designed to target. This information is crucial for:
- Calculating coverage across targeted regions
- Evaluating capture efficiency
- Restricting variant calling to intended regions
- Interpreting results in the context of what was actually targeted
Example BED file entry:​
chr7 127471196 127472363 Pos1 0 +
chr7 127472363 127473530 Pos2 0 +
BED files can be easily manipulated with tools like BEDTools and viewed in genome browsers like UCSC or IGV.