🧬Bioinformatics

Major Bioinformatics File Formats

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Every bioinformatics pipeline you'll encounter—from basic sequence searches to complex variant analysis—depends on understanding how biological data is structured and stored. These file formats aren't arbitrary; each one evolved to solve a specific computational problem, whether that's storing raw sequencing reads with quality metrics, representing alignments against a reference genome, or encoding three-dimensional protein structures. You're being tested on your ability to recognize which format fits which analysis task, how data flows between formats in a pipeline, and what information each format captures or loses.

Don't just memorize file extensions—know what biological question each format helps answer. When you see a question about quality-aware sequence data, you should immediately think FASTQ. When asked about storing genomic coordinates for visualization, BED should come to mind. Understanding the conceptual purpose behind each format will help you troubleshoot pipelines, choose appropriate tools, and answer exam questions that test your grasp of data flow in genomic analysis.

Sequence Storage Formats

These formats represent the fundamental building blocks: raw biological sequences. The key distinction is whether quality information is preserved alongside the sequence itself.

FASTA

Header-sequence structure—each entry begins with > followed by an identifier, with the sequence on subsequent lines
No quality information—stores only the sequence itself, making files compact but unsuitable for raw NGS data
Universal compatibility—the default input for BLAST searches, multiple sequence alignment, and database construction

FASTQ

Four-line entry structure—identifier, sequence, separator (+), and ASCII-encoded quality scores in strict order
Phred quality scores—each base gets a quality value representing the probability of sequencing error, typically $Q = -10 \log_{10}(P_{error})$
NGS data standard—the primary output format from Illumina, Ion Torrent, and other high-throughput sequencing platforms

Compare: FASTA vs. FASTQ—both store sequences, but FASTQ adds per-base quality scores essential for filtering low-confidence reads. If asked which format is appropriate for raw sequencing data, FASTQ is always the answer; FASTA is for processed or reference sequences.

Alignment and Mapping Formats

Once sequences are generated, they must be aligned to a reference. These formats encode where reads map, how confidently, and what differences exist.

SAM/BAM

SAM is human-readable text; BAM is compressed binary—identical information, but BAM files are ~70% smaller and indexed for fast access
CIGAR strings—encode alignment operations (matches, insertions, deletions) in compact notation like 50M2I48M
Mandatory fields include mapping quality (MAPQ), reference position, and bitwise flags indicating read properties (paired, unmapped, reverse complement)

VCF (Variant Call Format)

One variant per line—records chromosome, position, reference allele, alternate allele(s), and quality score
Genotype fields—stores individual-level data for population studies, including zygosity and read depth supporting each call
Essential for GWAS and clinical genomics—the standard format for SNPs, indels, and structural variants across all major variant callers

Compare: SAM/BAM vs. VCF—SAM/BAM stores where reads align, while VCF stores where alignments differ from the reference. In a typical pipeline, you generate BAM first, then call variants to produce VCF. Understanding this data flow is critical for pipeline questions.

Genomic Annotation Formats

These formats describe what features exist at specific genomic coordinates. They're essential for connecting raw data to biological meaning.

BED (Browser Extensible Data)

Minimal three-column structure—chromosome, start position (0-based), and end position define each feature
Optional columns—name, score, strand, and display parameters can extend the format for richer annotation
Genome browser visualization—the standard for uploading custom tracks to UCSC Genome Browser or IGV

GFF/GTF (General Feature Format/Gene Transfer Format)

Nine-column structure—includes source, feature type (exon, CDS, gene), and a flexible attributes column
GTF is gene-centric—a stricter GFF variant requiring gene_id and transcript_id attributes for RNA-seq analysis
Hierarchical relationships—can represent gene → transcript → exon structures through parent-child attribute links

Compare: BED vs. GFF/GTF—BED is simpler and uses 0-based coordinates; GFF/GTF is more detailed with 1-based coordinates and standardized attribute fields. Use BED for quick region definitions, GFF/GTF for full gene annotations. Coordinate system differences are a common source of off-by-one errors.

Structural Biology Formats

Protein and nucleic acid structures require specialized formats that capture three-dimensional atomic positions. These enable molecular modeling, docking, and structure-function analysis.

PDB (Protein Data Bank)

Atomic coordinate records—each ATOM line specifies x, y, z coordinates, atom type, residue, and chain identifier
Fixed-column format—legacy design with strict character positions, now supplemented by mmCIF for large structures
Experimental metadata—includes resolution, R-factor, and method (X-ray, NMR, cryo-EM) critical for assessing structure quality

Compare: PDB vs. sequence formats—PDB captures spatial arrangement while FASTA captures linear sequence. You can derive sequence from structure but not vice versa. Structure prediction tools like AlphaFold bridge this gap by predicting PDB-style coordinates from FASTA input.

Phylogenetic and Evolutionary Formats

Evolutionary analysis requires formats that represent both aligned sequences and tree topologies. These formats encode relationships between taxa.

PHYLIP

Header line specifies dimensions—first line contains number of taxa and sequence length
Interleaved or sequential—supports both layouts for multiple sequence alignments
Distance matrix support—can store pairwise evolutionary distances for distance-based tree construction methods

Newick

Parenthetical tree notation—nested parentheses represent branching patterns, e.g., ((A,B),C); shows A and B as sister taxa
Branch lengths optional—colons followed by numbers indicate evolutionary distance, e.g., (A:0.1,B:0.2):0.3
Compact and portable—entire tree topologies fit on a single line, making them easy to share and parse programmatically

Compare: PHYLIP vs. Newick—PHYLIP stores sequence alignments used to build trees; Newick stores the resulting tree topology. A typical phylogenetic workflow uses PHYLIP-formatted alignments as input and produces Newick-formatted trees as output.

Analysis Output Formats

Specialized formats capture results from specific bioinformatics algorithms. Understanding output options helps you choose the right downstream tools.

BLAST Output Formats

Multiple format options—tabular (-outfmt 6), XML (-outfmt 5), and pairwise alignment (-outfmt 0) serve different parsing needs
Tabular format fields—customizable columns including percent identity, alignment length, e-value, and bit score
E-value interpretation—represents expected number of hits by chance; lower values indicate more significant matches

Quick Reference Table

Concept	Best Examples
Raw sequence storage	FASTA, FASTQ
Quality-aware sequencing data	FASTQ
Read alignment to reference	SAM, BAM
Variant representation	VCF
Genomic region annotation	BED, GFF, GTF
3D molecular structure	PDB
Phylogenetic trees	Newick
Sequence alignment for phylogenetics	PHYLIP
Sequence similarity search results	BLAST output formats

Self-Check Questions

You receive raw Illumina sequencing data and need to assess read quality before alignment. Which file format contains the information you need, and what mathematical relationship defines the quality scores?
Compare SAM/BAM and VCF: at what stage of a variant-calling pipeline would you encounter each, and what biological question does each format help answer?
A collaborator sends you genomic coordinates, but your analysis is off by one base pair. Which two annotation formats might be involved, and how do their coordinate systems differ?
You need to visualize a custom set of regulatory regions in a genome browser and also annotate full gene structures with exon boundaries. Which format would you choose for each task, and why?
Describe the data flow in a typical phylogenetic analysis: which format would store your input multiple sequence alignment, and which format would represent your final tree topology?

🧬Bioinformatics

Major Bioinformatics File Formats

Why This Matters

Sequence Storage Formats

FASTA

FASTQ

Alignment and Mapping Formats

SAM/BAM

VCF (Variant Call Format)

Genomic Annotation Formats

BED (Browser Extensible Data)

GFF/GTF (General Feature Format/Gene Transfer Format)

Structural Biology Formats

PDB (Protein Data Bank)

Phylogenetic and Evolutionary Formats

PHYLIP

Newick

Analysis Output Formats

BLAST Output Formats

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes