upgrade
upgrade

🧬Bioinformatics

Major Bioinformatics File Formats

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Every bioinformatics pipeline you'll encounter—from basic sequence searches to complex variant analysis—depends on understanding how biological data is structured and stored. These file formats aren't arbitrary; each one evolved to solve a specific computational problem, whether that's storing raw sequencing reads with quality metrics, representing alignments against a reference genome, or encoding three-dimensional protein structures. You're being tested on your ability to recognize which format fits which analysis task, how data flows between formats in a pipeline, and what information each format captures or loses.

Don't just memorize file extensions—know what biological question each format helps answer. When you see a question about quality-aware sequence data, you should immediately think FASTQ. When asked about storing genomic coordinates for visualization, BED should come to mind. Understanding the conceptual purpose behind each format will help you troubleshoot pipelines, choose appropriate tools, and answer exam questions that test your grasp of data flow in genomic analysis.


Sequence Storage Formats

These formats represent the fundamental building blocks: raw biological sequences. The key distinction is whether quality information is preserved alongside the sequence itself.

FASTA

  • Header-sequence structure—each entry begins with > followed by an identifier, with the sequence on subsequent lines
  • No quality information—stores only the sequence itself, making files compact but unsuitable for raw NGS data
  • Universal compatibility—the default input for BLAST searches, multiple sequence alignment, and database construction

FASTQ

  • Four-line entry structure—identifier, sequence, separator (+), and ASCII-encoded quality scores in strict order
  • Phred quality scores—each base gets a quality value representing the probability of sequencing error, typically Q=10log10(Perror)Q = -10 \log_{10}(P_{error})
  • NGS data standard—the primary output format from Illumina, Ion Torrent, and other high-throughput sequencing platforms

Compare: FASTA vs. FASTQ—both store sequences, but FASTQ adds per-base quality scores essential for filtering low-confidence reads. If asked which format is appropriate for raw sequencing data, FASTQ is always the answer; FASTA is for processed or reference sequences.


Alignment and Mapping Formats

Once sequences are generated, they must be aligned to a reference. These formats encode where reads map, how confidently, and what differences exist.

SAM/BAM

  • SAM is human-readable text; BAM is compressed binary—identical information, but BAM files are ~70% smaller and indexed for fast access
  • CIGAR strings—encode alignment operations (matches, insertions, deletions) in compact notation like 50M2I48M
  • Mandatory fields include mapping quality (MAPQ), reference position, and bitwise flags indicating read properties (paired, unmapped, reverse complement)

VCF (Variant Call Format)

  • One variant per line—records chromosome, position, reference allele, alternate allele(s), and quality score
  • Genotype fields—stores individual-level data for population studies, including zygosity and read depth supporting each call
  • Essential for GWAS and clinical genomics—the standard format for SNPs, indels, and structural variants across all major variant callers

Compare: SAM/BAM vs. VCF—SAM/BAM stores where reads align, while VCF stores where alignments differ from the reference. In a typical pipeline, you generate BAM first, then call variants to produce VCF. Understanding this data flow is critical for pipeline questions.


Genomic Annotation Formats

These formats describe what features exist at specific genomic coordinates. They're essential for connecting raw data to biological meaning.

BED (Browser Extensible Data)

  • Minimal three-column structure—chromosome, start position (0-based), and end position define each feature
  • Optional columns—name, score, strand, and display parameters can extend the format for richer annotation
  • Genome browser visualization—the standard for uploading custom tracks to UCSC Genome Browser or IGV

GFF/GTF (General Feature Format/Gene Transfer Format)

  • Nine-column structure—includes source, feature type (exon, CDS, gene), and a flexible attributes column
  • GTF is gene-centric—a stricter GFF variant requiring gene_id and transcript_id attributes for RNA-seq analysis
  • Hierarchical relationships—can represent gene → transcript → exon structures through parent-child attribute links

Compare: BED vs. GFF/GTF—BED is simpler and uses 0-based coordinates; GFF/GTF is more detailed with 1-based coordinates and standardized attribute fields. Use BED for quick region definitions, GFF/GTF for full gene annotations. Coordinate system differences are a common source of off-by-one errors.


Structural Biology Formats

Protein and nucleic acid structures require specialized formats that capture three-dimensional atomic positions. These enable molecular modeling, docking, and structure-function analysis.

PDB (Protein Data Bank)

  • Atomic coordinate records—each ATOM line specifies x, y, z coordinates, atom type, residue, and chain identifier
  • Fixed-column format—legacy design with strict character positions, now supplemented by mmCIF for large structures
  • Experimental metadata—includes resolution, R-factor, and method (X-ray, NMR, cryo-EM) critical for assessing structure quality

Compare: PDB vs. sequence formats—PDB captures spatial arrangement while FASTA captures linear sequence. You can derive sequence from structure but not vice versa. Structure prediction tools like AlphaFold bridge this gap by predicting PDB-style coordinates from FASTA input.


Phylogenetic and Evolutionary Formats

Evolutionary analysis requires formats that represent both aligned sequences and tree topologies. These formats encode relationships between taxa.

PHYLIP

  • Header line specifies dimensions—first line contains number of taxa and sequence length
  • Interleaved or sequential—supports both layouts for multiple sequence alignments
  • Distance matrix support—can store pairwise evolutionary distances for distance-based tree construction methods

Newick

  • Parenthetical tree notation—nested parentheses represent branching patterns, e.g., ((A,B),C); shows A and B as sister taxa
  • Branch lengths optional—colons followed by numbers indicate evolutionary distance, e.g., (A:0.1,B:0.2):0.3
  • Compact and portable—entire tree topologies fit on a single line, making them easy to share and parse programmatically

Compare: PHYLIP vs. Newick—PHYLIP stores sequence alignments used to build trees; Newick stores the resulting tree topology. A typical phylogenetic workflow uses PHYLIP-formatted alignments as input and produces Newick-formatted trees as output.


Analysis Output Formats

Specialized formats capture results from specific bioinformatics algorithms. Understanding output options helps you choose the right downstream tools.

BLAST Output Formats

  • Multiple format options—tabular (-outfmt 6), XML (-outfmt 5), and pairwise alignment (-outfmt 0) serve different parsing needs
  • Tabular format fields—customizable columns including percent identity, alignment length, e-value, and bit score
  • E-value interpretation—represents expected number of hits by chance; lower values indicate more significant matches

Quick Reference Table

ConceptBest Examples
Raw sequence storageFASTA, FASTQ
Quality-aware sequencing dataFASTQ
Read alignment to referenceSAM, BAM
Variant representationVCF
Genomic region annotationBED, GFF, GTF
3D molecular structurePDB
Phylogenetic treesNewick
Sequence alignment for phylogeneticsPHYLIP
Sequence similarity search resultsBLAST output formats

Self-Check Questions

  1. You receive raw Illumina sequencing data and need to assess read quality before alignment. Which file format contains the information you need, and what mathematical relationship defines the quality scores?

  2. Compare SAM/BAM and VCF: at what stage of a variant-calling pipeline would you encounter each, and what biological question does each format help answer?

  3. A collaborator sends you genomic coordinates, but your analysis is off by one base pair. Which two annotation formats might be involved, and how do their coordinate systems differ?

  4. You need to visualize a custom set of regulatory regions in a genome browser and also annotate full gene structures with exon boundaries. Which format would you choose for each task, and why?

  5. Describe the data flow in a typical phylogenetic analysis: which format would store your input multiple sequence alignment, and which format would represent your final tree topology?