๐ŸงฌBioinformatics

Major Bioinformatics File Formats

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Every bioinformatics pipeline you'll encounter, from basic sequence searches to complex variant analysis, depends on understanding how biological data is structured and stored. These file formats aren't arbitrary; each one evolved to solve a specific computational problem, whether that's storing raw sequencing reads with quality metrics, representing alignments against a reference genome, or encoding three-dimensional protein structures. You're being tested on your ability to recognize which format fits which analysis task, how data flows between formats in a pipeline, and what information each format captures or loses.

Don't just memorize file extensions. Know what biological question each format helps answer. When you see a question about quality-aware sequence data, you should immediately think FASTQ. When asked about storing genomic coordinates for visualization, BED should come to mind. Understanding the conceptual purpose behind each format will help you troubleshoot pipelines, choose appropriate tools, and answer exam questions that test your grasp of data flow in genomic analysis.


Sequence Storage Formats

These formats represent the fundamental building blocks: raw biological sequences. The key distinction is whether quality information is preserved alongside the sequence itself.

FASTA

FASTA is the simplest and most widely used format for representing biological sequences. Each entry starts with a header line beginning with >, followed by an identifier and optional description. The sequence itself appears on the subsequent lines.

  • No quality information stored, which keeps files compact but makes the format unsuitable for raw NGS data where you need to know how confident each base call is
  • Universal compatibility across tools: FASTA is the default input for BLAST searches, multiple sequence alignment programs (like ClustalW or MUSCLE), and database construction

Because it's so stripped-down, FASTA works best for reference genomes, protein databases, and any situation where the sequence is already trusted.

FASTQ

FASTQ extends FASTA by pairing every base with a quality score. Each read follows a strict four-line structure:

  1. Line 1: Header starting with @, containing the read identifier
  2. Line 2: The nucleotide sequence
  3. Line 3: A separator line (just a +, sometimes followed by the identifier again)
  4. Line 4: ASCII-encoded quality scores, one character per base

Each character on line 4 maps to a Phred quality score, defined as:

Q=โˆ’10logโก10(Perror)Q = -10 \log_{10}(P_{\text{error}})

So a Phred score of 30 means a 1-in-1,000 chance that base call is wrong (99.9% accuracy). A score of 20 means 1-in-100 (99% accuracy). These scores are critical for downstream filtering and trimming of low-confidence reads.

FASTQ is the primary output from Illumina, Ion Torrent, and other high-throughput sequencing platforms.

Compare: FASTA vs. FASTQ: both store sequences, but FASTQ adds per-base quality scores essential for filtering low-confidence reads. If asked which format is appropriate for raw sequencing data, FASTQ is always the answer; FASTA is for processed or reference sequences.


Alignment and Mapping Formats

Once sequences are generated, they must be aligned to a reference. These formats encode where reads map, how confidently, and what differences exist.

SAM/BAM

After aligning reads to a reference genome (using tools like BWA or Bowtie2), the results are stored in SAM or BAM format. SAM (Sequence Alignment/Map) is human-readable text; BAM is its compressed binary equivalent. They contain identical information, but BAM files are roughly 70% smaller and can be indexed for fast random access to specific genomic regions.

Each alignment record includes several key fields:

  • CIGAR strings encode the alignment operations in compact notation. For example, 50M2I48M means 50 bases match/mismatch the reference, then a 2-base insertion in the read, then 48 more matching bases.
  • MAPQ (mapping quality) indicates how confidently the read maps to that particular location. A MAPQ of 0 means the read maps equally well to multiple locations.
  • Bitwise FLAG field encodes read properties using a single integer. Different bits indicate whether the read is paired, unmapped, on the reverse strand, etc. For instance, a FLAG of 99 tells you the read is paired, properly mapped, and its mate is on the reverse strand.

BAM files are typically sorted by genomic coordinate and indexed (producing a .bai file) so that tools like IGV or samtools can quickly retrieve reads from a specific region without scanning the entire file.

VCF (Variant Call Format)

VCF stores the differences between your sample(s) and a reference genome. Where SAM/BAM tells you where reads landed, VCF tells you where those reads disagree with the reference.

  • One variant per line, recording chromosome, position, reference allele, alternate allele(s), and a quality score
  • Genotype fields store individual-level data for population studies, including zygosity (homozygous vs. heterozygous) and the read depth supporting each allele call
  • Essential for GWAS and clinical genomics: VCF is the standard format for SNPs, indels, and structural variants across all major variant callers (GATK, FreeBayes, DeepVariant, etc.)

Compare: SAM/BAM vs. VCF: SAM/BAM stores where reads align, while VCF stores where alignments differ from the reference. In a typical pipeline, you generate BAM first (alignment step), then call variants to produce VCF (variant calling step). Understanding this data flow is critical for pipeline questions.


Genomic Annotation Formats

These formats describe what features exist at specific genomic coordinates. They're essential for connecting raw data to biological meaning.

BED (Browser Extensible Data)

BED is a lightweight format for defining genomic regions. At minimum, each line has just three tab-separated columns:

  1. Chromosome (e.g., chr1)
  2. Start position (0-based, so the first base of a chromosome is position 0)
  3. End position (exclusive, so a feature covering the first 100 bases would be 0โ€“100)

Optional columns can add a feature name, score, strand, and display parameters. BED is the standard for uploading custom tracks to the UCSC Genome Browser or IGV, and it's commonly used to define regions of interest (promoters, peaks from ChIP-seq, etc.).

GFF/GTF (General Feature Format / Gene Transfer Format)

GFF and GTF are richer annotation formats with a nine-column structure that includes source, feature type (exon, CDS, gene), start/end positions, score, strand, reading frame, and a flexible attributes column.

  • GTF is a stricter variant of GFF (specifically GFF2-like), requiring gene_id and transcript_id attributes. This makes it the go-to format for RNA-seq quantification tools like featureCounts and HTSeq.
  • Hierarchical relationships can be represented through parent-child attribute links, capturing gene โ†’ transcript โ†’ exon structures in a single file.
  • GFF3 (the current GFF standard) uses explicit Parent= attributes to define these hierarchies, while GTF relies on shared gene_id/transcript_id values.

Compare: BED vs. GFF/GTF: BED is simpler and uses 0-based, half-open coordinates; GFF/GTF is more detailed with 1-based, fully closed coordinates and standardized attribute fields. Use BED for quick region definitions, GFF/GTF for full gene annotations. Coordinate system differences are a common source of off-by-one errors, so always check which system a tool expects.


Structural Biology Formats

Protein and nucleic acid structures require specialized formats that capture three-dimensional atomic positions. These enable molecular modeling, docking, and structure-function analysis.

PDB (Protein Data Bank)

The PDB format stores atomic-level 3D coordinates for macromolecules. Each ATOM record specifies:

  • x, y, z coordinates of the atom in angstroms
  • Atom type and name (e.g., CA for alpha carbon)
  • Residue name, chain identifier, and residue number

PDB uses a fixed-column format, meaning each piece of data must occupy specific character positions in the line. This legacy design works for most structures but struggles with very large complexes (over 99,999 atoms or 62 chains). For those cases, the newer mmCIF (macromolecular Crystallographic Information File) format is used instead.

PDB files also contain experimental metadata like resolution, R-factor, and the method used to determine the structure (X-ray crystallography, NMR, or cryo-EM). These details are critical for assessing how much you should trust the atomic coordinates.

Compare: PDB vs. sequence formats: PDB captures spatial arrangement while FASTA captures linear sequence. You can derive a sequence from a structure (just read off the residues), but you can't go the other direction without computational prediction. Structure prediction tools like AlphaFold bridge this gap by predicting PDB-style coordinates from FASTA input.


Phylogenetic and Evolutionary Formats

Evolutionary analysis requires formats that represent both aligned sequences and tree topologies. These formats encode relationships between taxa.

PHYLIP

PHYLIP format was designed for the PHYLIP software suite but is now accepted by many phylogenetic programs. The first line is a header specifying the number of taxa and the alignment length (e.g., 5 500 for 5 sequences of 500 positions).

  • Supports both interleaved (sequences broken across multiple blocks) and sequential (one full sequence per taxon) layouts
  • Can also store pairwise distance matrices for distance-based tree construction methods like neighbor-joining
  • One quirk to watch for: the classic PHYLIP format truncates taxon names to 10 characters, which can cause problems with longer identifiers

Newick

Newick format represents tree topologies as a single line of text using nested parentheses. For example:

((A:0.1,B:0.2):0.3,C:0.4);

This tree shows A and B as sister taxa (sharing a common ancestor), with C as the outgroup. The numbers after colons represent branch lengths (evolutionary distances). The entire tree always ends with a semicolon.

  • Branch lengths are optional: ((A,B),C); is valid and just shows the topology without distances
  • Compact and portable: entire tree topologies fit on a single line, making them easy to share and parse programmatically
  • Most tree-building software (RAxML, IQ-TREE, MrBayes) outputs Newick or its extended variant, Nexus

Compare: PHYLIP vs. Newick: PHYLIP stores sequence alignments used to build trees; Newick stores the resulting tree topology. A typical phylogenetic workflow uses PHYLIP-formatted alignments as input and produces Newick-formatted trees as output.


Analysis Output Formats

Specialized formats capture results from specific bioinformatics algorithms. Understanding output options helps you choose the right downstream tools.

BLAST Output Formats

BLAST supports multiple output formats, selected with the -outfmt flag:

  • Pairwise (-outfmt 0): the default human-readable format showing full alignments between query and subject
  • XML (-outfmt 5): structured and detailed, good for programmatic parsing with libraries like BioPython
  • Tabular (-outfmt 6): the most commonly used for scripting, with customizable tab-separated columns including percent identity, alignment length, e-value, and bit score

The e-value is the metric you'll use most for judging hit significance. It represents the expected number of hits with that score or better that you'd find by chance in a database of that size. An e-value of 1ร—10โˆ’501 \times 10^{-50} is extremely significant; an e-value of 10 means you'd expect 10 hits that good just by random chance. Lower is better, and the threshold you choose depends on your application (typically <1ร—10โˆ’5< 1 \times 10^{-5} for homology searches).


Quick Reference Table

ConceptBest Examples
Raw sequence storageFASTA, FASTQ
Quality-aware sequencing dataFASTQ
Read alignment to referenceSAM, BAM
Variant representationVCF
Genomic region annotationBED, GFF, GTF
3D molecular structurePDB, mmCIF
Phylogenetic treesNewick
Sequence alignment for phylogeneticsPHYLIP
Sequence similarity search resultsBLAST output formats

Self-Check Questions

  1. You receive raw Illumina sequencing data and need to assess read quality before alignment. Which file format contains the information you need, and what mathematical relationship defines the quality scores?

  2. Compare SAM/BAM and VCF: at what stage of a variant-calling pipeline would you encounter each, and what biological question does each format help answer?

  3. A collaborator sends you genomic coordinates, but your analysis is off by one base pair. Which two annotation formats might be involved, and how do their coordinate systems differ?

  4. You need to visualize a custom set of regulatory regions in a genome browser and also annotate full gene structures with exon boundaries. Which format would you choose for each task, and why?

  5. Describe the data flow in a typical phylogenetic analysis: which format would store your input multiple sequence alignment, and which format would represent your final tree topology?

Major Bioinformatics File Formats to Know for Bioinformatics