Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Every bioinformatics pipeline you'll encounter, from basic sequence searches to complex variant analysis, depends on understanding how biological data is structured and stored. These file formats aren't arbitrary; each one evolved to solve a specific computational problem, whether that's storing raw sequencing reads with quality metrics, representing alignments against a reference genome, or encoding three-dimensional protein structures. You're being tested on your ability to recognize which format fits which analysis task, how data flows between formats in a pipeline, and what information each format captures or loses.
Don't just memorize file extensions. Know what biological question each format helps answer. When you see a question about quality-aware sequence data, you should immediately think FASTQ. When asked about storing genomic coordinates for visualization, BED should come to mind. Understanding the conceptual purpose behind each format will help you troubleshoot pipelines, choose appropriate tools, and answer exam questions that test your grasp of data flow in genomic analysis.
These formats represent the fundamental building blocks: raw biological sequences. The key distinction is whether quality information is preserved alongside the sequence itself.
FASTA is the simplest and most widely used format for representing biological sequences. Each entry starts with a header line beginning with >, followed by an identifier and optional description. The sequence itself appears on the subsequent lines.
Because it's so stripped-down, FASTA works best for reference genomes, protein databases, and any situation where the sequence is already trusted.
FASTQ extends FASTA by pairing every base with a quality score. Each read follows a strict four-line structure:
@, containing the read identifier+, sometimes followed by the identifier again)Each character on line 4 maps to a Phred quality score, defined as:
So a Phred score of 30 means a 1-in-1,000 chance that base call is wrong (99.9% accuracy). A score of 20 means 1-in-100 (99% accuracy). These scores are critical for downstream filtering and trimming of low-confidence reads.
FASTQ is the primary output from Illumina, Ion Torrent, and other high-throughput sequencing platforms.
Compare: FASTA vs. FASTQ: both store sequences, but FASTQ adds per-base quality scores essential for filtering low-confidence reads. If asked which format is appropriate for raw sequencing data, FASTQ is always the answer; FASTA is for processed or reference sequences.
Once sequences are generated, they must be aligned to a reference. These formats encode where reads map, how confidently, and what differences exist.
After aligning reads to a reference genome (using tools like BWA or Bowtie2), the results are stored in SAM or BAM format. SAM (Sequence Alignment/Map) is human-readable text; BAM is its compressed binary equivalent. They contain identical information, but BAM files are roughly 70% smaller and can be indexed for fast random access to specific genomic regions.
Each alignment record includes several key fields:
50M2I48M means 50 bases match/mismatch the reference, then a 2-base insertion in the read, then 48 more matching bases.BAM files are typically sorted by genomic coordinate and indexed (producing a .bai file) so that tools like IGV or samtools can quickly retrieve reads from a specific region without scanning the entire file.
VCF stores the differences between your sample(s) and a reference genome. Where SAM/BAM tells you where reads landed, VCF tells you where those reads disagree with the reference.
Compare: SAM/BAM vs. VCF: SAM/BAM stores where reads align, while VCF stores where alignments differ from the reference. In a typical pipeline, you generate BAM first (alignment step), then call variants to produce VCF (variant calling step). Understanding this data flow is critical for pipeline questions.
These formats describe what features exist at specific genomic coordinates. They're essential for connecting raw data to biological meaning.
BED is a lightweight format for defining genomic regions. At minimum, each line has just three tab-separated columns:
Optional columns can add a feature name, score, strand, and display parameters. BED is the standard for uploading custom tracks to the UCSC Genome Browser or IGV, and it's commonly used to define regions of interest (promoters, peaks from ChIP-seq, etc.).
GFF and GTF are richer annotation formats with a nine-column structure that includes source, feature type (exon, CDS, gene), start/end positions, score, strand, reading frame, and a flexible attributes column.
gene_id and transcript_id attributes. This makes it the go-to format for RNA-seq quantification tools like featureCounts and HTSeq.Parent= attributes to define these hierarchies, while GTF relies on shared gene_id/transcript_id values.Compare: BED vs. GFF/GTF: BED is simpler and uses 0-based, half-open coordinates; GFF/GTF is more detailed with 1-based, fully closed coordinates and standardized attribute fields. Use BED for quick region definitions, GFF/GTF for full gene annotations. Coordinate system differences are a common source of off-by-one errors, so always check which system a tool expects.
Protein and nucleic acid structures require specialized formats that capture three-dimensional atomic positions. These enable molecular modeling, docking, and structure-function analysis.
The PDB format stores atomic-level 3D coordinates for macromolecules. Each ATOM record specifies:
PDB uses a fixed-column format, meaning each piece of data must occupy specific character positions in the line. This legacy design works for most structures but struggles with very large complexes (over 99,999 atoms or 62 chains). For those cases, the newer mmCIF (macromolecular Crystallographic Information File) format is used instead.
PDB files also contain experimental metadata like resolution, R-factor, and the method used to determine the structure (X-ray crystallography, NMR, or cryo-EM). These details are critical for assessing how much you should trust the atomic coordinates.
Compare: PDB vs. sequence formats: PDB captures spatial arrangement while FASTA captures linear sequence. You can derive a sequence from a structure (just read off the residues), but you can't go the other direction without computational prediction. Structure prediction tools like AlphaFold bridge this gap by predicting PDB-style coordinates from FASTA input.
Evolutionary analysis requires formats that represent both aligned sequences and tree topologies. These formats encode relationships between taxa.
PHYLIP format was designed for the PHYLIP software suite but is now accepted by many phylogenetic programs. The first line is a header specifying the number of taxa and the alignment length (e.g., 5 500 for 5 sequences of 500 positions).
Newick format represents tree topologies as a single line of text using nested parentheses. For example:
((A:0.1,B:0.2):0.3,C:0.4);
This tree shows A and B as sister taxa (sharing a common ancestor), with C as the outgroup. The numbers after colons represent branch lengths (evolutionary distances). The entire tree always ends with a semicolon.
((A,B),C); is valid and just shows the topology without distancesCompare: PHYLIP vs. Newick: PHYLIP stores sequence alignments used to build trees; Newick stores the resulting tree topology. A typical phylogenetic workflow uses PHYLIP-formatted alignments as input and produces Newick-formatted trees as output.
Specialized formats capture results from specific bioinformatics algorithms. Understanding output options helps you choose the right downstream tools.
BLAST supports multiple output formats, selected with the -outfmt flag:
-outfmt 0): the default human-readable format showing full alignments between query and subject-outfmt 5): structured and detailed, good for programmatic parsing with libraries like BioPython-outfmt 6): the most commonly used for scripting, with customizable tab-separated columns including percent identity, alignment length, e-value, and bit scoreThe e-value is the metric you'll use most for judging hit significance. It represents the expected number of hits with that score or better that you'd find by chance in a database of that size. An e-value of is extremely significant; an e-value of 10 means you'd expect 10 hits that good just by random chance. Lower is better, and the threshold you choose depends on your application (typically for homology searches).
| Concept | Best Examples |
|---|---|
| Raw sequence storage | FASTA, FASTQ |
| Quality-aware sequencing data | FASTQ |
| Read alignment to reference | SAM, BAM |
| Variant representation | VCF |
| Genomic region annotation | BED, GFF, GTF |
| 3D molecular structure | PDB, mmCIF |
| Phylogenetic trees | Newick |
| Sequence alignment for phylogenetics | PHYLIP |
| Sequence similarity search results | BLAST output formats |
You receive raw Illumina sequencing data and need to assess read quality before alignment. Which file format contains the information you need, and what mathematical relationship defines the quality scores?
Compare SAM/BAM and VCF: at what stage of a variant-calling pipeline would you encounter each, and what biological question does each format help answer?
A collaborator sends you genomic coordinates, but your analysis is off by one base pair. Which two annotation formats might be involved, and how do their coordinate systems differ?
You need to visualize a custom set of regulatory regions in a genome browser and also annotate full gene structures with exon boundaries. Which format would you choose for each task, and why?
Describe the data flow in a typical phylogenetic analysis: which format would store your input multiple sequence alignment, and which format would represent your final tree topology?