Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Every bioinformatics pipeline you'll encounter—from basic sequence searches to complex variant analysis—depends on understanding how biological data is structured and stored. These file formats aren't arbitrary; each one evolved to solve a specific computational problem, whether that's storing raw sequencing reads with quality metrics, representing alignments against a reference genome, or encoding three-dimensional protein structures. You're being tested on your ability to recognize which format fits which analysis task, how data flows between formats in a pipeline, and what information each format captures or loses.
Don't just memorize file extensions—know what biological question each format helps answer. When you see a question about quality-aware sequence data, you should immediately think FASTQ. When asked about storing genomic coordinates for visualization, BED should come to mind. Understanding the conceptual purpose behind each format will help you troubleshoot pipelines, choose appropriate tools, and answer exam questions that test your grasp of data flow in genomic analysis.
These formats represent the fundamental building blocks: raw biological sequences. The key distinction is whether quality information is preserved alongside the sequence itself.
> followed by an identifier, with the sequence on subsequent lines+), and ASCII-encoded quality scores in strict orderCompare: FASTA vs. FASTQ—both store sequences, but FASTQ adds per-base quality scores essential for filtering low-confidence reads. If asked which format is appropriate for raw sequencing data, FASTQ is always the answer; FASTA is for processed or reference sequences.
Once sequences are generated, they must be aligned to a reference. These formats encode where reads map, how confidently, and what differences exist.
50M2I48MCompare: SAM/BAM vs. VCF—SAM/BAM stores where reads align, while VCF stores where alignments differ from the reference. In a typical pipeline, you generate BAM first, then call variants to produce VCF. Understanding this data flow is critical for pipeline questions.
These formats describe what features exist at specific genomic coordinates. They're essential for connecting raw data to biological meaning.
gene_id and transcript_id attributes for RNA-seq analysisCompare: BED vs. GFF/GTF—BED is simpler and uses 0-based coordinates; GFF/GTF is more detailed with 1-based coordinates and standardized attribute fields. Use BED for quick region definitions, GFF/GTF for full gene annotations. Coordinate system differences are a common source of off-by-one errors.
Protein and nucleic acid structures require specialized formats that capture three-dimensional atomic positions. These enable molecular modeling, docking, and structure-function analysis.
Compare: PDB vs. sequence formats—PDB captures spatial arrangement while FASTA captures linear sequence. You can derive sequence from structure but not vice versa. Structure prediction tools like AlphaFold bridge this gap by predicting PDB-style coordinates from FASTA input.
Evolutionary analysis requires formats that represent both aligned sequences and tree topologies. These formats encode relationships between taxa.
((A,B),C); shows A and B as sister taxa(A:0.1,B:0.2):0.3Compare: PHYLIP vs. Newick—PHYLIP stores sequence alignments used to build trees; Newick stores the resulting tree topology. A typical phylogenetic workflow uses PHYLIP-formatted alignments as input and produces Newick-formatted trees as output.
Specialized formats capture results from specific bioinformatics algorithms. Understanding output options helps you choose the right downstream tools.
-outfmt 6), XML (-outfmt 5), and pairwise alignment (-outfmt 0) serve different parsing needs| Concept | Best Examples |
|---|---|
| Raw sequence storage | FASTA, FASTQ |
| Quality-aware sequencing data | FASTQ |
| Read alignment to reference | SAM, BAM |
| Variant representation | VCF |
| Genomic region annotation | BED, GFF, GTF |
| 3D molecular structure | PDB |
| Phylogenetic trees | Newick |
| Sequence alignment for phylogenetics | PHYLIP |
| Sequence similarity search results | BLAST output formats |
You receive raw Illumina sequencing data and need to assess read quality before alignment. Which file format contains the information you need, and what mathematical relationship defines the quality scores?
Compare SAM/BAM and VCF: at what stage of a variant-calling pipeline would you encounter each, and what biological question does each format help answer?
A collaborator sends you genomic coordinates, but your analysis is off by one base pair. Which two annotation formats might be involved, and how do their coordinate systems differ?
You need to visualize a custom set of regulatory regions in a genome browser and also annotate full gene structures with exon boundaries. Which format would you choose for each task, and why?
Describe the data flow in a typical phylogenetic analysis: which format would store your input multiple sequence alignment, and which format would represent your final tree topology?