Fiveable

🧬Bioinformatics Unit 12 Review

QR code for Bioinformatics practice questions

12.1 Unix and command-line tools

12.1 Unix and command-line tools

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🧬Bioinformatics
Unit & Topic Study Guides

Unix and command-line tools are essential for bioinformatics data processing. They offer powerful text manipulation capabilities and modular design, letting researchers build efficient workflows for complex genomic analyses.

This section covers Unix basics, file system navigation, text processing tools, and bioinformatics-specific software. It also introduces shell scripting, version control, and high-performance computing concepts for managing large-scale genomic datasets.

Introduction to Unix

Unix provides the command-line tools and scripting capabilities that underpin most bioinformatics work. Its emphasis on modularity, flexibility, and interoperability makes it possible to chain together small programs into custom workflows for genomic data manipulation.

Unix philosophy

The core idea is "do one thing and do it well." Each Unix tool is designed to handle a specific task, and you combine them to accomplish complex goals. Programs communicate through plain text, which keeps things simple and interoperable. The pipe operator (|) is what makes this practical: it lets you feed the output of one program directly into the input of another, forming pipelines.

Unix vs other operating systems

  • Offers superior text processing compared to Windows, which matters when you're handling multi-gigabyte genomic datasets
  • Provides a more standardized command-line interface across Unix-like systems (Linux, macOS), so skills transfer easily
  • Supports scripting languages (Bash, Perl, Python) that are standard in bioinformatics workflows
  • Generally delivers better performance and resource management for computationally intensive tasks

Command-line interface basics

A command-line interface (CLI) gives you direct access to system functions through text-based commands. For bioinformatics, CLIs offer far greater control and automation than graphical interfaces, especially when you need to process thousands of files or run batch analyses.

Terminal emulators

Terminal emulators are applications that simulate a text-based terminal on your modern OS. Common examples include xterm, iTerm2 (macOS), and PuTTY (Windows). They support features like multiple tabs, split panes, and customizable color schemes. Critically, they also let you connect to remote Unix-based servers through SSH (Secure Shell), which is how you'll typically access HPC clusters.

Shell types

The shell is the program that interprets your commands. Several shells exist, each with its own syntax and features:

  • Bash (Bourne Again Shell): The most common default shell on Linux systems. Most bioinformatics tutorials assume Bash.
  • Zsh (Z Shell): Now the default on macOS. Offers better tab completion and theming.
  • Fish (Friendly Interactive Shell): Provides autosuggestions and syntax highlighting out of the box.
  • Tcsh (TENEX C Shell): Still used in some scientific computing environments.

For scripting, Bash remains the standard in bioinformatics. Interactive use is more a matter of personal preference.

File system navigation

Knowing how to move around the file system efficiently is fundamental. Bioinformatics projects involve many directories for raw data, scripts, results, and configuration files.

Directory structure

The Unix file system is a tree rooted at / (the root directory). Key locations include:

  • /: Root directory, the top of the hierarchy
  • ~: Your home directory, where user-specific files and configurations live
  • /bin: Essential system binaries
  • /etc: System configuration files
  • /home: Contains all user home directories

In bioinformatics projects, you'll typically organize your own directories like /data (raw sequencing data), /results (analysis outputs), and /scripts (custom analysis scripts).

Two essential commands: ls lists directory contents, and pwd prints your current working directory.

File paths

  • Absolute paths start from root and give the full location: /usr/local/bin/python
  • Relative paths specify location relative to where you are now: ../data/sequences.fasta
  • . refers to the current directory; .. refers to the parent directory
  • ~ expands to your home directory
  • Wildcards: * matches any number of characters, ? matches exactly one character (e.g., sample_*.fastq)

File manipulation commands

These commands form the foundation for managing and processing bioinformatics data. You'll use them constantly for organizing files, previewing datasets, and setting up analyses.

Creating and editing files

  • touch creates empty files or updates timestamps on existing ones
  • Text editors like nano (beginner-friendly), vim (powerful but steep learning curve), and emacs let you create and modify files
  • echo writes text to files when combined with redirection: echo "hello" > output.txt
  • cat displays file contents and can concatenate multiple files together
  • head and tail show the beginning and end of files, which is invaluable for previewing large datasets without loading the entire file

Moving and copying files

  • mv moves or renames files and directories
  • cp copies files; use the -r flag to copy directories recursively
  • rsync provides advanced synchronization and transfer, especially useful for large datasets across systems
  • Wildcards work with these commands: cp *.fastq /data/raw/ copies all FASTQ files at once

File permissions

Unix controls access with three permission types: read (4), write (2), and execute (1). These are assigned to three categories: owner, group, and others.

  • chmod changes permissions. For example, chmod 755 script.sh gives the owner full permissions (7 = 4+2+1) and everyone else read/execute (5 = 4+1).
  • chown changes file ownership
  • ls -l displays detailed file info including permissions (the -rwxr-xr-x string at the start of each line)
  • Special permissions (setuid, setgid, sticky bit) exist for advanced access control scenarios

Text processing tools

Text processing is where Unix truly shines for bioinformatics. Genomic data formats like FASTA, FASTQ, SAM/BAM, and VCF are all text-based, so tools that filter, extract, and transform text are central to your workflow.

Unix philosophy, Kafka, Samza, and the Unix philosophy of distributed data — Martin Kleppmann’s blog

grep for pattern matching

grep searches for patterns in text files using regular expressions. Key flags:

  • -i: Case-insensitive matching
  • -v: Invert the match (show lines that don't contain the pattern)
  • -r: Search recursively through directories
  • -c: Count matching lines instead of printing them

In bioinformatics, you'll use grep to filter sequence headers, find specific motifs, or extract lines matching a gene name from annotation files.

sed for stream editing

sed performs text transformations on an input stream or file. The most common operation is substitution:

</>Code
sed 's/pattern/replacement/' file.txt
  • -i edits files in-place (be careful with this)
  • Supports regular expressions for complex pattern matching
  • Useful for modifying sequence headers, reformatting data files, or doing batch find-and-replace across files

awk for data extraction

awk is a powerful tool for processing structured, column-based text data. It reads input line by line and automatically splits each line into fields.

  • $$0 represents the entire line
  • $$1, $$2, etc. represent individual fields (columns)
  • Supports variables, conditionals, and loops

For example, to print the second column of a tab-separated file: awk '{print $$2}' data.tsv. You can also use it to calculate statistics on sequence lengths or extract specific columns from tabular bioinformatics output.

Pipes and redirection

Pipes and redirection are what turn individual Unix commands into powerful data processing pipelines. They're essential for building efficient bioinformatics workflows that handle large datasets without creating unnecessary intermediate files.

Input/output streams

Every Unix command has three standard streams:

  • stdin (standard input): Default input, usually the keyboard
  • stdout (standard output): Default output, usually the terminal
  • stderr (standard error): A separate stream for error messages

Redirection operators control where these streams go:

  • > writes stdout to a file (overwrites)
  • >> appends stdout to a file
  • < reads input from a file
  • 2> redirects stderr
  • &> redirects both stdout and stderr

Combining commands

The pipe operator (|) connects the stdout of one command to the stdin of the next. This lets you build multi-step pipelines without intermediate files.

Here's a real bioinformatics example:

</>Code
zcat data.fastq.gz | grep -v '^@' | awk '{print length($$1)}' | sort -n | uniq -c

This pipeline decompresses a FASTQ file, filters out header lines, calculates sequence lengths, sorts them numerically, and counts occurrences of each length. Each tool does one thing, and the pipe chains them together.

Shell scripting fundamentals

Shell scripting lets you automate repetitive tasks and create reproducible bioinformatics workflows. Instead of typing the same sequence of commands every time, you write them into a script that can be run, shared, and version-controlled.

Variables and control structures

  • Variables store data and are referenced with $$: NAME="sample_01"
  • Environment variables like $$PATH and $$HOME provide system-wide configuration
  • If-else statements handle conditional execution
  • Loops (for, while) iterate over files or data. A for loop over FASTQ files is one of the most common patterns in bioinformatics scripting.
  • Case statements allow multiple conditional branches
  • Command substitution $$() captures the output of a command as a variable: COUNT=$$(wc -l < file.txt)

Functions in shell scripts

Functions are reusable code blocks that improve organization and readability. They can accept parameters and return values, and local variables limit scope within the function.

Here's a practical example that converts FASTQ to FASTA format:

</>BASH
fastq_to_fasta() { sed -n '1~4s/^@/>/p;2~4p' "$$1"; }

This function takes a filename as its first argument ($$1), uses sed to extract every 4th line starting from line 1 (converting the header), and every 4th line starting from line 2 (the sequence).

Package management

Package management systems simplify installing and maintaining bioinformatics software. They also help ensure reproducibility by tracking software versions and dependencies.

Installing software

  • System package managers like apt (Debian/Ubuntu) and yum (Red Hat/CentOS) handle system-wide software
  • Conda is the most popular package manager in bioinformatics because it creates isolated environments without requiring admin privileges
  • Bioconda is a Conda channel with thousands of pre-compiled bioinformatics tools
  • Compiling from source using make is sometimes necessary for cutting-edge or niche tools
  • Container technologies (Docker, Singularity) package an entire software environment, making it fully portable

Managing dependencies

  • Package managers handle dependency resolution automatically
  • Virtual environments (venv for Python, conda env for Conda) isolate project-specific dependencies so different projects don't conflict
  • Version pinning ensures reproducibility: requirements.txt (pip) or environment.yml (Conda) document exact versions
  • Containerization captures the entire software stack, including OS-level dependencies, for maximum reproducibility

Version control with Git

Version control tracks changes in your code and documentation over time. Git is the standard tool for this, and it's essential for collaborative development of bioinformatics pipelines.

Unix philosophy, CPLSTool: A Framework to Generate Automatic Bioinformatics Pipelines | Biomedres

Basic Git commands

A typical Git workflow follows these steps:

  1. git init initializes a new repository
  2. git add <file> stages changes for the next commit
  3. git commit -m "message" records staged changes with a descriptive message
  4. git status shows what's changed since the last commit
  5. git log displays commit history
  6. git diff shows line-by-line differences between commits or the working directory

Collaborative workflows

  • Branching lets you develop features or test experiments in parallel without affecting the main codebase
  • Merging combines changes from different branches
  • Pull requests (on GitHub/GitLab) facilitate code review and discussion before merging
  • Forking creates a personal copy of someone else's repository for independent development
  • Continuous Integration (CI) automates testing so you catch errors before they reach the main branch

Bioinformatics-specific tools

The Unix environment hosts a large ecosystem of specialized bioinformatics tools. Knowing which tool to reach for is just as important as knowing how to use it.

BLAST and sequence alignment

BLAST (Basic Local Alignment Search Tool) compares a query sequence against a database to find similar sequences. Different variants handle different input types:

  • blastn: Nucleotide vs. nucleotide
  • blastp: Protein vs. protein
  • blastx: Translated nucleotide query vs. protein database

The BLAST+ suite provides command-line tools for both local and remote searches. Beyond BLAST, read mapping tools like Bowtie2 and BWA align sequencing reads to reference genomes, while MUSCLE and MAFFT perform multiple sequence alignments for evolutionary analysis.

File format conversion

Bioinformatics involves many file formats, and converting between them is a routine task:

  • seqtk: Converts between FASTA and FASTQ formats
  • samtools: Converts between SAM and BAM formats, plus sorting and indexing alignments
  • bedtools: Manipulates genomic interval files (BED, GFF, VCF)
  • bcftools: Handles variant call format (VCF) files
  • awk and sed: Often used for custom format conversions when no dedicated tool exists

High-performance computing

Large-scale bioinformatics analyses (whole-genome sequencing, metagenomics, large BLAST searches) require more compute power than a laptop provides. HPC clusters, which are almost universally Unix-based, fill this role.

Job scheduling systems

HPC clusters use job schedulers to manage shared resources among many users:

  • Slurm: The most common scheduler in academic environments
  • PBS (Portable Batch System) and SGE (Sun Grid Engine) are also widely used
  • You submit jobs via scripts that specify resource requirements (CPUs, memory, time) and execution commands
  • Queue systems manage job priorities and resource allocation
  • Array jobs let you run the same script on many different inputs in parallel

Parallel processing

  • MPI (Message Passing Interface): Distributed memory parallelism across multiple nodes
  • OpenMP: Shared memory parallelism within a single node
  • GNU Parallel: A simple tool for parallelizing command-line operations
  • Many bioinformatics tools (BLAST+, BWA) have built-in multi-threading options via flags like -num_threads or -t
  • Workflow managers like Snakemake and Nextflow orchestrate complex pipelines with automatic parallelization and dependency tracking

Data management

Genomic datasets can be enormous (a single whole-genome sequencing run can produce hundreds of gigabytes). Unix provides tools for efficient storage, transfer, and organization of this data.

Compression techniques

  • gzip: The most common compression for bioinformatics files (.gz extension). Fast and widely supported.
  • bzip2: Higher compression ratios than gzip but slower to compress/decompress
  • xz: Even higher compression, at the cost of more CPU usage
  • CRAM: A specialized format for compressed alignment data that can be much smaller than BAM
  • Compression-aware tools like zcat and zgrep let you work with compressed files directly, without manually decompressing first

Archiving and backup

  • tar creates and extracts archive files (.tar, .tar.gz). Commonly used to bundle directories for transfer.
  • rsync efficiently synchronizes files between systems, transferring only what's changed
  • cron jobs automate regular backups and maintenance tasks on a schedule
  • RAID configurations provide hardware-level redundancy for critical data storage
  • Off-site backups (cloud storage, tape archives) protect against catastrophic data loss

Troubleshooting and debugging

Things will break. Effective troubleshooting saves hours of frustration and is a skill worth developing early.

Error messages interpretation

  • Error messages appear on stderr, which is separate from normal output. This means you can redirect errors to a log file with 2>.
  • Common error types: syntax errors (typos in commands), runtime errors (file not found, permission denied), and logical errors (command runs but produces wrong results)
  • Many tools have verbose (-v) or debug (-d) flags that produce more detailed output
  • man pages (man grep, man awk) are the built-in documentation. Stack Overflow and Biostars are valuable for bioinformatics-specific issues.
  • Read error messages from top to bottom. The first error is usually the root cause; later errors are often cascading failures.

Logging and monitoring

  • tee splits output to both a file and the screen simultaneously, so you can watch progress while saving a log: command | tee output.log
  • nohup lets processes continue running after you disconnect from the terminal, which is critical for long-running analyses
  • top and htop monitor CPU and memory usage in real time
  • ps shows running processes and their status
  • logrotate manages log file growth so they don't fill up your disk