Unix and command-line tools are essential for bioinformatics data processing. They offer powerful text manipulation capabilities and modular design, letting researchers build efficient workflows for complex genomic analyses.
This section covers Unix basics, file system navigation, text processing tools, and bioinformatics-specific software. It also introduces shell scripting, version control, and high-performance computing concepts for managing large-scale genomic datasets.
Introduction to Unix
Unix provides the command-line tools and scripting capabilities that underpin most bioinformatics work. Its emphasis on modularity, flexibility, and interoperability makes it possible to chain together small programs into custom workflows for genomic data manipulation.
Unix philosophy
The core idea is "do one thing and do it well." Each Unix tool is designed to handle a specific task, and you combine them to accomplish complex goals. Programs communicate through plain text, which keeps things simple and interoperable. The pipe operator (|) is what makes this practical: it lets you feed the output of one program directly into the input of another, forming pipelines.
Unix vs other operating systems
- Offers superior text processing compared to Windows, which matters when you're handling multi-gigabyte genomic datasets
- Provides a more standardized command-line interface across Unix-like systems (Linux, macOS), so skills transfer easily
- Supports scripting languages (Bash, Perl, Python) that are standard in bioinformatics workflows
- Generally delivers better performance and resource management for computationally intensive tasks
Command-line interface basics
A command-line interface (CLI) gives you direct access to system functions through text-based commands. For bioinformatics, CLIs offer far greater control and automation than graphical interfaces, especially when you need to process thousands of files or run batch analyses.
Terminal emulators
Terminal emulators are applications that simulate a text-based terminal on your modern OS. Common examples include xterm, iTerm2 (macOS), and PuTTY (Windows). They support features like multiple tabs, split panes, and customizable color schemes. Critically, they also let you connect to remote Unix-based servers through SSH (Secure Shell), which is how you'll typically access HPC clusters.
Shell types
The shell is the program that interprets your commands. Several shells exist, each with its own syntax and features:
- Bash (Bourne Again Shell): The most common default shell on Linux systems. Most bioinformatics tutorials assume Bash.
- Zsh (Z Shell): Now the default on macOS. Offers better tab completion and theming.
- Fish (Friendly Interactive Shell): Provides autosuggestions and syntax highlighting out of the box.
- Tcsh (TENEX C Shell): Still used in some scientific computing environments.
For scripting, Bash remains the standard in bioinformatics. Interactive use is more a matter of personal preference.
File system navigation
Knowing how to move around the file system efficiently is fundamental. Bioinformatics projects involve many directories for raw data, scripts, results, and configuration files.
Directory structure
The Unix file system is a tree rooted at / (the root directory). Key locations include:
/: Root directory, the top of the hierarchy~: Your home directory, where user-specific files and configurations live/bin: Essential system binaries/etc: System configuration files/home: Contains all user home directories
In bioinformatics projects, you'll typically organize your own directories like /data (raw sequencing data), /results (analysis outputs), and /scripts (custom analysis scripts).
Two essential commands: ls lists directory contents, and pwd prints your current working directory.
File paths
- Absolute paths start from root and give the full location:
/usr/local/bin/python - Relative paths specify location relative to where you are now:
../data/sequences.fasta .refers to the current directory;..refers to the parent directory~expands to your home directory- Wildcards:
*matches any number of characters,?matches exactly one character (e.g.,sample_*.fastq)
File manipulation commands
These commands form the foundation for managing and processing bioinformatics data. You'll use them constantly for organizing files, previewing datasets, and setting up analyses.
Creating and editing files
touchcreates empty files or updates timestamps on existing ones- Text editors like
nano(beginner-friendly),vim(powerful but steep learning curve), andemacslet you create and modify files echowrites text to files when combined with redirection:echo "hello" > output.txtcatdisplays file contents and can concatenate multiple files togetherheadandtailshow the beginning and end of files, which is invaluable for previewing large datasets without loading the entire file
Moving and copying files
mvmoves or renames files and directoriescpcopies files; use the-rflag to copy directories recursivelyrsyncprovides advanced synchronization and transfer, especially useful for large datasets across systems- Wildcards work with these commands:
cp *.fastq /data/raw/copies all FASTQ files at once
File permissions
Unix controls access with three permission types: read (4), write (2), and execute (1). These are assigned to three categories: owner, group, and others.
chmodchanges permissions. For example,chmod 755 script.shgives the owner full permissions (7 = 4+2+1) and everyone else read/execute (5 = 4+1).chownchanges file ownershipls -ldisplays detailed file info including permissions (the-rwxr-xr-xstring at the start of each line)- Special permissions (setuid, setgid, sticky bit) exist for advanced access control scenarios
Text processing tools
Text processing is where Unix truly shines for bioinformatics. Genomic data formats like FASTA, FASTQ, SAM/BAM, and VCF are all text-based, so tools that filter, extract, and transform text are central to your workflow.

grep for pattern matching
grep searches for patterns in text files using regular expressions. Key flags:
-i: Case-insensitive matching-v: Invert the match (show lines that don't contain the pattern)-r: Search recursively through directories-c: Count matching lines instead of printing them
In bioinformatics, you'll use grep to filter sequence headers, find specific motifs, or extract lines matching a gene name from annotation files.
sed for stream editing
sed performs text transformations on an input stream or file. The most common operation is substitution:
</>Codesed 's/pattern/replacement/' file.txt
-iedits files in-place (be careful with this)- Supports regular expressions for complex pattern matching
- Useful for modifying sequence headers, reformatting data files, or doing batch find-and-replace across files
awk for data extraction
awk is a powerful tool for processing structured, column-based text data. It reads input line by line and automatically splits each line into fields.
$$0represents the entire line$$1,$$2, etc. represent individual fields (columns)- Supports variables, conditionals, and loops
For example, to print the second column of a tab-separated file: awk '{print $$2}' data.tsv. You can also use it to calculate statistics on sequence lengths or extract specific columns from tabular bioinformatics output.
Pipes and redirection
Pipes and redirection are what turn individual Unix commands into powerful data processing pipelines. They're essential for building efficient bioinformatics workflows that handle large datasets without creating unnecessary intermediate files.
Input/output streams
Every Unix command has three standard streams:
- stdin (standard input): Default input, usually the keyboard
- stdout (standard output): Default output, usually the terminal
- stderr (standard error): A separate stream for error messages
Redirection operators control where these streams go:
>writes stdout to a file (overwrites)>>appends stdout to a file<reads input from a file2>redirects stderr&>redirects both stdout and stderr
Combining commands
The pipe operator (|) connects the stdout of one command to the stdin of the next. This lets you build multi-step pipelines without intermediate files.
Here's a real bioinformatics example:
</>Codezcat data.fastq.gz | grep -v '^@' | awk '{print length($$1)}' | sort -n | uniq -c
This pipeline decompresses a FASTQ file, filters out header lines, calculates sequence lengths, sorts them numerically, and counts occurrences of each length. Each tool does one thing, and the pipe chains them together.
Shell scripting fundamentals
Shell scripting lets you automate repetitive tasks and create reproducible bioinformatics workflows. Instead of typing the same sequence of commands every time, you write them into a script that can be run, shared, and version-controlled.
Variables and control structures
- Variables store data and are referenced with
$$:NAME="sample_01" - Environment variables like
$$PATHand$$HOMEprovide system-wide configuration - If-else statements handle conditional execution
- Loops (
for,while) iterate over files or data. Aforloop over FASTQ files is one of the most common patterns in bioinformatics scripting. - Case statements allow multiple conditional branches
- Command substitution
$$()captures the output of a command as a variable:COUNT=$$(wc -l < file.txt)
Functions in shell scripts
Functions are reusable code blocks that improve organization and readability. They can accept parameters and return values, and local variables limit scope within the function.
Here's a practical example that converts FASTQ to FASTA format:
</>BASHfastq_to_fasta() { sed -n '1~4s/^@/>/p;2~4p' "$$1"; }
This function takes a filename as its first argument ($$1), uses sed to extract every 4th line starting from line 1 (converting the header), and every 4th line starting from line 2 (the sequence).
Package management
Package management systems simplify installing and maintaining bioinformatics software. They also help ensure reproducibility by tracking software versions and dependencies.
Installing software
- System package managers like
apt(Debian/Ubuntu) andyum(Red Hat/CentOS) handle system-wide software - Conda is the most popular package manager in bioinformatics because it creates isolated environments without requiring admin privileges
- Bioconda is a Conda channel with thousands of pre-compiled bioinformatics tools
- Compiling from source using
makeis sometimes necessary for cutting-edge or niche tools - Container technologies (Docker, Singularity) package an entire software environment, making it fully portable
Managing dependencies
- Package managers handle dependency resolution automatically
- Virtual environments (
venvfor Python,conda envfor Conda) isolate project-specific dependencies so different projects don't conflict - Version pinning ensures reproducibility:
requirements.txt(pip) orenvironment.yml(Conda) document exact versions - Containerization captures the entire software stack, including OS-level dependencies, for maximum reproducibility
Version control with Git
Version control tracks changes in your code and documentation over time. Git is the standard tool for this, and it's essential for collaborative development of bioinformatics pipelines.

Basic Git commands
A typical Git workflow follows these steps:
git initinitializes a new repositorygit add <file>stages changes for the next commitgit commit -m "message"records staged changes with a descriptive messagegit statusshows what's changed since the last commitgit logdisplays commit historygit diffshows line-by-line differences between commits or the working directory
Collaborative workflows
- Branching lets you develop features or test experiments in parallel without affecting the main codebase
- Merging combines changes from different branches
- Pull requests (on GitHub/GitLab) facilitate code review and discussion before merging
- Forking creates a personal copy of someone else's repository for independent development
- Continuous Integration (CI) automates testing so you catch errors before they reach the main branch
Bioinformatics-specific tools
The Unix environment hosts a large ecosystem of specialized bioinformatics tools. Knowing which tool to reach for is just as important as knowing how to use it.
BLAST and sequence alignment
BLAST (Basic Local Alignment Search Tool) compares a query sequence against a database to find similar sequences. Different variants handle different input types:
- blastn: Nucleotide vs. nucleotide
- blastp: Protein vs. protein
- blastx: Translated nucleotide query vs. protein database
The BLAST+ suite provides command-line tools for both local and remote searches. Beyond BLAST, read mapping tools like Bowtie2 and BWA align sequencing reads to reference genomes, while MUSCLE and MAFFT perform multiple sequence alignments for evolutionary analysis.
File format conversion
Bioinformatics involves many file formats, and converting between them is a routine task:
- seqtk: Converts between FASTA and FASTQ formats
- samtools: Converts between SAM and BAM formats, plus sorting and indexing alignments
- bedtools: Manipulates genomic interval files (BED, GFF, VCF)
- bcftools: Handles variant call format (VCF) files
- awk and sed: Often used for custom format conversions when no dedicated tool exists
High-performance computing
Large-scale bioinformatics analyses (whole-genome sequencing, metagenomics, large BLAST searches) require more compute power than a laptop provides. HPC clusters, which are almost universally Unix-based, fill this role.
Job scheduling systems
HPC clusters use job schedulers to manage shared resources among many users:
- Slurm: The most common scheduler in academic environments
- PBS (Portable Batch System) and SGE (Sun Grid Engine) are also widely used
- You submit jobs via scripts that specify resource requirements (CPUs, memory, time) and execution commands
- Queue systems manage job priorities and resource allocation
- Array jobs let you run the same script on many different inputs in parallel
Parallel processing
- MPI (Message Passing Interface): Distributed memory parallelism across multiple nodes
- OpenMP: Shared memory parallelism within a single node
- GNU Parallel: A simple tool for parallelizing command-line operations
- Many bioinformatics tools (BLAST+, BWA) have built-in multi-threading options via flags like
-num_threadsor-t - Workflow managers like Snakemake and Nextflow orchestrate complex pipelines with automatic parallelization and dependency tracking
Data management
Genomic datasets can be enormous (a single whole-genome sequencing run can produce hundreds of gigabytes). Unix provides tools for efficient storage, transfer, and organization of this data.
Compression techniques
- gzip: The most common compression for bioinformatics files (
.gzextension). Fast and widely supported. - bzip2: Higher compression ratios than gzip but slower to compress/decompress
- xz: Even higher compression, at the cost of more CPU usage
- CRAM: A specialized format for compressed alignment data that can be much smaller than BAM
- Compression-aware tools like
zcatandzgreplet you work with compressed files directly, without manually decompressing first
Archiving and backup
tarcreates and extracts archive files (.tar,.tar.gz). Commonly used to bundle directories for transfer.rsyncefficiently synchronizes files between systems, transferring only what's changedcronjobs automate regular backups and maintenance tasks on a schedule- RAID configurations provide hardware-level redundancy for critical data storage
- Off-site backups (cloud storage, tape archives) protect against catastrophic data loss
Troubleshooting and debugging
Things will break. Effective troubleshooting saves hours of frustration and is a skill worth developing early.
Error messages interpretation
- Error messages appear on stderr, which is separate from normal output. This means you can redirect errors to a log file with
2>. - Common error types: syntax errors (typos in commands), runtime errors (file not found, permission denied), and logical errors (command runs but produces wrong results)
- Many tools have verbose (
-v) or debug (-d) flags that produce more detailed output manpages (man grep,man awk) are the built-in documentation. Stack Overflow and Biostars are valuable for bioinformatics-specific issues.- Read error messages from top to bottom. The first error is usually the root cause; later errors are often cascading failures.
Logging and monitoring
teesplits output to both a file and the screen simultaneously, so you can watch progress while saving a log:command | tee output.lognohuplets processes continue running after you disconnect from the terminal, which is critical for long-running analysestopandhtopmonitor CPU and memory usage in real timepsshows running processes and their statuslogrotatemanages log file growth so they don't fill up your disk