Running Bioinformatics on RCAC
-
Prerequisites
- RCAC cluster account (apply here)
- Basic command-line familiarity (cd, ls, mkdir, nano/vim)
- Access to Negishi, Gautschi, or Bell
-
What you will learn
- Find and load bioinformatics software via biocontainers
- Understand how containerized wrappers work
- Create and manage Conda environments
- Write and submit SLURM batch jobs
- Debug common failures
This guide covers the practical skills you need to run bioinformatics software on RCAC clusters. RCAC deploys bioinformatics tools as BioContainers (pre-built Apptainer containers) accessed through the Lmod module system. For tools not in the RCAC collection, you can use Conda environments or pull your own containers. Most production work runs as batch jobs through the SLURM scheduler.
By the end of this guide you will be able to find any bioinformatics tool on RCAC, run it correctly, and submit efficient batch jobs.
Getting a Terminal¶
You need a terminal session on the cluster before running any commands. There are two options:
- Go to gateway.negishi.rcac.purdue.edu and log in with your Purdue credentials.
For other clusters, replace
negishiwith the cluster name (e.g.,gateway.gautschi.rcac.purdue.edu,gateway.bell.rcac.purdue.edu,gateway.gilbreth.rcac.purdue.edu). - Click Clusters in the top menu, then select the cluster shell (e.g., Negishi Shell Access).
- A terminal opens in your browser. You are on a login node.
Replace <boilerid> with your Purdue career account username.
On Gautschi, use gautschi.rcac.purdue.edu.
Warning
Login nodes are shared. Do not run computationally intensive programs on login nodes.
Use sinteractive for quick tests or sbatch for real work.
Finding Software with Modules¶
RCAC uses the Lmod module system to manage software. Bioinformatics tools are deployed as pre-built BioContainers (Apptainer containers) and accessed through the biocontainers module.
Searching for a tool¶
First, load the biocontainers module to make bioinformatics tools visible:
module spider searches all modules, including those not yet visible. It shows available versions and any prerequisite modules.
To get loading instructions for a specific version:
To list all available biocontainer modules:
Loading a module¶
The biocontainers module unlocks all bioinformatics software. You must load it before any tool module becomes visible.
Tip
When you load biocontainers, you will see a message pointing to the user guides:
This is a great resource for tool-specific documentation and examples.
Use module --force purge (not just module purge) to remove sticky modules like xalt that use a newer glibc and conflict with containerized tools. Here is what happens if you skip --force:
The fix is to always start with module --force purge:
After loading, run the tool as usual -- the output will appear as if the tool is installed natively.
Behind the scenes, RCAC creates shell functions that wrap each tool in an apptainer/singularity container call. When you type bwa, the function runs singularity run <container.sif> bwa for you.
Understanding the wrapper¶
Because the tools are containerized, which and type will show the shell function that wraps the container call, not the actual executable:
For most use cases this does not matter -- the function handles everything transparently. However, if a pipeline or workflow checks the executable path (e.g., which bwa to verify the installation), it will get the function definition instead of a file path. In that case, you may need to either:
- Write a custom wrapper script that satisfies the pipeline's path check
- Contact rcac-help@purdue.edu for assistance with pipeline-specific setups
The $BIOC_IMAGE_DIR environment variable¶
Loading the biocontainers module sets the $BIOC_IMAGE_DIR environment variable, which points to the directory containing all container images:
You can use this to run containers directly with singularity run or apptainer exec when you need more control (e.g., custom bind mounts, GPU flags, or piping between containerized tools):
Verifying the installation¶
Resetting your environment¶
Use module --force purge at the top of every SLURM script and whenever you hit a module conflict. The --force flag is important because it also removes sticky modules (like xalt) that a plain module purge would leave behind.
Handling module conflicts¶
If you try to load two modules built with different compiler toolchains, Lmod will refuse with an error. The fix:
Loading both in a single command lets Lmod resolve the dependency tree.
Tip
Always specify the version in every module load command.
module load biocontainers samtools may give you a different version tomorrow, and your collaborator may get yet another version.
Explicit versions are essential for reproducibility.
Pulling Custom Containers¶
All bioinformatics modules on RCAC are already containerized via BioContainers (see Finding Software with Modules above). However, if a tool is not in the RCAC collection, you can pull your own container.
When to pull a custom container¶
- The tool or version is not available via
module spiderafter loadingbiocontainers - You need a specific build variant or tag not deployed by RCAC
- You are developing or testing a custom pipeline image
Pulling from a registry¶
If a tool is not in the biocontainers collection, pull it from a container registry:
This creates a .sif file in the current directory. Run commands inside it with apptainer exec:
Bind paths¶
RCAC auto-binds /home, /scratch, /depot, and /tmp into containers. For data in non-standard locations, bind manually:
Conda Environments¶
Conda is useful for niche Python or R packages and tools with complex dependency trees that are not available as modules or containers.
Creating an environment¶
Danger
Never install packages into the base environment.
Always create a named environment with conda create -n <name>.
Polluting the base environment causes hard-to-debug conflicts.
Redirecting Conda storage¶
Conda environments are large (often 2--10 GB). Your Home directory is only ~25 GB.
Redirect Conda storage to Scratch by creating a .condarc file:
| ~/.condarc | |
|---|---|
Then create the directories:
Warning
Scratch is purged after 60 days of inactivity. If your Conda environment sits untouched on Scratch, it will be deleted. For long-lived environments, consider using Depot storage instead.
How Do I Find and Run Software X?¶
Use this decision table to pick the right method:
| Step | Action | Command |
|---|---|---|
| 1 | Load biocontainers | module --force purge && module load biocontainers |
| 2 | Search for the tool | module spider <toolname> |
| 3 | If found | module load biocontainers <tool>/<version> |
| 4 | If not, search Conda | conda search -c bioconda <toolname> |
| 5 | If found in Conda | conda create -n <env> -c bioconda -c conda-forge <tool>=<ver> |
| 6 | If not found anywhere | Pull a Docker/Apptainer container or build from source |
Comparison: Biocontainers (Modules) vs Conda vs Custom Container¶
| Biocontainers (Modules) | Conda | Custom Container | |
|---|---|---|---|
| Maintained by | RCAC | You | You |
| Install effort | None | Medium | High |
| Reproducibility | Excellent (immutable image) | Fragile (solver can change) | Excellent |
| Storage cost | None | High (2--10 GB per env) | Medium (0.5--2 GB per image) |
| Speed | Near-native | Native | Near-native |
| Updates | RCAC manages | You manage | You manage |
| Best for | Most bioinformatics tools | Niche packages, R/Python envs | Full control, custom builds |
Submitting SLURM Jobs¶
Login nodes are for editing files and submitting jobs. All computation should happen on compute nodes through SLURM.
Interactive sessions¶
For quick testing, request an interactive session:
This gives you a shell on a compute node where you can load modules and test commands. Type exit when done.
Anatomy of a SLURM script¶
A SLURM batch script has three parts:
- Shebang:
#!/bin/bash #SBATCHdirectives: resource requests parsed by SLURM (not executed by bash)- Your commands: module loads, tool invocations, file operations
Example: BWA-MEM2 alignment¶
Submitting and monitoring¶
SLURM Quick Reference¶
| Directive | Description | Typical value |
|---|---|---|
--account |
Allocation/account name | Check with mybalance |
--partition |
Queue/partition | Cluster-specific |
--nodes |
Number of nodes | 1 (almost always for bioinformatics) |
--ntasks |
Number of processes | 1 for single tools |
--cpus-per-task |
Threads per process | Match tool's -t flag (4--32) |
--time |
Wall clock limit | Start generous, tighten after sacct |
--mem |
Total memory | Check tool docs; start with 16--32G |
--job-name |
Name shown in squeue |
Short, descriptive |
--output |
stdout file | %x_%j.out (name + job ID) |
--error |
stderr file | %x_%j.err |
--array |
Array job indices | 0-N for batch processing |
Array jobs¶
When running the same tool on multiple input files, use array jobs instead of submitting separate scripts.
Each array task gets a unique SLURM_ARRAY_TASK_ID (0, 1, 2, ...) that you use to select the input file.
Create the file list first:
Using Conda in SLURM¶
Conda requires shell initialization inside batch scripts. Without it, conda activate will fail:
The key line is eval "$(conda shell.bash hook)" -- this initializes Conda for the non-interactive bash shell that SLURM uses.
Resource estimation¶
After a job completes, check what it actually used:
- MaxRSS: peak memory. Use this to right-size
--memnext time. - Elapsed: actual wall time. Use this to right-size
--time.
Start generous, then tighten. Over-requesting wastes your allocation priority but under-requesting kills your job.
Common Pitfalls¶
"Command not found"¶
Problem: You run a tool and get bash: samtools: command not found.
Diagnosis: The module is not loaded, or you loaded biocontainers but forgot the tool module.
Fix:
If module spider cannot find the tool after loading biocontainers, it may not be installed on this cluster. Try Conda or pull a custom container.
SLURM job disappears with no output¶
Problem: Your job vanishes from squeue but produced no output files.
Diagnosis: Check the .err file and sacct:
Common states:
| State | Meaning |
|---|---|
COMPLETED |
Finished successfully (exit code 0:0) |
FAILED |
Your script had an error |
OUT_OF_MEMORY |
Exceeded --mem request |
TIMEOUT |
Exceeded --time request |
CANCELLED |
Manually cancelled or preempted |
Out of memory (OOM)¶
Problem: sacct shows OUT_OF_MEMORY.
Fix: Increase --mem. Check MaxRSS of the failed job to see peak usage, then request 20--30% more.
Scratch data disappeared¶
Problem: Your files on /scratch are gone.
Diagnosis: Scratch is purged after 60 days of inactivity. There is no warning and no recovery.
Fix: Move important results to Home or Depot promptly. For active projects, periodic access resets the clock.
Module conflicts¶
Problem: Lmod has detected the following error: ... when loading modules.
Fix: Start fresh:
Conda won't activate in SLURM¶
Problem: CommandNotFoundError: Your shell has not been properly configured...
Fix: Add shell initialization before conda activate:
Wrong cluster, wrong paths¶
Problem: Script fails because /scratch/negishi/ does not exist on Gautschi.
Fix: Use ${RCAC_SCRATCH} instead of hardcoding the cluster name:
This resolves to the correct path on any RCAC cluster.
Debugging Workflow¶
When a job fails, follow these steps:
- Check the exit code:
echo $?(0 = success, non-zero = failure) - Read the error log:
cat <jobname>_<jobid>.err - Check job accounting:
sacct -j <jobid> --format=JobID,State,ExitCode,MaxRSS,Elapsed - Reproduce interactively:
sinteractive -A <account-name> -n 4 --time=1:00:00, load modules, run the failing command - Search the error message: Google, Biostars, or the tool's GitHub Issues
- Ask for help: Email rcac-help@purdue.edu with your job ID and the error message
Conda Configuration Template¶
Copy this .condarc to your home directory to redirect Conda storage off of Home:
| ~/.condarc | |
|---|---|
Note
On Gautschi, replace /scratch/negishi/ with /scratch/gautschi/ or use ${RCAC_SCRATCH}/.conda/.
Resources¶
- RCAC Bioinformatics Tutorials: rcac-bioinformatics.github.io/guide/
- RCAC Knowledge Base: www.rcac.purdue.edu/knowledge
- Open OnDemand: gateway.negishi.rcac.purdue.edu (replace
negishiwith your cluster name) - Discord: discord.gg/zEF2nzhXdC
- Email: rcac-help@purdue.edu (include "bioinformatics support" in the subject line)
What's Next¶
Session 6: QC for Genomics -- April 7, 2026, 11:00 AM -- 12:00 PM ET
Topics: FastQC interpretation, fastp trimming, MultiQC aggregation, quality control strategies for different sequencing platforms.