Project Organization
A well-organized project is the difference between finishing a paper in a week and spending a week just finding your files. This guide gives you a concrete, opinionated system for structuring bioinformatics projects on RCAC clusters. Follow it from day one and you will save yourself hours of confusion later.
Why this matters¶
Three problems kill bioinformatics projects:
- Reproducibility: You re-run an analysis six months later and get different results because you cannot remember which parameters, software versions, or reference genome you used.
- Collaboration: A labmate asks to see your variant calls. You send them a path to a directory with 400 files and no explanation.
- Quota and storage: Your pipeline crashes at 3 AM because scratch filled up with intermediate BAMs nobody cleaned.
A standard directory layout, consistent naming, and a few habits solve all three. The rest of this guide shows you exactly how.
Recommended directory structure¶
Use this layout for every new project. The numbered prefixes force a logical reading order and sort correctly in ls.
| Directory | Purpose |
|---|---|
00_meta/ |
Project documentation: README, sample manifest, methods notes, software versions. The first place anyone should look. |
01_data/raw/ |
Untouched input files from sequencing or collaborators. Treat as read-only. |
01_data/processed/ |
Cleaned or reformatted data (trimmed reads, filtered VCFs used as inputs to later steps). |
02_scripts/ |
All analysis code. Numbered in execution order so the pipeline is self-documenting. |
03_analysis/ |
Working outputs from each pipeline step. Use lettered subdirectories to separate stages. Create a new version folder (e.g., c_alignment.v2/) when re-running with different parameters rather than overwriting. |
04_results/ |
Publication-ready figures, summary tables, and reports. Only final, polished outputs go here. |
99_logs/ |
SLURM stdout/stderr logs. Essential for debugging and documenting resource usage. |
Create this structure in one command:
Tip
Name the top-level directory as YYYYMMDD_ProjectID_AnalysisType. The ISO date prefix sorts projects chronologically in ls and removes any ambiguity about when work started.
RCAC storage tiers and what goes where¶
RCAC provides several storage tiers. Using them correctly prevents quota crashes and data loss.
| Location | Path | Capacity | Persistence | Best for |
|---|---|---|---|---|
| Home | $HOME |
~25 GB | Backed up, permanent | Scripts, configs, small logs, .bashrc |
| Scratch | $RCAC_SCRATCH |
~100 TB (shared) | Purged after inactivity | Active analysis, intermediates, temp files |
| Depot | /depot/<group>/ |
Group allocation | Permanent | Raw data, final results, shared references, archived projects |
Practical rules¶
- Raw sequencing data goes on Depot (or Scratch with a backup on Depot). Never keep the only copy on Scratch.
- Intermediate files (BAMs, unsorted SAMs, temp indexes) go on Scratch. They are regenerable; treat them as disposable.
- Scripts, configs, and READMEs go in
$HOMEor Depot. They are small and irreplaceable. - Final results (figures, summary tables, reports) go on Depot.
Symlink strategy¶
Keep a unified project tree on Scratch for active work, but symlink to Depot for raw data and long-term storage:
Warning
Scratch is purged after a period of inactivity. Never store your only copy of anything on Scratch. Run myquota regularly to monitor usage across all filesystems.
File and directory naming conventions¶
Bad file names are the most common source of silent errors in bioinformatics pipelines. Follow these rules without exception.
Do this¶
- Use lowercase with underscores or hyphens:
sample_A_R1.fastq.gz - Use ISO 8601 dates (
YYYYMMDDorYYYY-MM-DD):20250505_results.tsv - Include the sample ID and data type in the filename:
sampleA_aligned.bam - Number scripts in execution order:
01_qc-fastqc.sh,02_trim-fastp.sh - Name SLURM logs with the script name and job ID:
01_qc_%j.out(SLURM expands%jto the job ID)
Do not do this¶
| Bad | Why | Better |
|---|---|---|
final_FINAL_v2 (copy).bam |
Ambiguous, spaces, no versioning scheme | sampleA_aligned.v2.bam |
data 2025.fastq |
Spaces break shell scripts | data_20250505.fastq |
results.txt |
What results? From which step? | deseq2_differential_expression.tsv |
Bob's analysis/ |
Apostrophes and spaces cause quoting nightmares | bobs_analysis/ |
03/15/2025_run |
Ambiguous date format (US vs EU) | 20250315_run |
Note
If you work with sample IDs from a sequencing core, keep their original identifiers in a sample_manifest.tsv and use short, consistent aliases (e.g., sampleA, sampleB) in filenames. Map between the two in your manifest.
README and metadata practices¶
Every project gets a README.md in 00_meta/. Write it on day one and update it as the project evolves. Here is a template you can copy directly:
Recording software versions¶
Capture exact versions at the start of every project. Future you will thank present you.
Tip
Add module list and version-printing commands (e.g., samtools --version) to the top of every SLURM script. The output goes straight into your log files, giving you an automatic record per run.
Managing large intermediate files¶
Bioinformatics pipelines generate enormous intermediate files. A single whole-genome alignment can produce a 50-100 GB unsorted BAM before you even start variant calling. Managing these files proactively prevents quota disasters.
What to keep vs. delete¶
| File type | Keep? | Reason |
|---|---|---|
| Raw FASTQ | Always | Irreplaceable input |
| Trimmed FASTQ | Delete after alignment | Regenerable from raw in minutes |
| Unsorted SAM/BAM | Delete immediately | Sort and index, then remove the unsorted version |
| Sorted, indexed BAM | Keep during project | Needed for downstream analysis |
| VCF/GFF3 (final) | Always | Primary results |
Index files (.bai, .fai, .idx) |
Regenerate as needed | Trivial to recreate |
| MultiQC reports | Always | Small, high-value summaries |
Estimate disk usage before running a pipeline¶
A rough rule: expect 3-5x your raw data size in intermediates during active analysis. For 100 GB of FASTQ files, budget 300-500 GB of scratch space.
Auto-clean intermediates in SLURM scripts¶
Add cleanup steps at the end of your job scripts so temporary files do not accumulate:
Warning
Always use a conditional check (if [ -f ... ]) before deleting intermediates in automated scripts. If the upstream step failed silently, an unconditional rm destroys your only evidence of what went wrong.
Find the biggest space consumers¶
Archiving completed projects¶
When a project is published or shelved, clean it up and move it to Depot to free Scratch space.
What to keep¶
- Raw data (if not already on Depot permanently)
- Final results (
04_results/) - Scripts and metadata (
00_meta/,02_scripts/) - Key analysis outputs (final BAMs, VCFs, count matrices)
What to discard before archiving¶
- Unsorted/intermediate BAMs and SAMs
- Trimmed FASTQ files (regenerable from raw)
- Index files (
.bai,.fai,.tbi): trivial to recreate - Temporary directories,
.snakemake/,work/(Nextflow)
Create a manifest and tarball¶
Tip
Keep a copy of archive_manifest.txt outside the tarball at /depot/<group>/archives/20250505_AirwayStudy_RNAseq_manifest.txt. This lets you check what is inside an archive with grep without extracting the whole thing.
Version control for scripts and configs¶
Use Git for your scripts and documentation. Do not use Git for data files.
Initialize a repository for your project¶
A .gitignore for bioinformatics¶
Place this in your project root:
| .gitignore | |
|---|---|
What to commit¶
- Always: SLURM scripts, R/Python analysis scripts, config files, READMEs, sample manifests,
environment.yml - Never: FASTQ, BAM, VCF, reference genomes, large CSVs, anything over ~50 MB
Typical daily workflow¶
You do not need to be a Git expert. These commands handle 95% of bioinformatics version control:
Commit after every meaningful change; finishing a script, fixing a bug, changing parameters. Small, frequent commits are better than one giant commit at the end of the week.
Push to a remote repository¶
After this initial setup, git push is all you need to sync future commits.
Best practices for Git in bioinformatics¶
Use this as a quick reference:
- Symlink results from previous steps instead of copying them into new directories. This avoids duplicating large files and keeps your repo clean.
- Do not commit large files (FASTQ, BAM, SAM, VCF, reference genomes). They bloat the repository permanently; even deleting them later does not reclaim space in Git history.
- Do not commit too many files at once (e.g., hundreds of log files or analysis outputs). Large numbers of tracked files slow down every
git status,git add, andgit push. - Use
.gitignorefrom the start. Add it before your first commit so data files and logs never enter the history. - Write meaningful commit messages.
"Updated script"is useless six months later."Added fastp trimming step with --qualified_quality_phred 20"tells you exactly what changed. - Commit only
00_meta/and02_scripts/by default. These are small, text-based, and irreplaceable. Everything else is either too large or regenerable. - Keep the repository small. A bioinformatics project repo should be well under 100 MB. If
gitfeels slow, rungit count-objects -vHto check the repo size. - One project, one repository. Do not put multiple unrelated projects in a single repo, and do not scatter one project's scripts across multiple repos.
Note
If you have a small binary file that must be tracked (e.g., a 5 MB curated BED file), Git can handle it. But anything larger than ~50 MB belongs on Depot; document its path in your README instead of committing it.
Quick-start checklist¶
Copy this checklist and run through it at the start of every new project.
-
Create the project directory on Scratch
-
Symlink raw data from Depot (do not copy)
-
Write the README
Create
00_meta/README.mdwith: project name, date, PI, goal, sample summary, and pipeline overview. -
Create a sample manifest
Create
00_meta/sample_manifest.tsvmapping sample IDs to filenames, conditions, and any metadata. -
Initialize Git
-
Record software versions
-
Write your first script in
02_scripts/Name it
01_qc-fastqc.sh. Point SLURM logs to99_logs/: -
Check quota before starting
That is the entire setup. It takes five minutes at the start of a project and saves days of confusion later. The key principle is simple: keep your raw data safe, keep your scripts under version control, and keep your results organized; so anyone, including future you, can understand and reproduce your work.