Juicer on Negishi cluster¶
-
Prerequisites
- Hi-C paired-end FASTQ files
- Know your reference genome
- Know the restriction enzyme used
- RCAC HPC account (Negishi or Bell)
-
Objective
Generate
.hiccontact maps using Juicer Hi-C processing pipeline on the Negishi Cluster
Juicer is a pipeline for analyzing Hi-C data, including alignment, filtering, deduplication, and generation of .hic contact matrices. On Negishi, Juicer runs using a Singularity container with all required dependencies pre-installed (BWA, SAMtools, Java, etc.).
Warning
The Negishi cluster does not have GPUs. HiCCUPS (loop calling) is configured to run on CPUs in this installation. Arrowhead and other CPU-based steps will still run normally (provided if you provide motifs). There are better/faster alternatives like mustache, which you can run on the .hic files post Juicer.
Reference Genomes¶
Pre-built reference genomes are available in:
Currently, hg19 is the only genome available (we will add more genomes upon request). The reference directory contains:
- Reference FASTA (
genome.fa) - BWA index files (
.bwt,.pac,ann,ambandsa.) - Chromosome sizes file (
chrom.sizes) - Restriction enzyme site positions (e.g.,
hg19_MboI.txt)
Note
Please send a request to rcac-help if you need a different genome along with the version, enzyme and source of the genome. Include bioinformatics support: juicer reference genome in the subject line.
How to run Juicer?¶
Juicer is deployed as a biocontainer on the Negishi cluster. To run Juicer, follow these steps.
Organize your data in a directory structure like this:
Your directory structure should look like this:
The juicer pipeline works by creating a series of batch jobs and submitting them all at once using job dependencies. The main script is very light weight and has to be run on the login node.
Load the module
You can run Juicer using the juicer.sh script.
The arguments to the script are:
* -q: Queue name for alignments (e.g., testpbs)
* -Q: Walltime for alignments (e.g., 2:00:00)
* -l: Queue name for the rest of the pipeline (e.g., testpbs)
* -L: Walltime for the rest of the pipeline (e.g., 8:00:00)
* -A: Account name (e.g., testpbs)
Here the default arguments are used for the rest of the pipeline.
-g: Genome ID (hg19)-z: Genome FASTA file (${JUICER_DIR}/references/Homo_sapiens_assembly19.fasta)-y: Restriction sites (${JUICER_DIR}/restriction_site/hg19_MboI.txt)-D: Juicer scripts (default:${JUICER_DIR}or/depot/itap/datasets/juicer/2.0.1)-s: Restriction enzyme (MboI)
When this command is run, it will create a series of jobs in the specified queue.
How does the stdout look like for successful run? [click to show answer]
You can check the jobs running using squeue command
It looks something like this:
Juicer arguments¶
The full list of arguments for juicer.sh is:
Input/path arguments¶
| Option | Description |
|---|---|
-g genomeID |
Genome ID (e.g., hg19, mm10) defined internally or via -z |
-d topDir |
Top-level working directory. Must contain fastq/; creates splits/, aligned/ |
-z reference-genome |
Path to genome FASTA file; BWA index files must be in the same directory |
-p chrom.sizes |
Path to chrom.sizes file (can also use genome name like hg38) |
-y restriction-site-file |
File with positions of restriction sites (e.g., from generate_site_positions.py) |
-D juicerDir |
Path to Juicer scripts directory (default: /depot/itap/datasets/juicer/2.0.1) |
Cluster-specific options¶
| Option | Description |
|---|---|
-q queue |
SLURM queue for alignment jobs (default: standby) |
-l long queue |
SLURM queue for long jobs such as .hic creation (default: standby) |
-Q queue time |
Time limit for short jobs (e.g., -Q 4:00 for 4 hours) |
-L long queue time |
Time limit for long jobs (e.g., -L 168:00 for one week) |
-A account |
SLURM account name for job submission |
Experiment-specific options¶
| Option | Description |
|---|---|
-s site |
Restriction enzyme (e.g., MboI, HindIII) |
-a about |
Free-text experiment description (enclosed in single quotes) |
-i sample |
Sample name, added to SM: in read group |
-k library |
Library name, added to LB: in read group |
-b ligation |
Ligation junction sequence (used in counting) |
Performance options¶
| Option | Description |
|---|---|
-t threads |
Number of threads for BWA alignment |
-T threadsHic |
Number of threads for .hic file creation |
-C chunk size |
Number of lines per split file (default: 90,000,000; must be multiple of 4) |
-w wobble |
Wobble distance for deduplication (default: 4) |
Stage options¶
| Option | Description |
|---|---|
-S stage |
Start from a given stage: chimeric, merge, dedup, afterdedup, final, postproc, early |
Boolean options¶
| Flag | Description |
|---|---|
-j |
Use only exact duplicates during deduplication (disables wobble) |
-e |
Exit early before .hic file creation |
-f |
Include fragment-delimited maps in .hic output |
-u |
Use single-end mode for alignment |
-m |
Process methylation + Hi-C library |
--assembly |
Early exit after deduplication (for 3D-DNA input) |
--cleanup |
Remove intermediate files if pipeline completes |
--qc_apa |
Run APA-based QC |
--qc |
Downsample to 1 kb, skip annotation |
--in-situ |
Limit to 1 kb map resolution (no annotation) |
-h, --help |
Display usage help and exit |
Output¶
Upon completion, the main output file is:
You can visualize this file using Juicebox.
The other output files in aligned/ include: