Optimizing Trinity
When running trinity on RCAC clusters, you will mostly likely hit the file limit if you are running Trinity on /scratch on Bell/Negishi cluster. Here are few recommendations to improve performance of your runs:
1. Using the --workdir option¶
For the first 2 steps for Trinity (inchworm and chrysalis), number of files created are minimal. However, for the second phase (phase 2, butterfly), large number of files are created. You run this part of analyses on /dev/shm which is a memory based file system. It is faster than /scratch (a network drive) and does not have file limit.
A typical command would look like this:
Note
If you run out of walltime, the files in /dev/shm will be lost. So be sure to request enough wall-time for this job.
2. Cleaning intermediates¶
Depending on your downstream analyses, you may want to reconsider saving intermediate files generated by Trinity. You can use the --full_cleanup option to delete the intermediate files. If there are failed runs in the butterfly step, it will prevent the cleanup of the intermediate files and you can resume the run from there.
3. Normalization of reads¶
By default, Trinity will enable in silico normalization of reads. This is especially useful if your dataset is too large. If you are using --no_normalize_reads, you may want to reconsider and remove this option. Normalization not only reduces memory usage and runtime but also improves the assembly of over-sampled transcripts. In fact, having too many reads can degrade the quality of the assembly.
4. Running Trinity stepwise¶
Finally, if you prefer more control, you can decide to run them stepwise as shown on the Trinity documentation. This will allow you to monitor the progress of the run and potentially restart from a failed step and archive the intermediate files as you progress.
-
Run the initial in silico normalization step and kmer counting
-
Run through inchworm, stop before chrysalis
-
Run through chrysalis, stop before Trinity phase 2 parallel assembly of clustered reads
-
Finish the job, running all Phase 2 mini-assemblies in parallel:
5. Using nodes local storage¶
If you are running Trinity on a smaller dataset (<100Gb), you can run Trinity on the local storage of the node. This will be faster than running on /scratch. you will need to copy the files after they finish running.
A simple job script¶
A job script to run Trinity on Bell/Negishi cluster is shown below. You can modify the script to suit your needs.
Note
replace <partition-name> and <account-name> with appropriate values.
Frequently Asked Questions¶
How can I count the number of files in a directory?
You can use find command to count the number of files. eg.,