Downloading SRA Data
The NCBI Sequence Read Archive (SRA) is the largest publicly available repository of high-throughput sequencing data. This guide walks you through downloading FASTQ files from SRA on RCAC clusters using the SRA Toolkit.
Quick Start¶
If you're familiar with HPC (Negishi/Gautschi/Bell) and just need the commands:
For detailed instructions and best practices, continue reading below.
Understanding SRA Downloads¶
Before downloading, it's important to understand the two-step workflow:
prefetchDownloads the compressed.srafile from NCBI to your local cachefasterq-dump- Converts the.srafile to FASTQ format
Tip
Always use prefetch before fasterq-dump. Direct downloads with fasterq-dump alone are slower and prone to network failures.
Step-by-Step Guide¶
-
Load the SRA Toolkit module
Verify the installation:
-
Configure your cache directory (first time only)
By default, SRA Toolkit caches files in your home directory, which has limited space. Configure it to use scratch space instead:
-
Find your SRA accession numbers
SRA accessions typically start with:
- SRR - Individual run (most common)
- SRP - Study/Project
- SRX - Experiment
-
Download the SRA file using prefetch
For a single accession:
For multiple accessions from a file:
Where
accession_list.txtcontains one accession per line. -
Convert to FASTQ using fasterq-dump
For paired-end data:
For single-end data:
Key options:
--split-files- Separates paired reads into_1.fastqand_2.fastq-e 8- Use 8 threads (adjust based on your allocation)-p- Show progress
-
Compress the FASTQ files
FASTQ files are large. Compress them to save space:
SLURM Batch Script¶
For large downloads, submit a batch job rather than running interactively.
Submit the job with:
Note
Replace your_account with your actual SLURM account. Find available accounts with slist.
Verification Steps¶
After downloading, verify your files are complete and uncorrupted:
-
Check file sizes
FASTQ files should be reasonably sized (typically 1-50 GB for most runs):
-
Count reads
Count the number of reads in each file:
For paired-end data, both files should have the same read count.
-
Check file integrity
Verify gzip compression is intact:
-
Inspect first few reads
Ensure the FASTQ format looks correct:
You should see blocks of 4 lines: header (@), sequence, separator (+), and quality scores.
-
Run FastQC (optional)
For comprehensive quality assessment:
Expected Output¶
After successful download and conversion, you should have:
Or for single-end data:
Troubleshooting¶
Download fails with network timeout
Try these solutions:
- Use
prefetchwith resume capability: it automatically resumes interrupted downloads - Download during off-peak hours
- Check your network connection with
ping www.ncbi.nlm.nih.gov
Disk quota exceeded error
- Ensure your cache is set to scratch:
vdb-config -s /repository/user/main/public/root=$RCAC_SCRATCH/ncbi - Clean up old cached files:
rm -rf $RCAC_SCRATCH/ncbi/sra/*.sra - Check your quota with
myquota
fasterq-dump runs out of memory
- Request more memory in your SLURM script (
--mem=32Gor higher) - Reduce the number of threads (
-e 4instead of-e 8) - Use the
--tempflag to specify a temp directory on scratch
Files are empty or truncated
- Re-run
prefetchto re-download the SRA file - Verify the accession number is correct
- Check if the SRA record is still available on NCBI
FAQs¶
How long do downloads typically take?
Download times vary based on file size and network conditions:
- Small datasets (< 5 GB): 15-30 minutes
- Medium datasets (5-20 GB): 1-3 hours
- Large datasets (> 20 GB): 3+ hours
The prefetch step is typically the bottleneck as it depends on network speed.
Can I download directly without prefetch?
Technically yes, but it's not recommended:
Using prefetch first is faster, more reliable, and allows resuming interrupted downloads.
How do I download data from ENA instead?
ENA (European Nucleotide Archive) often has faster downloads. Use wget or curl:
The exact URL structure varies. Find the correct URLs on the ENA Browser.
What's the difference between split-files and split-3?
--split-files: Creates_1.fastqand_2.fastqfor paired data--split-3: Creates_1.fastq,_2.fastq, and an additional file for orphaned reads
For most analyses, --split-files is sufficient.
Additional Resources¶
- NCBI SRA Toolkit Documentation
- SRA Run Selector - Find and download accession lists
- ENA Browser - Alternative download source