AI Datasets¶
AI datasets in this collection cover computer vision, physicalAI, and robotics. They support tasks such as detection, segmentation, tracking, control, reinforcement learning, and large-scale model pretraining and evaluation across domains including everyday objects, smart spaces, and embodied PhysicalAI.
This page provides access to curated AI datasets available on RCAC clusters through two storage options:
- compressed formats on the POSIX file system, and
- raw/extracted data on S3-compatible object storage.
Accessing AI Datasets¶
Choose your preferred access method based on your workflow needs.
Which option should I choose?
- Use POSIX if you are running training jobs on RCAC clusters and need maximum I/O performance with compressed formats.
- Use S3 if you need to browse or download individual files, or stream data without downloading entire archives.
Compressed formats like LMDB and SquashFS are optimized for high-performance access on RCAC clusters. These are ideal for training jobs that need fast, local access to preprocessed data.
Quick Start:
Working with Datasets¶
Once you have loaded a dataset module, the following environment variables are automatically set to simplify access:
| Variable | Description |
|---|---|
$<DATASET_NAME>_ROOTDIR |
Root directory of the dataset |
$<DATASET_NAME>_HOME |
Dataset home path |
$RCAC_<DATASET_NAME>_ROOT |
RCAC-specific root path |
$RCAC_<DATASET_NAME>_VERSION |
Dataset version |
Raw and extracted datasets are available via S3-compatible object storage. This is useful for workflows that need direct access to individual files without decompressing archives, or for transferring data to other systems.
| Parameter | Value |
|---|---|
| Endpoint | https://s3.anvil.rcac.purdue.edu |
| Bucket | ai-datasets |
| Access | Public read-only |
Tools: You can use any S3-compatible tool such as rclone, s3cmd, or Python boto3.
For detailed instructions, see the Anvil Object Storage documentation and User Tools guide.
Available AI Datasets¶
| Dataset | Description |
|---|---|
| COCO | Common Objects in Context - object detection, segmentation, and captioning |
| LVIS | Large Vocabulary Instance Segmentation |
| VisualGenome | Visual knowledge base with structured image annotations |
| commoncrawl | Web crawl data for pretraining language models |
| fast.ai | Practical deep learning datasets and models |
| PhysicalAI-Robotics-GR00T-Teleop-Sim | GR00T teleoperation simulation data |
| PhysicalAI-Robotics-GR00T-X-Embodiment-Sim | GR00T cross-embodiment simulation |
| PhysicalAI-Robotics-Manipulation-SingleArm | Single-arm manipulation datasets |
| PhysicalAI-SmartSpaces | Smart environment interaction data |