Skip to content

AI Datasets

Back to all datasets

AI datasets in this collection cover computer vision, physicalAI, and robotics. They support tasks such as detection, segmentation, tracking, control, reinforcement learning, and large-scale model pretraining and evaluation across domains including everyday objects, smart spaces, and embodied PhysicalAI.

This page provides access to curated AI datasets available on RCAC clusters through two storage options:

  • compressed formats on the POSIX file system, and
  • raw/extracted data on S3-compatible object storage.

Accessing AI Datasets

Choose your preferred access method based on your workflow needs.

Which option should I choose?

  • Use POSIX if you are running training jobs on RCAC clusters and need maximum I/O performance with compressed formats.
  • Use S3 if you need to browse or download individual files, or stream data without downloading entire archives.

Compressed formats like LMDB and SquashFS are optimized for high-performance access on RCAC clusters. These are ideal for training jobs that need fast, local access to preprocessed data.

Quick Start:

1
2
3
4
$ module avail
$ module load datasets
$ module avail ai
$ module load ai/<dataset-name>

Working with Datasets

Once you have loaded a dataset module, the following environment variables are automatically set to simplify access:

Variable Description
$<DATASET_NAME>_ROOTDIR Root directory of the dataset
$<DATASET_NAME>_HOME Dataset home path
$RCAC_<DATASET_NAME>_ROOT RCAC-specific root path
$RCAC_<DATASET_NAME>_VERSION Dataset version

Raw and extracted datasets are available via S3-compatible object storage. This is useful for workflows that need direct access to individual files without decompressing archives, or for transferring data to other systems.

Parameter Value
Endpoint https://s3.anvil.rcac.purdue.edu
Bucket ai-datasets
Access Public read-only

Tools: You can use any S3-compatible tool such as rclone, s3cmd, or Python boto3.

$ module load rclone   # rclone requires module load; s3cmd is available by default

For detailed instructions, see the Anvil Object Storage documentation and User Tools guide.

Available AI Datasets

Dataset Description
COCO Common Objects in Context - object detection, segmentation, and captioning
LVIS Large Vocabulary Instance Segmentation
VisualGenome Visual knowledge base with structured image annotations
commoncrawl Web crawl data for pretraining language models
fast.ai Practical deep learning datasets and models
PhysicalAI-Robotics-GR00T-Teleop-Sim GR00T teleoperation simulation data
PhysicalAI-Robotics-GR00T-X-Embodiment-Sim GR00T cross-embodiment simulation
PhysicalAI-Robotics-Manipulation-SingleArm Single-arm manipulation datasets
PhysicalAI-SmartSpaces Smart environment interaction data