Why Dataset Format Matters

Robot training data format is not a detail you can defer. The format you choose on day one determines three things that will affect your project for months:

1. Training Framework Compatibility

Each major training framework expects data in a specific format. ACT and Diffusion Policy read HDF5 natively. Octo and the Open X-Embodiment data mix scripts expect RLDS/TFRecord. The LeRobot training library reads LeRobot Parquet. If your data is in the wrong format, you are writing conversion scripts before you can train — and conversion scripts are where subtle data corruption bugs hide.

2. Storage Efficiency and Access Patterns

A 500-episode dataset with 3 cameras at 30 fps occupies 25-50 GB in raw HDF5, 15-30 GB in compressed HDF5, 3-8 GB in LeRobot (MP4 video), or 20-40 GB in RLDS/TFRecord. The storage difference matters for cloud hosting costs, download times, and training data loading speed. But storage efficiency trades off against data fidelity: LeRobot's MP4 compression is lossy, while HDF5 and RLDS preserve exact pixel values.

3. Community and Sharing

If you want to share your dataset publicly, LeRobot format gives you one-command upload to Hugging Face Hub with built-in web visualization. RLDS gives you compatibility with the Open X-Embodiment ecosystem (50+ datasets, 22 robot types). HDF5 gives you maximum flexibility but no standardized sharing platform.

Our recommendation: Use HDF5 as your source-of-truth collection and storage format. Convert to LeRobot for sharing and to RLDS for cross-embodiment training. This gives you the best of all three ecosystems without the downsides of any single format lock-in.

HDF5: The Gold Standard for Robot Data Storage

HDF5 (Hierarchical Data Format 5) stores data in a filesystem-like hierarchy of groups (directories) and datasets (arrays). It was originally developed for scientific computing and has become the de facto standard for robot demonstration data thanks to its flexibility, mature tooling, and efficient random access.

Episode Structure

The standard ACT/ALOHA HDF5 layout stores each episode as a top-level group with observations, actions, and metadata attributes:

/episode_0/
    observations/
        images/
            cam_high          # uint8 [T x 480 x 640 x 3]   overhead camera
            cam_wrist_left    # uint8 [T x 480 x 640 x 3]   left wrist camera
            cam_wrist_right   # uint8 [T x 480 x 640 x 3]   right wrist camera
        qpos                  # float32 [T x 14]  joint positions (7 per arm)
        qvel                  # float32 [T x 14]  joint velocities
    action                    # float32 [T x 14]  leader arm positions (supervision signal)
    attrs:
        task = "pick_cube_bimanual"
        operator_id = "op_03"
        success = True
        num_timesteps = 450
        timestamp = "2026-04-10T14:32:00Z"

Reading HDF5 with Python

Reading episodes with h5py is straightforward. Here is a complete example that loads an episode's observations and actions:

import h5py
import numpy as np

# Open a single episode file
with h5py.File("episode_0.hdf5", "r") as f:
    # Read joint positions and actions
    qpos = f["/observations/qpos"][:]        # shape: [T, 14]
    action = f["/action"][:]                  # shape: [T, 14]

    # Read a specific camera frame (random access)
    frame_100 = f["/observations/images/cam_high"][100]  # shape: [480, 640, 3]

    # Read all frames for a camera
    all_frames = f["/observations/images/cam_high"][:]   # shape: [T, 480, 640, 3]

    # Read metadata
    task = f.attrs.get("task", "unknown")
    success = f.attrs.get("success", False)

    print(f"Task: {task}, Success: {success}")
    print(f"Episode length: {qpos.shape[0]} timesteps")
    print(f"Joint positions range: [{qpos.min():.3f}, {qpos.max():.3f}]")

HDF5 Best Practices

  • Chunking: Always chunk datasets along the time axis. Use chunk_size=1 for random access (debugging, visualization) or chunk_size=32 for sequential read efficiency (training). Never store unchunked image data — it loads as a single monolithic block.
  • Compression: Use LZF for image data (3-5x faster than GZIP at similar ratios for camera frames). Use GZIP level 4 for joint trajectories (higher ratio, speed not critical). Do not compress images at collection time — apply compression in the final archive after QA validation.
  • Metadata attributes: Store episode metadata as HDF5 group attributes: episode.attrs['success'], episode.attrs['task'], episode.attrs['operator_id'], episode.attrs['robot_serial']. Include a schema_version attribute on the file root to track format changes.
  • One file per episode vs. one file per dataset: For datasets under 1,000 episodes, one HDF5 file per episode is simpler for parallel processing and partial re-collection. For larger datasets, consider packing 50-100 episodes per file to reduce filesystem overhead.

HDF5 Pros and Cons

Pros
  • Mature library support (h5py, HDFView, Julia, C++)
  • Efficient random access to any frame
  • Flexible schema — add custom sensor types freely
  • Lossless storage preserves exact pixel values
  • Native to ACT, ALOHA, and Diffusion Policy
  • Human-inspectable with HDFView GUI
Cons
  • No built-in versioning or provenance tracking
  • Not cloud-streamable (must download full file)
  • Large file sizes without video compression
  • Schema inconsistencies between labs
  • No standardized sharing platform
  • Concurrent writes require careful locking

RLDS: The Open X-Embodiment Standard

RLDS (Reinforcement Learning Datasets) is the format used by the Open X-Embodiment dataset — the largest collection of robot manipulation data with 2.2M+ episodes across 22 robot types and 527K unique trajectories. It serializes data as TFRecord files processed via TensorFlow Datasets (TFDS).

RLDS Schema

Each RLDS dataset is defined by a TensorFlow DatasetBuilder that specifies the features schema. Episodes are represented as sequences of steps, where each step contains:

# Standard RLDS step structure
step = {
    "observation": {
        "image": tf.uint8,         # shape: [H, W, C]
        "state": tf.float32,       # shape: [D]  (joint positions + gripper)
        "wrist_image": tf.uint8,   # shape: [H, W, C]  (optional)
    },
    "action": tf.float32,          # shape: [D]
    "reward": tf.float32,          # scalar
    "discount": tf.float32,        # scalar (typically 1.0)
    "is_terminal": tf.bool,        # True on terminal state
    "is_first": tf.bool,           # True on first step
    "is_last": tf.bool,            # True on last step
    "language_instruction": tf.string,  # natural language task description
}

Loading RLDS Data

import tensorflow_datasets as tfds

# Load an Open X-Embodiment dataset
dataset = tfds.load("berkeley_autolab_ur5", split="train")

# Iterate over episodes
for episode in dataset.take(5):
    steps = episode["steps"]
    for step in steps:
        image = step["observation"]["image"].numpy()    # [H, W, 3]
        state = step["observation"]["state"].numpy()     # [D]
        action = step["action"].numpy()                  # [D]
        instruction = step["language_instruction"].numpy().decode()
        print(f"Instruction: {instruction}")
        print(f"State shape: {state.shape}, Action shape: {action.shape}")
        break  # just first step

RLDS Pros and Cons

Pros
  • Standardized schema enables cross-dataset training
  • Efficient streaming via tf.data pipelines
  • Cloud-native (stream from GCS/S3 without download)
  • 50+ datasets available in compatible format
  • Native to Octo, RT-2, and OXE data mix
  • Built-in language instruction field
Cons
  • TensorFlow dependency (heavy for PyTorch teams)
  • Sequential access only (no efficient random frame)
  • Rigid schema — custom sensors need DatasetBuilder
  • Writing a DatasetBuilder takes 2-4 hours
  • Inspection requires TF tooling
  • Less intuitive than HDF5 for debugging

LeRobot: The Hugging Face Ecosystem

LeRobot, developed by Hugging Face, uses Parquet files for tabular data (joint positions, actions, metadata) and MP4 video files for camera observations. It is designed for the open-source research workflow: collect locally, push to Hugging Face Hub, train with the LeRobot library, share results with the community.

LeRobot Dataset Structure

A LeRobot dataset on Hugging Face Hub contains:

my_dataset/
    data/
        train-00000-of-00001.parquet   # tabular data (all episodes)
    videos/
        observation.images.cam_high/
            episode_000000.mp4          # overhead camera video
            episode_000001.mp4
        observation.images.cam_wrist/
            episode_000000.mp4          # wrist camera video
            episode_000001.mp4
    meta/
        info.json                       # dataset metadata, features schema
        episodes.jsonl                  # per-episode metadata
        stats.json                      # per-feature mean/std/min/max

The Parquet file contains one row per timestep with columns for episode_index, frame_index, timestamp, observation.state (joint positions), action, and references to the corresponding video frame index.

Loading LeRobot Data

from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

# Load a dataset from Hugging Face Hub
dataset = LeRobotDataset("lerobot/aloha_sim_transfer_cube_human")

# Access a single frame (returns a dict of tensors)
frame = dataset[0]
print(f"State: {frame['observation.state'].shape}")      # [D]
print(f"Action: {frame['action'].shape}")                 # [D]
print(f"Image: {frame['observation.images.cam_high'].shape}")  # [C, H, W]

# Get episode-level info
print(f"Number of episodes: {dataset.num_episodes}")
print(f"Number of frames: {dataset.num_frames}")
print(f"FPS: {dataset.fps}")

LeRobot Pros and Cons

Pros
  • One-command upload to Hugging Face Hub
  • Built-in web visualization at hf.co/datasets/
  • Compact storage (MP4 video 5-10x smaller than raw)
  • 300+ public datasets and growing fast
  • Native ACT and Diffusion Policy training support
  • Statistics (mean/std) computed automatically
Cons
  • MP4 compression is lossy — not source-of-truth quality
  • Video decoding adds latency during training
  • Parquet not ideal for variable-length episodes
  • Schema changes require full dataset rebuild
  • Newer format with evolving tooling
  • No random frame access without decoding video

Format Comparison Table

Feature HDF5 RLDS / TFRecord LeRobot / Parquet
Native frameworks ACT, Diffusion Policy, custom Octo, RT-2, OXE data mix LeRobot, ACT (via lib), DP (via lib)
Storage size (500 eps, 3 cams) 15-30 GB (compressed) 20-40 GB 3-8 GB (MP4)
Image fidelity Lossless (raw uint8) Lossless (raw uint8) Lossy (MP4 H.264/H.265)
Random frame access Efficient (chunked) Inefficient (sequential) Requires video decode
Cloud streaming No (download required) Yes (tf.data from GCS/S3) Yes (HF Hub streaming)
Schema flexibility High (any structure) Low (fixed DatasetBuilder) Medium (Parquet columns)
Sharing platform None (manual hosting) TFDS catalog Hugging Face Hub
Community datasets Many (no central catalog) 50+ (Open X-Embodiment) 300+ (Hugging Face Hub)
Python tooling h5py (mature, lightweight) tensorflow-datasets (heavy) lerobot, datasets (growing)
Recommended for Primary storage, ACT/DP training Cross-embodiment, Octo training Sharing, community, quick start

Converting Between Formats

You will eventually need data in multiple formats. Here is the practical guide to conversion, with the tools and estimated effort for each path.

HDF5 to LeRobot

The LeRobot library provides native conversion for ALOHA-style HDF5 datasets:

# Convert ALOHA HDF5 to LeRobot format and push to Hub
python -m lerobot.scripts.push_dataset_to_hub \
    --raw-dir /path/to/hdf5/episodes \
    --raw-format aloha_hdf5 \
    --repo-id your-org/dataset-name \
    --push-to-hub 1

For custom HDF5 schemas (not ALOHA), you need to write a small adapter function that maps your key names to LeRobot's expected schema. This typically takes 30-60 minutes.

HDF5 to RLDS

Converting to RLDS requires writing a custom TensorFlow DatasetBuilder. This is the most labor-intensive conversion (2-4 hours for a new schema) but is a one-time cost per dataset format:

# Skeleton RLDS DatasetBuilder (simplified)
import tensorflow_datasets as tfds

class MyRobotDataset(tfds.core.GeneratorBasedBuilder):
    VERSION = tfds.core.Version("1.0.0")

    def _info(self):
        return tfds.core.DatasetInfo(
            builder=self,
            features=tfds.features.FeaturesDict({
                "steps": tfds.features.Dataset({
                    "observation": tfds.features.FeaturesDict({
                        "image": tfds.features.Image(shape=(480, 640, 3)),
                        "state": tfds.features.Tensor(shape=(14,), dtype=tf.float32),
                    }),
                    "action": tfds.features.Tensor(shape=(14,), dtype=tf.float32),
                    "is_terminal": tf.bool,
                    "is_first": tf.bool,
                    "is_last": tf.bool,
                    "language_instruction": tfds.features.Text(),
                }),
            }),
        )

    def _generate_examples(self, path):
        # Read from your HDF5 files and yield episodes
        for episode_path in sorted(path.glob("*.hdf5")):
            with h5py.File(episode_path, "r") as f:
                # Map HDF5 fields to RLDS schema
                yield episode_id, {"steps": steps_list}

RLDS to LeRobot

LeRobot provides a built-in converter for RLDS datasets, including all Open X-Embodiment datasets:

# Convert any RLDS dataset to LeRobot format
python -m lerobot.scripts.push_dataset_to_hub \
    --raw-dir /path/to/rlds/dataset \
    --raw-format rlds \
    --repo-id your-org/converted-dataset \
    --push-to-hub 1

LeRobot to HDF5

There is no official tool for this direction, but it is straightforward to write (30-60 minutes):

from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
import h5py
import numpy as np

dataset = LeRobotDataset("your-org/dataset-name")

for ep_idx in range(dataset.num_episodes):
    ep_frames = [dataset[i] for i in range(len(dataset))
                 if dataset[i]["episode_index"] == ep_idx]

    with h5py.File(f"episode_{ep_idx:05d}.hdf5", "w") as f:
        qpos = np.stack([fr["observation.state"].numpy() for fr in ep_frames])
        action = np.stack([fr["action"].numpy() for fr in ep_frames])
        f.create_dataset("observations/qpos", data=qpos, chunks=(1, qpos.shape[1]))
        f.create_dataset("action", data=action, chunks=(1, action.shape[1]))
        # Decode and store video frames as image arrays
        # ... (video decode step adds complexity)

Important caveat: Converting from LeRobot back to HDF5 cannot recover the original pixel-level fidelity because LeRobot stores images as lossy MP4 video. The converted HDF5 will contain decoded MP4 frames, not the original raw images.

Conversion Summary Table

From → To Tool Effort Notes
HDF5 → LeRobot lerobot.scripts.push_dataset_to_hub 30 min Native ALOHA support; custom schemas need adapter
HDF5 → RLDS Custom DatasetBuilder 2-4 hours One-time per schema; requires TF knowledge
RLDS → LeRobot lerobot.scripts.push_dataset_to_hub --raw-format rlds 15 min Works for all OXE datasets
LeRobot → HDF5 Custom script 30-60 min Lossy: MP4 frames, not original raw images
Any → Any SVRC Platform 5 min Upload once, export to any format via UI

How SVRC Delivers Your Data

When you engage SVRC for a data collection campaign, here is how we handle format delivery:

Collection Format

We always collect in HDF5 as our source of truth. Raw sensor data is stored losslessly with per-frame timestamps, full metadata, and chunked datasets for efficient access. This master copy is retained for the duration of your project.

Delivery Format

You specify your target format in the project brief. We support:

  • HDF5: Direct delivery of the source-of-truth files. Includes schema documentation and a Python example script for loading.
  • RLDS / TFRecord: Converted with a custom DatasetBuilder matched to your schema. Includes the DatasetBuilder source code so you can re-run the conversion yourself.
  • LeRobot / Parquet: Pushed to a private Hugging Face Hub repository under your organization. Includes dataset card with full metadata, statistics, and visualization.
  • Custom formats: ROS bag, CSV, JSON-lines, or proprietary schemas. We write the export adapter and include it in the delivery.

What Is Included

Every dataset delivery includes:

  • The dataset files in your requested format
  • A data manifest (JSON) listing all episodes with metadata, quality scores, and statistics
  • Schema documentation describing every field, data type, and unit
  • A Python example script that loads one episode and prints shapes and ranges
  • Per-feature statistics (mean, std, min, max) for normalization during training
  • QA report summarizing quality metrics across the full dataset

SVRC Platform Export

If you use the SVRC Fearless Platform, you can upload datasets in any format and export to any other format through the web UI. The platform handles schema normalization, statistics computation, and format-specific encoding (MP4 for LeRobot, TFRecord for RLDS) automatically. Upload once, export as many times as you need.

Frequently Asked Questions

Which format should I use if I am just getting started?

Start with HDF5. It has the simplest tooling (just h5py), the most flexible schema, and is native to the most popular training frameworks (ACT, Diffusion Policy). You can always convert to LeRobot or RLDS later. If you want to share your dataset immediately on Hugging Face Hub, use LeRobot from the start — but keep the raw HDF5 as your backup.

Is LeRobot's MP4 compression a problem for training?

For most manipulation tasks, no. The visual artifacts from H.264 compression at reasonable quality settings (CRF 20-23) are below the noise level of typical camera sensors. However, for tasks where pixel-level accuracy matters — visual servoing to sub-millimeter targets, detecting thin wires or threads, or research that analyzes compression artifacts — use lossless HDF5 as your training source. The LeRobot team is exploring lossless video codecs (FFV1) for future versions.

Can I mix datasets from different formats for training?

Yes, but you need to normalize them to a common format first. The most practical approach is to convert everything to a single format before training. If you are training with Octo or doing cross-embodiment experiments, convert everything to RLDS. If you are training with the LeRobot library, convert everything to LeRobot format. The SVRC Platform can normalize and export mixed-format uploads into a unified dataset.

How do I version-control my robot datasets?

For LeRobot datasets on Hugging Face Hub, versioning is built in via git-lfs. For HDF5, use a data manifest file (JSON) alongside your HDF5 files that records schema_version, creation date, episodes list, and statistics. Bump the schema version when you change sensor configuration. For production workflows, the SVRC Platform provides full dataset versioning with rollback.

What about ROS bag format?

ROS bag (rosbag2 in ROS2) is excellent for data recording during collection because it captures all ROS topics with timestamps natively. However, it is not well-suited as a training format because it requires ROS2 libraries to read, has no random access, and stores data in a format optimized for replay rather than ML training. The standard workflow is: record in ROS bag during collection, then convert to HDF5 (or LeRobot/RLDS) for training and sharing. This conversion step also serves as a data cleaning and validation checkpoint.