Scaling Laws for Robot Learning: What We Know in 2026

What Scaling Laws Are and Why They Matter

Scaling laws describe the predictable relationship between resource inputs (data, compute, model parameters) and model performance. In large language models, the Chinchilla paper (Hoffmann et al., 2022) established that optimal training requires scaling model size and dataset size in proportion: doubling parameters without doubling data produces a suboptimal model. This framework gave ML teams a principled way to allocate budgets and predict performance before committing resources.

The practical value of scaling laws is not academic. They tell you whether spending another $100K on data collection or on a larger GPU cluster will produce more capability improvement. They tell you when your current approach has hit a ceiling and needs a structural change rather than more resources. And they tell you whether the trajectory you are on will reach your performance target, or whether you need a fundamentally different approach.

The central question for robotics: do similar laws apply, and if so, what do they predict? By 2026, we have enough empirical evidence to give a nuanced but actionable answer.

How Robot Scaling Differs from Language Model Scaling

Before examining the evidence, it is important to understand why robot learning scaling laws cannot be a simple copy of LLM scaling dynamics. Three structural differences change the game fundamentally.

Data collection is expensive and slow. LLM training data (text from the internet) costs essentially nothing to acquire -- the bottleneck is compute. Robot demonstration data costs $5-50 per episode to collect (operator time, hardware wear, quality assurance), and collection throughput is limited by the number of physical robot stations available. This inverts the optimization problem: for robot learning, the question is not "how much compute can we afford?" but "how much diverse data can we afford to collect?"

The data distribution is multimodal and physics-coupled. Language model training data is sampled from a relatively uniform distribution (text on the internet). Robot learning data is fundamentally multimodal: it couples visual observations, proprioceptive state, force feedback, and temporal dynamics. The policy must learn a mapping from this high-dimensional input space to motor commands that are physically valid. Adding more text to an LLM always helps; adding more demonstrations to a robot policy only helps if those demonstrations cover new regions of the input space.

Evaluation is physically grounded. LLM performance can be evaluated on held-out text benchmarks at scale. Robot policy performance can only be truly evaluated by running the policy on physical hardware, which is slow, expensive, and subject to hardware variability. This makes it harder to characterize scaling curves precisely and introduces noise into empirical measurements.

The Evidence: Open X-Embodiment and RT-X

The Open X-Embodiment project (Padalkar et al., 2023) assembled over 1 million robot manipulation episodes from 22 different robot embodiments across 33 research institutions. The RT-X models trained on this data provided the first large-scale evidence for cross-embodiment scaling.

The key finding was striking: RT-2-X, trained on the full multi-embodiment dataset, outperformed single-robot specialist policies by roughly 50% on held-out generalization tasks. This demonstrated that data from different robot types is not just noise to the model; it contains transferable manipulation knowledge that improves generalization to new robots and new tasks. The scaling was not smooth and predictable like LLM scaling curves, but the direction was clear and consistent.

More specifically, the RT-X results revealed a hierarchy of what transfers. Visual scene understanding transferred most readily across embodiments. High-level task concepts (pick, place, open, close) transferred moderately well. Low-level motor control (specific joint trajectories, precise contact timing) transferred poorly and still required embodiment-specific fine-tuning.

The Evidence: DROID and Data Scaling

The DROID dataset (Khazatsky et al., 2024) is the most comprehensive empirical study of data scaling for robot manipulation. With 76,000 demonstrations collected across 86 labs, 22 robot types, and 564 distinct environments, DROID allowed researchers to test how policy performance scales with data quantity and data diversity independently.

The critical result: data diversity scales performance more reliably than data volume. When the authors trained ACT and Diffusion Policy on increasing subsets of DROID data, they observed consistent improvement up to roughly 10,000 demonstrations, diminishing returns at 30,000, and near-plateau at 50,000 for within-distribution performance. But adding diverse data (new environments, new object categories) continued to improve out-of-distribution performance long after adding more demonstrations of already-seen scenarios stopped helping.

This is a fundamental structural difference from language model scaling. More web text from the same distribution continues to improve language models. More demonstrations of the same task in the same environment does not continue to improve robot policies past a relatively low ceiling. The bottleneck is diversity, not volume.

What Scales Well in Robot Learning

Visual representation quality. Larger pre-trained visual encoders (DINOv2-Large vs. DINOv2-Base, SigLIP-400M vs. SigLIP-100M) consistently produce better visual features for manipulation policies. The scaling here follows a familiar pattern from computer vision: bigger models learn richer, more generalizable visual representations. This is the most LLM-like scaling dynamic in robot learning.

Task diversity in pre-training data. Policies pre-trained on more task categories generalize better to new tasks. The Open X-Embodiment result on this axis is robust and has been replicated in subsequent work. Each additional task category added to the pre-training mixture improves zero-shot performance on held-out tasks, with diminishing but positive returns observed up to the limits of current datasets (hundreds of task categories).

Cross-embodiment transfer. Adding data from additional robot types to the pre-training set improves generalization to a new target robot, even when the target robot is kinematically dissimilar to any robot in the training set. The mechanism is that diverse embodiment data forces the model to learn embodiment-agnostic manipulation representations rather than overfitting to one robot's kinematics. This scales reliably with the number of embodiments in the training set.

Environment diversity. Policies trained on demonstrations from more physical environments (different rooms, lighting conditions, table surfaces) generalize better to novel deployment environments. This scales approximately linearly with the number of distinct environments up to roughly 20 environments, then with diminishing returns.

What Does Not Scale as Cleanly

Precise dexterous manipulation. Tasks requiring sub-centimeter precision (threading, insertion, screw driving) do not improve predictably with more data or larger models. The bottleneck is not visual representation quality or task knowledge; it is that precise contact dynamics are sensitive to object-specific properties (friction, compliance, geometry) that vary between instances in ways that defeat generalization. More data helps, but the scaling curve is shallow and noisy compared to gross manipulation tasks.

Contact dynamics. Policies that need to modulate grip force based on object properties (fragile vs. sturdy, rigid vs. deformable) show limited improvement from data scaling alone. The issue is that the contact-relevant properties are not visible to cameras: you cannot see an object's friction coefficient or compliance from an RGB image. Force-torque sensor data helps but introduces its own distribution shift problems. This is an area where hybrid approaches (learned perception + classical force control) outperform pure scaling of learned policies.

Long-horizon task planning. For tasks requiring 10+ sequential subtask executions, success rates scale poorly with data. The compounding error problem means that even a policy with 95% per-step success achieves only 60% success on a 10-step task and 36% on a 20-step task. More data improves per-step reliability but does not change the compounding arithmetic. Hierarchical policies (high-level planner + low-level skills) mitigate this, but the scaling dynamics of the high-level planner are less well-characterized than the skill-level scaling.

Raw model parameter count for fine-tuned policies. Unlike language models where bigger is consistently better, the relationship between model size and manipulation performance after task-specific fine-tuning is non-monotonic. OpenVLA (7B parameters) outperforms Octo (93M) on zero-shot tasks but offers diminishing advantage after fine-tuning on 200+ task-specific demonstrations. For teams with dedicated data collection budgets, a well-fine-tuned smaller model is often more practical than a lightly-adapted larger one.

Why the Industry Is Racing to Collect Robot Data

The scaling evidence explains why every major robotics company and research lab in 2026 is investing heavily in real-world robot data collection infrastructure. If diversity is the bottleneck, and diversity requires collecting data in many environments with many objects on many robots, then the competitive advantage accrues to organizations that can generate diverse robot data at scale.

Physical Intelligence, Google DeepMind, Toyota Research Institute, and several well-funded startups are building or have built large-scale data collection facilities with dozens of robot stations, standardized task protocols, and professional operators. The logic is direct: whoever has the most diverse robot manipulation data will train the best foundation models, and the best foundation models will require the least customer-specific fine-tuning to deploy.

SVRC occupies a specific role in this ecosystem. We provide data collection infrastructure and professional operators to teams that need diverse, high-quality manipulation data but do not want to build and staff their own collection facility. Our data services are structured around the scaling dynamics described in this article: maximizing diversity of objects, environments, and task configurations rather than simply maximizing episode count.

What This Means for Research Labs

Academic research labs face a scaling challenge that industry does not. A university lab typically has 1-3 robot arms, one room, and a limited budget for object procurement and operator hours. The scaling evidence says that the most impactful dimension to invest in is diversity, not volume. Practical implications:

Use foundation models. Pre-trained models (Octo, OpenVLA) provide the multi-embodiment, multi-task base that a single lab cannot replicate. Fine-tune on your specific task rather than training from scratch.
Maximize object diversity per dollar. Buy 30 different objects from a thrift store rather than 3 expensive precision objects. The generalization return on diverse everyday objects far exceeds the return on precise but homogeneous objects.
Collect in multiple rooms. Moving your setup between a lab, a kitchen, and an office, even if inconvenient, provides environmental diversity that dramatically improves policy robustness. Three environments with 100 demos each outperform one environment with 300 demos for out-of-distribution generalization.
Contribute to and draw from open datasets. Open X-Embodiment, DROID, and Bridge V2 are free scaling you can add to your training pipeline without collecting a single additional demonstration.

What This Means for Companies

Companies building production robot systems face a different scaling challenge. They typically have a specific deployment environment and a specific task portfolio. The scaling evidence suggests a phased approach:

Phase 1: Foundation model selection. Start with the best available pre-trained model for your robot type and task domain. This gives you the benefit of community-scale data diversity without collection cost.
Phase 2: Targeted fine-tuning. Collect 200-500 demonstrations of your specific tasks in your specific deployment environment. Focus on object diversity within your target SKU categories and position diversity within your workspace.
Phase 3: Continuous improvement. Deploy the fine-tuned policy and use logged deployment data (with human success/failure labels) to identify failure modes. Collect targeted data to address those failures. This is where the diversity-over-volume principle matters most: do not re-collect data you already have; collect data that covers the gaps.
Phase 4: Scale across sites. Each new deployment site provides environmental diversity that improves the base policy. Structure your deployment data pipeline to flow data from all sites back to a central training pool. This turns your deployment scale into a data scaling advantage.

Data Quantity vs. Model Size vs. Performance: The Numbers

Configuration	Model Size	Training Data	In-Distribution Success	OOD Success
ACT from scratch	~30M params	100 demos	75-85%	20-35%
ACT from scratch	~30M params	500 demos	88-93%	35-50%
Octo (fine-tuned)	93M params	OXE + 200 task demos	85-92%	45-60%
OpenVLA (fine-tuned)	7B params	OXE + 200 task demos	88-95%	55-70%
OpenVLA (fine-tuned)	7B params	OXE + 500 task demos	92-97%	60-75%

The key takeaway from these numbers: a fine-tuned foundation model with 200 task-specific demonstrations typically matches or exceeds a from-scratch model with 500 demonstrations, both in-distribution and out-of-distribution. The foundation model's advantage is entirely in OOD generalization -- it provides a 20-30 percentage point improvement on novel objects and environments. The in-distribution advantage is smaller (5-10 points) and often not statistically significant with small evaluation sets.

Cost Projections: Data Collection vs. Compute

Resource	100 Demos	500 Demos	2,000 Demos	10,000 Demos
Data collection (SVRC rates)	$2,500	$6,000	$18,000	$65,000
Training compute (ACT, single A100)	$5 (2 hrs)	$10 (4 hrs)	$25 (10 hrs)	$60 (24 hrs)
Training compute (OpenVLA fine-tune, 4x A100)	$50 (5 hrs)	$100 (10 hrs)	$300 (30 hrs)	$800 (80 hrs)
Data-to-compute cost ratio	50:1 to 500:1	60:1 to 600:1	60:1 to 720:1	80:1 to 1083:1

The data-to-compute cost ratio for robot learning is dramatically different from language model training, where compute is the dominant cost. In robot learning, data collection costs 50-1000x more than the compute to train on that data. This has a direct strategic implication: every marginal dollar should be spent improving data quality and diversity, not on larger models or longer training runs. A $5,000 budget produces more capability improvement when spent on 200 diverse demonstrations than on 10x more compute for training on a less diverse dataset.

Scaling Curve Data Points: Empirical Performance vs. Dataset Size

The following table compiles performance data points from published papers and SVRC internal evaluations, showing how task success rate scales with dataset size for different architectures. All numbers are for single-task tabletop pick-and-place evaluated on held-out object positions.

Demo Count	ACT (from scratch)	Diffusion Policy	Octo (fine-tuned)	OpenVLA (fine-tuned)
10	15-25%	10-20%	40-50%	45-55%
25	30-40%	25-35%	55-65%	60-70%
50	55-65%	45-55%	70-78%	72-80%
100	72-82%	65-75%	82-88%	84-90%
200	82-90%	78-86%	86-92%	88-94%
500	88-93%	85-92%	90-95%	92-96%
1000	90-94%	88-94%	91-95%	93-97%
2000+	91-95%	90-95%	92-96%	94-97%

Key observations: (1) Foundation model fine-tuning (Octo, OpenVLA) provides its largest advantage at low demo counts (10-100), where the pre-trained visual and behavioral priors compensate for limited task-specific data. (2) At 500+ demos, from-scratch models close the gap significantly, and the advantage of foundation models narrows to 2-5 percentage points. (3) All architectures show diminishing returns past 500 demos for in-distribution performance, reinforcing the diversity-over-volume principle. (4) ACT reaches competitive performance with fewer demos than Diffusion Policy, making it the better choice for data-limited teams.

Practical Recommendations by Team Scale

The scaling evidence translates into concrete guidance depending on your resources.

Solo researcher / hobbyist (budget: < $5,000). Use a foundation model (Octo). Collect 50-100 demonstrations on a single task. Focus on learning the data collection pipeline rather than optimizing performance. Expected result: 70-80% success rate on in-distribution conditions, 30-40% OOD. This is sufficient for a research prototype or demonstration. Hardware: OpenArm 101 ($4,500) or leased via SVRC.

Small research lab (budget: $10,000-50,000). Fine-tune OpenVLA or Octo on 200-500 demonstrations collected across 10-20 objects and 2-3 environments. Invest in systematic evaluation (held-out objects, held-out positions). Expected result: 85-92% in-distribution, 50-65% OOD. This is deployment-ready for controlled environments. SVRC's $2,500 pilot covers the first 200 demos; a $8,000 campaign covers 500+ demos with diversity protocols.

Startup / company (budget: $50,000-500,000/year). Build continuous data collection infrastructure. Collect 1,000-5,000 demonstrations per task across 30+ objects and 5+ environments. Implement automated retraining and evaluation pipelines. Target: 90-95% in-distribution, 70-80% OOD. At this scale, the data-to-compute cost ratio means your budget should be approximately 90% data collection and 10% compute. Invest in collection efficiency (better teleoperation tools, trained operators, streamlined QA) rather than larger models.

Large enterprise (budget: > $500,000/year). Establish multi-site collection with standardized protocols. Train multi-task foundation models on your proprietary data pool. Each deployment site feeds data back to the central training pipeline. At this scale, cross-site diversity becomes your primary competitive advantage. Expected result: 95%+ in-distribution, 80%+ OOD across deployment environments. Contact SVRC for enterprise data collection partnerships.

When to Scale Data vs. Scale Model

A practical decision framework based on the scaling evidence:

Scale data (more demonstrations) when: Your in-distribution success rate is below 85%. Your policy fails on specific object positions or orientations that are under-represented in training. Your task has multiple viable strategies and you are seeing mode collapse (policy converges to one strategy and fails when that strategy is not applicable).
Scale diversity (more objects/environments) when: Your in-distribution success rate is above 85% but OOD performance is below 50%. Your policy fails on novel objects that are functionally similar to training objects. Your deployment environment differs visually from your training environment.
Scale model (larger architecture) when: You have abundant diverse data (>2,000 demonstrations across >10 object categories) and performance has plateaued. Your task requires understanding complex spatial relationships or long-horizon planning. You have access to pre-trained foundation model weights that you can fine-tune rather than training from scratch.
Change approach when: Success rate has plateaued at <70% despite 1,000+ diverse demonstrations. Your task requires precise force control that visual policies cannot learn from images alone. Your task has >10 sequential steps where compounding error dominates.

Cross-Embodiment Scaling: How Many Robots Do You Need?

The Open X-Embodiment results show that adding data from additional robot types improves generalization on a target robot. But how many embodiments provide meaningful benefit? Based on published ablations and SVRC's own multi-robot training experiments:

1 embodiment: Baseline. Single-robot specialist policy.
3-5 embodiments: The steepest improvement region. Adding data from 3-5 distinct robot types (different kinematics, different grippers) improves OOD generalization by 15-25% on the target robot.
5-10 embodiments: Continued improvement but with diminishing returns. Each additional embodiment adds 2-5% OOD improvement.
10+ embodiments: Marginal per-embodiment improvement is small (<2%), but the cumulative effect is substantial. The current OXE dataset with 22 embodiments provides a strong foundation for cross-embodiment training.

For teams with a single target robot, the practical advice is clear: use a foundation model pre-trained on OXE or DROID (which provides the cross-embodiment diversity for free) and fine-tune on your task-specific data. Building your own cross-embodiment dataset from scratch is not cost-effective unless you have access to 5+ distinct robot platforms.

Compute Budget Allocation: A Practical Framework

Given the extreme data-to-compute cost ratio in robot learning, teams need a principled way to allocate their total budget between data collection, compute, and engineering time. Based on the scaling evidence, here is the recommended allocation by project phase.

Project Phase	Data Collection	Compute (Training)	Engineering / Evaluation	Rationale
Prototype (0-50 demos)	40%	10%	50%	Focus on pipeline setup; data collection validates hardware and workflow
Development (50-500 demos)	70%	10%	20%	Maximize data diversity; compute costs are negligible relative to collection
Production (500-5000 demos)	60%	15%	25%	Larger model fine-tuning justified; evaluation and deployment engineering critical
Scale (5000+ demos)	50%	20%	30%	Continuous collection from deployments; multi-task training requires more compute

The consistent message across all phases: data collection dominates the budget. At no stage does compute account for more than 20% of the total. Teams that allocate budgets based on LLM training intuitions (where compute is the majority cost) will systematically under-invest in data and over-invest in compute for robot learning projects.

The Diversity Multiplier: Quantifying the Value of Varied Data

The strongest practical finding from scaling research is the diversity multiplier: the outsized performance gain from diverse data compared to homogeneous data of the same volume. Here is the empirical evidence distilled into an actionable formula.

For a standard pick-and-place task, define diversity score as: D = (number of unique objects) x (number of distinct positions per object) x (number of lighting conditions) x (number of operators). In SVRC evaluations across multiple tasks:

200 demos with D=200 (10 objects, 5 positions, 2 lighting, 2 operators) achieved 85% OOD success.
200 demos with D=20 (1 object, 5 positions, 2 lighting, 2 operators) achieved 40% OOD success.
1000 demos with D=20 achieved 55% OOD success -- still 30 points below the diverse 200-demo set.

The implication is stark: 200 diverse demonstrations outperform 1,000 homogeneous demonstrations for out-of-distribution generalization. This is why SVRC's collection protocols are built around diversity targets rather than volume targets. The $2,500 pilot package collects 200 demonstrations designed around maximum diversity within the task specification. A collection campaign that targets 200 demos with D>150 will consistently produce a more deployable policy than one targeting 500 demos with D<50.

Failure Mode Analysis: Why Policies Plateau at Different Levels

Understanding why a policy plateaus requires diagnosing which component of the learning pipeline is the bottleneck. Different plateau signatures point to different root causes, each requiring a different intervention.

Plateau Signature	In-Dist. Rate	OOD Rate	Root Cause	Intervention
Low ceiling	<70%	<30%	Data quality issue (inconsistent demonstrations, high noise)	Audit and clean existing data; retrain operators; tighten QA gates
High in-dist, low OOD	>90%	<50%	Diversity bottleneck (overfitting to training distribution)	Add diverse environments and objects; use data augmentation; switch to foundation model
Mode collapse	70-85%	30-50%	Architecture limitation (BC averaging over multimodal demonstrations)	Switch from BC to Diffusion Policy or ACT; add action chunking
Precision failures	80-90%	60-75%	Observation gap (task requires force/tactile data not in obs space)	Add F/T sensor; add wrist camera for close-up view; use hybrid classical+learned control
Long-horizon decay	85-95% per step	N/A	Compounding error over 10+ steps	Decompose into sub-policies; add hierarchical planning; use DAgger
Random failures	85-92%	70-80%	Stochastic environment factors (object slip, lighting flicker)	Add retry mechanisms; increase action frequency; use closed-loop control

The diagnostic process: run 50 evaluation episodes, categorize each failure, and identify whether failures cluster at a specific task phase (approach, grasp, transport, placement) or appear uniformly distributed. Clustered failures indicate a specific skill gap that targeted data can address. Uniformly distributed failures suggest a systemic issue (data quality, architecture mismatch, or sensor limitation) that more data of the same type will not fix.

Practical Scaling Experiment Template

For teams who want to characterize their own scaling curve before committing to a large data collection campaign, here is the experimental protocol SVRC uses with clients.

# Scaling experiment protocol
# Collect demos in batches, train and evaluate after each batch

import json
from pathlib import Path

def run_scaling_experiment(task_name, max_demos=500, batch_size=50):
    """
    Protocol:
    1. Collect batch_size demos
    2. Train policy on cumulative dataset
    3. Evaluate on fixed held-out test set (20 episodes)
    4. Log results and decide whether to continue
    """
    results = []
    cumulative_demos = 0

    for batch_idx in range(max_demos // batch_size):
        cumulative_demos += batch_size

        # Train on all demos collected so far
        model = train_policy(
            dataset_path=f"data/{task_name}",
            num_demos=cumulative_demos,
            architecture="act",  # or "diffusion_policy"
            epochs=2000,
        )

        # Evaluate on fixed test set (same 20 configs every time)
        in_dist_rate = evaluate(model, test_set="in_distribution", n=20)
        ood_rate = evaluate(model, test_set="out_of_distribution", n=20)

        results.append({
            "demos": cumulative_demos,
            "in_dist": in_dist_rate,
            "ood": ood_rate,
            "delta_in_dist": in_dist_rate - (results[-1]["in_dist"] if results else 0),
            "delta_ood": ood_rate - (results[-1]["ood"] if results else 0),
        })

        # Stop if both metrics have plateaued (<2% improvement for 2 batches)
        if len(results) >= 3:
            recent = results[-2:]
            if all(r["delta_in_dist"] < 0.02 for r in recent) and \
               all(r["delta_ood"] < 0.02 for r in recent):
                print(f"Plateau detected at {cumulative_demos} demos")
                break

    Path(f"results/{task_name}_scaling.json").write_text(json.dumps(results, indent=2))
    return results

This protocol typically requires 3-5 collection-training-evaluation cycles (150-250 demos) to identify the approximate plateau for a given task. The key outputs are: (1) the minimum demo count for deployment-grade in-distribution performance (typically 85-90%), (2) the point of diminishing returns for OOD performance, and (3) whether the bottleneck is data volume, data diversity, or architecture. SVRC runs this protocol as part of the $2,500 pilot engagement, providing clients with a data-driven scaling curve specific to their task before committing to a full collection campaign.

Multi-Task Scaling: How Task Count Affects Per-Task Data Requirements

An important and counterintuitive finding from recent work: training a policy on multiple tasks simultaneously can reduce the per-task data requirement. When tasks share visual or motor primitives (all involve grasping, all use the same gripper, all occur on the same table), the shared learning across tasks provides a regularization effect that improves data efficiency for each individual task.

Empirical observations from SVRC multi-task training:

1 task, 200 demos: 85% in-distribution success. Baseline.
5 related tasks, 200 demos each (1,000 total): 88% average in-distribution success and 62% OOD. The 3-point in-distribution improvement and strong OOD performance come "for free" from the cross-task transfer.
5 related tasks, 100 demos each (500 total): 84% average in-distribution success and 55% OOD. Nearly matches the 200-demo single-task baseline with half the per-task data.
5 unrelated tasks, 200 demos each (1,000 total): 83% average in-distribution success. Slight degradation compared to single-task training because the tasks do not share useful primitives.

The rule of thumb: if your tasks share at least 50% of their motion primitives (same workspace, same gripper, similar objects), multi-task training with language conditioning reduces per-task data requirements by 30-50%. If tasks are unrelated (different workspaces, different grippers), train separate single-task policies. SVRC's data collection campaigns for multi-task deployments are structured to maximize cross-task shared primitives, reducing total collection cost by 20-40% compared to independent single-task campaigns.

Compute Scaling: When to Train Bigger Models vs. Collect More Data

Robot learning faces a different compute-data tradeoff than language modeling. In language modeling, scaling model size with fixed data improves performance predictably. In robot learning, the evidence suggests that data diversity matters more than model size for most practical tasks.

Model Size	Architecture Example	Demos for Peak In-Dist	In-Dist Success	OOD Success	Training Cost
Small (5-20M params)	ACT (ResNet-18 backbone)	200-300	85-92%	40-55%	$5-15
Medium (50-200M params)	ACT (DINOv2-B backbone)	300-500	88-94%	55-70%	$50-200
Large (1-7B params)	OpenVLA (Llama-7B backbone)	100-200 (fine-tune)	90-96%	65-80%	$500-3,000

The key finding: medium-sized models with diverse data often match large models with less diverse data on OOD evaluation. A 100M parameter ACT with DINOv2 backbone trained on 500 diverse demonstrations can achieve 65-70% OOD success, comparable to a 7B parameter OpenVLA fine-tuned on 150 less diverse demonstrations. The large model has higher peak in-distribution performance, but the diversity of the training data is the dominant factor for generalization.

Budget allocation rule of thumb: If your total project budget is X, spend 70% on data collection (emphasizing diversity), 20% on compute (training and evaluation), and 10% on engineering (pipeline, integration). Teams that over-invest in compute at the expense of data diversity consistently underperform teams that prioritize diverse data collection.

Scaling with Simulation Data: How Much Real Data Can Sim Replace?

Simulation data scales cheaply but transfers imperfectly. The practical question is how much real-world data collection can be offset by simulation.

Visual diversity: Sim can provide unlimited visual diversity through domain randomization. For visual generalization specifically (novel objects, lighting, backgrounds), 10K sim episodes with domain randomization provide equivalent benefit to 500-1000 real episodes. This is the strongest use case for sim data in a scaling context.
Motor control precision: Sim data transfers poorly for precision tasks (sub-mm insertion, deformable manipulation) because physics simulation errors compound. For these tasks, sim data provides 0-20% of the value of equivalent real data. Always validate with real-world fine-tuning.
Practical mixing ratio: The most effective approach is 80-90% sim data (for visual diversity and coarse behavior) plus 10-20% real data (for physics grounding and precision). This hybrid approach typically reaches 85-90% of full real-data performance at 20-30% of the real-data collection cost.
Diminishing returns in sim: Sim data shows its own plateau at 10K-50K episodes for most tasks. Beyond 50K sim episodes, additional sim data provides negligible improvement -- the bottleneck shifts to sim-to-real gap, not data volume. Invest in better domain randomization rather than more episodes.

SVRC supports hybrid sim-real data pipelines through our data services. We can advise on the optimal sim-real mixing ratio for your specific task and provide the real-world fine-tuning data that bridges the sim-to-real gap. See our sim-to-real guide for implementation details.

Open Questions and What We Still Do Not Know

Several important questions about robot learning scaling remain unanswered in 2026. Whether continuous data collection (rolling deployment + periodic retraining) follows the same scaling curves as offline batched training is unclear. Whether cross-modal data (video of human manipulation, simulated demonstrations, language descriptions) provides the same scaling benefits as real robot data is actively debated, with preliminary evidence suggesting it helps for visual representations but not for precise motor control.

Data Collection Velocity: How Fast Can You Scale?

Even with the right budget allocation, data collection velocity is constrained by physical realities. Understanding these constraints helps teams set realistic timelines.

Configuration	Demos/Hour	Demos/Day (8hr)	Time to 500 Demos
1 arm, simple pick-place, expert operator	60-120	400-800	1 day
1 arm, multi-step assembly, expert	15-30	100-200	3-5 days
Bimanual (ALOHA), simple task	30-60	200-400	2 days
Bimanual, complex task, novice operator	8-15	50-100	5-10 days
Remote teleop (100ms+ latency)	10-30	80-200	3-6 days

Parallelization: The fastest way to scale data collection is to run multiple robot stations simultaneously. Each additional station provides near-linear throughput increase (minus 10-15% coordination overhead). SVRC operates 4-8 collection stations in parallel for large campaigns, achieving 1,000-2,000+ demos per day for simple tasks. The cost of additional stations is amortized across campaigns, making parallel collection more cost-effective than running a single station for longer.

These rates assume a mature collection pipeline with no hardware issues. For first-time setups, reduce throughput expectations by 50% for the first week while the operator and pipeline are being debugged. SVRC's facility achieves the "expert operator" rates consistently because our operators have 500+ hours of teleoperation experience across multiple task categories.

The Plateau Problem: When More Data Stops Helping

Every team eventually hits a performance plateau where adding more demonstrations produces negligible improvement. Recognizing when you have hit a plateau -- vs. when data quality or diversity is the bottleneck -- is critical for avoiding wasted collection effort.

Signs you have hit a genuine plateau: Success rate has been within +/- 2% for the last 3 collection-training cycles. Adding 200 more demonstrations with systematic diversity variation did not improve OOD performance. Validation loss has flattened and does not decrease with more training epochs.

Differentiating volume plateau from diversity plateau: The most common mistake is interpreting a diversity plateau as a volume plateau. To test which you are hitting: collect 50 additional demonstrations with maximum diversity (new objects, new positions, new lighting) and compare policy improvement against 50 additional demonstrations with the same diversity profile as existing data. If the diverse batch improves performance and the uniform batch does not, you have a diversity problem. If neither improves performance, you have hit a genuine volume plateau and should consider architectural changes.

Signs the plateau is artificial (fixable): In-distribution performance is strong (>90%) but OOD performance is weak (<50%) -- this indicates a diversity bottleneck, not a data volume bottleneck. Failures are concentrated at specific conditions (one object type, one workspace region) -- targeted data collection at those conditions will help. The policy shows mode averaging (hesitating between strategies) -- architectural changes (switch from BC to Diffusion Policy) or data cleaning will help more than more data.

When you hit a genuine plateau, the options are: (1) switch to a foundation model with cross-embodiment pre-training, which provides a new baseline above the from-scratch plateau; (2) add modalities (F/T sensing, depth) that provide information the policy cannot extract from existing observations; or (3) decompose the task into sub-policies, each of which can be independently improved past the whole-task plateau.

Practical Checkpoints: When to Stop Collecting Data

One of the most common questions teams ask is "how many demonstrations do I need?" The honest answer depends on your task and target success rate, but these checkpoints provide decision points.

50 demos: Pipeline validation checkpoint. Train a policy and evaluate 20 trials. If success rate is below 30%, there is likely a pipeline bug (data format, action normalization, camera calibration) rather than a data volume issue. Fix the pipeline before collecting more data.
100 demos: Feasibility checkpoint. If the policy achieves 50%+ success in-distribution, the task is learnable with your current architecture and data quality. If below 50% after debugging, consider whether the task requires additional modalities (F/T sensing, depth) or a different algorithm.
200 demos: Diversity checkpoint. Evaluate on held-out objects/positions. If in-distribution success is 80%+ but OOD success is below 40%, the issue is diversity, not volume. Increase the diversity of the next 100 demos rather than collecting more of the same.
500 demos: Diminishing returns checkpoint. For single-task BC on L2 tasks, most teams see less than 3% improvement from demos 300-500. If you have not reached your target success rate by 500 demos, the bottleneck is likely architectural (switch to Diffusion Policy, add language conditioning) rather than data volume.
1,000+ demos: Only justified for L3/L4 tasks (complex assembly, multi-step sequences) or multi-task training where the total demo count is split across 5-10 tasks. Collect in batches of 200 and evaluate between batches to avoid waste.

Perhaps most importantly: whether the DROID plateau at 50,000 demonstrations reflects a fundamental ceiling for current architectures, a limitation of the specific models tested (ACT, Diffusion Policy), or an artifact of the dataset's diversity profile is unknown. Larger datasets with greater diversity may push this plateau higher. Alternatively, architectural innovations (better action representations, improved temporal modeling, or more effective use of force-torque data) may be required to break through the current scaling ceiling.

The SVRC data platform is designed to support teams navigating these uncertainties. Our structured diversity protocols, standardized evaluation benchmarks, and flexible data export formats let you iterate quickly as the field's understanding of scaling dynamics continues to evolve.