Getting Started with Robot Teleoperation in 2026

What Is Robot Teleoperation?

Teleoperation means a human operator controls a robot using a control interface -- transmitting commands over a local or remote network while receiving sensory feedback (video, force, position) in return. In the context of robot learning, teleoperated demonstrations are the gold standard for collecting training data because they encode natural human strategies that are difficult to program manually. SVRC's teleop platform supports both local and remote operation across multiple robot platforms.

The teleoperation loop has four stages: (1) the operator observes the robot's environment through camera feeds and proprioceptive data, (2) the operator generates commands through a control device, (3) commands are transmitted to the robot and executed, (4) all observations and actions are logged synchronously for later use as training data. The quality of each stage directly affects both real-time control performance and the downstream utility of the recorded dataset.

In 2026, teleoperation serves two distinct purposes that increasingly converge. The first is direct remote operation -- a human controls a robot to perform useful work at a distance (inspection, maintenance, hazardous environment tasks). The second is demonstration collection -- a human operates the robot specifically to generate training data for imitation learning policies like ACT or Diffusion Policy. The hardware and software requirements overlap substantially, which is why teams that start with data collection often transition naturally into remote deployment capabilities.

Hardware You Need to Get Started

A basic teleoperation setup requires five components: a robot arm or mobile platform, cameras, a control device, a compute node for streaming and logging, and a logging system to capture synchronized observations and actions. SVRC's leased hardware packages come preconfigured with all required components. For teams using OpenArm, the SVRC leader-follower setup takes under 30 minutes to assemble.

Control Device Comparison

The control device is the most consequential hardware choice because it determines the quality ceiling of your demonstrations. Here is how the four main options compare in 2026:

Control Device	Latency	DOF Mapped	Best For	Cost	Training Curve
Leader arm (ALOHA-style)	<2ms local	Full joint-space (6-7 DOF)	Bimanual manipulation, ACT data	$2,000-5,000 per pair	2-4 hours
3D SpaceMouse	5-10ms	6 DOF (Cartesian + rotation)	Single-arm pick-place, slow precision	$200-500	1-2 days
Data glove (e.g., Paxini, Manus)	8-15ms	15-22 DOF (hand + wrist)	Dexterous hand control, finger tasks	$5,000-15,000	4-8 hours
VR controller (Quest 3, VIVE)	20-40ms	6 DOF + finger triggers	Mobile base + arm, humanoid whole-body	$300-1,000	30 minutes

Our recommendation: For teams collecting imitation learning data on tabletop manipulation tasks, the leader arm is the clear winner. The 1:1 kinematic mapping means the operator's muscle memory transfers directly to the robot, producing smoother demonstrations with fewer retakes. The ALOHA-style leader-follower configuration (identical kinematic chain for leader and follower arms) eliminates the retargeting problem entirely -- joint positions from the leader arm map directly to joint commands on the follower arm with no inverse kinematics step. This is why ACT works so well on ALOHA data: the demonstrations are mechanically clean.

For humanoid whole-body teleoperation (Unitree G1, Booster K1), VR controllers paired with body tracking are the current standard. The Quest 3 with hand tracking provides adequate finger mapping for simple grasps, but precision finger tasks still require dedicated glove hardware. SVRC supports all four interfaces through the teleop control platform.

Why Latency Matters: The 50ms Threshold

End-to-end latency -- the time from operator input to robot motion -- is the single most important performance metric in teleoperation. There are three critical thresholds:

<20ms: Operator perceives no delay. Robot feels like a direct extension of the hand. This is achievable with leader arms on a local USB/serial connection. Data collected at this latency contains the most natural, fluid human behavior.
20-50ms: Perceptible but manageable. Operators adapt within minutes and produce good demonstration data. This is typical for local network teleoperation (operator in the same building, robot connected via Ethernet).
50-150ms: Operators slow down significantly to compensate for delay. Demonstrations become cautious and jerky -- the operator moves, waits for feedback, adjusts, waits again. This "move-and-wait" pattern trains policies that are slow and hesitant. Data quality degrades substantially above 50ms.
>150ms: Fine manipulation becomes impractical. Only gross positioning and navigation tasks work reliably. This is typical for internet-based remote teleoperation across time zones.

Latency has four components, each of which must be measured and minimized independently:

Component	Typical Range	How to Minimize
Control device read	0.5-5ms	USB polling at 1kHz, avoid Bluetooth
Network transport	0.1ms (local) to 200ms (cross-continent)	Wired Ethernet for local; WebRTC with STUN for remote
Command processing	1-10ms	Dedicated real-time thread, avoid Python GIL on control loop
Motor execution	2-20ms	Use position servo mode, not velocity; 500Hz+ servo rate

Camera Configuration: The Three-Camera Standard

Camera placement is the second most impactful decision after control device choice. The current community standard for manipulation data collection is a three-camera setup:

Overhead camera (fixed): Provides top-down workspace view. Essential for spatial reasoning -- where objects are relative to each other. Mount at 0.8-1.2m above the table surface pointing straight down. Resolution: 640x480 is sufficient; higher resolutions add bandwidth without proportional policy improvement.
Side camera (fixed): Captures the arm-object spatial relationship and approach angle. Mount at table height, 0.6-1.0m from the workspace center, angled 15-30 degrees down. This camera captures depth cues that the overhead view misses.
Wrist camera (mounted on end-effector): Captures the grasp contact zone from millimeters away. This is the single most impactful camera for fine manipulation -- studies by the ALOHA team and others show that adding a wrist camera improves policy success rates by 20-40% on contact-rich tasks. Use a small USB camera (e.g., ELP 120fps fisheye module, ~$30) with a 3D-printed mount.

Camera synchronization matters. If observation timestamps are off by more than 10ms from action timestamps, you are training the policy on misaligned data -- it sees the world at time t but associates it with the action from time t+delta. On fast tasks (grasps completing in 200-400ms), even 20ms of misalignment degrades policy performance measurably. Use hardware triggers or timestamp-based software synchronization.

Camera resolution vs. frame rate tradeoff: For imitation learning, frame rate matters more than resolution. A 640x480 stream at 30fps produces better training data than a 1920x1080 stream at 15fps. The policy needs temporal continuity to learn motion dynamics. We run all SVRC data collection at 30fps minimum, with 50fps preferred for fast manipulation tasks.

Local vs Remote Teleoperation

Local teleoperation (operator in the same room) achieves the lowest latency -- typically under 5ms end-to-end -- and is the standard for data collection. Remote teleoperation (operator at a different location) introduces network latency but enables deployment scenarios like remote inspection, facility management, and distributed data collection. SVRC's data platform supports both modes with adaptive stream compression and latency monitoring built in.

For remote teleoperation, the video stream is the bottleneck -- not the control commands. Control commands are tiny (<100 bytes per packet at 50Hz), but uncompressed 640x480 RGB at 30fps is 28 MB/s per camera. With three cameras, that is 84 MB/s before compression. H.264 hardware encoding on the robot side reduces this to 2-5 Mbps per stream with acceptable quality, but encoding adds 15-30ms of latency. The SVRC remote teleop stack uses hardware-accelerated encoding on NVIDIA Jetson with tuned parameters for minimum latency (zerolatency preset, slice threading, intra-refresh).

The ACT Data Format and Recording Pipeline

Most modern imitation learning frameworks -- ACT, Diffusion Policy, LeRobot -- consume training data in HDF5 format with a specific structure. Each episode is stored as a group containing synchronized arrays for observations (images, joint positions, gripper state) and actions (target joint positions or velocities). Here is the standard schema:

episode_0/
  observations/
    images/
      cam_high    (T, 480, 640, 3) uint8     # overhead camera
      cam_low     (T, 480, 640, 3) uint8     # side camera
      cam_wrist   (T, 480, 640, 3) uint8     # wrist camera
    qpos          (T, 7) float32              # joint positions + gripper
    qvel          (T, 7) float32              # joint velocities
  actions         (T, 7) float32              # target joint positions
  timestamps      (T,) float64               # Unix timestamps in seconds
  metadata/
    success       bool                        # did episode achieve goal?
    task_name     string                      # e.g., "pick_cube_place_bin"
    operator_id   string                      # for tracking operator quality

Below is a minimal Python recording script that captures synchronized data from an OpenArm with three USB cameras. This is a simplified version of what SVRC's recording pipeline runs -- the production version adds automatic success detection, real-time quality metrics, and cloud upload.

import h5py
import numpy as np
import time
import cv2
from openarm_sdk import OpenArm  # SVRC OpenArm Python SDK

# Initialize hardware
arm = OpenArm(port="/dev/ttyUSB0", baudrate=1000000)
cameras = {
    "cam_high":  cv2.VideoCapture(0),
    "cam_low":   cv2.VideoCapture(2),
    "cam_wrist": cv2.VideoCapture(4),
}
for cam in cameras.values():
    cam.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
    cam.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)
    cam.set(cv2.CAP_PROP_FPS, 30)

CONTROL_HZ = 50  # 50 Hz control loop
MAX_STEPS = 500  # 10 seconds at 50 Hz
dt = 1.0 / CONTROL_HZ

def record_episode(episode_id: int, hdf5_path: str):
    """Record one teleoperation episode to HDF5."""
    images = {k: [] for k in cameras}
    qpos_list, qvel_list, action_list, ts_list = [], [], [], []

    print(f"Recording episode {episode_id}... Press Ctrl+C to stop.")
    try:
        for step in range(MAX_STEPS):
            t_start = time.time()

            # Read leader arm (control device) as target action
            leader_state = arm.read_leader()
            action = leader_state.joint_positions  # 7-dim: 6 joints + gripper

            # Read follower arm (robot) current state
            follower_state = arm.read_follower()
            qpos = follower_state.joint_positions
            qvel = follower_state.joint_velocities

            # Send action to follower
            arm.command_follower(action)

            # Capture images from all cameras
            for name, cap in cameras.items():
                ret, frame = cap.read()
                if ret:
                    images[name].append(frame[:, :, ::-1])  # BGR to RGB

            qpos_list.append(qpos)
            qvel_list.append(qvel)
            action_list.append(action)
            ts_list.append(time.time())

            # Maintain control frequency
            elapsed = time.time() - t_start
            if elapsed < dt:
                time.sleep(dt - elapsed)

    except KeyboardInterrupt:
        pass

    # Write to HDF5
    T = len(qpos_list)
    with h5py.File(hdf5_path, "a") as f:
        ep = f.create_group(f"episode_{episode_id}")
        obs = ep.create_group("observations")
        img_grp = obs.create_group("images")
        for name in cameras:
            img_grp.create_dataset(name, data=np.array(images[name]),
                                   chunks=(1, 480, 640, 3), compression="gzip")
        obs.create_dataset("qpos", data=np.array(qpos_list, dtype=np.float32))
        obs.create_dataset("qvel", data=np.array(qvel_list, dtype=np.float32))
        ep.create_dataset("actions", data=np.array(action_list, dtype=np.float32))
        ep.create_dataset("timestamps", data=np.array(ts_list, dtype=np.float64))

    print(f"Episode {episode_id}: {T} steps saved ({T / CONTROL_HZ:.1f}s)")

Collecting High-Quality Demonstrations

Good demonstrations are consistent, diverse, and successful. Consistency means following a defined protocol for object placement, approach strategy, and task completion. Diversity means varying object positions, orientations, and environmental conditions across episodes. Quality control at SVRC includes real-time episode review, automated outlier detection, and retake triggers when a demonstration falls outside quality bounds. Learn more about quality standards in our robot training data guide.

Operator Training Protocol

Operator quality is the most underappreciated factor in demonstration data. An experienced operator produces demonstrations that are 2-3x more consistent than a novice, which translates directly to policy performance. At SVRC, new operators go through a structured training protocol:

Familiarization (30 min): Free exploration with the control device and robot. No data recording. Goal: build kinesthetic intuition for the robot's workspace, speed limits, and force limits.
Practice episodes (20-30 episodes): Perform the target task with feedback from an experienced operator. These episodes are not used for training -- they are warm-up data.
Calibration test (10 episodes): Record 10 episodes and measure success rate, completion time, and trajectory smoothness (mean jerk magnitude). Operators must achieve >90% success rate and trajectory smoothness within 2 standard deviations of the reference set before moving to production recording.
Production recording: Record demonstrations for training data. Every 50 episodes, re-check quality metrics for drift.

Episode Quality Metrics

For each recorded episode, SVRC's pipeline computes and logs these quality signals:

Completion time: Should be within 1.5x of the reference mean. Abnormally fast episodes often indicate skipped steps; abnormally slow episodes indicate operator hesitation.
Trajectory smoothness: Mean jerk (third derivative of joint position) should be below a task-specific threshold. High jerk indicates corrections and hesitations that will confuse the policy.
Grasp duration: Time from first contact to stable grasp. Prolonged fumbling (>2s for a simple grasp) indicates a demonstration that should be retaken.
Task success: Binary -- did the episode achieve the goal state? Failed episodes are logged separately and can be used for negative examples or recovery behavior training, but should not be included in the primary training set.

Common Failure Modes and How to Fix Them

After supporting hundreds of teleoperation programs, SVRC has cataloged the most frequent failure modes that teams encounter when setting up teleoperation for the first time:

Symptom	Root Cause	Fix
Robot jerks or oscillates during teleoperation	Control loop rate too low (<30Hz) or PD gains too high	Increase control loop to 50Hz+. Reduce P gain by 30% and add velocity damping. Check USB polling rate.
Policy trained on teleop data fails at deployment	Camera moved between collection and deployment, or lighting changed	Use rigid camera mounts with alignment markers. Log camera intrinsics/extrinsics per session. Match lighting.
Video feed lags during remote teleop	H.264 encoder buffering frames for quality, not latency	Use `-tune zerolatency` and `-preset ultrafast` in ffmpeg/GStreamer. Reduce resolution to 480p.
Gripper commands lag behind arm commands	Gripper on separate serial bus with different polling rate	Synchronize arm and gripper commands in the same control loop iteration. Use a single serial bus if possible.
Timestamps in HDF5 not monotonically increasing	NTP time sync adjusting system clock during recording	Use `time.monotonic()` for relative timestamps. Store absolute time only in episode metadata.
Operator fatigue after 30 minutes of collection	Ergonomic issues with leader arm position or weight	Mount leader arm at elbow height. Add arm rest. Enforce 10-min break every 45 min. Rotate operators.

Software Stack: What Runs Where

A production teleoperation setup runs software on three machines. Understanding this architecture helps you debug latency and synchronization issues:

Robot-side compute (e.g., Jetson Orin Nano, $249): Runs the real-time control loop (50-500Hz), camera capture and encoding, and action execution. This machine must run a real-time kernel (PREEMPT_RT) or at minimum a low-latency kernel to avoid jitter. Python is acceptable for the logging path but the inner control loop should be C++ or use a Python extension that releases the GIL.
Operator workstation: Runs the control interface driver, video decoder, and operator UI. For local teleoperation, this can be the same machine as the robot-side compute. For remote teleoperation, it is a separate machine connected via WebRTC or a custom UDP streaming protocol.
Data server / cloud: Runs episode storage, quality metrics computation, dataset management, and training pipeline coordination. The SVRC data platform serves this role as a managed service.

How to Start a Teleoperation Program with SVRC

There are three paths depending on your starting point:

Full-service data collection ($2,500 pilot / $8,000 campaign): You describe your task and target robot -- SVRC provides the hardware, trained operators, lab environment, and post-processed dataset in HDF5/RLDS format. Best for teams that need training data without building teleop infrastructure. Start through data services.
Platform + your hardware: You own or lease the robot -- SVRC provides the teleop software stack, recording pipeline, dataset management, and quality metrics through the data platform. Monthly subscription, no hardware purchase required.
Hardware lease + platform: Lease an OpenArm or other platform through SVRC's leasing program and get the full software stack included. Fastest path from zero to collecting data -- most teams record their first episode within 4 hours of hardware delivery.

For teams who want to explore the interface before committing, try the virtual teleoperation sandbox -- a browser-based simulator that runs the full SVRC teleop UI against a simulated robot.