Best hierarchy success rate, compared with 69.57% for naive hierarchy and 69.63% for flat VLA.
What Matters in Orchestrating Robot Policies A Systematic Study of Hierarchical VLA Agents
Google DeepMind
Abstract
Hierarchical vision-language-action (Hi-VLA) systems have emerged as a promising paradigm for complex robot manipulation, by using high-level VLM planners to decompose tasks into language subgoals executed by low-level VLA controllers. Despite recent empirical progress, there is a lack of unified design principles for these systems: existing Hi-VLA systems differ in how they choose and connect planners, controllers, mechanisms to switch between the two, and how observations and memory are represented in the planner.
We unify representative Hi-VLA agents under an options-style control framework and benchmark core design choices across short-horizon, long-horizon, and reasoning-intensive tasks. Our analysis distills practical principles for building Hi-VLA systems, showing how model choices and interface mechanisms jointly shape performance. Applying these principles yields a substantially stronger system than either flat VLA control or a naively designed hierarchy, across experiments both in simulation and on a real ALOHA robot.
Key Results
The "best hierarchy" agent which uses the strongest orchestration choices from our study allows for substantially improved performance over flat VLA control and a naively designed hierarchy, especially on long-horizon and reasoning-intensive tasks.
Best hierarchy success rate, compared with 40.56% for naive hierarchy and 25.30% for flat VLA.
Best hierarchy success rate, compared with 66.49% for naive hierarchy and 50.90% for flat VLA.
Correct fruit placements with the best hierarchy, compared with 9 / 15 and 3 / 15 for baselines.
Real Robot Experiment
Best Hierarchy
The “best hierarchy” allows the robot to reason about the task, break it down into sub-goals, and recover from the mistake of low-level VLA (red grape misplacement).
Naive Hierarchy
By contrast, the “naive hierarchy” is also able to make reasonable progress, but is unable to recover from the grape misplacement.
Flat VLA
Flat VLA has a hard time converting the task instruction into reasonable low-level actions, and struggles to make progress.
What Matters
Here, we present how some of the key design choices in our study affect the performance of Hi-VLA agents.
VLM Policy
| Task | Lite | Lite (no thinking) | Flash | Pro |
|---|---|---|---|---|
| Short | +0.0 | -4.0 | +1.4 | -4.3 |
| Long | +0.0 | -9.5 | -5.9 | -5.1 |
| Reasoning | +0.0 | -16.7 | -2.6 | -0.8 |
Reasoning beats scale for the high-level VLM.
Increasing the number of parameters of the VLM policy does not lead to improvement in performance. By contrast, enabling VLM thinking positively impacts performance.
VLA Policy
| Task | GROD-S | GROD-S FT | GROD-L |
|---|---|---|---|
| Short | +0.0 | -8.8 | +12.4 |
| Long | +0.0 | -33.8 | +11.1 |
| Reasoning | +0.0 | -23.9 | +5.7 |
Steerable low-level control is a bottleneck.
Larger VLAs perform better due to better instruction following capabilities, while fine-tuning can hurt the performance significantly by degrading the instruction following capability.
Termination
| Task | Fixed | VLM | Success |
|---|---|---|---|
| Short | +0.0 | -3.7 | -1.2 |
| Long | +0.0 | -8.9 | +5.0 |
| Reasoning | +0.0 | -0.4 | +8.3 |
Switching policy is high leverage.
Good handoff timing prevents wasted execution. Success detection is an effective termination condition, and moderate fixed execution horizons reduce VLM calls without large performance loss.
Observation
| Task | Raw | Desc. | +Box | +Contact |
|---|---|---|---|---|
| Short | +0.0 | +0.4 | +6.4 | +8.2 |
| Long | +0.0 | -3.1 | +9.1 | +13.5 |
| Reasoning | +0.0 | -6.4 | -0.7 | +3.4 |
Observation representation matters.
Structured image representations help the VLM perceive the scene and make better decisions. Bounding boxes and contact information both allow the VLM to understand the scene better.
Memory Length
| Task | 1 | 3 | 5 | Full |
|---|---|---|---|---|
| Short | +0.0 | -1.0 | -0.7 | -0.2 |
| Long | +0.0 | -1.7 | -2.1 | -0.9 |
| Reasoning | +0.0 | -1.6 | -2.1 | -1.5 |
Raw memory is not enough.
Raw in-episode context provides little benefit by itself, suggesting that useful robot memory needs processing rather than simple context accumulation.
Memory Summary
| Task | None | 1-Step | Episode | Prev. |
|---|---|---|---|---|
| Short | +0.0 | -1.2 | -4.2 | +3.6 |
| Long | +0.0 | +0.2 | -2.2 | +7.6 |
| Reasoning | +0.0 | +0.2 | +3.1 | +7.7 |
Cross-episode reflection helps.
Summary / Reflection allows the VLM to encode long memory. While it does not help by itself, summarizing cross-episode experiences allows the VLM to notably boost performance.
Check out our paper for more detailed results and analysis!
BibTeX
@article{hu2026matters,
title = {What Matters in Orchestrating Robot Policies: A Systematic Study of Hierarchical VLA Agents},
author = {Hu, Jiaheng and Shridhar, Mohit and Lu, Caden and Shah, Dhruv and Chiang, Hao-Tien Lewis and Tan, Jie and Xie, Annie},
journal = {arXiv preprint arXiv:2606.10267},
year = {2026}
}