What Matters in Orchestrating Robot Policies A Systematic Study of Hierarchical VLA Agents

Jiaheng Hu, Mohit Shridhar, Caden Lu, Dhruv Shah, Hao-Tien Lewis Chiang, Jie Tan, Annie Xie

Google DeepMind

Abstract

Hierarchical vision-language-action (Hi-VLA) systems have emerged as a promising paradigm for complex robot manipulation, by using high-level VLM planners to decompose tasks into language subgoals executed by low-level VLA controllers. Despite recent empirical progress, there is a lack of unified design principles for these systems: existing Hi-VLA systems differ in how they choose and connect planners, controllers, mechanisms to switch between the two, and how observations and memory are represented in the planner.

We unify representative Hi-VLA agents under an options-style control framework and benchmark core design choices across short-horizon, long-horizon, and reasoning-intensive tasks. Our analysis distills practical principles for building Hi-VLA systems, showing how model choices and interface mechanisms jointly shape performance. Applying these principles yields a substantially stronger system than either flat VLA control or a naively designed hierarchy, across experiments both in simulation and on a real ALOHA robot.

Key Results

The "best hierarchy" agent which uses the strongest orchestration choices from our study allows for substantially improved performance over flat VLA control and a naively designed hierarchy, especially on long-horizon and reasoning-intensive tasks.

Short-Horizon 78.22%

Best hierarchy success rate, compared with 69.57% for naive hierarchy and 69.63% for flat VLA.

Long-Horizon 67.08%

Best hierarchy success rate, compared with 40.56% for naive hierarchy and 25.30% for flat VLA.

Reasoning 80.89%

Best hierarchy success rate, compared with 66.49% for naive hierarchy and 50.90% for flat VLA.

Real ALOHA 12 / 15

Correct fruit placements with the best hierarchy, compared with 9 / 15 and 3 / 15 for baselines.

Configuration Short Long Reasoning Real ALOHA

Best hierarchy 78.22% 67.08% 80.89% 12 / 15

Naive hierarchy 69.57% 40.56% 66.49% 9 / 15

Flat VLA 69.63% 25.30% 50.90% 3 / 15

Real Robot Experiment

ALOHA robot fruit sorting setup with colored plates. — Real-world evaluation: place fruits onto plates of matching color.

Best Hierarchy

The “best hierarchy” allows the robot to reason about the task, break it down into sub-goals, and recover from the mistake of low-level VLA (red grape misplacement).

Naive Hierarchy

By contrast, the “naive hierarchy” is also able to make reasonable progress, but is unable to recover from the grape misplacement.

Flat VLA

Flat VLA has a hard time converting the task instruction into reasonable low-level actions, and struggles to make progress.

What Matters

Here, we present how some of the key design choices in our study affect the performance of Hi-VLA agents.

VLM Policy

Task	Lite	Lite (no thinking)	Flash	Pro
Short	+0.0	-4.0	+1.4	-4.3
Long	+0.0	-9.5	-5.9	-5.1
Reasoning	+0.0	-16.7	-2.6	-0.8

Reasoning beats scale for the high-level VLM.

Increasing the number of parameters of the VLM policy does not lead to improvement in performance. By contrast, enabling VLM thinking positively impacts performance.

VLA Policy

Task	GROD-S	GROD-S FT	GROD-L
Short	+0.0	-8.8	+12.4
Long	+0.0	-33.8	+11.1
Reasoning	+0.0	-23.9	+5.7

Steerable low-level control is a bottleneck.

Larger VLAs perform better due to better instruction following capabilities, while fine-tuning can hurt the performance significantly by degrading the instruction following capability.

Termination

Task	Fixed	VLM	Success
Short	+0.0	-3.7	-1.2
Long	+0.0	-8.9	+5.0
Reasoning	+0.0	-0.4	+8.3

Switching policy is high leverage.

Good handoff timing prevents wasted execution. Success detection is an effective termination condition, and moderate fixed execution horizons reduce VLM calls without large performance loss.

Observation

Task	Raw	Desc.	+Box	+Contact
Short	+0.0	+0.4	+6.4	+8.2
Long	+0.0	-3.1	+9.1	+13.5
Reasoning	+0.0	-6.4	-0.7	+3.4

Observation representation matters.

Structured image representations help the VLM perceive the scene and make better decisions. Bounding boxes and contact information both allow the VLM to understand the scene better.

Memory Length

Task	1	3	5	Full
Short	+0.0	-1.0	-0.7	-0.2
Long	+0.0	-1.7	-2.1	-0.9
Reasoning	+0.0	-1.6	-2.1	-1.5

Raw memory is not enough.

Raw in-episode context provides little benefit by itself, suggesting that useful robot memory needs processing rather than simple context accumulation.

Memory Summary

Task	None	1-Step	Episode	Prev.
Short	+0.0	-1.2	-4.2	+3.6
Long	+0.0	+0.2	-2.2	+7.6
Reasoning	+0.0	+0.2	+3.1	+7.7

Cross-episode reflection helps.

Summary / Reflection allows the VLM to encode long memory. While it does not help by itself, summarizing cross-episode experiences allows the VLM to notably boost performance.

Check out our paper for more detailed results and analysis！

BibTeX

@article{hu2026matters,
  title   = {What Matters in Orchestrating Robot Policies: A Systematic Study of Hierarchical VLA Agents},
  author  = {Hu, Jiaheng and Shridhar, Mohit and Lu, Caden and Shah, Dhruv and Chiang, Hao-Tien Lewis and Tan, Jie and Xie, Annie},
  journal = {arXiv preprint arXiv:2606.10267},
  year    = {2026}
}