PLanAR: Planning-Language-Grounded
Agentic Reasoning for Robot Manipulation

Pengyuan Guo^1,* Zhonghao Mai^1,* Zhengtong Xu^1,* Kaidi Zhang¹ Quan Khanh Luu¹ Heng Zhang²

Zichen Miao¹ Arash Ajoudani² Zachary Kingston¹ Qiang Qiu¹ Yu She¹

¹Purdue University ²Instituto Italiano di Tecnologia *Equal contribution

Rocket The full hardware–software stack of PLanAR will be released.

Key Contributions

Closed-loop Vision-Language Robot Agent

We introduce a robot agent pipeline for open-world, open-vocabulary manipulation with vision-language closed-loop reasoning.

Model-Agnostic VLM Interface

Different VLMs (e.g., Gemini, GPT, Qwen) can be seamlessly swapped through a unified interface, enabling controlled evaluation without model-specific engineering.

Real-World Embodied Evaluation

Our evaluation measures grounded perception, spatial reasoning, and long-horizon decision-making under closed-loop execution on physical robots.

Open-Source, Deployable Platform

We release a reproducible hardware–software stack that integrates sensing, control, and agentic reasoning, enabling direct deployment of VLM-based agents in the wild.

Real-World Robot Manipulation

We design a suite of five real-world robot manipulation tasks to evaluate embodied vision-language reasoning under challenging conditions.

The open-vocabulary prompts used for each task are displayed below.

Sorting

The agent categorizes objects into designated bins based on semantic attributes, demonstrating open-vocabulary grounding and compositional reasoning.

Move the food to the bowl.

Sort the toys to the blue bin.

Stacking

Vertically arranging objects in specified order, requiring precise placement and sequential planning with strong dependencies.

Stack the cubes on the pink plate from bottom to top: orange, yellow, green and blue.

Stack the cubes on the pink plate from bottom to top: orange, blue, yellow, and green.

Stack the cubes on the orange plate from bottom to top: orange, blue, yellow, and green.

Stack the cubes on the blue plate from bottom to top: orange, blue, green, and yellow.

Crossword

Arranging letter blocks to form intersecting words on a grid, combining world knowledge with fine-grained spatial placement.

Solution

Fill the numbered slots using the provided blocks to solve the crossword puzzle. You do not need to use all blocks or all slots.

Reorientation

Adjusting object poses to satisfy language-specified orientation constraints, emphasizing spatial understanding beyond 2D placement.

Pick up the bottles and place them on the plates.

Pick up the bottle and place it on the plate.

Kitchen

Long-horizon rearrangement task placing items into context-appropriate containers, requiring sustained grounding and error recovery.

Open the pot, put the potato into the bowl, then take out the cup in the top drawer, place it on the plate, and close the drawer.

Close the pot, put the spice bottle into the top drawer, and close the drawer.

Put the potato into the pot, close the pot, then put the salt bottle in the top drawer and close the drawer.

Close the pot, put the salt bottle into the top drawer, and close the drawer.

Method

PLanAR executes a closed-loop agentic reasoning pipeline for manipulation that integrates task parsing, grounding, planning, execution, verification, and replanning. The system operates through iterative perception and action cycles, enabling robust performance in unstructured environments.

Click image to view full resolution

The pipeline uses multi-view RGB-D observations for open-vocabulary grounding, VLM-based reasoning for task decomposition and verification, and primitive-based execution with closed-loop feedback. This modular design allows different VLMs to be swapped in through a unified interface, enabling fair evaluation without model-specific engineering.

VLMs as Robot Agent Evaluations

1. Can a Single VLM Drive a Robot Agent?

Single-VLM Pipeline Comparison

We compare single-VLM baselines under the same agent pipeline to isolate model effects.

Failure mode breakdown for single-VLM pipelines on the sorting task.

2. Module Evaluation

Modular Evaluation Radar Chart

Normalized per-module performance across the pipeline (higher is better). Each axis corresponds to a module score.

Latency Comparison Across VLMs

Token Usage Comparison Across VLMs

3. Compositional Pipeline vs. Single VLM

We compare a Gemini Flash single-VLM baseline with our compositional pipeline across five manipulation tasks. Performance is measured by a task progress score capturing partial completion.

4. Off-the-Shelf VLA vs. PLanAR

Performance Comparison

We compare a π0.5 VLA fine-tuned with 40 demonstrations for sorting and 30 for stacking against the PLanAR pipeline on both tasks.

5. PLanAR vs. TiPToP

Performance Comparison

Cross-embodiment evaluation on the Franka robot and comparison with TiPToP on fruit sorting with human disturbance.

Ablation Study

We analyze two critical components that enable reliable closed-loop robot manipulation in PLanAR.

Action Checker for Vision-Language
Closed-loop Reasoning

Comparing no checker, goal checker only, and full action checker to evaluate robustness against disturbances.

Grasp Planner for Active Perception

Evaluating how grasp pose verification using local point clouds improves semantic correctness and physical feasibility.

BibTeX


@article{guo2026planar,
  title={PLanAR: Planning-Language-Grounded Agentic Reasoning for Robot Manipulation},
  author={Guo, Pengyuan and Mai, Zhonghao and Xu, Zhengtong and Zhang, Kaidi and Luu, Quan Khanh and Zhang, Heng and Miao, Zichen and Ajoudani, Arash and Kingston, Zachary and Qiu, Qiang and She, Yu},
  journal={arXiv preprint arXiv:2602.01662},
  year={2026}
}

CONTENTS

PLanAR: Planning-Language-Grounded Agentic Reasoning for Robot Manipulation

Key Contributions

Closed-loop Vision-Language Robot Agent

Model-Agnostic VLM Interface

Real-World Embodied Evaluation

Open-Source, Deployable Platform

Real-World Robot Manipulation

Sorting

Stacking

Crossword

Reorientation

Kitchen

Method

VLMs as Robot Agent Evaluations

1. Can a Single VLM Drive a Robot Agent?

Single-VLM Pipeline Comparison

2. Module Evaluation

Modular Evaluation Radar Chart

Latency Comparison Across VLMs

Token Usage Comparison Across VLMs

3. Compositional Pipeline vs. Single VLM

4. Off-the-Shelf VLA vs. PLanAR

Performance Comparison

5. PLanAR vs. TiPToP

Performance Comparison

Ablation Study

Action Checker for Vision-LanguageClosed-loop Reasoning

Grasp Planner for Active Perception

BibTeX

PLanAR: Planning-Language-Grounded
Agentic Reasoning for Robot Manipulation

Action Checker for Vision-Language
Closed-loop Reasoning