Intent at a Glance: Gaze-Guided Robotic Manipulation via Foundation Models

Author 1, Author 2, Author 3 ...

Abstract

Designing intuitive interfaces for robotic control remains a central challenge in enabling effective human-robot interaction, particularly in assistive care settings. Eye gaze offers a fast, non-intrusive, and intent-rich input modality, making it an attractive channel for conveying user goals. In this work, we present GAMMA (Gaze Assisted Manipulation for Modular Autonomy), a system that leverages ego-centric gaze tracking and a vision-language foundation model to infer user intent and autonomously execute robotic manipulation tasks. By contextualizing gaze fixations within the scene, the system maps visual attention to high-level semantic understanding, enabling skill selection and parameterization without task-specific training. We evaluate GAMMA on a range of table-top manipulation tasks and compare it against baseline gaze-based control without reasoning. Results demonstrate that GAMMA provides robust, intuitive, and generalizable control, highlighting the potential of combining foundation models and gaze for natural and scalable robot autonomy.

System Overview
Overview of GAMMA. A user wearing smart glasses uses their gaze to specify a manipulation task. In this example, the user wants the robot to pick up a plant under the lamp and place it in a tray. GAMMA transform the gaze fixations into robot's view and prompts the VLM to predict user intent. Given the predicted user intent, GAMMA calls corresponding functions for perception, planning, and execution. GAMMA prompts a VLM to select a proper grasping pose that takes the task context into consideration (e.g. not colliding with the lamp).

Method

System Pipeline
Functional Modules of GAMMA. GAMMA consists of various sensing & perception modules that leverages pretrained vision models (bottom left), and VLM-based reasoning modules at both the task level and grasp selection level (top right).

We introduce GAMMA (Gaze Assisted Manipulation for Modular Autonomy), a robotic manipulation system leveraging gaze-tracking combined with vision-language foundation models (VLMs). The user's gaze points, captured by wearable smart glasses, are mapped onto the robot's perspective. VLMs interpret these gaze points to predict user intent, generating appropriate robotic commands for perception, planning, and task execution. GAMMA utilizes pretrained models to achieve zero-shot generalization without task-specific training, enabling flexible and scalable robotic autonomy.

GAMMA

GAMMA integrates real-time gaze tracking from Meta's Project Aria glasses, transforming egocentric gaze data into actionable insights. It leverages SAM2 for object segmentation and Contact-GraspNet for grasp prediction, augmented by multi-viewpoint grasp validation. High-level task inference and context-aware grasp selection are performed by specialized VLMs, employing chain-of-thought reasoning prompted through visual and textual cues.

Intent Reasoning
Gaze-based Intent Reasoning Tasks. We designed 30 scenes of intent reasoning with diverse difficulty levels across 3 tasks. Easy scenes are relatively clean, medium difficulty-level scenes are cluttered or contain longer sequences of gaze points. The hard cases involves visual attacks.
Task / VLM Gemini2.0F Gemini2.5F Gemini Pro Llama4-Maverick GPT-4o
Plant 1.00 1.00 1.00 0.80 1.00
Basket 0.78 0.72 0.83 0.44 0.33
Coffee 1.00 1.00 1.00 1.00 1.00
Average 0.93 0.91 0.94 0.75 0.78
Time 1.89 1.26 4.22 11.63 2.60
VLM inference accuracies for predicting the user intent. Each intent involves a sequence of actions and corresponding objects (e.g. pick up nail polish, place in basket).
Intent Recognition Prompt
You are an expert in recognizing objects in the scene and guess what people want to do with these objects. Given an image, follow the INSTRUCTIONS below to infer what the intent is. Determine a sequence of the following functions that are necessary to complete the intended task. Include the appropriate gaze point numbers as parameters into the function. INSTRUCTIONS: 1. Input description: There will be red dots marked with numbers in the images. The number represents the order of the red dots. 2. Detect which objects the dots are exactly on. - You have to choose an object for each dot from the object pool: [watering can, yellow beaker, potted plant, basket, flip-top candy, drawer, box, wooden shelf, table, jar-shaped nail polish, coffee maker, coffee pod, cup] - It is possible that objects are the same. - If you think the dot is on the background table surface, select an object that is closest to the dot. - If you are not sure, just give your best guess. 3. Choose an action that is mostly reasonably happening between the two detected objects. - You have to choose an action from the action pool: [pick_and_place, pouring] 4. Output description: You MUST output ONLY 1 sentence in the specified format as a STRING: "action: selected_action, 1: object_name_1, 2: object_name_2" EXAMPLE: Input: An image with 2 dots in the image. Dot number 1 is on a basket and dot number 2 is on a bread. Your thinking process should be: - First, you will detect that dot 1 is on "bread" and dot 2 is on "basket". - Second, based on "bread" and "basket", the most possible action for them would be "pick_and_place". - Third, output in the specified dictionary format. Output: "action: pick_and_place, 1: bread, 2: basket"
Grasp Selection
Grasp Selection Visual Prompts. We evaluated with different visual representations for grasp selection. To provide enough information for inferring the 3D grasping poses, we use both numbered multi-view image prompts (top) and a short video clip of a camera hovering around color-coded grasp pose candidates (bottom). Different visual representation resulted in the VLM (Gemini 2.5 Pro) making different predictions.
Prompt / VLM Gemini2.0F Gemini2.5F Gemini Pro Llama4-Maverick GPT-4o
Image Average 0.20 0.47 0.60 0.13 0
Image Time 6.42 5.44 24.83 5.41 8.18
Video Average 0.27 0.27 0.33 - -
Video Time 15.96 15.18 32.06 - -
VLM grasp selection performance across visual prompts. "–" indicates missing or unsupported data.
Grasp Selection Prompt
You are a pose selector for assisting a robot to grasp an object in pick-and-place tasks. TASK: Given a video of a 3D point cloud with 9 possible grasping poses and the pick-and-place task description, follow the INSTRUCTION to select the one pose that most likely to help the robot finish the task successfully. INSTRUCTIONS: 1. Input description: The video shows a robot arm base, an object to pick, an object to place, and nine potential grasp poses indicated by the coloured rendered arms. A sentence of the task will be provided in format of "Pick the OBJECT_TO_PICK and place it at OBJECT_TO_PLACE." 2. Your goal is to identify the single optimal grasp pose to complete a task, considering the entire environment and the object's properties. Note that the robot end effector cannot change orientation once an object is grabbed. In other words, the grasp pose is the pose that the end effector will be in throughout the entire motion. Factor this into your analysis of each pose and your decision making process. 3. Observe the scene. Identify the robot arm, the target object, the grasp markers, and *all other objects and significant environmental features* (e.g., containers, fixtures, surfaces, walls, potential obstructions). Note their spatial relationships. 4. Object analysis. - Think about the characteristics of the OBJECT_TO_PICK and OBJECT_TO_PLACE. - Think about if OBJECT_TO_PLACE shape suggest a specific orientation for placement or use in this environment. 5. Analyze the environmental constraints and placement/interaction restrictions. - For OBJECT_TO_PLACE, is it constrained? Is there a required entry angle or orientation for placing/inserting the object? Are there obstructions *near this target area* that the robot arm, gripper, or the *object being held* might collide with during the final placement/interaction motion? (e.g., narrow opening, overhead structure, surrounding items). - Consider the approach path for both picking *and* placing/interacting. Are there obstacles the arm needs to avoid during transit or the final approach? - Consider the orientation with respect to the robot base. Would it result in a constrained joint angle to reach that point? 6. Evaluate the given poses from the following aspects: - Placement Feasibility (CRITICAL): Assuming this SAME pose is used for the final placement or interaction, can the robot successfully complete the inferred task? Consider the required entry angle/orientation for the target location, potential collisions with the environment *at the target location* (e.g., hitting the sides of a container, an overhead constraint), and clearance for the gripper and arm during the placing/interaction motion. Explicitly state *why* the environment makes this pose suitable or unsuitable for the final step of the task. - Pick Feasibility: Can the robot arm physically reach this pose without colliding with the environment during the approach to pick? Does it result in awkward joint angles? - Grasp Stability: Is this a stable grasp on the object itself? Does it respect the object's shape, material, and potential fragility? 7. Select the optimal pose: - Choose the most stable, collision-free pose based on your analysis above that will result in the successful completion of the entire task (pick, transit, place/interact), and best suited for the inferred task's final requirements based on the environmental analysis. 8. Output description: - Your response MUST be a single, valid JSON object. - Do NOT output any text before or after the JSON object. - DO NOT include your thinking process in the output. - The output JSON should include "detected_poses", "pose_analysis", "potential_poses", "final_selected_pose" and "justification" sections. You should have corresponding content for each pose. - "detected_poses": a list of dicts, each dict has "pose_id" (color name) and corresponding description of this pose. - "pose_analysis": a list of dicts, each dict has "pose_id" (color name) and corresponding analysis of this pose following the instruction above. - "potential_poses": a list of all possible poses' color names. - "final_selected_pose": the color name of the optimal pose. - "justification": your reason for selecting the final_selected_pose as the best one. - Strictly adhere to the JSON format shown below. - The "pose_analysis" part for each pose MUST include reasoning about environmental factors and the feasibility of the ENTIRE inferred task (especially placement or final interaction). - The "justification" part MUST explicitly explain why the chosen pose is optimal in the context of the environment and the task, highlighting how it addresses potential environmental constraints or placement requirements better than rejected poses. - JSON format: ``` json { "detected_poses": [ { "pose_id": "Pink", "description": "Description on this pose." }, { "pose_id": "Yellow", "description": "Description on this pose." }, { "pose_id": "Orange", "description": "Description on this pose." }, { "pose_id": "Purple", "description": "Description on this pose." }, { "pose_id": "Blue", "description": "Description on this pose." }, { "pose_id": "Green", "description": "Description on this pose." }, { "pose_id": "Red", "description": "Description on this pose." }, { "pose_id": "Brown", "description": "Description on this pose." }, { "pose_id": "Black", "description": "Description on this pose." } ], "pose_analysis": [ { "pose_id": "Pink", "analysis": "Reasoning for this pose." }, { "pose_id": "Yellow", "analysis": "Reasoning for this pose." }, { "pose_id": "Orange", "analysis": "Reasoning for this pose." }, { "pose_id": "Purple", "analysis": "Reasoning for this pose." }, { "pose_id": "Blue", "analysis": "Reasoning for this pose." }, { "pose_id": "Green", "analysis": "Reasoning for this pose." }, { "pose_id": "Red", "analysis": "Reasoning for this pose." }, { "pose_id": "Brown", "analysis": "Reasoning for this pose." }, { "pose_id": "Black", "analysis": "Reasoning for this pose." } ], "potential_poses": A_LIST_OF_ALL_POSSIBLE_POSES, "final_selected_pose": THE_POSE_ID_OF_THE_FINAL_SELECTED_POSE, "justification": JUSTIFY YOUR SELECTION. } ``` EXAMPLE: Example Input: Pick the plant and place it at the box. A video showing 9 possible poses to grasp the plant and place it into a box. Example Output: ```json { "detected_poses": { "Pink": "top-down grasp", -- "Yellow": "angled side grasp", "Orange": "slanted grasp, angled from robot base",--- "Purple": "slanted grasp, angled towards plant base",-- "Blue": "slanted grasp, angled from plant front", "Green": "angled side grasp",-- "Red": "slanted grasp, angled from plant front",-- "Brown": "slanted grasp, angled from plant front", "Black": "side grasp"-- }, "pose_analysis": [ { "pose_id": "Pink", "analysis": "The pose is a top-down grasp on the plant. While potentially stable for picking, this top-down grasp would likely cause the robot's gripper or arm to collide with the overhead element above the plant. Therefore, this pose is unsuitable for the placement part of the inferred task." }, { "pose_id": "Yellow", "analysis": "The pose is an angled side grasp. This approach seems to avoid collision with the overhead element identified near the placement container. The grasp appears stable on the plant's side. This pose appears feasible for pick and place." }, { "pose_id": "Orange", "analysis": "The pose is an angled grasp from the robot base. While potentially avoiding the overhead collision, the stability could be lower than a direct side grasp given the plant's shape. Therefore, it is less optimal due to instability." }, { "pose_id": "Purple", "analysis": "The pose is an angled grasp approaching the plant base. While this avoids the overhead element, it will result in joint constraints for the robot arm to get to that grasp. Furthermore, the gripper seems to be colliding into the plant and will likely cause it to topple when approaching from this angle. So, this pose is unsuitable." }, { "pose_id": "Blue", "analysis": "The pose is an angled grasp from the plant front. This avoids the overhead element but is very far from the robot base and is unlikely to be reached. So, this pose is unsuitable." }, { "pose_id": "Green", "analysis": "The pose is an angled side grasp. This may work since it will not be obstructed, assuming that we do not rotate the end effector in any way throughout the motion." }, { "pose_id": "Red", "analysis": "The pose is an angled grasp from the plant front. Referring to the end effector, it seems that the gripper would miss the plant and is therefore invalid." }, { "pose_id": "Brown", "analysis": "The pose is an angled grasp from the plant front. This is angled more vertically allowing it to be a reachable pose and potentially avoids the overhead element, making it a possible pose. However, it is not optimal." }, { "pose_id": "Black", "analysis": "The pose is a side grasp. While this avoids the overhead element, knowing that the end effector's grasp pose will be the same throughout the entire motion, this will be problematic and will cause collision when placing the plant into the box. So, this pose is unsuitable." } ], "potential_poses": ["Yellow", "Orange", "Green", "Brown"], "final_selected_pose": "Yellow", "justification": "The pose is selected as optimal. It provides a stable angled side grasp on the plant and, crucially, allows for placement into the box without colliding with the overhead environmental structure identified as a key constraint. Top-down grasps are infeasible due to this collision risk during placement. This angled side grasp was preferred as it is closer to the robot base, being more reachable." } ``` INPUT: Pick the {input_target_pick} and place it at {input_target_place}. OUTPUT: Return just the formatted content without any other words.

Baseline

The baseline method utilizes a gaze-controlled panel, where users select visual markers on-screen to directly control the robot's 6-DoF arm movements and gripper operations. This method provides direct user control but demands higher cognitive load and interaction time.

User Study
Experimental setup. Left: baseline 2D gaze control panel. Right: tasks in user study are of different difficulty levels. The more constrained the picking pose is, the harder the reasoning and execution becomes.

Results

Experimental evaluation demonstrated GAMMA significantly reduced task completion time compared to the gaze-panel baseline, requiring less cognitive and physical effort from users. However, GAMMA showed variability in grasp success due to challenges in precise grasp prediction. While objectively more efficient, user feedback indicated a preference for the baseline method's direct control, highlighting the importance of balancing automation with user agency.

User Study Results
User Study Results. We present the subjective evaluation in average likert-scale ratings from the users and the objective measure of time spent and success rates. We see that while users consider GAMMA to require lower demand and spent less time on the task, their performance is higher when they have the full control of the robot as GAMMA often requires another trial to correct a wrong prediction.
Trajectories
Visualization of sample trajectories. GAMMA trajectories are shorter and cleaner than gaze panel trajectories.

Conclusion

GAMMA presents a promising framework for intuitive and efficient gaze-guided robotic manipulation. By integrating advanced foundation models with real-time gaze tracking, GAMMA achieves robust, scalable autonomy without task-specific training. The user study results underscore a nuanced trade-off between automation efficiency and user preference for direct control, suggesting future work should explore hybrid interfaces that combine intuitive automation with opportunities for user intervention.

BibTeX

@misc{gamma2025,
  title={Intent at a Glance: Gaze-Guided Robotic Manipulation via Foundation Models},
  author={Author 1 and Author 2 and Author 3 and ...},
  year={2025},
  url={https://arxiv.org/abs/xxxx.xxxxx}
}