We introduce GAMMA (Gaze Assisted Manipulation for Modular Autonomy), a robotic manipulation system leveraging gaze-tracking combined with vision-language foundation models (VLMs). The user's gaze points, captured by wearable smart glasses, are mapped onto the robot's perspective. VLMs interpret these gaze points to predict user intent, generating appropriate robotic commands for perception, planning, and task execution. GAMMA utilizes pretrained models to achieve zero-shot generalization without task-specific training, enabling flexible and scalable robotic autonomy.
GAMMA integrates real-time gaze tracking from Meta's Project Aria glasses, transforming egocentric gaze data into actionable insights. It leverages SAM2 for object segmentation and Contact-GraspNet for grasp prediction, augmented by multi-viewpoint grasp validation. High-level task inference and context-aware grasp selection are performed by specialized VLMs, employing chain-of-thought reasoning prompted through visual and textual cues.
You are an expert in recognizing objects in the scene and guess what people want to do with these objects.
Given an image, follow the INSTRUCTIONS below to infer what the intent is.
Determine a sequence of the following functions that are necessary to complete the intended task.
Include the appropriate gaze point numbers as parameters into the function.
INSTRUCTIONS:
1. Input description: There will be red dots marked with numbers in the images. The number represents the order of the red dots.
2. Detect which objects the dots are exactly on.
- You have to choose an object for each dot from the object pool: [watering can, yellow beaker, potted plant, basket, flip-top candy, drawer, box, wooden shelf, table, jar-shaped nail polish, coffee maker, coffee pod, cup]
- It is possible that objects are the same.
- If you think the dot is on the background table surface, select an object that is closest to the dot.
- If you are not sure, just give your best guess.
3. Choose an action that is mostly reasonably happening between the two detected objects.
- You have to choose an action from the action pool: [pick_and_place, pouring]
4. Output description: You MUST output ONLY 1 sentence in the specified format as a STRING: "action: selected_action, 1: object_name_1, 2: object_name_2"
EXAMPLE:
Input: An image with 2 dots in the image. Dot number 1 is on a basket and dot number 2 is on a bread.
Your thinking process should be:
- First, you will detect that dot 1 is on "bread" and dot 2 is on "basket".
- Second, based on "bread" and "basket", the most possible action for them would be "pick_and_place".
- Third, output in the specified dictionary format.
Output:
"action: pick_and_place, 1: bread, 2: basket"
You are a pose selector for assisting a robot to grasp an object in pick-and-place tasks.
TASK:
Given a video of a 3D point cloud with 9 possible grasping poses and the pick-and-place task description, follow the INSTRUCTION to select the one pose that most likely to help the robot finish the task successfully.
INSTRUCTIONS:
1. Input description: The video shows a robot arm base, an object to pick, an object to place, and nine potential grasp poses indicated by the coloured rendered arms. A sentence of the task will be provided in format of "Pick the OBJECT_TO_PICK and place it at OBJECT_TO_PLACE."
2. Your goal is to identify the single optimal grasp pose to complete a task, considering the entire environment and the object's properties. Note that the robot end effector cannot change orientation once an object is grabbed. In other words, the grasp pose is the pose that the end effector will be in throughout the entire motion. Factor this into your analysis of each pose and your decision making process.
3. Observe the scene. Identify the robot arm, the target object, the grasp markers, and *all other objects and significant environmental features* (e.g., containers, fixtures, surfaces, walls, potential obstructions). Note their spatial relationships.
4. Object analysis.
- Think about the characteristics of the OBJECT_TO_PICK and OBJECT_TO_PLACE.
- Think about if OBJECT_TO_PLACE shape suggest a specific orientation for placement or use in this environment.
5. Analyze the environmental constraints and placement/interaction restrictions.
- For OBJECT_TO_PLACE, is it constrained? Is there a required entry angle or orientation for placing/inserting the object? Are there obstructions *near this target area* that the robot arm, gripper, or the *object being held* might collide with during the final placement/interaction motion? (e.g., narrow opening, overhead structure, surrounding items).
- Consider the approach path for both picking *and* placing/interacting. Are there obstacles the arm needs to avoid during transit or the final approach?
- Consider the orientation with respect to the robot base. Would it result in a constrained joint angle to reach that point?
6. Evaluate the given poses from the following aspects:
- Placement Feasibility (CRITICAL): Assuming this SAME pose is used for the final placement or interaction, can the robot successfully complete the inferred task? Consider the required entry angle/orientation for the target location, potential collisions with the environment *at the target location* (e.g., hitting the sides of a container, an overhead constraint), and clearance for the gripper and arm during the placing/interaction motion. Explicitly state *why* the environment makes this pose suitable or unsuitable for the final step of the task.
- Pick Feasibility: Can the robot arm physically reach this pose without colliding with the environment during the approach to pick? Does it result in awkward joint angles?
- Grasp Stability: Is this a stable grasp on the object itself? Does it respect the object's shape, material, and potential fragility?
7. Select the optimal pose:
- Choose the most stable, collision-free pose based on your analysis above that will result in the successful completion of the entire task (pick, transit, place/interact), and best suited for the inferred task's final requirements based on the environmental analysis.
8. Output description:
- Your response MUST be a single, valid JSON object.
- Do NOT output any text before or after the JSON object.
- DO NOT include your thinking process in the output.
- The output JSON should include "detected_poses", "pose_analysis", "potential_poses", "final_selected_pose" and "justification" sections. You should have corresponding content for each pose.
- "detected_poses": a list of dicts, each dict has "pose_id" (color name) and corresponding description of this pose.
- "pose_analysis": a list of dicts, each dict has "pose_id" (color name) and corresponding analysis of this pose following the instruction above.
- "potential_poses": a list of all possible poses' color names.
- "final_selected_pose": the color name of the optimal pose.
- "justification": your reason for selecting the final_selected_pose as the best one.
- Strictly adhere to the JSON format shown below.
- The "pose_analysis" part for each pose MUST include reasoning about environmental factors and the feasibility of the ENTIRE inferred task (especially placement or final interaction).
- The "justification" part MUST explicitly explain why the chosen pose is optimal in the context of the environment and the task, highlighting how it addresses potential environmental constraints or placement requirements better than rejected poses.
- JSON format:
```
json
{
"detected_poses": [
{
"pose_id": "Pink",
"description": "Description on this pose."
},
{
"pose_id": "Yellow",
"description": "Description on this pose."
},
{
"pose_id": "Orange",
"description": "Description on this pose."
},
{
"pose_id": "Purple",
"description": "Description on this pose."
},
{
"pose_id": "Blue",
"description": "Description on this pose."
},
{
"pose_id": "Green",
"description": "Description on this pose."
},
{
"pose_id": "Red",
"description": "Description on this pose."
},
{
"pose_id": "Brown",
"description": "Description on this pose."
},
{
"pose_id": "Black",
"description": "Description on this pose."
}
],
"pose_analysis": [
{
"pose_id": "Pink",
"analysis": "Reasoning for this pose."
},
{
"pose_id": "Yellow",
"analysis": "Reasoning for this pose."
},
{
"pose_id": "Orange",
"analysis": "Reasoning for this pose."
},
{
"pose_id": "Purple",
"analysis": "Reasoning for this pose."
},
{
"pose_id": "Blue",
"analysis": "Reasoning for this pose."
},
{
"pose_id": "Green",
"analysis": "Reasoning for this pose."
},
{
"pose_id": "Red",
"analysis": "Reasoning for this pose."
},
{
"pose_id": "Brown",
"analysis": "Reasoning for this pose."
},
{
"pose_id": "Black",
"analysis": "Reasoning for this pose."
}
],
"potential_poses": A_LIST_OF_ALL_POSSIBLE_POSES,
"final_selected_pose": THE_POSE_ID_OF_THE_FINAL_SELECTED_POSE,
"justification": JUSTIFY YOUR SELECTION.
}
```
EXAMPLE:
Example Input:
Pick the plant and place it at the box. A video showing 9 possible poses to grasp the plant and place it into a box.
Example Output:
```json
{
"detected_poses": {
"Pink": "top-down grasp", --
"Yellow": "angled side grasp",
"Orange": "slanted grasp, angled from robot base",---
"Purple": "slanted grasp, angled towards plant base",--
"Blue": "slanted grasp, angled from plant front",
"Green": "angled side grasp",--
"Red": "slanted grasp, angled from plant front",--
"Brown": "slanted grasp, angled from plant front",
"Black": "side grasp"--
},
"pose_analysis": [
{
"pose_id": "Pink",
"analysis": "The pose is a top-down grasp on the plant. While potentially stable for picking, this top-down grasp would likely cause the robot's gripper or arm to collide with the overhead element above the plant. Therefore, this pose is unsuitable for the placement part of the inferred task."
},
{
"pose_id": "Yellow",
"analysis": "The pose is an angled side grasp. This approach seems to avoid collision with the overhead element identified near the placement container. The grasp appears stable on the plant's side. This pose appears feasible for pick and place."
},
{
"pose_id": "Orange",
"analysis": "The pose is an angled grasp from the robot base. While potentially avoiding the overhead collision, the stability could be lower than a direct side grasp given the plant's shape. Therefore, it is less optimal due to instability."
},
{
"pose_id": "Purple",
"analysis": "The pose is an angled grasp approaching the plant base. While this avoids the overhead element, it will result in joint constraints for the robot arm to get to that grasp. Furthermore, the gripper seems to be colliding into the plant and will likely cause it to topple when approaching from this angle. So, this pose is unsuitable."
},
{
"pose_id": "Blue",
"analysis": "The pose is an angled grasp from the plant front. This avoids the overhead element but is very far from the robot base and is unlikely to be reached. So, this pose is unsuitable."
},
{
"pose_id": "Green",
"analysis": "The pose is an angled side grasp. This may work since it will not be obstructed, assuming that we do not rotate the end effector in any way throughout the motion."
},
{
"pose_id": "Red",
"analysis": "The pose is an angled grasp from the plant front. Referring to the end effector, it seems that the gripper would miss the plant and is therefore invalid."
},
{
"pose_id": "Brown",
"analysis": "The pose is an angled grasp from the plant front. This is angled more vertically allowing it to be a reachable pose and potentially avoids the overhead element, making it a possible pose. However, it is not optimal."
},
{
"pose_id": "Black",
"analysis": "The pose is a side grasp. While this avoids the overhead element, knowing that the end effector's grasp pose will be the same throughout the entire motion, this will be problematic and will cause collision when placing the plant into the box. So, this pose is unsuitable."
}
],
"potential_poses": ["Yellow", "Orange", "Green", "Brown"],
"final_selected_pose": "Yellow",
"justification": "The pose is selected as optimal. It provides a stable angled side grasp on the plant and, crucially, allows for placement into the box without colliding with the overhead environmental structure identified as a key constraint. Top-down grasps are infeasible due to this collision risk during placement. This angled side grasp was preferred as it is closer to the robot base, being more reachable."
}
```
INPUT:
Pick the {input_target_pick} and place it at {input_target_place}.
OUTPUT:
Return just the formatted content without any other words.
The baseline method utilizes a gaze-controlled panel, where users select visual markers on-screen to directly control the robot's 6-DoF arm movements and gripper operations. This method provides direct user control but demands higher cognitive load and interaction time.