Overview
The 1st Social Mobile Manipulation Challenge—organized based on the insights presented in the InfiniteWorld —focuses on developing embodied AI agents capable of performing long sequences of complex tasks through social interactions.
Competition Description
The challenge includes two interaction modes: human-robot and robot-robot. The goal is to advance the community's capabilities in areas like human intention reasoning, long-term task planning, and effective social interaction between embodied agents.Track 1: Vertical Interaction
This track focuses on vision-language navigation. The participating agent receives a partial scene graph as a prompt to perform tasks such as "pick up an object from one place and place it at another." The task descriptions are simple yet ambiguous, challenging the agent's understanding of both the instructions and the scene.
Task and Data Format
- Task instruction: e.g., "Please take the wreath from the living room and place it on the kitchen island."
- Task decomposition: includes sub-tasks like moving to the target location, grasping, moving, and releasing.
- Provided information: scene ID, target details, target-centered scene graph, and the optimal navigation path.
Model Input and Output
- Model Input: Task instruction, RGB-D observations from three viewpoints (-90°, 0°, +90°), and the current pose.
- Model Output: Atomic actions with corresponding numerical values, including “MoveForward”, “TurnLeft”, “TurnRight”, and “Stop”.
Actions and Parameters
Action | Description | Action Value (Range) |
---|---|---|
MoveForward | Move forward by a specified distance | 0 m ~ 1.50 m |
TurnLeft | Rotate left by a specified angle | 0° ~ 90° |
TurnRight | Rotate right by a specified angle | 0° ~ 90° |
Stop | Stop current navigation target | 0 |
Evaluation Metrics
- SR (Success Rate): Task completion rate
- SPL (Success weighted by Path Length): Task completion efficiency
- CR (Collision Rate): Collision rate
- Avg Steps (Average Steps): Average number of steps
Challenges
- Complex tasks: The agent must locate multiple targets simultaneously.
- Ambiguous instructions: Simple task descriptions require strong comprehension.
- The design of high-level prompts is crucial for task success.
Track 2: Horizontal Interaction
This track focuses on a multi-robot cooperative scenario. Two robots operate independently in the same scene but periodically share their observed scene maps to improve task efficiency. The robots use the Stretch-3 model and perform actions via atomic commands.
Robot and Actions
- Robot Model: Stretch-3 (Product Details).
- Atomic Actions: Move forward, turn left, and turn right. Each action requires specifying an action value (e.g., distance or angle) within predefined limits.
Observations and Task Instructions
- Each robot receives a task instruction, RGB-D observations from three viewpoints, and its current position state.
Scene Graph Sharing
After a certain number of steps, the two robots exchange their respective scene graphs (node graphs) to complete tasks more quickly. For example, a node table might include:
Node ID | Type | Description | Connected Nodes |
---|---|---|---|
N1 | Room | Starting area | N2, N3 |
N2 | Hallway | Corridor to other areas | N1, N4 |
N3 | Object | Detected table | N1 |
N4 | Door | Exit leading to another room | N2, N5 |
N5 | Room | Adjacent room | N4 |
Evaluation Metrics
- SR (Success Rate): Task completion rate
- SPL (Success weighted by Path Length): Task completion efficiency
- CR (Collision Rate): Collision rate
- Avg Steps (Average Steps): Average number of steps