Social Mobile Manipulation Challenge

Overview

The 1st Social Mobile Manipulation Challenge—organized based on the insights presented in the InfiniteWorld —focuses on developing embodied AI agents capable of performing long sequences of complex tasks through social interactions.

Competition Description

The challenge includes two interaction modes: human-robot and robot-robot. The goal is to advance the community's capabilities in areas like human intention reasoning, long-term task planning, and effective social interaction between embodied agents.

Track 1: Vertical Interaction

This track focuses on vision-language navigation. The participating agent receives a partial scene graph as a prompt to perform tasks such as "pick up an object from one place and place it at another." The task descriptions are simple yet ambiguous, challenging the agent's understanding of both the instructions and the scene.

Task and Data Format

Task instruction: e.g., "Please take the wreath from the living room and place it on the kitchen island."
Task decomposition: includes sub-tasks like moving to the target location, grasping, moving, and releasing.
Provided information: scene ID, target details, target-centered scene graph, and the optimal navigation path.

Model Input and Output

Model Input: Task instruction, RGB-D observations from three viewpoints (-90°, 0°, +90°), and the current pose.
Model Output: Atomic actions with corresponding numerical values, including “MoveForward”, “TurnLeft”, “TurnRight”, and “Stop”.

Actions and Parameters

Action	Description	Action Value (Range)
MoveForward	Move forward by a specified distance	0 m ~ 1.50 m
TurnLeft	Rotate left by a specified angle	0° ~ 90°
TurnRight	Rotate right by a specified angle	0° ~ 90°
Stop	Stop current navigation target	0

Evaluation Metrics

SR (Success Rate): Task completion rate
SPL (Success weighted by Path Length): Task completion efficiency
CR (Collision Rate): Collision rate
Avg Steps (Average Steps): Average number of steps

Challenges

Complex tasks: The agent must locate multiple targets simultaneously.
Ambiguous instructions: Simple task descriptions require strong comprehension.
The design of high-level prompts is crucial for task success.

Track 2: Horizontal Interaction

This track focuses on a multi-robot cooperative scenario. Two robots operate independently in the same scene but periodically share their observed scene maps to improve task efficiency. The robots use the Stretch-3 model and perform actions via atomic commands.

Robot and Actions

Robot Model: Stretch-3 (Product Details).
Atomic Actions: Move forward, turn left, and turn right. Each action requires specifying an action value (e.g., distance or angle) within predefined limits.

Observations and Task Instructions

Each robot receives a task instruction, RGB-D observations from three viewpoints, and its current position state.

Scene Graph Sharing

After a certain number of steps, the two robots exchange their respective scene graphs (node graphs) to complete tasks more quickly. For example, a node table might include:

Node ID	Type	Description	Connected Nodes
N1	Room	Starting area	N2, N3
N2	Hallway	Corridor to other areas	N1, N4
N3	Object	Detected table	N1
N4	Door	Exit leading to another room	N2, N5
N5	Room	Adjacent room	N4

Evaluation Metrics

SR (Success Rate): Task completion rate
SPL (Success weighted by Path Length): Task completion efficiency
CR (Collision Rate): Collision rate
Avg Steps (Average Steps): Average number of steps