GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning
AI 摘要
提出一种基于视觉-语言推理的杂乱衣物检索方案,实现安全准确的单件衣物抓取。
主要贡献
- 提出基于视觉-语言推理的衣物检索流程
- 利用SAM2进行衣物分割,增强VLM对衣物状态的感知
- 引入双臂协作框架处理复杂衣物
方法论
结合视觉分割(SAM2)、VLM推理和视觉可供性感知,实现对杂乱衣物的安全单件检索。
原文摘要
Garment manipulation has attracted increasing attention due to its critical role in home-assistant robotics. However, the majority of existing garment manipulation works assume an initial state consisting of only one garment, while piled garments are far more common in real-world settings. To bridge this gap, we propose a novel garment retrieval pipeline that can not only follow language instruction to execute safe and clean retrieval but also guarantee exactly one garment is retrieved per attempt, establishing a robust foundation for the execution of downstream tasks (e.g., folding, hanging, wearing). Our pipeline seamlessly integrates vision-language reasoning with visual affordance perception, fully leveraging the high-level reasoning and planning capabilities of VLMs alongside the generalization power of visual affordance for low-level actions. To enhance the VLM's comprehensive awareness of each garment's state within a garment pile, we employ visual segmentation model (SAM2) to execute object segmentation on the garment pile for aiding VLM-based reasoning with sufficient visual cues. A mask fine-tuning mechanism is further integrated to address scenarios where the initial segmentation results are suboptimal. In addition, a dual-arm cooperation framework is deployed to address cases involving large or long garments, as well as excessive garment sagging caused by incorrect grasping point determination, both of which are strenuous for a single arm to handle. The effectiveness of our pipeline are consistently demonstrated across diverse tasks and varying scenarios in both real-world and simulation environments. Project page: https://garmentpile2.github.io/.