Total: 1
Learning skills in open-world environments is essential for developing agents capable of handling a variety of tasks by combining basic skills. Online demonstration videos are typically long but unsegmented, making them difficult to segment and label with skill identifiers. Unlike existing methods that rely on random splitting or human labeling, we have developed a self-supervised learning-based approach to segment these long videos into a series of semantic-aware and skill-consistent segments. Drawing inspiration from human cognitive event segmentation theory, we introduce Skill Boundary Detection (SBD), an annotation-free temporal video segmentation algorithm. SBD detects skill boundaries in a video by leveraging prediction errors from a pretrained unconditional action-prediction model. This approach is based on the assumption that a significant increase in prediction error indicates a shift in the skill being executed. We evaluated our method in Minecraft, a rich open-world simulator with extensive gameplay videos available online. The SBD-generated segments yielded relative performance improvements of 63.7% and 52.1% for conditioned policies on short-term atomic tasks, and 11.3% and 20.8% for their corresponding hierarchical agents on long-horizon tasks, compared to unsegmented baselines. Our method can leverage the diverse YouTube videos to train instruction-following agents. The project page is at https://craftjarvis.github.io/SkillDiscovery/.