Skip to main content
eScholarship
Open Access Publications from the University of California

UCLA

UCLA Electronic Theses and Dissertations bannerUCLA

Scene Abstraction for Generalizable Long-horizon Robot Planning

Abstract

Humans excel at abstracting raw information into meaningful high-level representations, which builds the foundation for understanding complex situations and making sophisticated decisions in novel scenarios. In contrast, robots often struggle to solve complex tasks and generalize to unseen situations due to their limited abstraction capabilities. This dissertation presents a novel scene abstraction perspective and a holistic framework for robots to perform long-horizon tasks in unseen real-world scenarios by: (i) perceiving scenes as abstract states, (ii) acquiring world models that predict action potentials and consequences on abstract states, and (iii) planning to reach novel goals within the abstract state space using these world models. We advocate for a scene graph-based representation that abstracts objects and their relations as symbols, allowing for strong compositional generalization to novel objects and goals in planning. The dissertation is structured in three parts, with focus on perception, planning, and learning, respectively. In the first part, we introduce a manually-defined contact graph representation that preserves the kinematic state of the environment for task and motion planning. We develop a scene reconstruction system that recovers this representation from RGB-D streams, enabling the creation of functionally-equivalent digital twins for simulating robot interaction. In the second part, we demonstrate closed-loop reasoning and planning using contact graphs and other feedback forms, leveraging the internal world knowledge of language models. We show that a Vision-Language Model can enable closed-loop mobile manipulation in the real world with feedback from the contact graph and images from the robot's wrist camera. We also show that a Large Language Model can propose task and motion planning solutions and make corrections by reasoning motion planner feedback. The third part focuses on learning task-relevant symbolic abstractions and world models that generalize over novel object configurations. We present an interactive framework that learns PDDL-style symbolic predicates and operators from interaction data and language feedback. Additionally, we propose a probabilistic framework that learns object symbols and a stochastic grammar capturing state transitions in the context of object cutting. We demonstrate that these learned symbolic representations and world models can be utilized to solve complex tasks with novel objects and unseen goals through planning. By placing abstraction at its core, this dissertation seeks to unify perception, planning, and learning to build more capable and generalizable embodied intelligence.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View