With the recent advancements in artificial intelligence (AI), researchers are making endeavours towards building an AI that can understand humans, collaborate with humans, and help or guide them to accomplish certain everyday chores. The actualization of such an assistant AI can pose several challenges including planning (on certain events), comprehending human instructions, multimodal understanding, and grounded conversational ability.
Imagine a scenario that one wishes to perform a task, such as “making a plate of fried rice”, or “purchasing a suitable sofa bed”, which can require multiple steps of actions and manipulation of certain objects. How would an assistant AI collaborate with humans to accomplish such desired tasks? One crucial aspect of the system is to understand how and when to take a certain action, which is often learned from interpreting and following a guidance, a piece of resource that encompasses knowledge about accomplishing the task and potentially the events that will occur during task completions. The guidance can come from human verbal interactions (e.g., in the form of a conversation or a question) or static written instructional manuals.
In the first part of this thesis, I will decompose the proposed system framework into three foundational components: (1) task-step sequencing/planning, where the AI needs to understand the appropriate sequential procedure of performing each sub-task to accomplish the whole task, especially when the task knowledge is learned from instructional resources online that can be many and do not always come consolidated with proper ordering; (2) action-dependencies understanding, where an agent should be able to infer dependencies of performing an action and the outcomes after executing a particular action, in order to examine the situations and adjust the plan of accomplishing tasks; (3) multimodal grounding and active perception, that we equip the AI with the ability to actively ground the visually perceived surroundings to the textual instructions (or verbal interactions) and perform reasoning over multimodal information along the task completions.
Combining the two parts, the foundational components as well as the established novel challenging benchmarks, this thesis aims at providing a comprehensive research road map for the research direction of next-generation (multimodal) AI assistants.