Search

Scholarly Works (6 results)

Sort By:

Thesis
Peer Reviewed

Learning Counterfactual Reasoning By Answering Counterfactual Questions From Videos

Hu, Qingyuan
Advisor(s): Peng, Nanyun

UCLA Electronic Theses and Dissertations (2023)

Multimodal counterfactual reasoning is a vital yet challenging ability for AI systems. It involves predicting the outcomes of hypothetical circumstances based on vision and language inputs, which enables AI models to learn from failures and explore hypothetical scenarios. Despite its importance, there are only a few datasets targeting the counterfactual reasoning abilities of multimodal models. Among them, they only cover reasoning over synthetic environments or specific types of events (e.g. traffic collisions), making them hard to reliably benchmark the model generalization ability in diversereal-world scenarios and reasoning dimensions. To overcome these limitations, we develop a video question answering dataset, ACQUIRED: it consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints, which ensures a focus on real-world diversity. In addition, each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal, which can comprehensively evaluate the model counterfactual abilities along multiple aspects. We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap (> 13%) between models and humans. The findings suggest that multimodal counterfactual reasoning remains an open challenge and ACQUIRED is a comprehensive and reliable benchmark for inspiring future research in this direction.

Cover page: Learning Counterfactual Reasoning By Answering Counterfactual Questions From Videos

Thesis
Peer Reviewed

Commonsense-Guided Text Generation with Knowledge Grounding and Scoring

Zhang, Felix
Advisor(s): Peng, Nanyun

UCLA Electronic Theses and Dissertations (2023)

This thesis investigates improving the world knowledge and commonsense reasoning abilities of Language Models (LMs) such as GPT2 and T5 (Radford et al., 2019; Raffel et al., 2020) through the task of commonsense language generation using the CommonGen benchmark (Lin et al., 2020). We propose a framework that guides pretrained LMs to generate more commonsensical sentences without updating the LMs’ parameters. To do so, we introduce an automatic commonsense metric grounded on ConceptNet (Speer et al., 2017) inspired by ACCENT (Ghazarian et al., 2023). To this end, we introduce a parser to extract triplets of commonsense-related concepts from a input sentence trained on few-shot GPT3-annotated data. We take the extracted triplets and compute similarity scores using COMET (Bosselut et al., 2019) to measure how well the sentence is grounded to ConceptNet, which we assume as the oracle of commonsense knowledge. Finally, we extend the Neurally-Decomposed Oracle by Meng et al. (2022), adding our commonsense metric masked with the lexical constraint into the signal used to train the auxiliary network, and demonstrate our framework is able to guide LMs towards more commonsensical generations while satisfying lexical constraints.

Cover page: Commonsense-Guided Text Generation with Knowledge Grounding and Scoring

Thesis
Peer Reviewed

Grounding Code-Switching Evaluation to Community Speech Patterns

Pattichis, Rebecca
Advisor(s): Peng, Nanyun

UCLA Electronic Theses and Dissertations (2024)

Code-switching (CS), broadly defined as switching between multiple languages in speech and text, is a common occurrence in multilingual communities. And yet, CS has been historically disparaged in higher institutions, including in the research field of Natural Language Processing. This thesis contextualizes CS dataset collection, transcription, and analysis for better data quality. Specifically, I improve CS dataset analysis by adapting previous metrics in NLP that are based on word-level units, which are misaligned with bilingual speech. Crucially, CS is not equally likely between any two words, but follows syntactic and prosodic rules. This work therefore adapts two metrics, multilinguality and CS probability, to use the Intonation Unit (IU), an established unit for speech transcription, as basic tokens for NLP tasks. I also calculate these two metrics separately for distinct mixing types: alternating-language multi-word strings and single-word incorporations. Results indicate that there is a shared tendency among bilinguals for multi-word CS to occur across, rather than within, IU boundaries. That is, bilinguals tend to prosodically separate their two languages. This constraint is blurred when metric calculations do not distinguish multi-word and single-word items. By comparing against the same metrics and datasets using the word as a token, I also show that IUs help researchers distinguish between CS speaker patterns, whereas the word-based metrics homogenize and obscure these patterns. These results call for a reconsideration of units of analysis in future development of CS datasets for NLP tasks.

Cover page: Grounding Code-Switching Evaluation to Community Speech Patterns

Thesis
Peer Reviewed

The Joint Training of Transition-Based AMR Parser

Xu, Guangxuan
Advisor(s): Peng, Nanyun

UCLA Electronic Theses and Dissertations (2022)

Abstract Meaning Representation(AMR) parsing converts a natural language sentence into a specially designed semantic graph(AMR), which captures the most essential semantic entities and relations of the input sentence. While the recent introduction of pretrained sequence- to-sequence models have brought performance improvement and pipeline simplification, the problem of how to best encode structural information into seq2seq models remains. This exploratory work proposes joint training of transition-based AMR parsers that incorporates not only the parsing objective, but also a denoising objective into training; it seeks to answer whether the improved understanding of structural alignment can benefit sequence- to-sequence AMR parsers. It also shows potential application of the joint-trained models: the joint-training setup can greatly liberate the transition-based parsers from State Machine’s alignment constraints and allow them to be easily repurposed for a set of related tasks that could theoretically benefit from the structural training, such as paraphrase generation and generation from keywords.

Cover page: The Joint Training of Transition-Based AMR Parser

Thesis
Peer Reviewed

Open Vocabulary Part Grounding in Multimodal Large Language Models

Sinha, Raunak
Advisor(s): Peng, Nanyun

UCLA Electronic Theses and Dissertations (2024)

We investigate the complexities of open-vocabulary part segmentation, highlighting its greater challenges compared to bounding box detection, which demands less granularity. Part grounding requires a deeper understanding of object structures, as models must differentiate between visually similar parts. Through an evaluation of models such as DesCo, LISA, and VLPart on the PACO dataset, we examine the limitations of these approaches. LISA Description, which leverages descriptive input, performs significantly better in segmentation, achieving an Average AP of 16.3, demonstrating the value of contextual information for part differentiation. However, descriptive training proved ineffective for bounding box detection, as DesCo PACO (+ve) trained without descriptions outperformed the descriptive models with an AP of 23.37. This discrepancy underscores the differing requirements of bounding box detection versus segmentation. Current models continue to struggle with the precision needed for part segmentation, emphasizing the need for further advancements in open-vocabulary part grounding.

Cover page: Open Vocabulary Part Grounding in Multimodal Large Language Models

Thesis
Peer Reviewed

Grounded-Knowledge-Enhanced Instruction Understanding for Multimodal Assistant Applications

Wu, Te-Lin
Advisor(s): Peng, Nanyun

UCLA Electronic Theses and Dissertations (2024)

With the recent advancements in artificial intelligence (AI), researchers are making endeavours towards building an AI that can understand humans, collaborate with humans, and help or guide them to accomplish certain everyday chores. The actualization of such an assistant AI can pose several challenges including planning (on certain events), comprehending human instructions, multimodal understanding, and grounded conversational ability.

Imagine a scenario that one wishes to perform a task, such as “making a plate of fried rice”, or “purchasing a suitable sofa bed”, which can require multiple steps of actions and manipulation of certain objects. How would an assistant AI collaborate with humans to accomplish such desired tasks? One crucial aspect of the system is to understand how and when to take a certain action, which is often learned from interpreting and following a guidance, a piece of resource that encompasses knowledge about accomplishing the task and potentially the events that will occur during task completions. The guidance can come from human verbal interactions (e.g., in the form of a conversation or a question) or static written instructional manuals.

In the first part of this thesis, I will decompose the proposed system framework into three foundational components: (1) task-step sequencing/planning, where the AI needs to understand the appropriate sequential procedure of performing each sub-task to accomplish the whole task, especially when the task knowledge is learned from instructional resources online that can be many and do not always come consolidated with proper ordering; (2) action-dependencies understanding, where an agent should be able to infer dependencies of performing an action and the outcomes after executing a particular action, in order to examine the situations and adjust the plan of accomplishing tasks; (3) multimodal grounding and active perception, that we equip the AI with the ability to actively ground the visually perceived surroundings to the textual instructions (or verbal interactions) and perform reasoning over multimodal information along the task completions.

Combining the two parts, the foundational components as well as the established novel challenging benchmarks, this thesis aims at providing a comprehensive research road map for the research direction of next-generation (multimodal) AI assistants.

Cover page: Grounded-Knowledge-Enhanced Instruction Understanding for Multimodal Assistant Applications