Imperfect Label Information in Multimodal Human-Centric Machine Learning
Multimodal machine learning studies the ability to take multiple streams of input data to make predictions on an output. The classic notion is that by using multiple streams of input, we can make better predictions by accounting for multiple contexts. Such applications include audio-visual speech recognition, emotion prediction, and much more. While this research has enabled novel and effective ways to fuse the data for improved modeling performance, few works have examined how highly uncertain and varied human opinions and behavior can impact model performance.
Accounting the variability or differences in human opinions is important for multimodal machine learning because in many human-centric applications the labels contain high degrees of uncertainty. One notable example of this is in predicting human sentiment or emotions. In current datasets, we do not get a complete picture for the variability of human opinions. This is further complicated by the fact that the inclusion of additional modalities leads to an increase in discriminating features, causing models to fit to imperfect data faster.
This thesis lays a foundation for examining the effect of label variability on multimodal algorithms and datasets. We propose and develop novel techniques for unimodal label tolerance and strive to bring this to a multimodal domain. The goal is that by explicitly accounting for ambiguities in the output, we can improve the effectiveness and understanding of label noise in a multimodal domain.