The explosion of massive media data induced by the proliferation of digital cameras, mobile devices as well as the emergence of online media websites, has led us into the era of big data. Accurate and effective analyses of the big multimedia data to support semantically enriched representation in terms of events, activities, and entities can bring transformative improvements to a variety of application domains. The basic form of multimedia analysis for more sophisticated interpretation is characterized by questions such as ``who, what, where, when" that identify subjects, activities, locations, and time associated with images/video segments. In this thesis, we primarily focus on the ``who" question, which is referred as the person identification problem in multimedia data.
While advances in image processing and computer vision has resulted in powerful techniques for person identification, such techniques based on the facial appearance representations, are usually prone to errors due to a variety of factors including noise, poor signal quality, occlusion, etc. It is widely recognized in the multimedia research community that additional contextual features can be leveraged to bring significant improvements to such tasks. Nevertheless, how to systematically utilize the heterogeneous contextual information still poses a big challenge. Besides, the person identification procedure is conventionally processed in an ``offline" setting where the typical goal is to achieve complete annotations of the whole collection before further applications. Such an ``offline " process is not tenable when dealing with big multimedia data, since the limitation of computational resources as well as restriction of manpower does not allow us to process every image with each possible tag and clean up every potentially noisy result.
We note that similar challenges also arise in the database domain, especially for the entity resolution task. To address these challenges, recent entity resolution research has explored a series of powerful methods including techniques to exploit relationships, contextual information, domain semantics in the form of constraints and ontologies, etc. for the purpose of resolving references in structured/semi-structured and unstructured textual data. Additionally, query-driven data cleaning techniques have also been proposed and explored to resolve the challenges of big data.
In this thesis, we aim to explore how such advanced entity resolution techniques can be exploited to improve semantic interpretation of multimedia data, specifically for the person identification problem. We first explore how to leverage the automatic data cleaning techniques to exploit relationships, contextual information, domain semantics, constraints, etc., to enhance the performance of face clustering and recognition. Then, we propose the new paradigm for face clustering/tagging suited for big data where image enrichment is seamlessly integrated into the image retrieval/analysis process -- we refer to this new paradigm as ``query-driven image enrichment".
Particularly, we first study the person identification problem in the context of surveillance videos and propose a context-based framework for low-quality data, which integrates multiple contextual information leveraging the entity resolution framework called RelDC to improve the performance of person identification. Inspired by the significant results improvement, we investigate the face clustering problem and propose a unified framework that employs bootstrapping to automatically learn adaptive rules to integrate heterogeneous context information together. Furthermore, we exploit the human-in-the-loop mechanism to leverage human interaction to achieve high quality clustering results. Later, to address the challenges of big multimedia data, we propose the query-driven approach to face clustering/tagging which investigates the query-driven active learning strategies in order to achieve the accurate query answers with minimum user participation.