Deep Neural Networks (DNNs) have transformed the field of multimedia generation and recognition by replacing traditional hand-engineered systems in domains like vision, speech and text. This is because DNNs can operate end-to-end and model complex dependencies yielding state-of-the-art results on several generation and recognition benchmarks. However, there are three key challenges that need to be addressed for the practical, secure and reliable deployment of DNN-based media processing systems: 1) Robustness: DNNs are vulnerable to adversarial attacks, 2) Data-Requirement: DNNs often require large amounts of labelled data, 3) Compute-Efficiency: DNNs require extensive compute and resources.
My research focuses on addressing the above three challenges of DNN based multimedia generation and recognition systems. On the robustness side, I first analyze practical vulnerabilities of DNN-based recognition systems and then propose a robust defense framework that can reliably identify adversarial inputs using perceptually informed input transformations. To address the challenge of data-requirement, I develop training frameworks that can effectively adapt foundation models trained using self-supervised learning for recognition and synthesis tasks in a data-efficient manner. Finally, to address the challenge of compute-efficiency, I propose acceleration methods using hardware-software codesign that significantly reduce the latency and resource-requirement while preserving the synthesis quality of DNN generators.