The rapid evolution of machine learning (ML) presents significant evaluation challenges, particularly regarding data quality (data bugs) and the limited applicability of traditional software testing metrics. Standard evaluation often overlooks crucial aspects like nuanced data diversity, subjective user experience, and content provenance. This dissertation argues for a paradigm shift in ML evaluation, moving beyond narrow, structural metrics towards holistic frameworks. The central thesis is that robust and meaningful assessment requires integrating complementary perspectives: systematically enhancing data diversity, incorporating human-centric subjective evaluation, and ensuring reliable authorship verification.
This work investigates these pillars through empirical studies. Initial analysis reveals the limitations of structural metrics like neuron coverage (NC) for effectively guiding diverse test generation in deep neural networks (RQ1). Subsequently, we demonstrate that aggressively expanding input diversity via label-altering transformations (sibylvariance) significantly improves model generalization, defect detection, and robustness (RQ2). To refine augmentation strategies, we introduce feature-aware data augmentation (FADA) to optimize policies by balancing diversity and text quality, alongside INSPECTOR, a human-in-the-loop system using transformation provenance for efficient data validation (RQ3). Shifting focus to output quality, we propose the Psychological Depth Scale (PDS) to measure subjective, human-centric dimensions like empathy and engagement, finding that current language models can achieve high, human-like levels (RQ4). Recognizing the challenge this poses, we analyze watermarking techniques, demonstrating their practical resilience against attacks and affirming their necessity for authorship verification in the age of advanced AI (RQ5).
Collectively, these findings underscore the need for a multi-faceted evaluation approach. This research contributes components towards a unified framework synthesizing structural rigor, data-centric diversity, human judgment on quality and subjective experience, and technical methods for provenance. By adopting such holistic evaluation—drawing on software engineering, machine learning, and human-computer interaction—we can better assess ML systems for real-world reliability, utility, and trustworthiness, fostering the development of more beneficial and accountable AI.