This dissertation explores the use of Deep Learning-based methodologies aimed at representing and predicting the Head Related Transfer Functions (HRTFs). HRTFs describe the source location-specific acoustic transformations that sound signals undergo due to interactions with the human outer body structures such as the torso, head and pinna. HRTFs are understood to be crucial for sound localization perception, particularly when employing earphones/headphones to replicate directional audio where, non-directional mono sources are filtered using the listener's HRTFs specific to their left and right ears to achieve the intended directionality.
As the HRTFs are a function of the individual anthropometry, there arises a pressing need for efficient methods to predict personalized HRTFs, devoid of cumbersome user involvement. The fast developing field of deep learning presents a promising avenue to address this challenge. However, the scarcity of comprehensive HRTF datasets poses a significant challenge. This dissertation work initially delves into techniques for deriving sub-dimensional latent representations from HRTFs, aiming to reduce the data dimensionality and complexity, thereby alleviating the load on prediction models. The latent representations are derived by capitalizing on both spatial and spectral interactions between neighboring HRTFs, in conjunction with contrastive learning methodologies aimed at disentangling pertinent features inherent in HRTFs. Visualizations of the obtained latent space are provided demonstrating the favorable characteristics of the derived representations. Furthermore, this study extends its investigation into utilizing the acquired latent representations for predicting spatially dense HRTFs through upsampling of spatially sparse measurements. The effectiveness of the proposed methodologies is validated using the mean Log Spectral Distortion (LSD) between the ground truth and predicted HRTFs as an objective metric, resulting in state of the art results. Additionally, reconstruction plots are also provided showing the preservation of the peaks and notches in the reconstructed HRTFs, which are understood to be crucial, if not the most dominant, perceptual cues.