Neural networks have been widely studied and used in recent years due to its high
classification accuracy and training efficiency. With the increase of network depth, however,
the models become worse calibrated, meaning they cannot reflect the true probabilities. On
the other hand, in many applications such as medical diagnosis, facial recognition and selfdriving cars, the calibrated output probabilities are of critical importance. Therefore, the
understanding of the cause of deep neural network uncalibration is of much concern.
The influence of model structures on the output calibration has been explored.
However, the impact of the training dataset quality and heterogeneity, such as dataset size
and label noise remains unclear. In this thesis, the impact of data quality and heterogeneity
on the output calibration is investigated theoretically and experimentally. Afterwards, the
defect of calibration methods using single global parameter are discussed. To overcome
the calibration issues resulting from the dataset heterogeneity, we propose an improved
calibration technique that can give better performance.