Fraud detection is a problem of highly imbalanced binary classification, where the size ratio of positive class (fraudulent instances) to negative class (normal instances) is low. Any fraud detection model is faced with the trade-off between two types of classification mistakes - type I error (also known as the false positive rate (FPR)) and type II error (also known as the false negative rate (FNR)). One common way to handle this trade-off is to set a pre-determined level of maximum acceptable type I error, $\alpha$, and try to minimize type II error. However, an acceptable empirical type I error evaluated on the test set (below $\alpha$) does not necessarily guarantee population type I error to be below $\alpha$ as well. This problem is addressed by the Neyman-Pearson (NP) classification algorithm \citep{np}, with which we can put a high-probability upper bound on population type I error. In this study, we use simulation to illustrate how empirical and population type I error can differ, and how NP classification effectively controls population type I error. We then build credit card fraud prediction models based on classical and NP algorithms and compare their performances.