Modern commercial antivirus systems increasingly rely on machine learning to
keep up with the rampant inflation of new malware. However, it is well-known
that machine learning models are vulnerable to adversarial examples (AEs).
Previous works have shown that ML malware classifiers are fragile to the
white-box adversarial attacks. However, ML models used in commercial antivirus
products are usually not available to attackers and only return hard
classification labels. Therefore, it is more practical to evaluate the
robustness of ML models and real-world AVs in a pure black-box manner. We
propose a black-box Reinforcement Learning (RL) based framework to generate AEs
for PE malware classifiers and AV engines. It regards the adversarial attack
problem as a multi-armed bandit problem, which finds an optimal balance between
exploiting the successful patterns and exploring more varieties. Compared to
other frameworks, our improvements lie in three points. 1) Limiting the
exploration space by modeling the generation process as a stateless process to
avoid combination explosions. 2) Due to the critical role of payload in AE
generation, we design to reuse the successful payload in modeling. 3)
Minimizing the changes on AE samples to correctly assign the rewards in RL
learning. It also helps identify the root cause of evasions. As a result, our
framework has much higher black-box evasion rates than other off-the-shelf
frameworks. Results show it has over 74\%--97\% evasion rate for two
state-of-the-art ML detectors and over 32\%--48\% evasion rate for commercial
AVs in a pure black-box setting. We also demonstrate that the transferability
of adversarial attacks among ML-based classifiers is higher than the attack
transferability between purely ML-based and commercial AVs.