Contemporary methods for benchmarking noisy quantum processors typically measure average error rates or process infidelities. However, thresholds for fault-tolerant quantum error correction are given in terms of worst-case error rates—defined via the diamond norm—which can differ from average error rates by orders of magnitude. One method for resolving this discrepancy is to randomize the physical implementation of quantum gates, using techniques like randomized compiling (RC). In this work, we use gate set tomography to perform precision characterization of a set of two-qubit logic gates to study RC on a superconducting quantum processor. We find that, under RC, gate errors are accurately described by a stochastic Pauli noise model without coherent errors, and that spatially correlated coherent errors and non-Markovian errors are strongly suppressed. We further show that the average and worst-case error rates are equal for randomly compiled gates, and measure a maximum worst-case error of 0.0197(3) for our gate set. Our results show that randomized benchmarks are a viable route to both verifying that a quantum processor’s error rates are below a fault-tolerance threshold, and to bounding the failure rates of near-term algorithms, if—and only if—gates are implemented via randomization methods which tailor noise.