As ever-increasing amounts of renewable electricity enter the energy supply mix on a regional, national and international basis, greater emphasis is being placed on energy conversion and storage technologies to deal with the oscillations, excess and lack of electricity. Hydrogen generation via proton exchange membrane water electrolysis (PEMWE) is one technology that offers a pathway to store large amounts of electricity in the form of hydrogen. The challenges to widespread adoption of PEM water electrolyzers lie in their high capital and operating costs which both need to be reduced through R&D. An evaluation of reported PEMWE performance data in the literature reveals that there are excessive variations of in situ performance results that make it difficult to draw conclusions on the pathway forward to performance optimization and future R&D directions. To enable the meaningful comparison of in situ performance evaluation across laboratories there is an obvious need for standardization of materials and testing protocols. Herein, we address this need by reporting the results of a round robin test effort conducted at the laboratories of five contributors to the IEA Electrolysis Annex 30. For this effort a method and equipment framework were first developed and then verified with respect to its feasibility for measuring water electrolysis performance accurately across the various laboratories. The effort utilized identical sets of test articles, materials, and test cells, and employed a set of shared test protocols. It further defined a minimum skeleton of requirements for the test station equipment. The maximum observed deviation between laboratories at 1 A cm −2 at cell temperatures of 60 °C and 80 °C was 27 and 20 mV, respectively. The deviation of the results from laboratory to laboratory was 2–3 times higher than the lowest deviation observed at one single lab and test station. However, the highest deviations observed were one-tenth of those extracted by a literature survey on similar material sets. The work endorses the urgent need to identify one or more reference sets of materials in addition to the method and equipment framework introduced here, to enable accurate comparison of results across the entire community. The results further imply that cell temperature control appears to be the most significant source of deviation between results, and that care must be taken with respect to break-in conditions and cell electrical connections for meaningful performance data.