Skip to main content
eScholarship
Open Access Publications from the University of California

Total Cost of Ownership and Evaluation of Google Cloud Resources for the ATLAS Experiment at the LHC

(2025)

Abstract: The ATLAS Google Project was established as part of an ongoing evaluation of the use of commercial clouds by the ATLAS Collaboration, in anticipation of the potential future adoption of such resources by WLCG grid sites to fulfil or complement their computing pledges. Seamless integration of Google cloud resources into the worldwide ATLAS distributed computing infrastructure was achieved at large scale and for an extended period of time, and hence cloud resources are shown to be an effective mechanism to provide additional, flexible computing capacity to ATLAS. For the first time a total cost of ownership analysis has been performed, to identify the dominant cost drivers and explore effective mechanisms for cost control. Network usage significantly impacts the costs of certain ATLAS workflows, underscoring the importance of implementing such mechanisms. Resource bursting has been successfully demonstrated, whilst exposing the true cost of this type of activity. A follow-up to the project is underway to investigate methods for improving the integration of cloud resources in data-intensive distributed computing environments and reducing costs related to network connectivity, which represents the primary expense when extensively utilising cloud resources.

Cover page of I/O in Machine Learning Applications on HPC Systems: A 360-degree Survey

I/O in Machine Learning Applications on HPC Systems: A 360-degree Survey

(2025)

Growing interest in Artificial Intelligence (AI) has resulted in a surge in demand for faster methods of Machine Learning (ML) model training and inference. This demand for speed has prompted the use of high performance computing (HPC) systems that excel in managing distributed workloads. Because data is the main fuel for AI applications, the performance of the storage and I/O subsystem of HPC systems is critical. In the past, HPC applications accessed large portions of data written by simulations or experiments or ingested data for visualizations or analysis tasks. ML workloads perform small reads spread across a large number of random files. This shift of I/O access patterns poses several challenges to modern parallel storage systems. In this article, we survey I/O in ML applications on HPC systems, and target literature within a 6-year time window from 2019 to 2024. We define the scope of the survey, provide an overview of the common phases of ML, review available profilers and benchmarks, examine the I/O patterns encountered during offline data preparation, training, and inference, and explore I/O optimizations utilized in modern ML frameworks and proposed in recent literature. Lastly, we seek to expose research gaps that could spawn further R&D.

Regen: An object layout regenerator on large-scale production HPC systems

(2025)

This article proposes an object layout regenerator called Regen which regenerates and removes the object layout dynamically to improve the read performance of applications. Regen first detects frequent access patterns from the I/O requests of the applications. Second, Regen reorganizes the objects and regenerates or preallocates new object layouts according to the identified access patterns. Finally, Regen removes or reuses the obsolete or regenerated object layouts as necessary. As a result, Regen accelerates access to objects by providing a flexible object layout. We implement Regen as a framework on top of Proactive Data Container (PDC) and evaluate it on Cori supercomputer, a production-scale HPC system, by using realistic HPC I/O benchmarks. The experimental results show that Regen improves the I/O performance by up to 16.92× compared with an existing system.

Cover page of Data Readiness for AI: A 360-Degree Survey

Data Readiness for AI: A 360-Degree Survey

(2025)

Artificial Intelligence (AI) applications critically depend on data. Poor-quality data produces inaccurate and ineffective AI models that may lead to incorrect or unsafe use. Evaluation of data readiness is a crucial step in improving the quality and appropriateness of data usage for AI. R&D efforts have been spent on improving data quality. However, standardized metrics for evaluating data readiness for use in AI training are still evolving. In this study, we perform a comprehensive survey of metrics used to verify data readiness for AI training. This survey examines more than 140 papers published by ACM Digital Library, IEEE Xplore, journals such as Nature, Springer, and Science Direct, and online articles published by prominent AI experts. This survey aims to propose a taxonomy of data readiness for AI (DRAI) metrics for structured and unstructured datasets. We anticipate that this taxonomy will lead to new standards for DRAI metrics that would be used for enhancing the quality, accuracy, and fairness of AI training and inference.

Measurement of the top quark mass with the ATLAS detector using t t ¯ events with a high transverse momentum top quark

(2025)

The mass of the top quark is measured using top-quark-top-antiquark pair events with high transverse momentum top quarks. The dataset, collected with the ATLAS detector in proton–proton collisions at s=13 TeV delivered by the Large Hadron Collider, corresponds to an integrated luminosity of 140 fb−1. The analysis targets events in the lepton-plus-jets decay channel, with an electron or muon from a semi-leptonically decaying top quark and a hadronically decaying top quark that is sufficiently energetic to be reconstructed as a single large-radius jet. The mean of the invariant mass of the reconstructed large-radius jet provides the sensitivity to the top quark mass and is simultaneously fitted with two additional observables to reduce the impact of the systematic uncertainties. The top quark mass is measured to be mt=172.95±0.53 GeV, which is the most precise ATLAS measurement from a single channel.

Observation of VVZ production at s = 13 TeV with the ATLAS detector

(2025)

A search for the production of three massive vector bosons, VVZ(V=W,Z), in proton–proton collisions at s=13 TeV is performed using data with an integrated luminosity of 140 fb−1 recorded by the ATLAS detector at the Large Hadron Collider. Events produced in the leptonic final states WWZ→ℓνℓνℓℓ (ℓ=e,μ), WZZ→ℓνℓℓℓℓ, ZZZ→ℓℓℓℓℓℓ, and the semileptonic final states WWZ→qqℓνℓℓ and WZZ→ℓνqqℓℓ, are analysed. The measured cross section for the pp→VVZ process is 660−90+93(stat.)−81+88(syst.) fb, and the observed (expected) significance is 6.4 (4.7) standard deviations, representing the observation of VVZ production. In addition, the measured cross section for the pp→WWZ process is 442±94(stat.)−52+60(syst.) fb, and the observed (expected) significance is 4.4 (3.6) standard deviations, representing evidence of WWZ production. The measured cross sections are consistent with the Standard Model predictions. Constraints on physics beyond the Standard Model are also derived in the effective field theory framework by setting limits on Wilson coefficients for dimension-8 operators describing anomalous quartic gauge boson couplings.

Erratum: Measurement of t-channel production of single top quarks and antiquarks in pp collisions at 13 TeV using the full ATLAS Run 2 data sample

(2025)

The performance of missing transverse momentum reconstruction and its significance with the ATLAS detector using 140 fb-1 of s=13 TeV pp collisions

(2025)

Abstract: This paper presents the reconstruction of missing transverse momentum ( $$p_{\text {T}}^{\text {miss}}$$ p T miss ) in proton–proton collisions, at a center-of-mass energy of 13 TeV. This is a challenging task involving many detector inputs, combining fully calibrated electrons, muons, photons, hadronically decaying $$\tau $$ τ -leptons, hadronic jets, and soft activity from remaining tracks. Possible double counting of momentum is avoided by applying a signal ambiguity resolution procedure which rejects detector inputs that have already been used. Several $$p_{\text {T}}^{\text {miss}}$$ p T miss ‘working points’ are defined with varying stringency of selections, the tightest improving the resolution at high pile-up by up to 39% compared to the loosest. The $$p_{\text {T}}^{\text {miss}}$$ p T miss performance is evaluated using data and Monte Carlo simulation, with an emphasis on understanding the impact of pile-up, primarily using events consistent with leptonic Z decays. The studies use $$140~\text {fb}^{-1}$$ 140 fb - 1 of data, collected by the ATLAS experiment at the Large Hadron Collider between 2015 and 2018. The results demonstrate that $$p_{\text {T}}^{\text {miss}}$$ p T miss reconstruction, and its associated significance, are well understood and reliably modelled by simulation. Finally, the systematic uncertainties on the soft $$p_{\text {T}}^{\text {miss}}$$ p T miss component are calculated. After various improvements the scale and resolution uncertainties are reduced by up to $$76\%$$ 76 % and $$51\%$$ 51 % , respectively, compared to the previous calculation at a lower luminosity.

Cover page of The DESI One-Percent Survey: Modelling the clustering and halo occupation of all four DESI tracers with UCHUU

The DESI One-Percent Survey: Modelling the clustering and halo occupation of all four DESI tracers with UCHUU

(2025)

We present results from a set of mock lightcones for the DESI One-Percent Survey, created from the UCHUU simulation. This 8 h-3 Gpc3 N-body simulation comprises 2.1 trillion particles and provides high-resolution dark matter (sub)haloes in the framework of the Planck-based λ CDM cosmology. Employing the subhalo abundance matching (SHAM) technique, we populated the UCHUU (sub)haloes with all four DESI tracers - Bright Galaxy Survey (BGS), luminous red galaxies (LRGs), emission line galaxies (ELGs), and quasars (QSOs) - to z = 2.1. Our method accounts for redshift evolution as well as the clustering dependence on luminosity and stellar mass. The two-point clustering statistics of the DESI One-Percent Survey generally agree with predictions from UCHUU across scales ranging from 0.3 h-1 Mpc to 100 h-1 Mpc for the BGS and across scales ranging from 5 h-1 Mpc to 100 h-1 Mpc for the other tracers. We observed some differences in clustering statistics that can be attributed to incompleteness of the massive end of the stellar mass function of LRGs, our use of a simplified galaxy-halo connection model for ELGs and QSOs, and cosmic variance. We find that at the high precision of UCHUU, the shape of the halo occupation distribution (HOD) of the BGS and LRG samples is smaller bias values, likely due to cosmic variance. The bias dependence on absolute magnitude, stellar mass, and redshift aligns with that of previous surveys. These results provide DESI with tools to generate high-fidelity lightcones for the remainder of the survey and enhance our understanding of the galaxy-halo connection.

Search for tt¯H/A→tt¯tt¯ production in proton–proton collisions at s=13 TeV with the ATLAS detector

(2025)

Abstract: A search is presented for a heavy scalar (H) or pseudo-scalar (A) predicted by the two-Higgs-doublet models, where the H/A is produced in association with a top-quark pair $$(t\bar{t}H/A),$$ ( t t ¯ H / A ) , and with the H/A decaying into a $$t\bar{t}$$ t t ¯ pair. The full LHC Run 2 proton–proton collision data collected by the ATLAS experiment is used, corresponding to an integrated luminosity of $$139~\text {fb}^{-1}.$$ 139 fb - 1 . Events are selected requiring exactly one or two opposite-charge electrons or muons. Data-driven corrections are applied to improve the modelling of the $$t\bar{t}$$ t t ¯ +jets background in the regime with high jet and b-jet multiplicities. These include a novel multi-dimensional kinematic reweighting based on a neural network trained using data and simulations. An H/A-mass parameterised graph neural network is trained to optimise the signal-to-background discrimination. In combination with the previous search performed by the ATLAS Collaboration in the multilepton final state, the observed upper limits on the $$t\bar{t}H/A \rightarrow t\bar{t}t\bar{t}$$ t t ¯ H / A → t t ¯ t t ¯ production cross-section at 95% confidence level range between 14 fb and 5.0 fb for an H/A with mass between 400  $$\text {GeV}$$ GeV and 1000  $$\text {GeV}$$ GeV , respectively. Assuming that both the H and A contribute to the $$t\bar{t}t\bar{t}$$ t t ¯ t t ¯ cross-section, $$\tan \beta $$ tan β values below 1.7 or 0.7 are excluded for a mass of 400  $$\text {GeV}$$ GeV or 1000  $$\text {GeV}$$ GeV , respectively. The results are also used to constrain a model predicting the pair production of a colour-octet scalar, with the scalar decaying into a $$t\bar{t}$$ t t ¯ pair.