In this paper we demonstrate the practical portability of a simple version of matrix multiplication designed to exploit maximal and predictable locality at all levels of the memory hierarchy, with no a priori knowledge of the specific organization of the memory system for any particular machine. We show that memory hierarchies portability does not sacrifice floating point performance, which is always a significant fraction of peak and, at least on one machine, is higher than ATLAS and vendor multiplication.
We present a proof of concept of the fact that the theoretical conclusions on locality exploitation yield practical implementations with the desired properties.