Open Access Publications from the University of California

## Order statistics and variability in data streams

Abstract

High-volume data streams are too large and grow too quickly to store entirely in working memory, introducing new challenges to what might otherwise be simple calculations. For one, nearly all interesting statistics of a stream can only be estimated without recalling expired data, which can be costly or impossible. For another, even just estimating a statistic could be infeasible if the space required to do so does not grow slowly with the size of the stream.

One such set of fundamental statistics are the quantiles (order statistics) of a dataset defined by an aggregating stream of items. We develop an approximate quantile summary with probabilistic guarantees that yields a new upper bound on the space needed for such a summary when the items are subject only to comparison operations.

The difficulties of data streams are compounded when streams may arrive distributed across several locations, as would occur in sensor networks and telecommunication networks; in addition to handling the high volume of these streams we must also coordinate among the sites to ensure consistency for any statistics we wish to track, ideally with a minimum of communication.

We introduce a new, natural parameter for data streams, the variability'' $v$, that permits us to easily extend existing algorithms for the aggregating streaming model to one in which streams are composed of insertion and deletion transactions. For this second model, our definition refines existing worst-case communication bounds from $O(n)$ to $\tilde{O}(v)$ for a host of problems. We further show that the variabilities for many streams of interest grow slowly with respect to the size of the streams.