If you want to uniformly sample a handful of elements from a very large stream of data you probably don’t want to read it all into memory first. It would be ideal if you could sample while streaming the data. For this, reservoir sampling provides one answer. The algorithm is summarized thusly
- Let
be the number of samples required (without replacement),
the array to hold the
samples, and
the stream of elements.
- Fill
with the first
elements from
- for each
, randomly sample
, and set
if
.
- The sample set is the final state of
Can we show that this algorithm does indeed uniformly sample the elements (without replacement), that is, the probability of sampling any subset of elements is
?
Suppose ,
, and
. The probability of picking
is the probability of picking it at step
when
and then making sure in the later steps it doesn’t get replaced by another element
Thus, the sampling is uniform when . In general, for sampling
elements
the probability due to step
is
That is the probability that each element is put into its own slot. The probability that doesn’t get overwritten by the upcoming elements (except for
which have already been picked by step
) is
Thus, the probability of selecting according to the algorithm is