Cammie Xiaotu: Reservoir Sampling
Distributed Reservoir Sampling VariationThis is the problem that got me looking at the weighted sample above. In both of the above algorithms, I can process the stream in O(N) time where N is length of the stream, in other words: in a single pass. If I want to break break up the problem on say 10 machines and solve it close to 10 times faster, how can I do that?
The answer is to have each of the 10 machines take roughly 1/10th of the input to process and generate their own reservoir sample from their subset of the data. Then, a final process must take the 10 output reservoirs and merge them with another sample. The trick is that the final process must take into account the reservoir weights of each of the input reservoirs. Basically, the final process should reservoir sample every element in the 10 input reservoirs but assign each element a weight of weight[x] * seen_weight / reservoir_weight where weight[x] is the element's original weight, seen_weight is the total weight counted in generating that particular reservoir and reservoir_weight is the sum of the weights of the elements in the reservoir. Proving this works is a little harder, so I will frustrate you by leaving it as an excercise for the dear reader.
Read full article from Cammie Xiaotu: Reservoir Sampling
No comments:
Post a Comment