All About Programming: Spark streaming with Checkpoint

Spark streaming with Checkpoint - Stack Overflow

In streaming scenarios holding 24 hours of data is usually too much. To solve that you use a probabilistic methods instead of exact measures for streaming and perform a later batch computation to get the exact numbers (if needed).

In your case to get a distinct count you can use an algorithm called HyperLogLog. You can see an example of using Twitter's implementation of HyperLogLog (part of a library called AlgeBird) from spark streaming here

Read full article from Spark streaming with Checkpoint - Stack Overflow

Spark streaming with Checkpoint - Stack Overflow

No comments:

Post a Comment

Labels

Popular Posts