Spark streaming with Checkpoint - Stack Overflow
In streaming scenarios holding 24 hours of data is usually too much. To solve that you use a probabilistic methods instead of exact measures for streaming and perform a later batch computation to get the exact numbers (if needed).
In your case to get a distinct count you can use an algorithm called HyperLogLog. You can see an example of using Twitter's implementation of HyperLogLog (part of a library called AlgeBird) from spark streaming here
Read full article from Spark streaming with Checkpoint - Stack Overflow
No comments:
Post a Comment