As I work more in distributed systems, it feels like streams of data are almost everywhere. The next follow on is if you are succesful, you likely need to implement random sampling places to for monitoring, logging or data analysis. For streams of data where the full size of the dataset is unknown, this can be complicated to achieve good randomness.
Now enters a set of algorithms that called Resevoir sampling. The gist is basically helping you sample a set of k elements from an unknown size of n elements.
I love the most efficient version of the algorithm ever since reading about it here.
In pseudo code, it basically goes like:
1. Keep an array of size k called a.
2. While i <= k, add the elements to a.
3. When i > k, randomly replace an element in the array a.
That's it! This algorithm runs O(1) and allows you to always have a random sample.