Given an infinite or extremely large data stream (Data Stream or Integer Stream) of unknown size that cannot fit into memory all at once. You are required to randomly draw data points from it (e.g., ).
Constraints:
This problem is usually divided into two parts: [Source: darkinterview.com]
Since the total length of the data stream is unknown, we cannot use conventional random.choice. We must use the Reservoir Sampling algorithm.
Algorithm Logic:
Why is this uniform? We can prove this by induction. [Source: darkinterview.com]
Perplexity has a strong preference for Python; using Python's standard library is recommended.
import random
from typing import Iterator, List, Optional
def reservoir_sampling(stream: Iterator[int], k: int) -> List[int]:
"""
Randomly samples k elements from an infinite stream.
"""
reservoir = []
# Simulate stream processing
for i, item in enumerate(stream):
# index i starts from 0, corresponding to the (i+1)-th element
current_count = i + 1
if len(reservoir) < k:
# Fill the reservoir with the first k elements
reservoir.append(item)
else:
# For the i-th element (i > k), replace an element in the reservoir
# with probability k / current_count
# random.random() generates a float in [0.0, 1.0)
if random.random() < k / current_count:
# Randomly choose a position in the reservoir to replace
replace_idx = random.randint(0, k - 1)
reservoir[replace_idx] = item
return reservoir
The interviewer will follow up: "How do you prove that your algorithm is 'Evenly Distributed'? That is, how do you verify that each element has an equal probability of being selected?"
Note: Interviewers usually do not require a rigorous mathematical derivation but instead ask for code implementation for verification. You need to demonstrate how to verify the results by simulating multiple samplings and using statistical methods. [Source: darkinterview.com]
Steps:
num_trials = 100,000).num_trials * (k / N).Proficiency with collections.Counter and scipy.stats (if third-party libraries are allowed) or implementing simple statistical calculations manually is required.
from collections import Counter
from scipy import stats # This is a bonus, showing knowledge of statistical libraries
def verify_sampling(n: int, k: int, num_trials: int = 100000):
"""
Verifies the uniformity of Reservoir Sampling.
n: Total size of the stream (e.g., numbers from 0 to n-1 in the stream)
k: Sample size
num_trials: Number of simulations
"""
counts = Counter()
for _ in range(num_trials):
# Generate a new stream (0 to n-1) for each trial
stream = iter(range(n))
sample = reservoir_sampling(stream, k)
for num in sample:
counts[num] += 1
# --- Analyze Results ---
print(f"Total elements: {n}, Sample size: {k}, Trials: {num_trials}")
expected_count = num_trials * (k / n)
print(f"Expected count per element: {expected_count}")
# Prepare data for Chi-Square Test
observed_frequencies = []
expected_frequencies = []
max_error = 0
print("\n--- Sample Frequencies ---")
# Check the frequency of each number (0 to n-1)
for i in range(n):
obs = counts[i]
observed_frequencies.append(obs)
expected_frequencies.append(expected_count)
error_pct = abs(obs - expected_count) / expected_count * 100
max_error = max(max_error, error_pct)
if i < 5 i > n - :
()
()
chi2_stat, p_val = stats.chisquare(f_obs=observed_frequencies, f_exp=expected_frequencies)
()
()
()
p_val > :
()
:
()
Why Chi-Square? [Source: darkinterview.com]
Memory Complexity:
Randomness Quality:
random module uses the Mersenne Twister algorithm, which is good enough for most simulations but is not cryptographically secure. If cryptographic security is needed, the secrets module should be used (though usually not necessary for this type of interview question).