Coding - Stream Sampling & Distribution Verification - Perplexity Interview Question | DarkInterview

Stream Sampling & Distribution Verification

CodingSoftware Engineer

Problem Overview

Given an infinite or extremely large data stream (Data Stream or Integer Stream) of unknown size that cannot fit into memory all at once. You are required to randomly draw $k$ data points from it (e.g., $k=3$ ).

Constraints:

Memory Limit: Cannot store all data; only a small amount (e.g., $O(k)$ ) can be stored.
Single Pass: The data stream can only be traversed once.
Uniform Distribution: Each element in the stream must have an equal probability of being selected (i.e., $k/N$ , where $N$ is the total length of the stream).

This problem is usually divided into two parts: [Source: darkinterview.com]

Algorithm Implementation: Implement the sampling algorithm.
Verification: Prove that your algorithm is correct (typically through code simulation and statistical testing).

Part 1: Sampling Algorithm

Core Solution: Reservoir Sampling

Since the total length $N$ of the data stream is unknown, we cannot use conventional random.choice. We must use the Reservoir Sampling algorithm.

Algorithm Logic:

Initialization: Create a "reservoir" (array) of size $k$ .
Fill First $k$ Elements: For the first $k$ elements in the stream, directly add them to the reservoir.
Process Subsequent Elements: For the $n$ $n$ -th element (where $n > k$ $n > k$ ):
- Generate a random integer $j$ in the range (or generate a random float in and check if ).

Why is this uniform? We can prove this by induction. [Source: darkinterview.com]

Hypothesis: After processing the $n$ -th element, the probability of any seen element being in the reservoir is $k/n$ .
Base Case: For $n=k$ , all elements are in the reservoir with probability $1 = k/k$ .
Inductive Step: Assume the hypothesis holds for $n-1$ (probability is ).

Implementation (Python)

Perplexity has a strong preference for Python; using Python's standard library is recommended.

import random
from typing import Iterator, List, Optional

def reservoir_sampling(stream: Iterator[int], k: int) -> List[int]:
    """
    Randomly samples k elements from an infinite stream.
    """
    reservoir = []
    
    # Simulate stream processing
    for i, item in enumerate(stream):
        # index i starts from 0, corresponding to the (i+1)-th element
        current_count = i + 1
        
        if len(reservoir) < k:
            # Fill the reservoir with the first k elements
            reservoir.append(item)
        else:
            # For the i-th element (i > k), replace an element in the reservoir
            # with probability k / current_count
            # random.random() generates a float in [0.0, 1.0)
            if random.random() < k / current_count:
                # Randomly choose a position in the reservoir to replace
                replace_idx = random.randint(0, k - 1)
                reservoir[replace_idx] = item
                
    return reservoir

Part 2: Verification & Distribution Check

Problem Description

The interviewer will follow up: "How do you prove that your algorithm is 'Evenly Distributed'? That is, how do you verify that each element has an equal probability of being selected?"

Note: Interviewers usually do not require a rigorous mathematical derivation but instead ask for code implementation for verification. You need to demonstrate how to verify the results by simulating multiple samplings and using statistical methods. [Source: darkinterview.com]

Core Solution: Monte Carlo Simulation + Chi-Square Test

Steps:

Simulation: Run the sampling algorithm many times (e.g., num_trials = 100,000).
Counting: Record the number of times each distinct number is selected.
Expected vs. Observed:
- Theoretical Expected Count = num_trials * (k / N).
- Observed Count = The actual count recorded.
Statistical Test:
- Intuitive Check: Print frequency histograms or calculate relative errors.
- Rigorous Check (High Bar): Use the Chi-Square Test to determine if there is a significant difference between the observed and theoretical distributions.

Python Implementation for Verification

Proficiency with collections.Counter and scipy.stats (if third-party libraries are allowed) or implementing simple statistical calculations manually is required.

from collections import Counter
from scipy import stats     # This is a bonus, showing knowledge of statistical libraries

def verify_sampling(n: int, k: int, num_trials: int = 100000):
    """
    Verifies the uniformity of Reservoir Sampling.
    n: Total size of the stream (e.g., numbers from 0 to n-1 in the stream)
    k: Sample size
    num_trials: Number of simulations
    """
    counts = Counter()
    
    for _ in range(num_trials):
        # Generate a new stream (0 to n-1) for each trial
        stream = iter(range(n))
        sample = reservoir_sampling(stream, k)
        for num in sample:
            counts[num] += 1
            
    # --- Analyze Results ---
    print(f"Total elements: {n}, Sample size: {k}, Trials: {num_trials}")
    
    expected_count = num_trials * (k / n)
    print(f"Expected count per element: {expected_count}")
    
    # Prepare data for Chi-Square Test
    observed_frequencies = []
    expected_frequencies = []
    
    max_error = 0
    
    print("\n--- Sample Frequencies ---")
    # Check the frequency of each number (0 to n-1)
    for i in range(n):
        obs = counts[i]
        observed_frequencies.append(obs)
        expected_frequencies.append(expected_count)
        
        error_pct = abs(obs - expected_count) / expected_count * 100
        max_error = max(max_error, error_pct)
        
        if i < 5  i > n - : 
            ()
            
    ()
    
    
    
    
    
    chi2_stat, p_val = stats.chisquare(f_obs=observed_frequencies, f_exp=expected_frequencies)
    
    ()
    ()
    ()
    
     p_val > :
        ()
    :
        ()

Key Discussion Points in Interview

Why Chi-Square? [Source: darkinterview.com]
- The Chi-Square test is used to compare the goodness of fit between "observed categorical frequencies" and "expected categorical frequencies." Here, the categories are each unique element in the stream.
- If the interviewer does not require a strict Chi-Square test, calculating the "Max Relative Error" is often acceptable (e.g., all elements' errors are within 1-2%).
Memory Complexity:
- The algorithm stores only $k$ elements, so its space complexity is $O(k)$ .
- If the verification part involves simple counting, the space complexity is $O(N)$ (to record the count of each element). If $N$ is very large, the verification code might need optimization (e.g., verifying the distribution of only a subset of elements, or using binning for verification).
Randomness Quality:

Shuffle an Array: The Fisher-Yates shuffle algorithm is related, also testing randomness.
Random Pick Index: LeetCode 398.
Linked List Random Node: LeetCode 382.

Discussion