Stream Processing with Stop Words
Problem Overview
Given an infinite character stream (or a very large text stream that cannot fit into memory) and a list of stop words (sensitive words), return the substring that appears before the first occurrence of any stop word.
Constraints:
- Memory Efficient: The input is extremely large and cannot be loaded into memory all at once. It must be read in chunks.
- Python Generator: Must use the
yield keyword to implement streaming processing.
- Cross-Chunk Handling: A stop word may be split across two consecutive chunks, and the system must correctly identify it.
Example: [Source: darkinterview.com]
stop_words = ["<stop>", "<end>"]
stream_chunks = ["This is a te", "st<st", "op> message"]
Part 1: Core Algorithm
Key Challenge: Cross-Chunk Stop Words
The most critical difficulty in this problem is handling stop words that are split across chunk boundaries. For example:
- Chunk 1:
"...te"
- Chunk 2:
"st<st"
- Chunk 3:
"op>..."
The word <stop> is actually split across chunks 2 and 3. If we only check within each chunk, we will miss it.
Solution: Buffer-Based Sliding Window
Core Idea: [Source: darkinterview.com]
- Maintain a buffer to store unconsumed characters from the previous chunk.
- When processing each new chunk, prepend the buffer to it.
- After processing, save the last few characters (length =
max_stop_word_length - 1) to the buffer for the next iteration.
- This ensures that stop words spanning chunk boundaries can be detected.
Why max_stop_word_length - 1?
- If the longest stop word is 6 characters, we only need to keep the last 5 characters from the current chunk.
- When combined with the first character of the next chunk, we can form a complete 6-character window to check for stop words.
Part 2: Implementation
Generator-Based Stream Processor
from typing import Iterator, List, Optional
def process_stream_with_stopwords(
stream: Iterator[str],
stop_words: List[str]
) -> Iterator[str]:
"""
Process a character stream and yield characters until a stop word is found.
Args:
stream: An iterator yielding string chunks
stop_words: A list of stop words to detect
Yields:
Characters before the first stop word
"""
if not stop_words:
for chunk in stream:
yield chunk
return
max_stop_len = max(len(word) for word in stop_words)
buffer = ""
for chunk in stream:
text = buffer + chunk
earliest_pos = len(text)
found_stop_word = None
for stop_word in stop_words:
pos = text.find(stop_word)
if pos != -1 and pos < earliest_pos:
earliest_pos = pos
found_stop_word = stop_word
if found_stop_word:
if earliest_pos > 0:
text[:earliest_pos]
safe_to_yield_len = (, (text) - (max_stop_len - ))
safe_to_yield_len > :
text[:safe_to_yield_len]
buffer = text[safe_to_yield_len:]
:
buffer = text
buffer:
buffer
() -> :
.join(process_stream_with_stopwords(stream, stop_words))
Usage Example
def create_stream_1():
chunks = ["This is a te", "st<st", "op> message"]
for chunk in chunks:
yield chunk
stop_words = ["<stop>", "<end>"]
result = extract_text_before_stopword(create_stream_1(), stop_words)
print(result)
def create_stream_2():
chunks = ["Hello world", " <stop> more", " text"]
for chunk in chunks:
yield chunk
result = extract_text_before_stopword(create_stream_2(), stop_words)
print(result)
def create_stream_3():
chunks = ["This is ", "a normal ", "text"]
for chunk in chunks:
yield chunk
result = extract_text_before_stopword(create_stream_3(), stop_words)
print(result)
Part 3: Edge Cases & Optimizations
Critical Edge Cases
-
Empty Stream
stream = iter([])
result = extract_text_before_stopword(stream, ["<stop>"])
-
Stop Word at Beginning [Source: darkinterview.com]
stream = iter(["<stop>", "text"])
result = extract_text_before_stopword(stream, ["<stop>"])
-
Multiple Overlapping Stop Words
stream = iter(["test<st", "op><end>more"])
stop_words = ["<stop>", "<end>"]
-
Very Small Chunks
stream = iter(["<", "s", "t", "o", "p", ">"])
-
Stop Word Longer Than Chunk Size [Source: darkinterview.com]
stream = iter(["a", "b", "<", "s", "t", , , , ])
stop_words = []
-
Use Trie Data Structure for Multiple Stop Words
If the number of stop words is very large (e.g., thousands), using a Trie (prefix tree) can significantly speed up the search:
class TrieNode:
def __init__(self):
self.children = {}
self.is_end_of_word = False
class Trie:
def __init__(self, words: List[str]):
self.root = TrieNode()
for word in words:
self.insert(word)
def insert(self, word: str):
node = self.root
for char in word:
if char not in node.children:
node.children[char] = TrieNode()
node = node.children[char]
node.is_end_of_word = True
def find_earliest_match(self, text: str) -> Optional[int]:
"""Find the position of the earliest stop word in text."""
for start_pos in range(len(text)):
node = self.root
for i in range(start_pos, len(text)):
char = text[i]
char node.children:
node = node.children[char]
node.is_end_of_word:
start_pos
Memory Complexity Analysis
- Buffer Size: O(L) where L is the length of the longest stop word
- Stop Words Storage: O(m×k) where m = number of stop words, k = average length
- Total Space: , independent of the stream size
This is critical for handling streams that are gigabytes or terabytes in size. [Source: darkinterview.com]
Key Discussion Points in Interview
1. Why Use Generators?
Memory Efficiency:
- Generators process data lazily (on-demand) using
yield.
- Only one chunk is in memory at a time, plus a small buffer.
- This allows processing of unlimited stream sizes.
Comparison:
def bad_approach(stream):
full_text = ''.join(stream)
def good_approach(stream):
for chunk in stream:
yield process(chunk)
2. Buffer Size Calculation
Question: "Why is the buffer size max_stop_len - 1?" [Source: darkinterview.com]
Answer:
- Suppose the longest stop word is
"<stop>" (6 characters).
- To detect it across chunk boundaries, we need to ensure we can see at least 6 consecutive characters.
- By keeping the last 5 characters from chunk n and prepending them to chunk n+1, we guarantee any 6-character pattern spanning the boundary will be visible.
Visual Example:
// Source: https://darkinterview.com/collections/j6h3k9n2/questions/bcc7bdca-d055-44e5-a270-0d98d2148590
Chunk 1: "...ABC" → buffer = "ABC" (assuming max_stop_len = 4)
Chunk 2: "DEF..." → text = "ABCDEF..."
Now we can detect "ABCD", "BCDE", "CDEF" spanning the boundary
3. Python-Specific Tricks
Interviewer might ask: "How would you implement this in production?" [Source: darkinterview.com]
Key Points:
- Use
collections.deque for the buffer if frequent append/pop operations are needed (though for string it's not necessary here).
- Consider using
re.search() for regex-based stop word matching.
- Use
io.StringIO for testing with mock streams.
- Handle character encoding properly (UTF-8, especially for non-English text).
4. Alternative Approaches
Regex-Based Solution:
import re
def process_with_regex(stream, stop_words):
escaped = [re.escape(word) for word in stop_words]
pattern = '|'.join(escaped)
regex = re.compile(pattern)
buffer = ""
max_stop_len = max(len(word) for word in stop_words)
for chunk in stream:
text = buffer + chunk
match = regex.search(text)
if match:
yield text[:match.start()]
return
safe_len = max(0, len(text) - (max_stop_len - 1))
if safe_len > 0:
yield text[:safe_len]
buffer = text[safe_len:]
else:
buffer = text
if buffer:
yield buffer
Trade-offs: [Source: darkinterview.com]
- Pros: More concise, handles regex patterns natively
- Cons: Regex compilation overhead, slightly harder to debug
- LeetCode 76: Minimum Window Substring (similar sliding window concept)
- LeetCode 438: Find All Anagrams in a String (character window)
- Text Streaming Processing: Log file parsing, real-time text analysis
- Tokenization: Breaking text into words/sentences with delimiters