Given an infinite character stream (or a very large text stream that cannot fit into memory) and a list of stop words (sensitive words), return the substring that appears before the first occurrence of any stop word.
Constraints:
yield keyword to implement streaming processing.Example: [Source: darkinterview.com]
stop_words = ["<stop>", "<end>"]
stream_chunks = ["This is a te", "st<st", "op> message"]
# Expected output: "This is a test"
# Reason: "<stop>" is split across chunks 2 and 3
The most critical difficulty in this problem is handling stop words that are split across chunk boundaries. For example:
"...te""st<st""op>..."The word <stop> is actually split across chunks 2 and 3. If we only check within each chunk, we will miss it.
Core Idea: [Source: darkinterview.com]
max_stop_word_length - 1) to the buffer for the next iteration.Why max_stop_word_length - 1?
from typing import Iterator, List, Optional
def process_stream_with_stopwords(
stream: Iterator[str],
stop_words: List[str]
) -> Iterator[str]:
"""
Process a character stream and yield characters until a stop word is found.
Args:
stream: An iterator yielding string chunks
stop_words: A list of stop words to detect
Yields:
Characters before the first stop word
"""
if not stop_words:
# If no stop words, yield the entire stream
for chunk in stream:
yield chunk
return
# Calculate the maximum stop word length
max_stop_len = max(len(word) for word in stop_words)
# Buffer to store characters from the previous chunk
buffer = ""
for chunk in stream:
# Combine buffer with current chunk
text = buffer + chunk
# Check for stop words in the combined text
earliest_pos = len(text) # Initialize to end of text
found_stop_word = None
for stop_word in stop_words:
pos = text.find(stop_word)
if pos != -1 and pos < earliest_pos:
earliest_pos = pos
found_stop_word = stop_word
if found_stop_word:
# Stop word found - yield characters before it and stop
if earliest_pos > 0:
yield text[:earliest_pos]
return
safe_to_yield_len = (, (text) - (max_stop_len - ))
safe_to_yield_len > :
text[:safe_to_yield_len]
buffer = text[safe_to_yield_len:]
:
buffer = text
buffer:
buffer
() -> :
.join(process_stream_with_stopwords(stream, stop_words))
# Example 1: Stop word split across chunks
def create_stream_1():
chunks = ["This is a te", "st<st", "op> message"]
for chunk in chunks:
yield chunk
stop_words = ["<stop>", "<end>"]
result = extract_text_before_stopword(create_stream_1(), stop_words)
print(result) # Output: "This is a test"
# Example 2: Stop word entirely within one chunk
def create_stream_2():
chunks = ["Hello world", " <stop> more", " text"]
for chunk in chunks:
yield chunk
result = extract_text_before_stopword(create_stream_2(), stop_words)
print(result) # Output: "Hello world "
# Example 3: No stop word found
def create_stream_3():
chunks = ["This is ", "a normal ", "text"]
for chunk in chunks:
yield chunk
result = extract_text_before_stopword(create_stream_3(), stop_words)
print(result) # Output: "This is a normal text"
Empty Stream
stream = iter([])
result = extract_text_before_stopword(stream, ["<stop>"])
# Expected: ""
Stop Word at Beginning [Source: darkinterview.com]
stream = iter(["<stop>", "text"])
result = extract_text_before_stopword(stream, ["<stop>"])
# Expected: ""
Multiple Overlapping Stop Words
stream = iter(["test<st", "op><end>more"])
stop_words = ["<stop>", "<end>"]
# Should stop at "<stop>", not "<end>"
# Expected: "test"
Very Small Chunks
stream = iter(["<", "s", "t", "o", "p", ">"])
# Each character is a separate chunk
# Must still detect "<stop>"
Stop Word Longer Than Chunk Size [Source: darkinterview.com]
stream = iter(["a", "b", "<", "s", "t", "o", "p", ">", "c"])
stop_words = ["<stop>"]
# Expected: "ab"
Use Trie Data Structure for Multiple Stop Words
If the number of stop words is very large (e.g., thousands), using a Trie (prefix tree) can significantly speed up the search:
class TrieNode:
def __init__(self):
self.children = {}
self.is_end_of_word = False
class Trie:
def __init__(self, words: List[str]):
self.root = TrieNode()
for word in words:
self.insert(word)
def insert(self, word: str):
node = self.root
for char in word:
if char not in node.children:
node.children[char] = TrieNode()
node = node.children[char]
node.is_end_of_word = True
def find_earliest_match(self, text: str) -> Optional[int]:
"""Find the position of the earliest stop word in text."""
for start_pos in range(len(text)):
node = self.root
for i in range(start_pos, len(text)):
char = text[i]
if char not in node.children:
node = node.children[char]
node.is_end_of_word:
start_pos
This is critical for handling streams that are gigabytes or terabytes in size. [Source: darkinterview.com]
Memory Efficiency:
yield.Comparison:
# BAD: Loads entire stream into memory
def bad_approach(stream):
full_text = ''.join(stream) # Out of memory for large streams!
# ... process full_text
# GOOD: Processes incrementally
def good_approach(stream):
for chunk in stream:
yield process(chunk) # Only one chunk in memory
Question: "Why is the buffer size max_stop_len - 1?" [Source: darkinterview.com]
Answer:
"<stop>" (6 characters).Visual Example:
Chunk 1: "...ABC" → buffer = "ABC" (assuming max_stop_len = 4)
Chunk 2: "DEF..." → text = "ABCDEF..."
Now we can detect "ABCD", "BCDE", "CDEF" spanning the boundary
Interviewer might ask: "How would you implement this in production?" [Source: darkinterview.com]
Key Points:
collections.deque for the buffer if frequent append/pop operations are needed (though for string it's not necessary here).re.search() for regex-based stop word matching.io.StringIO for testing with mock streams.Regex-Based Solution:
import re
def process_with_regex(stream, stop_words):
# Escape special regex characters
escaped = [re.escape(word) for word in stop_words]
pattern = '|'.join(escaped) # "word1|word2|word3"
regex = re.compile(pattern)
buffer = ""
max_stop_len = max(len(word) for word in stop_words)
for chunk in stream:
text = buffer + chunk
match = regex.search(text)
if match:
yield text[:match.start()]
return
safe_len = max(0, len(text) - (max_stop_len - 1))
if safe_len > 0:
yield text[:safe_len]
buffer = text[safe_len:]
else:
buffer = text
if buffer:
yield buffer
Trade-offs: [Source: darkinterview.com]
Time Complexity: [Source: darkinterview.com]
Aho-Corasick Algorithm
For production systems with many stop words, the Aho-Corasick algorithm provides complexity where is the number of matches. This is optimal for multi-pattern matching.