Web Crawler
Basic Problem
You are given:
- A starting URL
startUrl
- An interface
HtmlParser that can fetch all URLs from a given web page
Implement a web crawler that returns all URLs reachable from startUrl that share the same hostname as startUrl.
The order of URLs in the result does not matter.
Below is the interface for HtmlParser: [Source: darkinterview.com]
interface HtmlParser {
public List<String> getUrls(String url);
}
Your function will be called like List<String> crawl(String startUrl, HtmlParser htmlParser).
Requirements
Your crawler should:
- Start from
startUrl.
- Use
HtmlParser.getUrls(url) to obtain all links from a page.
- Avoid duplicates — never crawl the same URL twice.
- Restrict scope — only follow URLs whose hostname matches that of
startUrl.
- Assume all URLs use the
http protocol and do not include a port.
Note for Candidate
- URL Fragments: Consider how to handle URL fragments (e.g.,
http://example.com/page#section1). Should these be treated as the same URL or different? Clarify with the interviewer if needed.
- URL Normalization: Assume URL normalization is NOT required for the basic problem.
Follow-up: Multithreaded/Concurrent Implementation (Important!!)
After implementing the basic single-threaded version, implement a multithreaded or concurrent version of the web crawler to improve performance. [Source: darkinterview.com]
Requirements
- Parallelize the crawling — multiple URLs should be fetched concurrently
- Thread safety — ensure no race conditions when accessing shared data structures (visited set, result list)
- Avoid duplicates — even with multiple threads, each URL should only be crawled once
- Maintain hostname restriction — only crawl URLs with matching hostnames
Note for Candidate
Use a thread pool to control the total number of threads.
- Do NOT create unbounded threads (e.g., one thread per URL)
- A thread pool limits concurrency and prevents resource exhaustion
- Common approach: Use a fixed-size thread pool (e.g., 10-20 threads) with a task queue
Solution: JavaScript Multithreaded Implementation
Below is a complete JavaScript implementation of the multithreaded web crawler with proper concurrency control and URL normalization.
ThreadPool Implementation
class ThreadPool {
constructor(maxConcurrency) {
this.maxConcurrency = maxConcurrency;
this.queue = [];
this.runningTasks = 0;
this.waitForFinishPromise = null;
this.waitForFinishResolve = null;
}
execute(task) {
return new Promise((resolve, reject) => {
this.queue.push({
task,
resolve,
reject
});
this._processQueue();
});
}
_processQueue() {
if (this.runningTasks >= this.maxConcurrency) {
return;
}
if (this.queue.length === 0) {
if (this.runningTasks === 0) {
. && .();
. = ;
. = ;
}
;
}
{ task, resolve, reject } = ..();
.++;
.()
.(task)
.(resolve, reject)
.( {
.--;
.();
});
}
() {
(. === && .. === ) {
.();
}
(!.) {
. = ( {
. = resolve;
});
}
.;
}
}
URL Normalization Helper
function getProcessedUrlObject(url) {
const newUrl = new URL(url);
newUrl.hash = "";
return newUrl;
}
Main Crawler Function
async function crawl(startUrl, htmlParser, maxConcurrency = 10) {
const processedStartUrlObject = getProcessedUrlObject(startUrl);
const startHostName = processedStartUrlObject.hostname;
const result = [processedStartUrlObject.toString()];
const visited = new Set([processedStartUrlObject.toString()]);
const pool = new ThreadPool(maxConcurrency);
const crawlUrl = async (url) => {
try {
const nextUrls = await htmlParser.getUrls(url);
for (const nextUrl of nextUrls) {
const nextUrlProcessedObject = getProcessedUrlObject(nextUrl);
const nextUrlProcessedString = nextUrlProcessedObject.toString();
if (nextUrlProcessedObject.hostname === startHostName
&& !visited.has(nextUrlProcessedString)) {
visited.add(nextUrlProcessedString);
result.push(nextUrlProcessedString);
pool.execute(() => crawlUrl(nextUrl)).( {
.(, err);
});
}
}
} (error) {
.(, error);
}
};
pool.( (startUrl));
pool.();
result;
}
Usage Example
class HtmlParser {
constructor(urls) {
this.urls = urls;
}
async getUrls(url) {
await new Promise(resolve => setTimeout(resolve, Math.random() * 15));
return this.urls[url] || [];
}
}
const urls = {
"http://news.yahoo.com": [
"http://news.yahoo.com/news/topics/",
"http://news.yahoo.com/news"
],
"http://news.yahoo.com/news/topics/": [
"http://news.yahoo.com/news",
"http://news.yahoo.com/news/sports"
],
"http://news.yahoo.com/news": [
"http://news.google.com"
],
"http://news.yahoo.com/news/sports": []
};
const parser = new HtmlParser(urls);
const result = await crawl("http://news.yahoo.com", parser, 2);
console.log("Crawled URLs:", result);
Key Design Decisions
- ThreadPool: Implements a custom async task queue with configurable max concurrency
- URL Normalization: Removes hash fragments to avoid treating the same page as different URLs
- Thread Safety: Uses a
Set for visited tracking; JavaScript's single-threaded event loop ensures no race conditions
- Error Handling: Catches and logs errors without crashing the entire crawl
- Concurrency Control: Limits parallel requests to avoid overwhelming servers
Additional Discussion Topics
The following questions are for verbal discussion. The interviewer does not expect you to code solutions, but be prepared to explain your approach and reasoning. [Source: darkinterview.com]
1. Concurrency Models: Threads vs Processes
Question: What are the differences between threads and processes? When would you use one over the other for web crawling?
Discussion Points:
- Threads: Lightweight, share memory, lower overhead, suitable for I/O-bound tasks like web crawling
- Processes: Heavyweight, isolated memory, better for CPU-bound tasks, avoid GIL issues in Python
- Trade-offs: Memory usage, communication overhead, fault isolation
- For web crawling: Threads are typically preferred due to I/O-bound nature and shared state needs
2. Distributed Crawling System
Question: Our list of seed URLs is now in the millions, and a single machine can't handle the workload. How would you design a distributed crawling system across multiple machines? [Source: darkinterview.com]
Discussion Points:
- URL Distribution: How to partition and assign URLs to different machines
- Consistent hashing based on hostname/domain
- Central coordinator/queue system
- Each machine owns specific domains
- Coordination: How machines communicate and avoid duplicates
- Shared distributed cache (Redis, Memcached)
- Central URL frontier/queue
- Partition by URL hash
- Load Balancing: Ensuring even distribution of work
- Fault Tolerance: Handling machine failures and retries
3. Politeness Policy and Rate Limiting
Question: Our current crawler is too aggressive and might overload the servers we are crawling. How would you implement a politeness policy to ensure we don't send too many requests in a short period?
Discussion Points: [Source: darkinterview.com]
- robots.txt compliance: Respect crawl delays and disallowed paths
- Rate limiting per domain: Limit requests per second to each hostname
- Per-domain request queues
- Delay between requests to same domain (e.g., 1-2 seconds)
- Adaptive throttling: Slow down if server responds slowly or returns errors
- Distributed rate limiting: Coordinate across multiple crawler machines
- Time-based scheduling: Crawl during off-peak hours if possible
4. Duplicate Content Detection
Question: We are finding that many different URLs point to the same or very similar content (mirrors, URL parameters, etc.). How would you detect and handle this to save on storage and processing?
Discussion Points:
- Content Fingerprinting:
- Hash the page content (MD5, SHA-256)
- Store hash to detect exact duplicates
- Simhash for near-duplicate detection
- URL Normalization:
- Canonicalize URLs (remove tracking parameters, sort query params)
- Follow canonical tags in HTML
- Content Similarity:
- Shingling/n-grams for fuzzy matching
- Jaccard similarity
- Trade-offs:
- Storage cost vs computation cost
- False positives vs false negatives
- When to fetch vs when to skip