You are given:
startUrlHtmlParser that can fetch all URLs from a given web pageImplement a web crawler that returns all URLs reachable from startUrl that share the same hostname as startUrl.
The order of URLs in the result does not matter.
Below is the interface for HtmlParser: [Source: darkinterview.com]
interface HtmlParser {
// Returns all URLs from a given page URL.
public List<String> getUrls(String url);
}
Your function will be called like List<String> crawl(String startUrl, HtmlParser htmlParser).
Your crawler should:
startUrl.HtmlParser.getUrls(url) to obtain all links from a page.startUrl.http protocol and do not include a port.http://example.com/page#section1). Should these be treated as the same URL or different? Clarify with the interviewer if needed.After implementing the basic single-threaded version, implement a multithreaded or concurrent version of the web crawler to improve performance. [Source: darkinterview.com]
Use a thread pool to control the total number of threads.
Below is a complete JavaScript implementation of the multithreaded web crawler with proper concurrency control and URL normalization.
class ThreadPool {
constructor(maxConcurrency) {
this.maxConcurrency = maxConcurrency;
this.queue = [];
this.runningTasks = 0;
this.waitForFinishPromise = null;
this.waitForFinishResolve = null;
}
// Add a task to the queue and start processing
execute(task) {
return new Promise((resolve, reject) => {
this.queue.push({
task,
resolve,
reject
});
this._processQueue();
});
}
_processQueue() {
// Don't start new tasks if at max concurrency
if (this.runningTasks >= this.maxConcurrency) {
return;
}
// If queue is empty, check if all work is done
if (this.queue.length === 0) {
if (this.runningTasks === 0) {
this.waitForFinishResolve && .();
. = ;
. = ;
}
;
}
{ task, resolve, reject } = ..();
.++;
.()
.(task)
.(resolve, reject)
.( {
.--;
.();
});
}
() {
(. === && .. === ) {
.();
}
(!.) {
. = ( {
. = resolve;
});
}
.;
}
}
function getProcessedUrlObject(url) {
const newUrl = new URL(url);
// Remove hash fragments to treat http://example.com/page#section1
// and http://example.com/page#section2 as the same URL
newUrl.hash = "";
return newUrl;
}
async function crawl(startUrl, htmlParser, maxConcurrency = 10) {
const processedStartUrlObject = getProcessedUrlObject(startUrl);
const startHostName = processedStartUrlObject.hostname;
const result = [processedStartUrlObject.toString()];
const visited = new Set([processedStartUrlObject.toString()]);
const pool = new ThreadPool(maxConcurrency);
const crawlUrl = async (url) => {
try {
const nextUrls = await htmlParser.getUrls(url);
for (const nextUrl of nextUrls) {
const nextUrlProcessedObject = getProcessedUrlObject(nextUrl);
const nextUrlProcessedString = nextUrlProcessedObject.toString();
// Only crawl if same hostname and not yet visited
if (nextUrlProcessedObject.hostname === startHostName
&& !visited.has(nextUrlProcessedString)) {
visited.add(nextUrlProcessedString);
result.push(nextUrlProcessedString);
// Queue the URL for crawling (fire and forget)
pool.execute(() => crawlUrl(nextUrl)).catch((err) => {
.(, err);
});
}
}
} (error) {
.(, error);
}
};
pool.( (startUrl));
pool.();
result;
}
// Example with HtmlParser mock
class HtmlParser {
constructor(urls) {
this.urls = urls;
}
async getUrls(url) {
// Simulate network delay
await new Promise(resolve => setTimeout(resolve, Math.random() * 15));
return this.urls[url] || [];
}
}
const urls = {
"http://news.yahoo.com": [
"http://news.yahoo.com/news/topics/",
"http://news.yahoo.com/news"
],
"http://news.yahoo.com/news/topics/": [
"http://news.yahoo.com/news",
"http://news.yahoo.com/news/sports"
],
"http://news.yahoo.com/news": [
"http://news.google.com"
],
"http://news.yahoo.com/news/sports": []
};
const parser = new HtmlParser(urls);
const result = await crawl("http://news.yahoo.com", parser, 2);
console.log("Crawled URLs:", result);
// Expected output: All yahoo.com URLs (excluding news.google.com)
Set for visited tracking; JavaScript's single-threaded event loop ensures no race conditionsThe following questions are for verbal discussion. The interviewer does not expect you to code solutions, but be prepared to explain your approach and reasoning. [Source: darkinterview.com]
Question: What are the differences between threads and processes? When would you use one over the other for web crawling?
Discussion Points:
Question: Our list of seed URLs is now in the millions, and a single machine can't handle the workload. How would you design a distributed crawling system across multiple machines? [Source: darkinterview.com]
Discussion Points:
Question: Our current crawler is too aggressive and might overload the servers we are crawling. How would you implement a politeness policy to ensure we don't send too many requests in a short period?
Discussion Points: [Source: darkinterview.com]
Question: We are finding that many different URLs point to the same or very similar content (mirrors, URL parameters, etc.). How would you detect and handle this to save on storage and processing?
Discussion Points: