Coding - Web Crawler - Anthropic Interview Question | DarkInterview

Web Crawler

Basic Problem

You are given:

A starting URL startUrl
An interface HtmlParser that can fetch all URLs from a given web page

Implement a web crawler that returns all URLs reachable from startUrl that share the same hostname as startUrl. The order of URLs in the result does not matter.

Below is the interface for HtmlParser: [Source: darkinterview.com]

// Source: https://darkinterview.com/collections/a3b8c1d5/questions/8641d81b-929f-45d4-be78-6a669a63dd94
interface HtmlParser {
    // Returns all URLs from a given page URL.
    public List<String> getUrls(String url);
}

Your function will be called like List<String> crawl(String startUrl, HtmlParser htmlParser).

Requirements

Your crawler should:

Start from startUrl.
Use HtmlParser.getUrls(url) to obtain all links from a page.
Avoid duplicates — never crawl the same URL twice.
Restrict scope — only follow URLs whose hostname matches that of startUrl.
Assume all URLs use the http protocol and do not include a port.

Note for Candidate

URL Fragments: Consider how to handle URL fragments (e.g., http://example.com/page#section1). Should these be treated as the same URL or different? Clarify with the interviewer if needed.
URL Normalization: Assume URL normalization is NOT required for the basic problem.

Follow-up: Multithreaded/Concurrent Implementation (Important!!)

After implementing the basic single-threaded version, implement a multithreaded or concurrent version of the web crawler to improve performance. [Source: darkinterview.com]

Requirements

Parallelize the crawling — multiple URLs should be fetched concurrently
Thread safety — ensure no race conditions when accessing shared data structures (visited set, result list)
Avoid duplicates — even with multiple threads, each URL should only be crawled once
Maintain hostname restriction — only crawl URLs with matching hostnames

Note for Candidate

Use a thread pool to control the total number of threads.

Do NOT create unbounded threads (e.g., one thread per URL)
A thread pool limits concurrency and prevents resource exhaustion
Common approach: Use a fixed-size thread pool (e.g., 10-20 threads) with a task queue

Solution: JavaScript Multithreaded Implementation

Below is a complete JavaScript implementation of the multithreaded web crawler with proper concurrency control and URL normalization.

ThreadPool Implementation

// Source: https://darkinterview.com/collections/a3b8c1d5/questions/8641d81b-929f-45d4-be78-6a669a63dd94
class ThreadPool {
    constructor(maxConcurrency) {
        this.maxConcurrency = maxConcurrency;
        this.queue = [];
        this.runningTasks = 0;
        this.waitForFinishPromise = null;
        this.waitForFinishResolve = null;
    }

    // Add a task to the queue and start processing
    execute(task) {
        return new Promise((resolve, reject) => {
            this.queue.push({
                task,
                resolve,
                reject
            });
            this._processQueue();
        });
    }

    _processQueue() {
        // Don't start new tasks if at max concurrency
        if (this.runningTasks >= this.maxConcurrency) {
            return;
        }

        // If queue is empty, check if all work is done
        if (this.queue.length === 0) {
            if (this.runningTasks === 0) {
                . && .();
                . = ;
                . = ;
            }
            ;
        }

        
         { task, resolve, reject } = ..();
        .++;
        .()
            .(task)
            .(resolve, reject)
            .( {
                .--;
                .();
            });
    }

    
    () {
         (. ===  && .. === ) {
             .();
        }

         (!.) {
            . =  ( {
                . = resolve;
            });
        }
         .;
    }
}

URL Normalization Helper

// Source: https://darkinterview.com/collections/a3b8c1d5/questions/8641d81b-929f-45d4-be78-6a669a63dd94
function getProcessedUrlObject(url) {
    const newUrl = new URL(url);
    // Remove hash fragments to treat http://example.com/page#section1 
    // and http://example.com/page#section2 as the same URL
    newUrl.hash = "";
    return newUrl;
}

Main Crawler Function

// Source: https://darkinterview.com/collections/a3b8c1d5/questions/8641d81b-929f-45d4-be78-6a669a63dd94
async function crawl(startUrl, htmlParser, maxConcurrency = 10) {
    const processedStartUrlObject = getProcessedUrlObject(startUrl);
    const startHostName = processedStartUrlObject.hostname;

    const result = [processedStartUrlObject.toString()];
    const visited = new Set([processedStartUrlObject.toString()]);
    const pool = new ThreadPool(maxConcurrency);

    const crawlUrl = async (url) => {
        try {
            const nextUrls = await htmlParser.getUrls(url);
            for (const nextUrl of nextUrls) {
                const nextUrlProcessedObject = getProcessedUrlObject(nextUrl);
                const nextUrlProcessedString = nextUrlProcessedObject.toString();
                
                // Only crawl if same hostname and not yet visited
                if (nextUrlProcessedObject.hostname === startHostName 
                    && !visited.has(nextUrlProcessedString)) {
                    visited.add(nextUrlProcessedString);
                    result.push(nextUrlProcessedString);
                    // Queue the URL for crawling (fire and forget)
                    pool.execute(() => crawlUrl(nextUrl)).( {
                        .(, err);
                    });
                }
            }
        }  (error) {
            .(, error);
        }
    };

    
     pool.( (startUrl));
    
     pool.();
    
     result;
}

Usage Example

// Source: https://darkinterview.com/collections/a3b8c1d5/questions/8641d81b-929f-45d4-be78-6a669a63dd94
// Example with HtmlParser mock
class HtmlParser {
    constructor(urls) {
        this.urls = urls;
    }

    async getUrls(url) {
        // Simulate network delay
        await new Promise(resolve => setTimeout(resolve, Math.random() * 15));
        return this.urls[url] || [];
    }
}

const urls = {
    "http://news.yahoo.com": [
        "http://news.yahoo.com/news/topics/",
        "http://news.yahoo.com/news"
    ],
    "http://news.yahoo.com/news/topics/": [
        "http://news.yahoo.com/news",
        "http://news.yahoo.com/news/sports"
    ],
    "http://news.yahoo.com/news": [
        "http://news.google.com"
    ],
    "http://news.yahoo.com/news/sports": []
};

const parser = new HtmlParser(urls);
const result = await crawl("http://news.yahoo.com", parser, 2);
console.log("Crawled URLs:", result);
// Expected output: All yahoo.com URLs (excluding news.google.com)

Key Design Decisions

ThreadPool: Implements a custom async task queue with configurable max concurrency
URL Normalization: Removes hash fragments to avoid treating the same page as different URLs
Thread Safety: Uses a Set for visited tracking; JavaScript's single-threaded event loop ensures no race conditions
Error Handling: Catches and logs errors without crashing the entire crawl
Concurrency Control: Limits parallel requests to avoid overwhelming servers

Additional Discussion Topics

The following questions are for verbal discussion. The interviewer does not expect you to code solutions, but be prepared to explain your approach and reasoning. [Source: darkinterview.com]

1. Concurrency Models: Threads vs Processes

Question: What are the differences between threads and processes? When would you use one over the other for web crawling?

Discussion Points:

Threads: Lightweight, share memory, lower overhead, suitable for I/O-bound tasks like web crawling
Processes: Heavyweight, isolated memory, better for CPU-bound tasks, avoid GIL issues in Python
Trade-offs: Memory usage, communication overhead, fault isolation
For web crawling: Threads are typically preferred due to I/O-bound nature and shared state needs

2. Distributed Crawling System

Question: Our list of seed URLs is now in the millions, and a single machine can't handle the workload. How would you design a distributed crawling system across multiple machines? [Source: darkinterview.com]

Discussion Points:

URL Distribution: How to partition and assign URLs to different machines
- Consistent hashing based on hostname/domain
- Central coordinator/queue system
- Each machine owns specific domains
Coordination: How machines communicate and avoid duplicates
- Shared distributed cache (Redis, Memcached)
- Central URL frontier/queue
- Partition by URL hash
Load Balancing: Ensuring even distribution of work
Fault Tolerance: Handling machine failures and retries

3. Politeness Policy and Rate Limiting

Question: Our current crawler is too aggressive and might overload the servers we are crawling. How would you implement a politeness policy to ensure we don't send too many requests in a short period?

Discussion Points: [Source: darkinterview.com]

robots.txt compliance: Respect crawl delays and disallowed paths
Rate limiting per domain: Limit requests per second to each hostname
- Per-domain request queues
- Delay between requests to same domain (e.g., 1-2 seconds)
Adaptive throttling: Slow down if server responds slowly or returns errors
Distributed rate limiting: Coordinate across multiple crawler machines
Time-based scheduling: Crawl during off-peak hours if possible

4. Duplicate Content Detection

Question: We are finding that many different URLs point to the same or very similar content (mirrors, URL parameters, etc.). How would you detect and handle this to save on storage and processing?

Discussion Points:

Content Fingerprinting:
- Hash the page content (MD5, SHA-256)
- Store hash to detect exact duplicates
- Simhash for near-duplicate detection
URL Normalization:
- Canonicalize URLs (remove tracking parameters, sort query params)
- Follow canonical tags in HTML
Content Similarity:
- Shingling/n-grams for fuzzy matching
- Jaccard similarity
Trade-offs:
- Storage cost vs computation cost
- False positives vs false negatives
- When to fetch vs when to skip

Legal Disclaimer & Copyright Notice

Web Crawler

Basic Problem

Requirements

Note for Candidate

Follow-up: Multithreaded/Concurrent Implementation (Important!!)

Requirements

Note for Candidate

Solution: JavaScript Multithreaded Implementation

ThreadPool Implementation

URL Normalization Helper

Main Crawler Function

Usage Example

Key Design Decisions

Additional Discussion Topics

1. Concurrency Models: Threads vs Processes

2. Distributed Crawling System

3. Politeness Policy and Rate Limiting

4. Duplicate Content Detection