Design distributed throttling
You have n machines that are crawling websites, implement throtting such that each crawler does not crawl more than x pages from a particular host
Solution
- Use a permission server in addition to the crawler server
- This would keep track of the number of pages downloaded by any crawler during a given time period, and then approve or deny a new crawl request
- Every crawler must first acquire permission befoew crawling the page, if denied, it should wait download
- To prevent the permission server from becomming a bottle-neck we can have multiple servers and use a hash to distribute the requests
- The permission algorithm could be a simple tll based key, which would be the last time a page was downlaoded from a server, if the ttl has not expired then permission denied