Design distributed throttling

You have n machines that are crawling websites, implement throtting such that each crawler does not crawl more than x pages from a particular host

Solution

  • Use a permission server in addition to the crawler server
  • This would keep track of the number of pages downloaded by any crawler during a given time period, and then approve or deny a new crawl request
  • Every crawler must first acquire permission befoew crawling the page, if denied, it should wait download
  • To prevent the permission server from becomming a bottle-neck we can have multiple servers and use a hash to distribute the requests
  • The permission algorithm could be a simple tll based key, which would be the last time a page was downlaoded from a server, if the ttl has not expired then permission denied

results matching ""

    No results matching ""