Building a Polite Web Crawler - Turnerj (aka. James Turner)

Web crawling is the act of having a program or script accessing a website, capturing content and discovering any pages linked to from that content. On the surface it really is only performing HTTP requests and parsing HTML, both things that can be quite easily accomplished in a variety of languages and frameworks.

Web crawling is an extremely important tool for search engines or anyone wanting to perform analysis of a website. The act of crawling a site though can consume a lot of resources for the site operator depending how the site is crawled.

For example, if you crawl an 1000 page site in a few seconds, you've likely caused a not insignificant amount of server load for low-bandwidth hosting. What if you crawled a slow-loading page but your crawler didn't handle it properly, continuously re-querying the same page. What if you are just crawling pages that shouldn't be crawled. These things can lead to very upset website operators.

The character Maurice Moss from the TV Show "IT Crowd" throwing his computer monitor.

In a previous article, I wrote about the Robots.txt file and how that can help address these problems from the website operator's perspective. Web crawlers should (but don't have to) abide by the rules governed in that file to prevent getting blocked. In addition to the Robots.txt file, there are some other things crawlers should do to avoid being blocked.

When crawling a website on a large scale, especially for commercial purposes, it is a good idea to provide a custom User Agent, allowing website operators a chance to restrict what pages can be crawled.

Crawl frequency is another aspect you will want to to refine to allow you to crawl a site fast enough without being a performance burden. It is highly likely you will want to limit crawling to a handful of requests a second. It is also a good idea to track how long requests are taking and to start throttling the crawler to compensate for potential site load issues.

Web crawling with manners

I spend my days programming in the world of .NET and had a need for a web crawler for a project of mine. There are some popular web crawlers already out there including Abot and DotnetSpider however for different reasons they didn't suit my needs.

I originally did have Abot setup in my project however I have been porting my project to .NET Core and it didn't support it. The library also uses a no longer support version of a library that does parsing of Robots.txt files.

With DotnetSpider, it does support .NET Core but it is designed around an entire different process of using it with message queues, model binding and built-in DB writing. These are cool features but excessive for my own needs.

I wanted a simple crawler, supporting async/await, with .NET Core support thus InfinityCrawler was born!

I'll be honest, I don't know why I called it InfinityCrawler - it sounded cool at the time so I just went with it.

This crawler is in .NET Standard and builds upon both my SitemapTools and RobotsExclusionTools libraries. It uses the Sitemap library to help seed the list of URLs it should start crawling.

It has built in support for crawl frequency including obeying frequency defined in the Robots.txt file. It can detect slow requests and auto throttle itself to avoid thrashing the website as well as detect when performance improves and return back to normal.

using InfinityCrawler;

var crawler = new Crawler();
var results = await crawler.Crawl(siteUri, new CrawlSettings
{
	UserAgent = "Your Awesome Crawler User Agent Here"
});

InfinityCrawler, while available for use in any .NET project, it still is in its early stages. I am happy with its core functionality but likely will go through a few stages of restructure as well as expanding on the testing.

I am personally pretty proud of how I implemented the async/await part but would love to talk to anyone that is an expert in this area with .NET to check my implementation and give pointers on how to improve it.