Turnerj // TODO: Do Work

No Robots Allowed

Jan 7, 2019

You probably have heard about web crawlers/spiders/bots etc, generally in the context of a search engine indexing a site to appear in its search results.

This relationship between a search engine and a website operator is a delicate one. The website operator wants traffic to come to their site when people are searching for related phrases. The search engine wants to index the site so that it can get people to the most relevant content.

Website operators however do not like it when the crawler is hitting the site so hard that it is taken down nor do they like it when pages they didn't want displayed are up in search results.

A website operator has a powerful tool in their arsenal: They can just block the crawler from scanning the site at all. I mean, if their site was going down often because of being crawled too heavily or was crawling pages that they REALLY didn't want indexed, that is their only choice, right?

But what if it wasn't...

The Grinder TV Show Quote - Dean saying "But what if it wasn't?"

Welcome to the ring, robots.txt

The original specification for the "robots.txt" file was formed in 1994 with the aim to facilitate some level of control between website operators and the web crawlers.

It can look something like this:

# robots.txt for http://www.example.com/

User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
Disallow: /tmp/ # these will soon disappear
Disallow: /foo.html

It is a fairly simple format, define a user-agent (or * for any) you want the following rules to apply to and add rules for disallowing particular paths. The file also supports comments after a # symbol.

This specification has been expanded on in later years like in the NoRobots RFC to include Allow rules and multiple user-agents per block.

While no official documentation on it, various web crawlers support wildcard paths, using the $ to match to the end of the path, support for Crawl-delay and support for specifying sitemaps (via Sitemap).

For example, here is dev.to's robots file:

# See http://www.robotstxt.org/robotstxt.html for documentation on how to use the robots.txt file
# To ban all spiders from the entire site uncomment the next two lines:
# User-agent: *
# Disallow: /

Sitemap: https://thepracticaldev.s3.amazonaws.com/sitemaps/sitemap.xml.gz

With the lack of specific disallow rules, this indicates that web crawlers can crawl any page they find.

So the next question: Is having a "robots.txt" file a guarantee that all web crawlers will behave?

No (sorry)

The Grinder TV Show Quote - Dean Sr. hitting the table

While true, there is no guarantee a web crawler will actually obey the robots file, it still is in their best interest otherwise they might end up being blocked.

The various big search engines will obey the rules because they need to, again their job is to get relevant content. It is the random other web crawlers people are writing in applications where you need to watch out for.

I am one of those people writing a web crawler and wanted to properly respect the websites I am crawling. While others have written libraries that can already do this, I wanted a better solution than what I found available.

My Library

Robots Exclusion Tools - an extremely creative name if I say so myself.

With NRobots being "an unofficial and unsupported fork" for robots file parsing, I wrote my own from scratch targeting .NET Standard 2.0. It supports all of the previously described rules while allowing flexibility to be extended later.

I wrote a custom tokenizer based on Jack Vanlightly's "Simple Tokenizer" article which is the core of my library. I wrote a validation layer on top of it to check the token patterns to make sure they adhere to the NoRobots RFC.

I do probably have a bit of the Not-Invented-Here syndrome but I think this library is a genuine step forward for anyone needing to parse robots files in .NET.

In a future post, I will go into how I use this library in two other libraries I have written.

More Information