MageHost - Crawler Bot Control

Gewijzigd op Wo, 26 Jun, 2024 om 4:45 PM

Crawlers, also known as bots or spiders, navigate websites to index their contents. They read content much faster than human users, which can significantly impact server load. To manage this, you can instruct crawlers on which URLs to avoid.

Robots.txt

Using a robots.txt file, you can specify a list of URLs or URL paths that crawlers should not access.

Most reputable crawlers respect these instructions.

Example:

User-Agent: *
Disallow: */search/

This tells the crawler to skip any URLs containing /search/.

For more information, visit: https://www.robotstxt.org/robotstxt.html

HTML

In HTML, you can use the rel="nofollow" attribute in anchor elements to indicate that certain links should not be followed by crawlers. This is especially useful for dynamically generated links, which can't always be pre-defined in a robots.txt file.

Example:

<a rel="nofollow" href="https://www.savvii.com/managed-e-commerce-hosting/">

On e-commerce sites, for example, category filters create multiple URLs that lead to the same content.

These URLs can generate an excessive number of pages for crawlers to process, as each filtered view results in a new unique URL.

These excessive number of pages increases server load and can't usually be mitigated through caching.

Headers / Meta Tags

You can also use response headers or meta tags to instruct crawlers to skip all links on the page.

Examples:

X-Robots-Tag: nofollow

<meta name="robots" content="nofollow">

By implementing these strategies, you can help crawlers navigate your website more efficiently.

This ensures that all relevant information is indexed while reducing the time crawlers spend on your site.

Consequently, server load decreases as crawlers only visit pertinent URLs, and more URLs can be cached, leading to improved response times.