Creating a robots.txt for webcrawlers

Modified on Thu, 3 Feb, 2022 at 6:51 PM

Standard Rules

A robots.txt file is placed in the webroot. In most cases this wil be /wordpress/current/
The content of the file termines the rules for User Agents (webcrawlers).
Different crawlers allow for different rules. Standard Regex does not apply. However some special signs are widely supported;

* = Wildcard 
$ = End of URL

The standard notation of a robots.txt file.  

User-agent: [user-agent name]
Crawl-Delay: [Delay in milliseconds per URL crawl]
Disallow: [URL string to exempt from crawling]

Example of a robots.txt

Blocking Seekport from ALL pages 

User-agent: Seekport 
Disallow: /

Crawl Delay Yahoo (Slurp) for 120MS and block the /contact page. 

User-agent: Slurp
Crawl-Delay: 120
Disallow: /contact$

Disables all PDF files from being crawled. 

User-agent: msnbot
Disallow: /uploads/*.pdf$

Multiple URL's for a single user agent. 

User-agent: Slurp
Dissalow: /example/$
Disallow: /contact/$
Disallow: /hidden/$

Multiple user agents. Seperate with an empty line. 

User-agent: Ahrefsbot
Crawl-Delay: 120
Disallow: /contact$

User-agent: Googlebot
Crawl-Delay: 120
Disallow: /contact$

User-agent: Slurp
Crawl-Delay: 120
Disallow: /contact$

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article