Identify and block bots by user-agent string.

The first way to identify SEO bots is to utilize their user agent string. Each web client (web browser or bot) sends special string on every request – this string should be unique to this client, so we can identify which bot is crawling our website and block those which are not welcome. In order to block four bots that are listed above you need to put the following on top of your .htaccess file:

RewriteEngine On
#moz.com
RewriteCond %{HTTP_USER_AGENT} rogerbot [OR]
#majestic.com
RewriteCond %{HTTP_USER_AGENT} MJ12bot [OR]
#moz.com
RewriteCond %{HTTP_USER_AGENT} dotbot [OR]
#gigablast.com
RewriteCond %{HTTP_USER_AGENT} gigabot [OR]
#ahrefs.com
RewriteCond %{HTTP_USER_AGENT} AhrefsBot
RewriteRule .* – [F]

Basically it tells apache to send “403 Forbidden” response to all clients which contain the specified strings in their user-agent.

This method has disadvantages, though. User-agent strings are very easy to spoof, so anyone who has access to curl for example, or “User-Agent Switcher” Chrome extension can make requests to the websites using Ahref’s user-agent string for example. In addition, SEO bots might use some “unofficial” user agent strings to bypass this blocking method.

Identify and block bots by hostname

Another method is to block bots by their hostname. This is very similar to blocking by IP addresses, but more reliable and easy to implement because you don’t have to maintain long (and constantly changing) list of IP addresses used by crawlers that you want to block. This method has several disadvantages too, though. First, we cannot block all bots using this method – for example Majestic SEO claims to be a distributed search engine which means they crawl from wide range of user’s IP addresses. Some bots don’t have known hostname patterns or just changing datacenters that they use quite frequently. However, we can block some of them using this technique. Below is example of .htaccess file (again, this should be on top of .htaccess, before WordPress rewrite rules for example).

Deny from .ahrefs.com
Deny from .dotnetdotcom.org

As you can see, here we block Ahrefs crawlers and dotnetdotcom.org crawlers (this company was purchased by moz.com and now Moz uses their data too).

Moz.com bot, rogerbot, currently use the Amazon Web Services (https://aws.amazon.com/) infrastructure for their crawler, so it becomes challenging to block this bot by hostname. The issue is that if you block AWS servers, you can accidentally block some legitimate clients like RSS feed readers, some minor search engines etc. However, it won’t affect your site visitors and major search engines like Google and Bing, so it’s probably safe to block requests from AWS too. If it’s ok for you, you can add “.amazonaws.com“ to the list of blocked hosts, so the final version looks like:

Deny from .ahrefs.com
Deny from .dotnetdotcom.org
Deny from .amazonaws.com

For maximum reliability use both blocking techniques:

RewriteEngine On
#moz.com
RewriteCond %{HTTP_USER_AGENT} rogerbot [OR]
#majestic.com
RewriteCond %{HTTP_USER_AGENT} MJ12bot [OR]
#moz.com
RewriteCond %{HTTP_USER_AGENT} dotbot [OR]
#gigablast.com
RewriteCond %{HTTP_USER_AGENT} gigabot [OR]
#ahrefs.com
RewriteCond %{HTTP_USER_AGENT} AhrefsBot
RewriteRule .* – [F]
Deny from .ahrefs.com
Deny from .dotnetdotcom.org
#please remove the line below if you don’t want to block requests from AWS
Deny from .amazonaws.com

Keep in mind that you shouldn't have more than one RewriteEngine On command in your .htaccess file. Also don't put anything after [L] as that is the sign of last command and shouldn't go anything after that.

In case you would like to include more bots here is an extensive robots database