Blue Raven Pro – Robots.txt Guide for WordPress Sites

Robots.txt Guide for WordPress Sites

WordPress automatically creates pages and URLs that you may not want search engines to index. The robots.txt file helps control how search engines and other bots crawl your site. Here’s everything you need to know, from the basics to advanced use cases. Blue Raven can help you edit these files or use the default template is a starting place.

Basics on Robots.txt

robots.txt is a simple text file located in your site’s root directory (e.g., https://yoursite.com/robots.txt). It tells search engines which pages they can and cannot crawl.

robots.txt is for guidance, not enforcement.
It helps manage search engine indexing and reduces server load.
It cannot protect sensitive content — for security, use proper authentication, permissions, or server rules.

Example:

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/

User-agent: * → Applies to all bots.
Disallow: /wp-admin/ → Prevents bots from crawling the admin area.

Use cases:

Keep private areas hidden from search engines.
Avoid indexing duplicate content (e.g., search results pages).
Improve crawl efficiency by directing bots to important content first.

Crawlers & Bots

Not all bots are created equal. Some respect robots.txt, some don’t.

Key points:

Search engine bots (Google, Bing) respect robots.txt and follow rules.
Bad bots (spam crawlers, scrapers) may ignore it.

Example: Block a specific bot:

User-agent: StealthRocin
Disallow: /

This tells a bot named “StealthRocin” to stay away from your entire site.
- If StealthRocin is a “good bot” and follows standards → it will stay away.
- If it’s a rogue bot → it will ignore this completely.
You can also allow certain bots while blocking others:

User-agent: Googlebot
Allow: /

User-agent: *
Disallow: /

Even though robots.txt is voluntary, there are still good reasons to disallow certain bots

Signal your intent – Search engines and well-behaved bots respect your preferences. – For example, you might not want duplicate content, admin pages, or staging environments indexed.
Filter legitimate bots from malicious ones – If you define rules, you can at least identify rogue bots that ignore them in your logs. This helps you detect crawlers that might scrape content or attempt attacks.
Reduce server load – Well-behaved bots won’t crawl disallowed pages, which saves bandwidth and CPU.

What to do about bots that break the rules

If a bot ignores robots.txt, you need real enforcement methods:

Restrict access by user-agent or IP.
- Block via .htaccess or Nginx rules
- Example in Apache (.htaccess):

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} StealthRocin [NC]
RewriteRule .* - [F,L]

Firewall / security plugins
- Use WordPress security plugins like Wordfence or Sucuri to block malicious bots automatically.
Rate limiting / bot management services
- Services like Cloudflare can detect bad bots and block or challenge them before they reach your server.
Monitor your logs
- Regularly check your server logs for bots ignoring robots.txt and take action if needed.

Advanced Uses For Robots.txt

Use robots.txt to fine-tune crawling behavior:

Crawl-delay: Slow down bot requests to reduce server load (some bots support it):

User-agent: *
Crawl-delay: 10

Sitemap reference: Help search engines find your XML sitemap:

Sitemap: https://yoursite.com/sitemap_index.xml

Blue Raven can help with this – More of this is covered in our Sitemap and IndexNow Sections.

Disallow specific file types:

User-agent: *
Disallow: /*.pdf$
Disallow: /*.zip$

Use case:

Sometimes your site has files like PDFs, ZIPs, or other downloads that you don’t want search engines to index.
Example: You might have private reports, downloadable resources, or old backups in .zip or .pdf format.
This keeps your search results clean, reduces duplicate content, and prevents search engines from wasting crawl budget on non-HTML files.

How it works:

* = all bots
/*.pdf$ = matches any URL ending in .pdf

Block URL parameters: Prevent indexing of dynamic content:

Disallow: /*?replytocom

What replytocom is:

WordPress automatically adds a URL parameter called replytocom when someone clicks “reply” on a comment.
Example URL:

https://example.com/my-post/?replytocom=42

Each comment gets its own replytocom URL.

Problem:

These URLs create duplicate content because the page is essentially the same as the main post.
Search engines might index hundreds of URLs for the same post, which wastes crawl budget and can harm SEO.

Use case:

By disallowing ?replytocom, you prevent search engines from crawling these duplicate comment URLs.

Security Tools and Methods For Robots.txt

While robots.txt is not a security tool, it can help reduce accidental exposure:

Block sensitive folders:

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-config.php

⚠️ Note: Anyone can still view disallowed URLs if they know the link — use proper WordPress security measures to protect files and folders.

Prevent indexing of staging or test environments:

User-agent: *
Disallow: /

What it does:

User-agent: * → applies to all bots
Disallow: / → tells bots not to crawl anything on the site

So basically, it blocks search engines from indexing the entire site.

✅ Tip: For extra security, also password-protect staging sites — robots.txt is not a security measure, only a guideline.

Next Level Pro Websites

For large or complex WordPress sites, robots.txt can be a powerful SEO and site management tool:

Segment content for different bots: Allow Google to crawl everything but restrict other bots.
Boost important pages: Disallow low-value pages to focus crawl budget on high-value pages.
Custom rules per subdirectory: For multisite or eCommerce setups, you can disallow entire sections selectively.

Example:

User-agent: *
Disallow: /cart/
Disallow: /checkout/

User-agent: Googlebot
Allow: /

Combine robots.txt with meta robots tags for precise control.
Use robots.txt analytics tools to monitor bot activity and identify unwanted crawlers.

A Simple but Effective robots.txt for WordPress

# Block sensitive areas from all bots
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /staging/
Disallow: /*?replytocom
Disallow: /*.pdf$
Disallow: /*.zip$

# Allow Googlebot to crawl everything else
User-agent: Googlebot
Allow: /

# Reference XML sitemap
Sitemap: https://example.com/sitemap_index.xml

Key Takeaways

robots.txt guides bots, but does not prevent access.
Use it for SEO, crawl management, and to reduce exposure of non-public areas.
For real security, rely on passwords, permissions, and server rules.
Advanced configurations help pro sites optimize performance, SEO, and server load.