Robots.txt for the AI Age: Allow, Block, or A/B Test?
10 min read
TL;DR
- Robots.txt controls which bots can access which parts of your site – but it is a request, not enforcement.
- Blocking AI crawlers means your products will not appear in AI-generated recommendations.
- The optimal strategy is not all-or-nothing – it is selective access with measurement.
- A/B testing robots.txt (botjar's unique capability) lets you measure the revenue impact of different crawler policies.
What Robots.txt Actually Does
The robots.txt file lives at the root of your website (e.g., https://yoursite.com/robots.txt) and tells web crawlers which pages or sections they are allowed or disallowed from accessing. It uses a simple text-based protocol standardized in 1994 – the Robots Exclusion Protocol.
Here is the most basic example of a robots.txt file that allows all crawlers to access all pages:
User-agent: * Allow: /
And here is one that blocks a specific crawler from your entire site:
User-agent: GPTBot Disallow: /
The critical thing to understand is that robots.txt is advisory. Well-behaved crawlers from reputable companies (Google, OpenAI, Anthropic) respect it. Malicious bots ignore it entirely. This means robots.txt is not a security mechanism – it is a communication channel between you and legitimate crawlers.
The Default State: What Happens Without Explicit Rules
If your robots.txt does not mention a specific crawler, the default behavior for most bots is to crawl everything that is not explicitly disallowed. This means that unless you have added specific Disallow directives for AI crawlers, they are already accessing your entire site.
Many site operators are unaware of this. They assume that because they never explicitly invited AI crawlers, their content is not being used. In reality, the opposite is true: by default, your content is being crawled, processed, and potentially used for AI training and recommendations. The question is not whether AI crawlers are visiting – it is whether you are making the most of those visits.
If you do not have a robots.txt file at all, the behavior is the same: crawlers treat the entire site as open. Having no robots.txt is functionally identical to having one that says Allow: /.
Should You Block AI Crawlers?
This is the most common question in Bot CRO, and the answer depends on your business model. There are legitimate reasons to block certain AI crawlers: you may not want your proprietary content used to train commercial AI models, you may have licensing concerns, or you may want to reduce server load from aggressive crawlers like Bytespider.
However, for ecommerce sites that sell products, blocking AI crawlers is almost always a mistake. When you block GPTBot, your products disappear from ChatGPT recommendations. When you block PerplexityBot, you lose both the recommendation and the referral traffic (Perplexity cites sources with links). When you block ClaudeBot, Claude cannot reference your product information in conversations.
The nuanced approach is selective access. Here is a robots.txt configuration that allows major AI crawlers to access your product pages while restricting them from admin, checkout, and user-generated content areas:
# Allow AI crawlers on product and category pages User-agent: GPTBot Allow: /products/ Allow: /collections/ Allow: /pages/ Disallow: /admin/ Disallow: /checkout/ Disallow: /account/ User-agent: ClaudeBot Allow: /products/ Allow: /collections/ Allow: /pages/ Disallow: /admin/ Disallow: /checkout/ Disallow: /account/ User-agent: PerplexityBot Allow: /products/ Allow: /collections/ Allow: /pages/ Disallow: /admin/ Disallow: /checkout/ Disallow: /account/ # Block aggressive crawler with no clear benefit User-agent: Bytespider Disallow: /
This configuration gives you the best of both worlds: your products appear in AI recommendations while your sensitive pages remain private. You block Bytespider because its aggressive crawling adds server load without a clear revenue benefit for most ecommerce brands.
Fine-Tuning Access: Allow Specific Paths per Crawler
The robots.txt protocol supports path-level control. You can allow one crawler to access your blog while blocking it from your product pages, or vice versa. The order of rules matters: most crawlers apply the most specific matching rule.
You can also combine robots.txt with the Sitemap directive to guide crawlers toward your most important pages:
User-agent: GPTBot Allow: /products/ Allow: /collections/ Allow: /blog/ Disallow: / Sitemap: https://yoursite.com/sitemap.xml
In this configuration, GPTBot can only access product pages, collection pages, and blog posts. Everything else is disallowed. The Sitemap directive helps the crawler discover all available URLs within those allowed paths.
One common mistake is forgetting that Disallow: / at the end of a user-agent block will override all preceding Allow rules in some crawler implementations. Always test your configuration using a robots.txt validator and check your server logs to verify that crawlers are behaving as expected.
A/B Testing Robots.txt: Measure Before You Commit
Here is the problem with traditional robots.txt management: you make a change, wait weeks or months, and hope the outcome is positive. There is no controlled experiment. There is no baseline. There is no way to isolate the impact of the change from other variables (seasonality, algorithm updates, competitive shifts).
This is where botjar introduces a fundamentally new capability: A/B testing for robots.txt. The concept is simple but powerful. You define two robots.txt variants – for example, one that allows GPTBot on product pages and one that blocks it. Botjar serves each variant to the crawler on alternating crawl sessions while tracking which product pages get crawled, how often they appear in AI recommendations, and the downstream traffic and revenue impact.
For the first time, you can answer questions like: does allowing GPTBot to crawl my product pages actually increase ChatGPT recommendations? Does blocking Bytespider reduce server costs without affecting AI visibility? Does giving PerplexityBot access to my blog posts drive measurable referral traffic? These are questions that no one could answer before because no tool offered controlled experiments at the crawler policy level.
The A/B testing approach eliminates guesswork from crawler policy decisions. Instead of relying on industry advice or best guesses, you make data-driven decisions based on your own site's data. A policy that works for a large media publisher may not work for a DTC ecommerce brand, and vice versa. Testing reveals the answer for your specific situation.
Common Robots.txt Mistakes
Even experienced technical teams make robots.txt errors that have outsized consequences. Here are the most frequent mistakes to avoid.
Blocking All Bots With a Wildcard
Adding User-agent: * / Disallow: / blocks every well-behaved bot, including Googlebot. This removes your site from search results and AI recommendations simultaneously. It is the nuclear option and is almost never the right choice for a production site.
Treating Robots.txt as Security
Robots.txt does not prevent access – it requests that crawlers not access certain paths. Malicious bots ignore it. Sensitive pages (admin panels, user data, API endpoints) must be protected with authentication, not robots.txt directives. Relying on robots.txt for security is like putting a "please do not enter" sign on an unlocked door.
Forgetting Google-Extended Is Separate From Googlebot
Many site owners add a blanket block for Google-Extended thinking it will affect search indexing. It will not. Google-Extended controls AI training data only. Googlebot (which handles search) is a completely separate user-agent. Blocking Google-Extended while allowing Googlebot is a valid and common configuration. Refer to our AI crawlers guide for the full list of user-agent strings and their purposes.
Never Checking Whether Rules Are Working
You update your robots.txt, deploy it, and move on. But did the crawlers actually change their behavior? Without monitoring your server logs, you have no way to know. Syntax errors, caching delays, and edge-case parsing differences between crawlers can all cause your rules to be interpreted differently than you intended. Always verify changes by checking crawl patterns in your logs after deployment.
Set-and-Forget Mentality
The AI crawler landscape is evolving rapidly. New crawlers appear regularly. Existing crawlers change their behavior. A robots.txt configuration that was optimal six months ago may be leaving revenue on the table today. Treat your robots.txt as a living document that gets reviewed and updated as the landscape shifts. Better yet, use continuous A/B testing to stay ahead automatically.
Continue Reading
A/B test your robots.txt with botjar
Stop guessing which crawler policies work. Botjar lets you run controlled experiments on your robots.txt and measure the impact on AI visibility and revenue.
See Robots.txt A/B Testing – Demo