Robots.txt|9 min read

Robots.txt Best Practices for Ecommerce in 2026

Botjar Team|

Robots.txt Has Never Mattered More

For most of the web's history, robots.txt was a simple file that told Googlebot which pages to skip. You wrote it once, maybe updated it when you redesigned your site, and forgot about it. Those days are over.

In 2026, your robots.txt file is a strategic asset. It controls which AI crawlers can access your content, which pages search engines prioritize, and whether competitor scrapers can harvest your pricing data. A well-configured robots.txt can boost your AI Visibility Score, protect sensitive pages, and reduce unnecessary server load – all from a single text file.

The Basics (Quickly)

Robots.txt is a plain text file at yourdomain.com/robots.txt that provides crawling directives to web robots. It uses a simple syntax:

  • User-agent: specifies which bot the rules apply to
  • Disallow: blocks access to specified paths
  • Allow: explicitly permits access (overrides Disallow for specific paths)
  • Crawl-delay: requests a delay between requests (not respected by all crawlers)
  • Sitemap: points to your XML sitemap

Rules are advisory, not enforced. Well-behaved bots follow them. Malicious bots ignore them. Your robots.txt is a guideline, not a security mechanism.

The 2026 Ecommerce Template

Here is a comprehensive robots.txt configuration for a modern ecommerce site. Adapt it to your specific URL structure:

Search Engine Crawlers

Allow broad access with targeted exclusions:

  • Allow all product pages, category pages, and content pages
  • Block internal search results (/search?) to prevent index bloat
  • Block faceted navigation URLs with parameters to avoid duplicate content
  • Block cart, checkout, and account pages – these add no search value
  • Block paginated results beyond a reasonable depth

AI Crawlers (GPTBot, ClaudeBot, PerplexityBot)

For most ecommerce sites, you want to allow AI crawlers access to your product and content pages while blocking operational areas:

  • Allow: /products/, /collections/, /categories/, /blog/
  • Block: /cart, /checkout, /account, /admin
  • Block: /search?, /*?sort=, /*?filter=
  • Consider blocking: /wishlist, /compare, /recently-viewed

SEO Tool Crawlers

AhrefsBot, SemrushBot, and similar crawlers analyze your backlink profile and keyword rankings. Blocking them has no SEO benefit and just means your own SEO tools have incomplete data. Allow them unless you have a specific reason not to.

Aggressive or Unwanted Crawlers

Some crawlers provide no value and consume significant resources:

  • Bytespider – ByteDance's crawler can be extremely aggressive. Consider blocking if volume is problematic.
  • MJ12bot – Majestic's crawler is aggressive on some sites. Block if it is impacting server performance.
  • DotBot – Moz's crawler can be blocked if you do not use Moz tools.

Common Mistakes to Avoid

1. Blocking Everything by Default

A blanket Disallow: / for all user agents is the nuclear option. It blocks all crawlers from all pages. Some sites do this accidentally during development and forget to remove it. This is catastrophic for both SEO and AI visibility.

2. Not Listing AI Crawlers Specifically

If you only have rules for Googlebot and a wildcard * user agent, AI crawlers will follow your wildcard rules. This might be fine, or it might accidentally block them from pages you want them to access. Be explicit about AI crawler directives.

3. Blocking Sitemaps

If your Disallow rules accidentally cover your sitemap URL, crawlers cannot discover your full page inventory. Always ensure your sitemap paths are accessible.

4. Ignoring Crawl-delay

While Googlebot ignores Crawl-delay, some AI crawlers and SEO bots respect it. Setting a reasonable crawl delay (5-10 seconds) for aggressive bots reduces server load without blocking them entirely.

5. Never Testing Changes

Every robots.txt change should be tested before deployment. Google Search Console has a robots.txt tester. For AI crawlers, you need to verify that your directives work as expected – which is where A/B testing your robots.txt becomes valuable.

Advanced: Dynamic Robots.txt

Some ecommerce platforms support dynamic robots.txt generation based on conditions. This lets you:

  • Adjust crawler access based on time of day (allow aggressive crawling during off-peak)
  • Temporarily block all crawlers during sales events to prioritize human traffic
  • Serve different robots.txt rules based on the requesting IP (not recommended – it violates the spirit of the protocol)

Dynamic robots.txt is powerful but risky. One misconfiguration and you could accidentally block Googlebot during your biggest sales day. Test thoroughly.

Monitoring Robots.txt Compliance

Writing a great robots.txt is only half the job. You also need to verify that bots are actually following your directives. Not all crawlers respect robots.txt, and some follow it selectively.

Monitor your server logs or use a tool like botjar to verify:

  • Which bots are respecting your Disallow rules
  • Which bots are ignoring them and crawling blocked paths anyway
  • Whether any legitimate crawlers are being accidentally blocked
  • How crawl volume changes after robots.txt updates

Test your robots.txt with real bot data. Botjar shows you which crawlers are following your directives, which are ignoring them, and what you should change. Get your free bot audit →

More from the blog

botjar

Scanning visitor...