search

How to Block AI Crawlers and Bots with robots.txt

AI crawlers are visiting your website every day. OpenAI’s GPTBot, Anthropic’s ClaudeBot, Meta’s crawler, and dozens of others are scraping content to train large language models or power AI search features.

Unlike traditional search engine crawlers that index your site and send you traffic, many AI crawlers take your content without attribution or a link back. The good news is that most of them respect robots.txt directives, giving you control over what they can access.

Below I’ll cover which AI bots are crawling your site, how to block them selectively, and how to decide which ones to keep. This guide is part of my AEO checklist for WordPress.

Training Crawlers vs. Search and Retrieval Bots

Before blocking anything, understand the three categories of AI crawlers:

Training crawlers collect your content to train AI models. Your text becomes part of the model’s knowledge, but you get no attribution, no link, and no traffic. Examples: GPTBot, Google-Extended, ClaudeBot, Meta-ExternalAgent.

Search and retrieval bots fetch your content in real time to answer a user’s question. They typically cite your page and link back to it. Examples: OAI-SearchBot, Claude-SearchBot, PerplexityBot.

AI assistants and agents browse the web on behalf of a specific user – like ChatGPT-User, Claude-User, or the newer autonomous agents (ChatGPT Operator, Google Agent). Some of these are starting to ignore robots.txt because the request is “user-initiated.”

Blocking training crawlers protects your content from being absorbed without credit. Blocking search bots means AI assistants won’t reference your site when users ask related questions – which could cost you visibility in the growing generative search landscape (GEO).

Think carefully before blocking citation crawlers like ChatGPT-User and PerplexityBot. These bots drive referral traffic to your site by citing and linking to your pages. Blocking them means your content won’t appear in AI-powered answers, which is an increasingly important traffic source.

Complete List of AI Crawlers

Here are the major AI crawlers you should know about. The landscape has grown to over 140 known user-agents as of early 2026 – these are the ones that matter most:

Training Crawlers

CompanyUser-AgentPurpose
OpenAIGPTBotTrains GPT models
AnthropicClaudeBotTrains Claude models
GoogleGoogle-ExtendedTrains Gemini (does not affect Search rankings)
AppleApplebot-ExtendedApple Intelligence / Siri
MetaMeta-ExternalAgentTrains LLaMA models (highest crawl volume after Googlebot)
AmazonAmazonbotAlexa and Rufus AI shopping assistant
ByteDanceBytespiderTikTok AI training (volume dropped 85% in 2025)
Common CrawlCCBotOpen dataset used by many AI labs
Coherecohere-aiEnterprise AI models
DeepSeekDeepSeekBotLLM training (questionable robots.txt compliance)

Search and Retrieval Bots

These bots fetch content to answer queries and typically cite your page with a link back:

CompanyUser-AgentPurpose
OpenAIOAI-SearchBotPowers ChatGPT Search results
AnthropicClaude-SearchBotClaude search results
PerplexityPerplexityBotAI search engine
AmazonAmzn-SearchBotAmazon AI search

AI Assistants (User-Triggered)

These browse on behalf of a specific user. Blocking them means the AI can’t look up your content when a user asks:

CompanyUser-AgentPurpose
OpenAIChatGPT-UserReal-time browsing for ChatGPT
AnthropicClaude-UserReal-time browsing for Claude
PerplexityPerplexity-UserUser-triggered fetching

Watch out for ChatGPT-User: OpenAI quietly removed robots.txt compliance language from the ChatGPT-User documentation, arguing that user-initiated requests “may not” be subject to robots.txt. This means blocking ChatGPT-User in your robots.txt may no longer be effective.

This list evolves as new AI companies launch crawlers. For an up-to-date directory with 140+ entries, check the ai.robots.txt community project on GitHub or Known Agents.

How to Block AI Crawlers in robots.txt

Add User-agent and Disallow directives to your robots.txt file. This file sits at the root of your site (e.g., https://yoursite.com/robots.txt).

Block All AI Training Crawlers

To block the major training crawlers while keeping citation bots allowed:

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: DeepSeekBot
Disallow: /

This is the approach most publishers take: block training crawlers that absorb your content without attribution, but keep search and retrieval bots allowed so your site can appear in AI-powered answers.

Block Specific Paths Only

If you want AI crawlers to access some content but not all of it, block specific directories:

User-agent: GPTBot
Disallow: /premium-content/
Disallow: /members-only/
Allow: /blog/

User-agent: ClaudeBot
Disallow: /premium-content/
Disallow: /members-only/
Allow: /blog/

This lets AI models train on your public blog posts while protecting gated or premium content.

Block All AI Crawlers (Training and Search)

If you want to block every known AI crawler entirely:

# Block all AI crawlers (training + search + assistants)
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-User
Disallow: /

User-agent: Claude-SearchBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: DeepSeekBot
Disallow: /

Crawlers That Ignore robots.txt

Not every AI crawler plays by the rules. A few major ones are effectively invisible to robots.txt:

ChatGPT Atlas (OpenAI) uses a standard Chrome user-agent string with no identifying token. It blends with normal browser traffic and cannot be distinguished via robots.txt.

Grok / xAI rotates through residential IP addresses and spoofs Safari and Chrome user-agents. Despite xAI documenting a “GrokBot” user-agent, no real traffic has been observed using it.

Bing Copilot uses the standard Bingbot user-agent, so you can’t block Copilot without also blocking Bing Search.

For these crawlers, the only effective defense is server-level blocking through a WAF (like Cloudflare’s AI Crawl Control), IP-based rules in your web server config, or rate limiting. Cloudflare now blocks known AI crawlers by default on all new domains.

How to Edit robots.txt in WordPress

There are several ways to edit your robots.txt file in WordPress:

Option 1: Edit the File Directly

If you have a physical robots.txt file in your site’s root directory, edit it with any text editor and upload via FTP or your hosting file manager. This is the most reliable method.

Option 2: Use an SEO Plugin

Most SEO plugins let you edit robots.txt from the WordPress admin:

  • Yoast SEO: Go to Yoast SEO > Tools > File Editor
  • Rank Math: Go to Rank Math > General Settings > Edit robots.txt

Option 3: Use a Filter in functions.php

If WordPress generates your robots.txt dynamically (no physical file exists), you can add rules via the robots_txt filter:

add_filter( 'robots_txt', function( $output ) {
    $output .= "n# Block AI training crawlersn";
    $output .= "User-agent: GPTBotnDisallow: /nn";
    $output .= "User-agent: ClaudeBotnDisallow: /nn";
    $output .= "User-agent: Google-ExtendednDisallow: /nn";
    $output .= "User-agent: Meta-ExternalAgentnDisallow: /nn";
    $output .= "User-agent: AmazonbotnDisallow: /nn";
    $output .= "User-agent: BytespidernDisallow: /nn";
    $output .= "User-agent: CCBotnDisallow: /n";
    return $output;
}, 99 );

robots.txt vs. llms.txt vs. ai.txt

Several standards now exist for managing AI access to your content. They serve different purposes:

  • robots.txt controls crawl access – which bots can visit which pages
  • llms.txt provides context – a content map so AI systems can understand and cite your site accurately
  • ai.txt (by Spawning) declares training permissions – specifically for AI model training, with EU TDM opt-out support

A balanced approach I recommend: block training crawlers via robots.txt, provide an llms.txt so citation bots that do access your site represent you accurately, and add structured data markup so your content is easy for AI systems to parse.

How to Verify AI Crawlers Are Blocked

After updating your robots.txt, verify the rules are working:

  1. Visit https://yoursite.com/robots.txt in your browser and confirm the new directives appear
  2. Use Google’s robots.txt Tester in Search Console to validate syntax
  3. Monitor your AI traffic in Google Analytics to see if crawler visits decrease over time
  4. Check your server access logs for the user-agent strings you blocked

Remember that robots.txt is a voluntary protocol. Well-known AI companies (OpenAI, Anthropic, Google, Apple) respect it, but smaller or less scrupulous crawlers may ignore it. For stronger enforcement, consider server-level blocking via your web server configuration or a WAF (Web Application Firewall).

FAQs

Common questions about blocking AI crawlers with robots.txt:

Will blocking AI crawlers affect my Google search rankings?
No. Blocking AI training crawlers like GPTBot, ClaudeBot, or Google-Extended does not affect your Google search rankings. These bots are separate from Googlebot, which handles search indexing. Blocking Google-Extended only prevents your content from being used to train Gemini - it does not impact your visibility in Google Search or Google AI Overviews.
What is the difference between GPTBot and ChatGPT-User?
GPTBot is OpenAI's training crawler - it collects content to improve GPT models. ChatGPT-User is the browsing agent that fetches pages in real time when a ChatGPT user asks it to search the web. Blocking GPTBot prevents your content from being used for training. Blocking ChatGPT-User prevents ChatGPT from citing or linking to your site in conversations.
Should I block Google-Extended?
Blocking Google-Extended prevents your content from being used to train Google's Gemini models. It does not affect regular Google Search indexing or your appearance in AI Overviews - those are handled by Googlebot, which is a separate user-agent. If you want to prevent AI training but keep your search presence, blocking Google-Extended is a safe choice.
Do all AI crawlers respect robots.txt?
Major AI companies like OpenAI, Anthropic, Google, Apple, and Perplexity have publicly committed to respecting robots.txt. However, robots.txt is a voluntary protocol with no technical enforcement. Smaller or unknown crawlers may ignore it. For stronger protection, combine robots.txt with server-level blocking using firewall rules or web server configuration.
Can I block AI crawlers from specific pages only?
Yes. Instead of Disallow: / (which blocks the entire site), you can block specific paths. For example, Disallow: /premium-content/ blocks only that directory. You can also use Allow: to permit access to specific paths within a blocked area. This gives you granular control over what AI systems can and cannot access.
What are Anthropic's ClaudeBot, Claude-User, and Claude-SearchBot?
Anthropic operates three separate crawlers. ClaudeBot scrapes content for model training. Claude-User fetches pages in real time when a Claude user triggers web browsing. Claude-SearchBot indexes content for Claude's search results. Each can be blocked independently in robots.txt, and Anthropic states all three respect robots.txt directives.
Can some AI crawlers bypass robots.txt?
Yes. Some AI crawlers use standard browser user-agent strings and are invisible to robots.txt. OpenAI's ChatGPT Atlas uses a normal Chrome user-agent. xAI's Grok crawler spoofs Safari and Chrome UAs with rotating IPs. Bing Copilot uses the standard Bingbot user-agent. For these, server-level blocking through a WAF like Cloudflare is the only effective approach.
Is blocking AI crawlers retroactive?
No. Blocking a crawler in robots.txt only prevents future crawling. Any content that was already scraped before you added the block may still exist in the AI model's training data. There is currently no standardized way to request removal of previously scraped content, though some companies like OpenAI offer opt-out forms for content already collected.

Summary

AI crawlers now fall into three categories: training crawlers (GPTBot, ClaudeBot, Google-Extended) that absorb your content into models, search bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot) that cite and link back to you, and AI assistants/agents that browse on behalf of users.

Most publishers block training crawlers while keeping search bots allowed. That’s the approach I use on this site. Add the User-agent and Disallow directives to your robots.txt, but be aware that some crawlers (Atlas, Grok) bypass it entirely.

For a complete strategy, combine robots.txt with an llms.txt file and structured data. Block what you don’t want, and guide the bots you allow toward accurate citations. To verify your setup, run a free AI Visibility Audit.

Join the Discussion
0 Comments  ]

Leave a Comment

To add code, use the buttons below. For instance, click the PHP button to insert PHP code within the shortcode. If you notice any typos, please let us know!

Savvy WordPress Development official logo