AI crawlers are visiting your website every day. OpenAI’s GPTBot, Anthropic’s ClaudeBot, Meta’s crawler, and dozens of others are scraping content to train large language models or power AI search features.
Unlike traditional search engine crawlers that index your site and send you traffic, many AI crawlers take your content without attribution or a link back. The good news is that most of them respect robots.txt directives, giving you control over what they can access.
Below I’ll cover which AI bots are crawling your site, how to block them selectively, and how to decide which ones to keep. This guide is part of my AEO checklist for WordPress.
Training Crawlers vs. Search and Retrieval Bots
Before blocking anything, understand the three categories of AI crawlers:
Training crawlers collect your content to train AI models. Your text becomes part of the model’s knowledge, but you get no attribution, no link, and no traffic. Examples: GPTBot, Google-Extended, ClaudeBot, Meta-ExternalAgent.
Search and retrieval bots fetch your content in real time to answer a user’s question. They typically cite your page and link back to it. Examples: OAI-SearchBot, Claude-SearchBot, PerplexityBot.
AI assistants and agents browse the web on behalf of a specific user – like ChatGPT-User, Claude-User, or the newer autonomous agents (ChatGPT Operator, Google Agent). Some of these are starting to ignore robots.txt because the request is “user-initiated.”
Blocking training crawlers protects your content from being absorbed without credit. Blocking search bots means AI assistants won’t reference your site when users ask related questions – which could cost you visibility in the growing generative search landscape (GEO).
Think carefully before blocking citation crawlers like ChatGPT-User and PerplexityBot. These bots drive referral traffic to your site by citing and linking to your pages. Blocking them means your content won’t appear in AI-powered answers, which is an increasingly important traffic source.
Complete List of AI Crawlers
Here are the major AI crawlers you should know about. The landscape has grown to over 140 known user-agents as of early 2026 – these are the ones that matter most:
Training Crawlers
| Company | User-Agent | Purpose |
|---|---|---|
| OpenAI | GPTBot | Trains GPT models |
| Anthropic | ClaudeBot | Trains Claude models |
Google-Extended | Trains Gemini (does not affect Search rankings) | |
| Apple | Applebot-Extended | Apple Intelligence / Siri |
| Meta | Meta-ExternalAgent | Trains LLaMA models (highest crawl volume after Googlebot) |
| Amazon | Amazonbot | Alexa and Rufus AI shopping assistant |
| ByteDance | Bytespider | TikTok AI training (volume dropped 85% in 2025) |
| Common Crawl | CCBot | Open dataset used by many AI labs |
| Cohere | cohere-ai | Enterprise AI models |
| DeepSeek | DeepSeekBot | LLM training (questionable robots.txt compliance) |
Search and Retrieval Bots
These bots fetch content to answer queries and typically cite your page with a link back:
| Company | User-Agent | Purpose |
|---|---|---|
| OpenAI | OAI-SearchBot | Powers ChatGPT Search results |
| Anthropic | Claude-SearchBot | Claude search results |
| Perplexity | PerplexityBot | AI search engine |
| Amazon | Amzn-SearchBot | Amazon AI search |
AI Assistants (User-Triggered)
These browse on behalf of a specific user. Blocking them means the AI can’t look up your content when a user asks:
| Company | User-Agent | Purpose |
|---|---|---|
| OpenAI | ChatGPT-User | Real-time browsing for ChatGPT |
| Anthropic | Claude-User | Real-time browsing for Claude |
| Perplexity | Perplexity-User | User-triggered fetching |
Watch out for ChatGPT-User: OpenAI quietly removed robots.txt compliance language from the ChatGPT-User documentation, arguing that user-initiated requests “may not” be subject to robots.txt. This means blocking ChatGPT-User in your robots.txt may no longer be effective.
This list evolves as new AI companies launch crawlers. For an up-to-date directory with 140+ entries, check the ai.robots.txt community project on GitHub or Known Agents.
How to Block AI Crawlers in robots.txt
Add User-agent and Disallow directives to your robots.txt file. This file sits at the root of your site (e.g., https://yoursite.com/robots.txt).
Block All AI Training Crawlers
To block the major training crawlers while keeping citation bots allowed:
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: DeepSeekBot
Disallow: /This is the approach most publishers take: block training crawlers that absorb your content without attribution, but keep search and retrieval bots allowed so your site can appear in AI-powered answers.
Block Specific Paths Only
If you want AI crawlers to access some content but not all of it, block specific directories:
User-agent: GPTBot
Disallow: /premium-content/
Disallow: /members-only/
Allow: /blog/
User-agent: ClaudeBot
Disallow: /premium-content/
Disallow: /members-only/
Allow: /blog/This lets AI models train on your public blog posts while protecting gated or premium content.
Block All AI Crawlers (Training and Search)
If you want to block every known AI crawler entirely:
# Block all AI crawlers (training + search + assistants)
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-User
Disallow: /
User-agent: Claude-SearchBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: DeepSeekBot
Disallow: /Crawlers That Ignore robots.txt
Not every AI crawler plays by the rules. A few major ones are effectively invisible to robots.txt:
ChatGPT Atlas (OpenAI) uses a standard Chrome user-agent string with no identifying token. It blends with normal browser traffic and cannot be distinguished via robots.txt.
Grok / xAI rotates through residential IP addresses and spoofs Safari and Chrome user-agents. Despite xAI documenting a “GrokBot” user-agent, no real traffic has been observed using it.
Bing Copilot uses the standard Bingbot user-agent, so you can’t block Copilot without also blocking Bing Search.
For these crawlers, the only effective defense is server-level blocking through a WAF (like Cloudflare’s AI Crawl Control), IP-based rules in your web server config, or rate limiting. Cloudflare now blocks known AI crawlers by default on all new domains.
How to Edit robots.txt in WordPress
There are several ways to edit your robots.txt file in WordPress:
Option 1: Edit the File Directly
If you have a physical robots.txt file in your site’s root directory, edit it with any text editor and upload via FTP or your hosting file manager. This is the most reliable method.
Option 2: Use an SEO Plugin
Most SEO plugins let you edit robots.txt from the WordPress admin:
- Yoast SEO: Go to Yoast SEO > Tools > File Editor
- Rank Math: Go to Rank Math > General Settings > Edit robots.txt
Option 3: Use a Filter in functions.php
If WordPress generates your robots.txt dynamically (no physical file exists), you can add rules via the robots_txt filter:
add_filter( 'robots_txt', function( $output ) {
$output .= "n# Block AI training crawlersn";
$output .= "User-agent: GPTBotnDisallow: /nn";
$output .= "User-agent: ClaudeBotnDisallow: /nn";
$output .= "User-agent: Google-ExtendednDisallow: /nn";
$output .= "User-agent: Meta-ExternalAgentnDisallow: /nn";
$output .= "User-agent: AmazonbotnDisallow: /nn";
$output .= "User-agent: BytespidernDisallow: /nn";
$output .= "User-agent: CCBotnDisallow: /n";
return $output;
}, 99 );robots.txt vs. llms.txt vs. ai.txt
Several standards now exist for managing AI access to your content. They serve different purposes:
robots.txtcontrols crawl access – which bots can visit which pages- llms.txt provides context – a content map so AI systems can understand and cite your site accurately
ai.txt(by Spawning) declares training permissions – specifically for AI model training, with EU TDM opt-out support
A balanced approach I recommend: block training crawlers via robots.txt, provide an llms.txt so citation bots that do access your site represent you accurately, and add structured data markup so your content is easy for AI systems to parse.
How to Verify AI Crawlers Are Blocked
After updating your robots.txt, verify the rules are working:
- Visit
https://yoursite.com/robots.txtin your browser and confirm the new directives appear - Use Google’s robots.txt Tester in Search Console to validate syntax
- Monitor your AI traffic in Google Analytics to see if crawler visits decrease over time
- Check your server access logs for the user-agent strings you blocked
Remember that robots.txt is a voluntary protocol. Well-known AI companies (OpenAI, Anthropic, Google, Apple) respect it, but smaller or less scrupulous crawlers may ignore it. For stronger enforcement, consider server-level blocking via your web server configuration or a WAF (Web Application Firewall).
FAQs
Common questions about blocking AI crawlers with robots.txt:
Disallow: / (which blocks the entire site), you can block specific paths. For example, Disallow: /premium-content/ blocks only that directory. You can also use Allow: to permit access to specific paths within a blocked area. This gives you granular control over what AI systems can and cannot access.ClaudeBot scrapes content for model training. Claude-User fetches pages in real time when a Claude user triggers web browsing. Claude-SearchBot indexes content for Claude's search results. Each can be blocked independently in robots.txt, and Anthropic states all three respect robots.txt directives.Summary
AI crawlers now fall into three categories: training crawlers (GPTBot, ClaudeBot, Google-Extended) that absorb your content into models, search bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot) that cite and link back to you, and AI assistants/agents that browse on behalf of users.
Most publishers block training crawlers while keeping search bots allowed. That’s the approach I use on this site. Add the User-agent and Disallow directives to your robots.txt, but be aware that some crawlers (Atlas, Grok) bypass it entirely.
For a complete strategy, combine robots.txt with an llms.txt file and structured data. Block what you don’t want, and guide the bots you allow toward accurate citations. To verify your setup, run a free AI Visibility Audit.

