AI crawlers are visiting your website every day. OpenAI’s GPTBot, Anthropic’s ClaudeBot, Meta’s crawler, and dozens of others are scraping content to train large language models or power AI search features.
Unlike traditional search engine crawlers that index your site and send you traffic, many AI crawlers take your content without attribution or a link back. The good news is that most of them respect robots.txt directives, giving you control over what they can access.
In this guide, you’ll learn which AI bots are crawling your site, how to block them selectively, and how to make smart decisions about which ones to allow.
“Most AI web crawlers support being blocked via robots.txt, allowing website owners to opt-out of having their content used for AI training.” – OpenAI GPTBot documentation, 2024.
Training Crawlers vs. Citation Crawlers
Before you start blocking, it’s important to understand the two types of AI crawlers:
Training crawlers collect your content to train AI models. Your text becomes part of the model’s knowledge, but you get no attribution, no link, and no traffic. Examples include GPTBot, Google-Extended, and ClaudeBot.
Citation crawlers (also called retrieval or browsing agents) fetch your content in real time to answer a user’s question. They typically cite your page and link back to it. Examples include ChatGPT-User, PerplexityBot, and OAI-SearchBot.
Blocking training crawlers protects your content from being absorbed without credit. Blocking citation crawlers means AI assistants won’t reference your site when users ask related questions – which could cost you visibility in the growing generative search landscape (GEO).
Think carefully before blocking citation crawlers like ChatGPT-User and PerplexityBot. These bots drive referral traffic to your site by citing and linking to your pages. Blocking them means your content won’t appear in AI-powered answers, which is an increasingly important traffic source.
Complete List of AI Crawlers
Here are the major AI crawlers you should know about, organized by company:
| Company | User-Agent | Type | Purpose |
|---|---|---|---|
| OpenAI | GPTBot | Training | Trains GPT models |
| OpenAI | ChatGPT-User | Citation | Real-time browsing for ChatGPT |
| OpenAI | OAI-SearchBot | Citation | Powers ChatGPT Search results |
| Anthropic | ClaudeBot | Training | Trains Claude models |
| Anthropic | Claude-User | Citation | Real-time browsing for Claude |
Google-Extended | Training | Trains Gemini models | |
| Perplexity | PerplexityBot | Citation | AI search engine |
| Apple | Applebot-Extended | Training | Apple Intelligence / Siri |
| Meta | Meta-ExternalAgent | Training | Trains LLaMA models |
| ByteDance | Bytespider | Training | TikTok AI training |
| Common Crawl | CCBot | Training | Open dataset used by AI labs |
| Cohere | cohere-ai | Training | Enterprise AI models |
| DeepSeek | DeepSeekBot | Training | Knowledge indexing |
This list evolves as new AI companies launch their own crawlers. For an up-to-date directory, check the ai.robots.txt community project on GitHub.
How to Block AI Crawlers in robots.txt
Add User-agent and Disallow directives to your robots.txt file. This file sits at the root of your site (e.g., https://yoursite.com/robots.txt).
Block All AI Training Crawlers
To block the major training crawlers while keeping citation bots allowed:
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: DeepSeekBot
Disallow: /This is the approach most publishers take: block training crawlers that absorb your content without attribution, but allow citation crawlers that can send traffic back.
Block Specific Paths Only
If you want AI crawlers to access some content but not all of it, block specific directories:
User-agent: GPTBot
Disallow: /premium-content/
Disallow: /members-only/
Allow: /blog/
User-agent: ClaudeBot
Disallow: /premium-content/
Disallow: /members-only/
Allow: /blog/This lets AI models train on your public blog posts while protecting gated or premium content.
Block All AI Crawlers (Training and Citation)
If you want to block every known AI crawler entirely:
# Block all AI crawlers (training + citation)
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-User
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: DeepSeekBot
Disallow: /How to Edit robots.txt in WordPress
There are several ways to edit your robots.txt file in WordPress:
Option 1: Edit the File Directly
If you have a physical robots.txt file in your site’s root directory, edit it with any text editor and upload via FTP or your hosting file manager. This is the most reliable method.
Option 2: Use an SEO Plugin
Most SEO plugins let you edit robots.txt from the WordPress admin:
- Yoast SEO: Go to Yoast SEO > Tools > File Editor
- Rank Math: Go to Rank Math > General Settings > Edit robots.txt
Option 3: Use a Filter in functions.php
If WordPress generates your robots.txt dynamically (no physical file exists), you can add rules via the robots_txt filter:
add_filter( 'robots_txt', function( $output ) {
$output .= "\n# Block AI training crawlers\n";
$output .= "User-agent: GPTBot\nDisallow: /\n\n";
$output .= "User-agent: ClaudeBot\nDisallow: /\n\n";
$output .= "User-agent: Google-Extended\nDisallow: /\n\n";
$output .= "User-agent: Meta-ExternalAgent\nDisallow: /\n\n";
$output .= "User-agent: Bytespider\nDisallow: /\n\n";
$output .= "User-agent: CCBot\nDisallow: /\n";
return $output;
}, 99 );
robots.txt vs. llms.txt: Different Jobs
Blocking AI crawlers with robots.txt and providing context with llms.txt are two different strategies that work together:
robots.txtcontrols access – which bots can crawl which pagesllms.txtprovides context – when an AI does use your content, it knows how to cite you correctly
A balanced approach: block training crawlers via robots.txt so your content isn’t absorbed into models without credit, but provide an llms.txt file so citation crawlers that do access your site can represent you accurately.
How to Verify AI Crawlers Are Blocked
After updating your robots.txt, verify the rules are working:
- Visit
https://yoursite.com/robots.txtin your browser and confirm the new directives appear - Use Google’s robots.txt Tester in Search Console to validate syntax
- Monitor your AI traffic in Google Analytics to see if crawler visits decrease over time
- Check your server access logs for the user-agent strings you blocked
Remember that robots.txt is a voluntary protocol. Well-known AI companies (OpenAI, Anthropic, Google, Apple) respect it, but smaller or less scrupulous crawlers may ignore it. For stronger enforcement, consider server-level blocking via your web server configuration or a WAF (Web Application Firewall).
FAQs
Common questions about blocking AI crawlers with robots.txt:
Disallow: / (which blocks the entire site), you can block specific paths. For example, Disallow: /premium-content/ blocks only that directory. You can also use Allow: to permit access to specific paths within a blocked area. This gives you granular control over what AI systems can and cannot access.Summary
AI crawlers fall into two categories: training crawlers (GPTBot, ClaudeBot, Google-Extended) that absorb your content into models, and citation crawlers (ChatGPT-User, PerplexityBot) that fetch content in real time and link back to your site.
Most publishers block training crawlers to prevent their content from being used without credit, while keeping citation crawlers allowed to maintain visibility in AI-powered search. Add the appropriate User-agent and Disallow directives to your robots.txt file to control access.
Combine robots.txt blocking with an llms.txt file for a complete AI content strategy: block the bots you don’t want, and guide the ones you do allow toward accurate citations.

