search ]

How to Block AI Crawlers and Bots with robots.txt

AI crawlers are visiting your website every day. OpenAI’s GPTBot, Anthropic’s ClaudeBot, Meta’s crawler, and dozens of others are scraping content to train large language models or power AI search features.

Unlike traditional search engine crawlers that index your site and send you traffic, many AI crawlers take your content without attribution or a link back. The good news is that most of them respect robots.txt directives, giving you control over what they can access.

In this guide, you’ll learn which AI bots are crawling your site, how to block them selectively, and how to make smart decisions about which ones to allow.

“Most AI web crawlers support being blocked via robots.txt, allowing website owners to opt-out of having their content used for AI training.” – OpenAI GPTBot documentation, 2024.

Training Crawlers vs. Citation Crawlers

Before you start blocking, it’s important to understand the two types of AI crawlers:

Training crawlers collect your content to train AI models. Your text becomes part of the model’s knowledge, but you get no attribution, no link, and no traffic. Examples include GPTBot, Google-Extended, and ClaudeBot.

Citation crawlers (also called retrieval or browsing agents) fetch your content in real time to answer a user’s question. They typically cite your page and link back to it. Examples include ChatGPT-User, PerplexityBot, and OAI-SearchBot.

Blocking training crawlers protects your content from being absorbed without credit. Blocking citation crawlers means AI assistants won’t reference your site when users ask related questions – which could cost you visibility in the growing generative search landscape (GEO).

Think carefully before blocking citation crawlers like ChatGPT-User and PerplexityBot. These bots drive referral traffic to your site by citing and linking to your pages. Blocking them means your content won’t appear in AI-powered answers, which is an increasingly important traffic source.

Complete List of AI Crawlers

Here are the major AI crawlers you should know about, organized by company:

CompanyUser-AgentTypePurpose
OpenAIGPTBotTrainingTrains GPT models
OpenAIChatGPT-UserCitationReal-time browsing for ChatGPT
OpenAIOAI-SearchBotCitationPowers ChatGPT Search results
AnthropicClaudeBotTrainingTrains Claude models
AnthropicClaude-UserCitationReal-time browsing for Claude
GoogleGoogle-ExtendedTrainingTrains Gemini models
PerplexityPerplexityBotCitationAI search engine
AppleApplebot-ExtendedTrainingApple Intelligence / Siri
MetaMeta-ExternalAgentTrainingTrains LLaMA models
ByteDanceBytespiderTrainingTikTok AI training
Common CrawlCCBotTrainingOpen dataset used by AI labs
Coherecohere-aiTrainingEnterprise AI models
DeepSeekDeepSeekBotTrainingKnowledge indexing

This list evolves as new AI companies launch their own crawlers. For an up-to-date directory, check the ai.robots.txt community project on GitHub.

How to Block AI Crawlers in robots.txt

Add User-agent and Disallow directives to your robots.txt file. This file sits at the root of your site (e.g., https://yoursite.com/robots.txt).

Block All AI Training Crawlers

To block the major training crawlers while keeping citation bots allowed:

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: DeepSeekBot
Disallow: /

This is the approach most publishers take: block training crawlers that absorb your content without attribution, but allow citation crawlers that can send traffic back.

Block Specific Paths Only

If you want AI crawlers to access some content but not all of it, block specific directories:

User-agent: GPTBot
Disallow: /premium-content/
Disallow: /members-only/
Allow: /blog/

User-agent: ClaudeBot
Disallow: /premium-content/
Disallow: /members-only/
Allow: /blog/

This lets AI models train on your public blog posts while protecting gated or premium content.

Block All AI Crawlers (Training and Citation)

If you want to block every known AI crawler entirely:

# Block all AI crawlers (training + citation)
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-User
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: DeepSeekBot
Disallow: /

How to Edit robots.txt in WordPress

There are several ways to edit your robots.txt file in WordPress:

Option 1: Edit the File Directly

If you have a physical robots.txt file in your site’s root directory, edit it with any text editor and upload via FTP or your hosting file manager. This is the most reliable method.

Option 2: Use an SEO Plugin

Most SEO plugins let you edit robots.txt from the WordPress admin:

  • Yoast SEO: Go to Yoast SEO > Tools > File Editor
  • Rank Math: Go to Rank Math > General Settings > Edit robots.txt

Option 3: Use a Filter in functions.php

If WordPress generates your robots.txt dynamically (no physical file exists), you can add rules via the robots_txt filter:

add_filter( 'robots_txt', function( $output ) {
    $output .= "\n# Block AI training crawlers\n";
    $output .= "User-agent: GPTBot\nDisallow: /\n\n";
    $output .= "User-agent: ClaudeBot\nDisallow: /\n\n";
    $output .= "User-agent: Google-Extended\nDisallow: /\n\n";
    $output .= "User-agent: Meta-ExternalAgent\nDisallow: /\n\n";
    $output .= "User-agent: Bytespider\nDisallow: /\n\n";
    $output .= "User-agent: CCBot\nDisallow: /\n";
    return $output;
}, 99 );

robots.txt vs. llms.txt: Different Jobs

Blocking AI crawlers with robots.txt and providing context with llms.txt are two different strategies that work together:

  • robots.txt controls access – which bots can crawl which pages
  • llms.txt provides context – when an AI does use your content, it knows how to cite you correctly

A balanced approach: block training crawlers via robots.txt so your content isn’t absorbed into models without credit, but provide an llms.txt file so citation crawlers that do access your site can represent you accurately.

How to Verify AI Crawlers Are Blocked

After updating your robots.txt, verify the rules are working:

  1. Visit https://yoursite.com/robots.txt in your browser and confirm the new directives appear
  2. Use Google’s robots.txt Tester in Search Console to validate syntax
  3. Monitor your AI traffic in Google Analytics to see if crawler visits decrease over time
  4. Check your server access logs for the user-agent strings you blocked

Remember that robots.txt is a voluntary protocol. Well-known AI companies (OpenAI, Anthropic, Google, Apple) respect it, but smaller or less scrupulous crawlers may ignore it. For stronger enforcement, consider server-level blocking via your web server configuration or a WAF (Web Application Firewall).

FAQs

Common questions about blocking AI crawlers with robots.txt:

Will blocking AI crawlers affect my Google search rankings?
No. Blocking AI training crawlers like GPTBot, ClaudeBot, or Google-Extended does not affect your Google search rankings. These bots are separate from Googlebot, which handles search indexing. Blocking Google-Extended only prevents your content from being used to train Gemini - it does not impact your visibility in Google Search or Google AI Overviews.
What is the difference between GPTBot and ChatGPT-User?
GPTBot is OpenAI's training crawler - it collects content to improve GPT models. ChatGPT-User is the browsing agent that fetches pages in real time when a ChatGPT user asks it to search the web. Blocking GPTBot prevents your content from being used for training. Blocking ChatGPT-User prevents ChatGPT from citing or linking to your site in conversations.
Should I block Google-Extended?
Blocking Google-Extended prevents your content from being used to train Google's Gemini models. It does not affect regular Google Search indexing or your appearance in AI Overviews - those are handled by Googlebot, which is a separate user-agent. If you want to prevent AI training but keep your search presence, blocking Google-Extended is a safe choice.
Do all AI crawlers respect robots.txt?
Major AI companies like OpenAI, Anthropic, Google, Apple, and Perplexity have publicly committed to respecting robots.txt. However, robots.txt is a voluntary protocol with no technical enforcement. Smaller or unknown crawlers may ignore it. For stronger protection, combine robots.txt with server-level blocking using firewall rules or web server configuration.
Can I block AI crawlers from specific pages only?
Yes. Instead of Disallow: / (which blocks the entire site), you can block specific paths. For example, Disallow: /premium-content/ blocks only that directory. You can also use Allow: to permit access to specific paths within a blocked area. This gives you granular control over what AI systems can and cannot access.
Is blocking AI crawlers retroactive?
No. Blocking a crawler in robots.txt only prevents future crawling. Any content that was already scraped before you added the block may still exist in the AI model's training data. There is currently no standardized way to request removal of previously scraped content, though some companies like OpenAI offer opt-out forms for content already collected.

Summary

AI crawlers fall into two categories: training crawlers (GPTBot, ClaudeBot, Google-Extended) that absorb your content into models, and citation crawlers (ChatGPT-User, PerplexityBot) that fetch content in real time and link back to your site.

Most publishers block training crawlers to prevent their content from being used without credit, while keeping citation crawlers allowed to maintain visibility in AI-powered search. Add the appropriate User-agent and Disallow directives to your robots.txt file to control access.

Combine robots.txt blocking with an llms.txt file for a complete AI content strategy: block the bots you don’t want, and guide the ones you do allow toward accurate citations.

Join the Discussion
0 Comments  ]

Leave a Comment

To add code, use the buttons below. For instance, click the PHP button to insert PHP code within the shortcode. If you notice any typos, please let us know!

Savvy WordPress Development official logo