The file robots.txt contains instructions for search engines regarding how they should crawl your website. These instructions, known as directives, are used to guide bots and tell specific search engines (or all of them) to avoid scanning certain addresses, files, or parts of your site.
In this post, I’ll take an in-depth look at this file, particularly in the context of WordPress sites. You’ll probably discover a few things you didn’t know about it.
If you just need the quick answer – here’s how to block all search engines from your entire site:
User-agent: *
Disallow: /What is the robots.txt file?
Bots and crawlers are a necessary part of the Internet. But that doesn’t mean you want them scanning every address and piece of content on your site without restriction.
The desire to control bot behavior led to the creation of a standard known as the Robots Exclusion Protocol. The robots.txt file is the practical implementation of that protocol – it lets you guide search engine bots on how to crawl your site.
For most website owners, the benefits of robots.txt fall into two categories:
- Optimizing the crawl budget and the resources that search engines allocate to your site. This is achieved by guiding and requesting to avoid wasting resources on scanning pages that you don’t want in the index. This action ensures that search engines will focus on scanning the most important pages on your site.
- Optimizing server performance and preventing overload due to scanning. This is done by blocking bots that consume unnecessary resources from scanning irrelevant addresses and content.
robots.txt is not intended for specific control over indexed pages
Using robots.txt, you can prevent search engines from accessing specific parts of your site, prevent the scanning of duplicate content and irrelevant content, and provide information to search engines on how to more efficiently scan your site.
The file tells search engines which pages and files they can scan on your site. But it’s not a foolproof way to control which pages appear in Google’s index.
To prevent a specific page from appearing in search results, use the noindex tag – either at the code level on the page itself or at the server level (for instance, in the .htaccess file). For WordPress specifically, you can also prevent search result pages from being indexed.
Although many website administrators have heard of the term robots.txt, it doesn’t necessarily mean they understand how to use it correctly. Unfortunately, I’ve seen numerous misguided instructions on this topic.
What does a robots.txt file look like?
For a WordPress site, for example, the robots.txt file might look something like this:
User-agent: *
Disallow: /wp-admin/Let’s explain the anatomy of the robots.txt file in the context of this example:
- User-agent – Indicates which user-agent (search engine) the following directives are relevant to.
- Asterisk (*) – Indicates that the directive is relevant for all user-agents, not specific ones.
- Disallow – This directive signifies which content should not be accessible for the specified user-agent.
- /wp-admin/ – This is the path that should not be accessible to the user-agent you’ve specified.
In summary, the example above instructs all search engine user-agents not to access the /wp-admin/ directory. Let’s delve into the various components of robots.txt…
1. User-agent in the robots.txt file
Each search engine identifies with a specific user-agent. Google’s bot is identified as Googlebot, Bing’s as BingBot, and AI crawlers like OpenAI’s as GPTBot.
The line that starts with “user-agent” marks the beginning of a set of directives. All directives between the first user-agent and the next one apply to that first user-agent.
2. Disallow directive in the robots.txt file
You can instruct search engines not to access specific files or pages, or entire sections of your site, using the Disallow directive.
The Disallow directive must be followed by a path indicating which content should not be accessible. If no path is defined, search engines will disregard the directive.
User-agent: *
Disallow: /wp-admin/3. Allow directive in the robots.txt file
The Allow directive counteracts specific Disallow directives. Using both together lets search engines access a file or page that would otherwise be blocked.
The Allow directive must also be followed by a path. If no path is defined, search engines will ignore it.
User-agent: *
Allow: /media/terms-and-conditions.pdf
Disallow: /media/In this example, the directive instructs all search engine user-agents not to access the /media/ directory, except for the PDF file mentioned in the Allow directive.
Important: When using these directives together, it’s advisable not to use wildcards, as this can lead to conflicting instructions.
Example of conflicting directives
User-agent: *
Allow: /directory
Disallow: *.htmlIn this scenario, search engines will likely be confused about the address http://domain.co.il/file.html, for example. It won’t be clear whether they should scan this file or not.
When certain directives are unclear to search engines, or at least less strict than others, Google will interpret the less strict directive, and in this case, it will scan the mentioned file.
4. Separate line for each directive
Each directive should appear on a separate line. Otherwise, search engines might misunderstand the instructions in the robots.txt file.
Refrain from writing in this style:
User-agent: * Disallow: /directory-1/ Disallow: /directory-2/ Disallow: /directory-3/5. Using wildcards*
In addition to using wildcards for user-agents, you can also use them for URL addresses. For example:
User-agent: *
Disallow: *?In this example, there’s a directive instructing search engines not to access any URL address containing a question mark (?).
6. Using URL suffix ($) in the robots.txt file
You can use the dollar sign ($) to indicate a URL address suffix after the path. For example:
User-agent: *
Disallow: *.php$In this example, the directive prevents search engines from accessing and scanning any URL with the .php suffix.
However, addresses with parameters following that will still be accessible. For instance, the address http://example.co.il/page.php?lang=he will still be accessible, as it doesn’t end with .php.
7. Indicating the sitemap using robots.txt
Although the robots.txt file is intended to guide search engines on which pages they shouldn’t access, it’s also used to indicate where the XML Sitemap of the site is located.
You’ve probably already submitted your sitemap through Google Search Console or Bing Webmaster Tools. It’s still recommended to reference it in robots.txt as well.
Multiple Sitemap files can be referenced. Here’s an example:
User-agent: *
Disallow: /wp-admin/
Sitemap: http://example.co.il/sitemap1.xml
Sitemap: http://example.co.il/sitemap2.xmlThe guidance for search engines in this example is to not access the /wp-admin/ directory, and in addition, it indicates the presence of two Sitemap files located at the specified addresses.
The reference to the Sitemap file’s address should be absolute. Additionally, note that the address doesn’t necessarily need to be on the same server as the robots.txt file.
8. Adding Comments in the robots.txt File
Comments can be added using the hash symbol (#). They can appear at the beginning of a line or after the directive on the same line.
Search engines ignore everything after the hash symbol – comments are intended for humans only. Here’s an example:
# Don't allow access to the /wp-admin/ directory for all robots.
User-agent: *
Disallow: /wp-admin/And here’s an example where the comment is on the same line as the directive:
User-agent: * # Applies to all robots
Disallow: /wp-admin/ # Don't allow access to the /wp-admin/ directory.9. Using the Crawl-delay Directive in the robots.txt File
The Crawl-delay directive is not an official directive. It exists to prevent overloading the server during crawling due to a high frequency of requests.
If crawling overloads your server, Crawl-delay is only a temporary fix. It usually means your hosting is too weak for your site’s traffic or that your site is misconfigured.
Googlebot ignores this directive entirely. Google previously allowed setting the crawl rate via Search Console, but that option has been deprecated.
Bing does respect the Crawl-delay directive. The directive looks like this:
User-agent: BingBot
Disallow: /private/
Crawl-delay: 10How to Create and Edit the robots.txt File for WordPress Sites
WordPress automatically generates a virtual robots.txt file for your site. Even if you don’t take any action, a default file probably exists.
You can check by adding /robots.txt after your domain name. For example, https://savvy.co.il/robots.txt will display Savvy Blog’s robots.txt file.
If the file is virtual, you won’t be able to edit it directly. To make changes, create a physical file on your server.
You can do this in several ways…
1. Creating a robots.txt File Using FTP
You can create and edit a robots.txt file using FTP software. Start by using a text editor to create an empty file named robots.txt.
Then connect to your server via FTP and upload this file to the root folder of your site. From there, you can make further modifications through your FTP client.
2. Creating and Editing the robots.txt File Using Yoast SEO
If you’re using the Yoast SEO plugin for WordPress, you can create and edit the robots.txt file directly through the plugin’s interface. Go to SEO > Tools and click Edit Files.
If the file doesn’t exist, you’ll see the option to create one. If it already exists, you can edit it along with the .htaccess file.
Typical robots.txt File for WordPress Sites
The following code is intended specifically for WordPress sites. Keep in mind that this is a recommendation only and is relevant only if:
- You don’t want the WordPress admin interface (admin) to be crawled.
- You don’t want internal search result pages to be crawled.
- You don’t want tag pages and author pages to be crawled.
- You’re using pretty permalinks as recommended…
User-agent: *
Disallow: /wp-content/plugins/ # block access to the plugins folder
Disallow: /wp-login.php # block access to the admin section
Disallow: /readme.html # block access to the readme file
Disallow: /search/ # block access to internal search result pages
Disallow: *?s=* # block access to internal search result pages
Disallow: *?p=* # block access to pages for which permalinks fail
Disallow: *&p=* # block access to pages for which permalinks fail
Disallow: *&preview=* # block access to preview pages
Disallow: /tag/ # block access to tag pages
Disallow: /author/ # block access to author pages
Sitemap: https://www.example.com/sitemap_index.xmlWhile this code is relevant for most WordPress sites, always perform adjustments and testing to ensure it suits your specific needs.
Blocking AI Crawlers
In 2025, AI companies started sending dedicated crawlers to scrape web content for model training. If you want to prevent your content from being used for AI training, add these rules:
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /These rules block the main AI training crawlers: OpenAI’s GPTBot, Google’s Google-Extended (used for Gemini training), Anthropic’s ClaudeBot, and Common Crawl’s CCBot.
Blocking
Google-Extendeddoes not affect your Google Search rankings. It only prevents your content from being used to train Google’s AI models.
Note that citation-based AI crawlers like PerplexityBot and ChatGPT-User provide backlinks and traffic. Blocking those is generally not recommended unless you have specific reasons.
For a deeper dive on this topic, check our dedicated guide on how to block AI crawlers with robots.txt. You may also be interested in understanding how nofollow, sponsored, and UGC link attributes affect SEO.
Common Actions with the robots.txt File
Here’s a table (desktop only) describing several common actions that can be taken with the robots.txt file. The table was taken directly from Google’s documentation.
AI-Specific Access Control with llms.txt
While robots.txt blocks AI crawlers at the access level, a newer standard takes a different approach. The llms.txt file lets you define how AI models should present and use your content.
Think of it as a complement to robots.txt – rather than blocking AI entirely, you can guide how your content appears in AI-generated answers.
Learn how to set up and use it in our guide to llms.txt for AI engines. You can also serve full Markdown content with llms-full.txt for better AI representation.
FAQs
Here are the most common questions about robots.txt and how it works.
noindex meta tag on the page itself.https://example.com/robots.txt. It only applies to the specific protocol, host, and port where it is located.User-agent: GPTBot followed by Disallow: / to block OpenAI's crawler. The same approach works for ClaudeBot, Google-Extended, CCBot, and other AI crawlers. Blocking these does not affect your Google Search rankings.Summary
The key takeaway: using the Disallow directive in robots.txt is not the same as using the noindex meta tag. Blocking a URL from crawling won’t necessarily keep it out of the index or search results.
You can use robots.txt to shape how search engines interact with your site. But it doesn’t give you precise control over what gets indexed.
Most WordPress site owners don’t need to touch the default virtual file. But if you’re dealing with a problematic bot, need to block AI crawlers, or want to refine how search engines access specific parts of your site, adding custom rules is worth the effort.
I recommend reviewing your robots.txt at least once a year, especially as the AI crawler landscape keeps evolving.

