search

What is robots.txt File & How to Use it Correctly

The file robots.txt contains instructions for search engines regarding how they should crawl your website. These instructions, known as directives, are used to guide bots and tell specific search engines (or all of them) to avoid scanning certain addresses, files, or parts of your site.

In this post, I’ll take an in-depth look at this file, particularly in the context of WordPress sites. You’ll probably discover a few things you didn’t know about it.

If you just need the quick answer – here’s how to block all search engines from your entire site:

User-agent: *
Disallow: /

What is the robots.txt file?

Bots and crawlers are a necessary part of the Internet. But that doesn’t mean you want them scanning every address and piece of content on your site without restriction.

The desire to control bot behavior led to the creation of a standard known as the Robots Exclusion Protocol. The robots.txt file is the practical implementation of that protocol – it lets you guide search engine bots on how to crawl your site.

For most website owners, the benefits of robots.txt fall into two categories:

  • Optimizing the crawl budget and the resources that search engines allocate to your site. This is achieved by guiding and requesting to avoid wasting resources on scanning pages that you don’t want in the index. This action ensures that search engines will focus on scanning the most important pages on your site.
  • Optimizing server performance and preventing overload due to scanning. This is done by blocking bots that consume unnecessary resources from scanning irrelevant addresses and content.

robots.txt is not intended for specific control over indexed pages

Using robots.txt, you can prevent search engines from accessing specific parts of your site, prevent the scanning of duplicate content and irrelevant content, and provide information to search engines on how to more efficiently scan your site.

The file tells search engines which pages and files they can scan on your site. But it’s not a foolproof way to control which pages appear in Google’s index.

To prevent a specific page from appearing in search results, use the noindex tag – either at the code level on the page itself or at the server level (for instance, in the .htaccess file). For WordPress specifically, you can also prevent search result pages from being indexed.

Although many website administrators have heard of the term robots.txt, it doesn’t necessarily mean they understand how to use it correctly. Unfortunately, I’ve seen numerous misguided instructions on this topic.

What does a robots.txt file look like?

For a WordPress site, for example, the robots.txt file might look something like this:

User-agent: *
Disallow: /wp-admin/

Let’s explain the anatomy of the robots.txt file in the context of this example:

  • User-agent – Indicates which user-agent (search engine) the following directives are relevant to.
  • Asterisk (*) – Indicates that the directive is relevant for all user-agents, not specific ones.
  • Disallow – This directive signifies which content should not be accessible for the specified user-agent.
  • /wp-admin/ – This is the path that should not be accessible to the user-agent you’ve specified.

In summary, the example above instructs all search engine user-agents not to access the /wp-admin/ directory. Let’s delve into the various components of robots.txt…

1. User-agent in the robots.txt file

Each search engine identifies with a specific user-agent. Google’s bot is identified as Googlebot, Bing’s as BingBot, and AI crawlers like OpenAI’s as GPTBot.

The line that starts with “user-agent” marks the beginning of a set of directives. All directives between the first user-agent and the next one apply to that first user-agent.

2. Disallow directive in the robots.txt file

You can instruct search engines not to access specific files or pages, or entire sections of your site, using the Disallow directive.

The Disallow directive must be followed by a path indicating which content should not be accessible. If no path is defined, search engines will disregard the directive.

User-agent: *
Disallow: /wp-admin/

3. Allow directive in the robots.txt file

The Allow directive counteracts specific Disallow directives. Using both together lets search engines access a file or page that would otherwise be blocked.

The Allow directive must also be followed by a path. If no path is defined, search engines will ignore it.

User-agent: *
Allow: /media/terms-and-conditions.pdf
Disallow: /media/

In this example, the directive instructs all search engine user-agents not to access the /media/ directory, except for the PDF file mentioned in the Allow directive.

Important: When using these directives together, it’s advisable not to use wildcards, as this can lead to conflicting instructions.

Example of conflicting directives

User-agent: *
Allow: /directory
Disallow: *.html

In this scenario, search engines will likely be confused about the address http://domain.co.il/file.html, for example. It won’t be clear whether they should scan this file or not.

When certain directives are unclear to search engines, or at least less strict than others, Google will interpret the less strict directive, and in this case, it will scan the mentioned file.

4. Separate line for each directive

Each directive should appear on a separate line. Otherwise, search engines might misunderstand the instructions in the robots.txt file.

Refrain from writing in this style:

User-agent: * Disallow: /directory-1/ Disallow: /directory-2/ Disallow: /directory-3/

5. Using wildcards*

In addition to using wildcards for user-agents, you can also use them for URL addresses. For example:

User-agent: *
Disallow: *?

In this example, there’s a directive instructing search engines not to access any URL address containing a question mark (?).

6. Using URL suffix ($) in the robots.txt file

You can use the dollar sign ($) to indicate a URL address suffix after the path. For example:

User-agent: *
Disallow: *.php$

In this example, the directive prevents search engines from accessing and scanning any URL with the .php suffix.

However, addresses with parameters following that will still be accessible. For instance, the address http://example.co.il/page.php?lang=he will still be accessible, as it doesn’t end with .php.

7. Indicating the sitemap using robots.txt

Although the robots.txt file is intended to guide search engines on which pages they shouldn’t access, it’s also used to indicate where the XML Sitemap of the site is located.

You’ve probably already submitted your sitemap through Google Search Console or Bing Webmaster Tools. It’s still recommended to reference it in robots.txt as well.

Multiple Sitemap files can be referenced. Here’s an example:

User-agent: *
Disallow: /wp-admin/
Sitemap: http://example.co.il/sitemap1.xml
Sitemap: http://example.co.il/sitemap2.xml

The guidance for search engines in this example is to not access the /wp-admin/ directory, and in addition, it indicates the presence of two Sitemap files located at the specified addresses.

The reference to the Sitemap file’s address should be absolute. Additionally, note that the address doesn’t necessarily need to be on the same server as the robots.txt file.

8. Adding Comments in the robots.txt File

Comments can be added using the hash symbol (#). They can appear at the beginning of a line or after the directive on the same line.

Search engines ignore everything after the hash symbol – comments are intended for humans only. Here’s an example:

# Don't allow access to the /wp-admin/ directory for all robots.
User-agent: *
Disallow: /wp-admin/

And here’s an example where the comment is on the same line as the directive:

User-agent: * # Applies to all robots
Disallow: /wp-admin/ # Don't allow access to the /wp-admin/ directory.

9. Using the Crawl-delay Directive in the robots.txt File

The Crawl-delay directive is not an official directive. It exists to prevent overloading the server during crawling due to a high frequency of requests.

If crawling overloads your server, Crawl-delay is only a temporary fix. It usually means your hosting is too weak for your site’s traffic or that your site is misconfigured.

Googlebot ignores this directive entirely. Google previously allowed setting the crawl rate via Search Console, but that option has been deprecated.

Bing does respect the Crawl-delay directive. The directive looks like this:

User-agent: BingBot
Disallow: /private/
Crawl-delay: 10

How to Create and Edit the robots.txt File for WordPress Sites

WordPress automatically generates a virtual robots.txt file for your site. Even if you don’t take any action, a default file probably exists.

You can check by adding /robots.txt after your domain name. For example, https://savvy.co.il/robots.txt will display Savvy Blog’s robots.txt file.

If the file is virtual, you won’t be able to edit it directly. To make changes, create a physical file on your server.

You can do this in several ways…

1. Creating a robots.txt File Using FTP

You can create and edit a robots.txt file using FTP software. Start by using a text editor to create an empty file named robots.txt.

Then connect to your server via FTP and upload this file to the root folder of your site. From there, you can make further modifications through your FTP client.

2. Creating and Editing the robots.txt File Using Yoast SEO

If you’re using the Yoast SEO plugin for WordPress, you can create and edit the robots.txt file directly through the plugin’s interface. Go to SEO > Tools and click Edit Files.

If the file doesn’t exist, you’ll see the option to create one. If it already exists, you can edit it along with the .htaccess file.

Typical robots.txt File for WordPress Sites

The following code is intended specifically for WordPress sites. Keep in mind that this is a recommendation only and is relevant only if:

  • You don’t want the WordPress admin interface (admin) to be crawled.
  • You don’t want internal search result pages to be crawled.
  • You don’t want tag pages and author pages to be crawled.
  • You’re using pretty permalinks as recommended…
User-agent: *
Disallow: /wp-content/plugins/ # block access to the plugins folder
Disallow: /wp-login.php # block access to the admin section
Disallow: /readme.html # block access to the readme file
Disallow: /search/ # block access to internal search result pages
Disallow: *?s=* # block access to internal search result pages
Disallow: *?p=* # block access to pages for which permalinks fail
Disallow: *&p=* # block access to pages for which permalinks fail
Disallow: *&preview=* # block access to preview pages
Disallow: /tag/ # block access to tag pages
Disallow: /author/ # block access to author pages
Sitemap: https://www.example.com/sitemap_index.xml

While this code is relevant for most WordPress sites, always perform adjustments and testing to ensure it suits your specific needs.

Blocking AI Crawlers

In 2025, AI companies started sending dedicated crawlers to scrape web content for model training. If you want to prevent your content from being used for AI training, add these rules:

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

These rules block the main AI training crawlers: OpenAI’s GPTBot, Google’s Google-Extended (used for Gemini training), Anthropic’s ClaudeBot, and Common Crawl’s CCBot.

Blocking Google-Extended does not affect your Google Search rankings. It only prevents your content from being used to train Google’s AI models.

Note that citation-based AI crawlers like PerplexityBot and ChatGPT-User provide backlinks and traffic. Blocking those is generally not recommended unless you have specific reasons.

For a deeper dive on this topic, check our dedicated guide on how to block AI crawlers with robots.txt. You may also be interested in understanding how nofollow, sponsored, and UGC link attributes affect SEO.

Common Actions with the robots.txt File

Here’s a table (desktop only) describing several common actions that can be taken with the robots.txt file. The table was taken directly from Google’s documentation.

Useful rules
Disallow crawling of the entire websiteKeep in mind that in some situations URLs from the website may still be indexed, even if they haven’t been crawled.

User-agent: *
Disallow: /
Disallow crawling of a directory and its contentsAppend a forward slash to the directory name to disallow crawling of a whole directory. The disallowed string may appear anywhere in the path, so Disallow: /junk/ matches https://example.com/junk/ and https://example.com/for-sale/other/junk/
Caution:
Remember, don’t use robots.txt to block access to private content; use proper authentication instead. URLs disallowed by the robots.txt file might still be indexed without being crawled, and the robots.txt file can be viewed by anyone, potentially disclosing the location of your private content.

User-agent: *
Disallow: /calendar/
Disallow: /junk/
Disallow: /books/fiction/contemporary/
Allow access to a single crawlerOnly googlebot-news may crawl the whole site.

User-agent: Googlebot-news
Allow: /

User-agent: *
Disallow: /
Allow access to all but a single crawlerUnnecessarybot may not crawl the site, all other bots may.

User-agent: Unnecessarybot
Disallow: /

User-agent: *
Allow: /
Disallow crawling of a single web pageFor example, disallow the useless_file.html page.

User-agent: *
Disallow: /useless_file.html
Block a specific image from Google ImagesFor example, disallow the dogs.jpg image.

User-agent: Googlebot-Image
Disallow: /images/dogs.jpg
Block all images on your site from Google ImagesGoogle can’t index images and videos without crawling them.

User-agent: Googlebot-Image
Disallow: /
Disallow crawling of files of a specific file typeFor example, disallow for crawling all .gif files.

User-agent: Googlebot
Disallow: /*.gif$
Disallow crawling of an entire site, but allow Mediapartners-GoogleThis implementation hides your pages from search results, but the Mediapartners-Google web crawler can still analyze them to decide what ads to show visitors on your site.

User-agent: *
Disallow: /

User-agent: Mediapartners-Google
Allow: /
Use $ to match URLs that end with a specific stringFor example, disallow all .xls files.

User-agent: Googlebot
Disallow: /*.xls$

AI-Specific Access Control with llms.txt

While robots.txt blocks AI crawlers at the access level, a newer standard takes a different approach. The llms.txt file lets you define how AI models should present and use your content.

Think of it as a complement to robots.txt – rather than blocking AI entirely, you can guide how your content appears in AI-generated answers.

Learn how to set up and use it in our guide to llms.txt for AI engines. You can also serve full Markdown content with llms-full.txt for better AI representation.

FAQs

Here are the most common questions about robots.txt and how it works.

Does robots.txt prevent pages from being indexed by Google?
No. Blocking a URL in robots.txt prevents crawling, but Google can still index the page if other sites link to it. To prevent indexing, use the noindex meta tag on the page itself.
Where should robots.txt be placed?
The file must be placed in the root directory of your website. For example, https://example.com/robots.txt. It only applies to the specific protocol, host, and port where it is located.
Can robots.txt block AI crawlers like GPTBot?
Yes. You can add User-agent: GPTBot followed by Disallow: / to block OpenAI's crawler. The same approach works for ClaudeBot, Google-Extended, CCBot, and other AI crawlers. Blocking these does not affect your Google Search rankings.
What is the difference between robots.txt and llms.txt?
The robots.txt file controls crawler access - it tells bots which parts of your site they can or cannot crawl. The llms.txt file is a newer standard that defines how AI models should present and use your content. They complement each other rather than replacing one another.
Does WordPress create a robots.txt file automatically?
Yes. WordPress generates a virtual robots.txt file by default. To customize it, you need to create a physical file on your server or use an SEO plugin like Yoast SEO or Rank Math to edit it through the WordPress admin.

Summary

The key takeaway: using the Disallow directive in robots.txt is not the same as using the noindex meta tag. Blocking a URL from crawling won’t necessarily keep it out of the index or search results.

You can use robots.txt to shape how search engines interact with your site. But it doesn’t give you precise control over what gets indexed.

Most WordPress site owners don’t need to touch the default virtual file. But if you’re dealing with a problematic bot, need to block AI crawlers, or want to refine how search engines access specific parts of your site, adding custom rules is worth the effort.

I recommend reviewing your robots.txt at least once a year, especially as the AI crawler landscape keeps evolving.

Join the Discussion
0 Comments  ]

Leave a Comment

To add code, use the buttons below. For instance, click the PHP button to insert PHP code within the shortcode. If you notice any typos, please let us know!

Savvy WordPress Development official logo