
Common pitfalls when optimising for LLMs
There are a few ways to prevent this:
X-Robots-Tag: noindex
Block access to .md files in robots.txt
You can tell traditional search engines not to access Markdown files using your robots.txt file:
User-agent: *
Disallow: /*.md$
User-agent: Googlebot
Disallow: /*.md$
If you’re creating Markdown versions of your content for large language models (LLMs), make sure those files don’t show up in traditional search results. You don’t want someone searching for your company name on Google to land on a raw .md version of your homepage.
Disallowing .md files in your robots.txt will stop traditional search engines from crawling that content. LLM crawlers do not follow or respect robots.txt, so .md files will still be accessible to them (this is our actual goal). Remember, this method is not fully reliable to prevent page indexing, it only restricts crawling of those pages.
Alternatively, you can block traditional search engines entirely from accessing Markdown files. This goes a step further than noindex by preventing them from even loading the page. Again, this can be done at the server or CDN level by checking the user-agent and denying access.
Use noindex headers or block crawlers entirely
or, more specifically:
User-agent: Bingbot
Disallow: /*.md$
This can be applied selectively based on the user-agent, so only traditional search engines receive the noindex directive while LLM crawlers remain unaffected. To do this, you’ll need access to your web server or a CDN that supports conditional headers.
You can add HTTP response headers to tell traditional search engines not to index specific pages: