We Audited the Top 1000 Sites for AI Search Readiness

Everyone has an opinion about getting found in AI search. Vendors sell readiness audits, forums argue about whether llms.txt matters, and Google publishes guidance about how its models read the web. What almost nobody does is measure the actual state of things. So we did. We took the top 1000 websites in the world and checked, one by one, how prepared they are for AI crawlers and answer engines.
The result is a useful reality check. Only one in ten of the top 1000 is genuinely ready for AI search, and barely a third has taken any stance on AI crawlers at all. The web is far less ready than the conversation around it suggests, and the gap is not where most people expect.
How We Measured It
We started with the Tranco list, a research grade ranking of the most popular domains that combines several sources and is built to resist manipulation. You can read about the method on the Tranco project site. We took the top 1000 domains, then visited each one and recorded four concrete signals.
The first signal is llms.txt, the emerging file that tells AI models how to read and use a site. The second is robots.txt, both whether it exists and whether it explicitly names any AI crawler such as GPTBot or ClaudeBot. The third is whether the homepage ships JSON-LD structured data. The fourth is whether the site declares a sitemap. These four map directly to the signals an answer engine uses when it decides whether to trust and cite a page.
Of the 1000 domains, 669 returned a usable response. The rest are infrastructure that does not serve a public website: certificate authorities, content delivery endpoints, DNS hosts, and tracking domains that rank high but have no homepage to read. Every percentage below uses those 669 reachable sites as the denominator, so the numbers describe real, public facing websites and not the plumbing of the internet.
A few honest limits before the findings. We read each site once, at a single point in time, so a site that changed its files the next day is frozen as we found it. We checked the homepage for structured data rather than every template, so a site with JSON-LD only on article pages reads as a miss here. And we counted whether a site names an AI crawler, not whether it allows or blocks it, because the act of naming is the signal of awareness we wanted to measure. With those caveats stated, the picture is still strikingly clear.
We start with a deliberately generous question: does a site have any AI stance at all? We count a yes if it either publishes an llms.txt file or explicitly addresses at least one AI crawler in its robots.txt. That is a low bar, no perfect setup required. Later we raise it and score every site across all four signals, but this first cut asks only whether a site has noticed that AI crawlers exist.
Before the section by section read, here is the whole sample in one view. Each bar is the share of the 669 reachable sites that pass that check.
Only a Third Have Noticed AI
Across the 669 reachable sites, just 32.6 percent have any AI stance at all. Two out of every three of the most visited websites on earth have done nothing, anywhere, to signal how AI engines should treat their content. And that is only the low bar. The share that is genuinely ready, which we score later, is far smaller.
That is the number worth sitting with. These are not small business pages built in an afternoon. They are the most trafficked, most resourced, most professionally maintained sites in the world. If a third of them have an AI stance, the real web, the long middle of sites with smaller teams, is almost certainly far behind.
It helps to remember how low this bar sits. A site clears it by publishing a single optional file, or by typing one crawler name into a text file it already has. We did not ask for good answers, complete structured data, or a thought out policy. We asked for any answer at all, and two thirds of the top sites gave none.
It is tempting to read this as a crisis. It is more honest to read it as an opportunity. The signals that make a site legible to an answer engine are not exotic. They are the same crawl health and structure fundamentals that have always underpinned search, which is the calmer reading we argued in our guide to generative engine optimization. Most sites are not behind because the work is hard. They are behind because nobody told them the work applies to them now.
llms.txt Is Still a Rounding Error
The most hyped AI readiness tactic of the last year is the llms.txt file. In our sample, exactly 12.4 percent of sites publish one. The standard that was supposed to be the new robots.txt sits, for now, in the single digits of real adoption among even the biggest sites.
Here is the honest part. That low number is not the scandal it looks like. We have said before that llms.txt is cheap to add and fine to have, but no major AI engine has confirmed it as a ranking or citation input, and Google’s own guidance calls it unnecessary. So the 12.4 percent is less a measure of negligence and more a measure of how little the file actually buys you today. If the biggest sites in the world, with the most to gain and the most staff to do it, are not bothering, that tells you something about the real return.
The takeaway is not “rush to publish llms.txt.” It is the opposite. Spend the effort on the signals that engines have confirmed they read, and treat llms.txt as a five minute nice to have, not a priority. The sites at the top of the web have quietly voted with their time, and the vote says this file is optional.
The Bot Wars: Who Sites Actually Name
The most revealing slice of the data is which AI crawlers sites choose to address in robots.txt. When a site names a specific bot, whether to allow or block it, it is making a deliberate decision about that company’s access. Here is the share of the 669 reachable sites that name each one, with the raw count alongside.
OpenAI’s GPTBot is the most named crawler on the web, which fits its position as the bot most site owners thought about first. Common Crawl sits unusually high because it predates the AI wave and many sites blocked it years ago for unrelated reasons, then found themselves part of the AI conversation by accident. Anthropic and Google follow closely, and the long tail of newer crawlers like Apple and Cohere shows that awareness drops off fast once you leave the four names that dominate the headlines.
There is a subtlety the raw counts hide. Naming a bot can mean welcoming it or banning it, and our audit recorded the presence of the name rather than the direction of the rule. In practice both decisions come from the same place, a site owner who sat down and thought about AI access on purpose. That is why the count is a fair proxy for awareness even though it does not tell you the split between open doors and closed ones.
The pattern matters for your own robots.txt. If you decide to allow or block AI crawlers, naming only GPTBot leaves a dozen other bots unaddressed, each following its own default in the meantime. A complete stance covers the full set, which is exactly why Seodisias checks for 14 known AI crawlers in its AI Ready analysis rather than just the famous one.
The Quiet Majority Says Nothing
Flip the bot data around and a bigger story appears. Only 23.8 percent of reachable sites name any AI crawler at all in robots.txt. More than three quarters say nothing. They have not allowed AI bots, they have not blocked them, they simply have not engaged with the question. Even among the sites that bother to keep a robots.txt at all, only 32.6 percent name a single AI crawler.
Silence is itself a decision, and usually the wrong one. A site that says nothing gets crawled under whatever default each AI company chooses, with no record of intent and no control over how its content feeds answer engines. For a publisher worried about scraping, that is a missed chance to set boundaries. For a business that wants citations, it is a missed chance to roll out the welcome mat. Either way, the absence of a stance means the site is reacting to AI rather than directing how AI treats it.
The fundamentals around this silence are not encouraging either. Only 49.3 percent of sites declare a sitemap, and just 33.2 percent ship JSON-LD structured data on the homepage. Structured data is the single clearest way to tell any engine, search or generative, what a page actually contains, and two thirds of the top sites skip it on their most important page. If you want a fast structural win, this is where the easy ground is, and it is the kind of issue a technical SEO audit surfaces in minutes.
Why does the gap exist at all, on sites that can clearly afford to close it? The honest answer is that AI readiness has no owner. Search teams treat it as someone else’s job, legal worries about scraping without acting, and engineering has a backlog that a robots.txt edit never reaches the top of. The work is small but unassigned, and unassigned work does not happen. That is good news for anyone willing to assign it.
Finally, score every site, 25 points for each of the four signals, and sort the whole top 1000 into readiness tiers. Here the base is all 1000 domains, so the lowest tier also holds the 331 that serve no public page at all.
Nearly six in ten of the most popular domains on earth land in the bottom tier, and only one in ten, our line for genuinely ready, reaches the top. Among the real sites that respond, just 18 score a perfect four out of four. The middle is thin, so a site that gets even two or three of these jumps ahead of most of the web.
What to Do About It
The encouraging conclusion from a discouraging dataset is that the bar to stand out is low. You do not need a separate AI department or an expensive transformation. You need to do the few concrete things that two thirds of the biggest sites have not, and you can do all of them in an afternoon.
- Take a
robots.txtstance. Decide whether you want AI crawlers to reach your content, then write that decision for the full set of known bots, not justGPTBot. - Add
JSON-LDstructured data to your important pages so engines can parse what each page contains. Start with the homepage and your top templates. - Keep a current sitemap so crawlers find everything that matters, which also feeds directly into how crawl budget works on larger sites.
- Treat
llms.txtas optional. Add it if you like, but do not let it crowd out the signals above.
The reason this works is the same reason the dataset looks the way it does. The signals that make you legible to an answer engine overlap almost entirely with a technically healthy site. The channel changed when answers started rendering inside AI, but the work did not. Sites that already do solid technical SEO are most of the way to AI ready without calling it that, and sites that ignore the basics are invisible to both.
So treat the top 1000 as a mirror for the rest of the web. A third have an AI stance, an eighth publish llms.txt, and two thirds skip structured data. The opportunity is not to chase the loudest tactic but to do the quiet, confirmed work that most sites still neglect. The companies at the top got there with budgets most readers do not have, yet they left this ground uncovered, which means a small site that does the basics can look more deliberate to an answer engine than a giant that ignored them. Pick one signal, your robots.txt stance or your structured data, fix it this week, then move to the next one.
The Sites We Checked
For transparency, here are the top 100 public sites in our sample, the highest ranked domains that returned a response. We list them as plain text, without links, so the list stays neutral and hands no ranking signal to anyone. It also gives a feel for what the audit actually looked at.
google.com whatsapp.net chatgpt.com office365.com
cloudflare.com fastly.net vimeo.com t.me
gstatic.com appsflyersdk.com myfritz.net criteo.com
facebook.com netflix.com zoom.us blogspot.com
microsoft.com wordpress.org qq.com europa.eu
googleapis.com digicert.com tiktokv.com vk.com
youtube.com skype.com yandex.net b-cdn.net
amazonaws.com youtu.be baidu.com googleadservices.com
apple.com pinterest.com workers.dev github.io
instagram.com gandi.net windows.com amazon-adsystem.com
mail.ru goo.gl cloudflare-dns.com epicgames.com
fbcdn.net whatsapp.com nginx.org unity3d.com
twitter.com x.com mozilla.org snapchat.com
dzen.ru googlesyndication.com nic.ru app-measurement.com
linkedin.com yahoo.com opera.com apache.org
googletagmanager.com cloud.microsoft yandex.ru nih.gov
live.com icloud.com samsung.com mailinabox.email
office.com tiktok.com nginx.com amazonvideo.com
amazon.com msn.com sentry.io dns.google
azure.com spotify.com wordpress.com outlook.com
wikipedia.org cloudflare.net okcdn.ru kaspersky.com
github.com adobe.com reddit.com intuit.com
bing.com googledomains.com google-analytics.com app-analytics-services.com
doubleclick.net ntp.org bit.ly telekom.de
googleusercontent.com wa.me ui.com prodregistryv2.orgCheck If Your Own Site Is AI Ready
We ran these four checks on a thousand sites. You can run them on yours in a few minutes. The AI Ready feature in Seodisias scores your site out of 100 across these signals and the rest of the readiness checklist, then groups what to fix by priority, with the schema and robots.txt snippets ready to paste. It works locally, handles unlimited URLs, and is free to download and use. If two thirds of the biggest sites on earth have not done this, the few hours it takes you are some of the highest leverage work on your site right now.