How to expose your data to AI with schema, feeds and entity |…

844K+

Websites with llms.txt

BuiltWith, October 2025

80%+

Enterprise RAG with FAISS/Elastic

Applied Sciences, Dec 2025

52% vs 37%

Accuracy GPT-5 mini vs ChatGPT Search

OpenAI data

4 mechanisms

To expose data to LLMs today

Why can't your data stay only inside your website any longer?

We've advised companies of 2M to 15M€ in revenue that had technically solid websites, good content and reasonable domain authority. And yet, when you asked ChatGPT, Gemini or Perplexity about their category, they didn't appear. It's what we call internally the "glass wall problem": your information is there, but AIs don't receive it in a format they can process with confidence.

Classic search engines crawl and index your website's HTML systematically. LLMs work differently. A model like GPT-5 or Gemini 2.5 doesn't "read" your whole website: it retrieves fragments on demand, in real time, and processes them based on how easily interpretable their content is. If your website relies heavily on JavaScript, has complex navigation or data buried in PDFs, those fragments may not arrive.

A systematic review of 63 studies published in Applied Sciences (December 2025) confirms that more than 80% of enterprise RAG implementations rely on standard retrieval frameworks like FAISS or Elasticsearch, and that one of the main bottlenecks remains the quality and accessibility of source data. It's not the model that usually fails. It's the data layer that feeds it.

Key fragment: LLMs don't index your website like a classic search engine. They retrieve fragments on demand and only process what's directly interpretable. If your data isn't formatted for them, the probability of your company appearing in AI-generated answers is significantly reduced.

What does "first-party data for LLMs" mean in practice?

When I talk about first-party data for LLMs I'm not referring to sharing your CRM with OpenAI. I'm referring to building a data layer you control, in formats language models can consume reliably, and that you update yourself — not a crawler.

This includes four mechanisms that are already operational or in a phase of accelerated adoption:

| Mechanism | What it does | For whom | | --- | --- | --- | | llms.txt | Markdown file at your root that tells LLMs which pages are relevant and how to interpret them. | Any company with a website and own content. | | Product Feed (OpenAI / Google) | Structured feed (JSON/CSV/XML) you send directly to AI platforms with products, prices and stock. | E-commerce, retailers, marketplaces. | | Advanced schema markup | JSON-LD with entity, product, FAQ, author and relationship data. Feeds Knowledge Graphs. | All companies. Especially B2B and services. | | MCP | Open protocol (Anthropic / Linux Foundation) for bidirectional connection between LLMs and enterprise systems. | Companies with APIs, databases and complex internal processes. |

My reading: we're facing a change similar to what sitemap.xml meant for classic SEO. Whoever builds this layer with criteria and before their competition will have an advantage hard to replicate, because it's not just technology: it's own data, updated and verifiable. And that takes time.

Translation to GEO: For a B2B services company, implementing schema Organization + Person + sameAs on its main platforms and adding llms.txt to the root of its website can move the needle on visibility before LLMs with a reasonable effort. It's not a mega-project, but it does require criteria about which data to expose and how to structure it.

How does llms.txt work and is it worth implementing now?

The llms.txt standard was proposed by Jeremy Howard (co-founder of Answer.AI) in September 2024. The idea is straightforward: a Markdown file at the root of your website (/llms.txt) that offers LLMs a curated map of your most relevant content, with interpretation context.

Unlike a sitemap.xml — which lists all URLs for crawlers — llms.txt selects the important parts and presents them in a format a language model can process directly, without having to parse complex HTML, sidebars, cookie banners or dynamic JavaScript.

According to BuiltWith data (October 2025), more than 844,000 websites have already implemented it. Companies like Anthropic, Cloudflare, Stripe and Vercel use it in their documentation. LangChain performed internal benchmarks comparing four ways of giving documentation access to code agents, and the version optimized with llms.txt clearly outperformed the rest.

Now, a necessary nuance. A SE Ranking study on 300,000 domains (November 2025) didn't find statistical correlation between having llms.txt and being cited more by LLMs. And no major AI provider has officially confirmed that they use this file in their inference pipelines. This doesn't invalidate the standard; it contextualizes it. We're in an early adoption phase, similar to sitemap.xml's before Google formally adopted it.

My position: implementing llms.txt costs less than an hour and has asymmetric upside. If tomorrow an LLM starts looking for this file, you already have it. If it doesn't look for it, you haven't lost anything relevant. It's the kind of bet a CEO should approve without much thought.

Key fragment: llms.txt is a Markdown file at your web root that offers language models a curated map of your most relevant content. More than 844,000 websites already have it. Today there's no evidence of direct impact on citations, but the cost of implementing it is so low that the risk-benefit ratio is clearly favorable.

If you sell products: dynamic feeds for ChatGPT and Google

This section is especially relevant if you have e-commerce, marketplace or sell physical/digital products online. If your model is purely B2B/services, you can skip to the next section.

OpenAI already has an operational product feed specification that allows merchants to send structured data directly to ChatGPT: titles, prices, stock, images, variants, logistics, ratings. The documentation is public at developers.openai.com/commerce/specs/feed.

This already works. ChatGPT launched its shopping assistant in November 2025, and since September it allows Instant Checkout with Shopify, Etsy and Stripe through the Agentic Commerce Protocol (ACP). The model doesn't crawl your store: you send it a feed via HTTPS, and you can update it every 15 minutes to keep prices and inventory in real time. It's not vaporware.

A data point I find relevant: according to OpenAI, the specialized GPT-5 mini model for shopping queries reaches 52% accuracy in searches with multiple restrictions, compared to 37% for standard ChatGPT Search. The difference is largely made by the structured data it receives from the feed. The more complete and accurate the information, the better it responds.

Google is going in the same direction. Its Universal Commerce Protocol (UCP) works through the existing Google Merchant Center and is designed so products can appear with a direct purchase option in Google AI Mode and Gemini. If you already have Merchant Center, preparing your feed for these AI surfaces is a natural extension of what you're already doing.

Operational pattern we observe at Elevam: among the e-commerce clients we work with, those who keep clean, updated product feeds with consistent Product + Offer + AggregateRating schema between their website and sales platforms tend to receive significantly better treatment from LLMs on transactional queries. Those with outdated or inconsistent feeds simply don't appear in those answers. We don't have a controlled experiment to assert causality, but the pattern is clear and repeated enough to take it seriously.

For all companies: entity disambiguation, the problem nobody sees

This applies both to B2B and to e-commerce, services, SaaS or any company that wants AI to identify it correctly.

You can have the best product, the best content and the best website in the sector. But if AI doesn't have clear signals about who you are as an entity, it tends not to cite you. Not out of bad will, but out of caution: models avoid asserting things they're not sure about.

Entity disambiguation is the process by which an LLM decides which "thing" in the real world a name refers to. When someone asks ChatGPT about "Apollo", the model decides if they're talking about the space program, the Greek god or the sales platform. It resolves it by probability, with the available signals.

Gartner has estimated that traditional search volume could drop around 25% in 2026 as buyers — especially in B2B — migrate to AI assistants. If that estimate comes close to reality, brands not clearly identified as entities in Knowledge Graphs will progressively lose visibility, no matter how well they rank in classic SEO.

What you need for AI to identify you without ambiguity:

Schema Organization with as many recommended properties as apply to your case: founder, slogan, areaServed, numberOfEmployees, foundingDate. Google doesn't set a mandatory minimum, but its documentation recommends being as complete as possible within what's real and verifiable.
The sameAs property pointing to LinkedIn, Crunchbase, Wikipedia (if you have an article), verified social profiles. Each URL in sameAs acts as an identity confirmation signal for the Knowledge Graph.
Lexical consistency: the same brand name, address and corporate description on all platforms. If your website says "Elevam", on LinkedIn "Elevam Digital" and on Crunchbase "Elevam S.L.", AI can treat them as different entities. It seems a minor detail, but it fractures the signal.
Verified Knowledge Panel on Google, with a description aligned with your real positioning.

As Gianluca Fiorelli points out in his analysis of 2025 Google Search Console updates: Google is actively using its Knowledge Graph to map social profiles to a single corporate entity. If Search Console automatically detects your social channels, it's a signal that Google has successfully disambiguated your brand. That's verifiable and actionable.

Key fragment: Entity disambiguation conditions whether an LLM cites you or omits you. To solve it you need complete schema Organization, sameAs on multiple platforms, total lexical consistency and a verified Knowledge Panel. Without these signals, your brand is a URL among many, not a recognized entity.

If you have complex internal systems: what is MCP and when does it make sense

This section is relevant if your company has internal databases, own APIs or operational processes that could benefit from an AI agent accessing them. If your case is simpler, skip directly to the implementation order.

The Model Context Protocol (MCP) was launched by Anthropic in November 2024 as an open standard, and in December 2025 it was donated to the Agentic AI Foundation under the Linux Foundation, with support from OpenAI, Block and other companies. According to Gartner, 75% of gateway providers are expected to have MCP capabilities in 2026.

The protocol allows an AI agent to connect in a standardized way to your systems: databases, APIs, business tools. Unlike RAG — which is essentially unidirectional: retrieves data to feed the model — MCP is bidirectional. The agent can query inventory, update order statuses, read support tickets or assign priorities. This changes the nature of what an agent can do in a business context.

But adoption isn't trivial. At RSA Conference 2026 multiple sessions on MCP security risks were presented: overpermissions, prompt injection through tools, data leakage from weak access controls. There's real value, but also real risk if it's not well governed.

My strategic reading: MCP isn't for every company today. If you bill 2M€ and have a small technical team, your priority is schema + llms.txt + product feeds. But if you're in the 10M-20M€ range with distributed internal systems, you should be evaluating MCP now — not necessarily implementing, but understanding what internal data an AI agent might need and with what governance. The competitive advantage isn't just in having AI agents; it's in them operating with your data, in real time, with real control.

Translation to GEO: Within the HSA Protocol we apply at Elevam, the evaluation of data maturity for AI is one of the first diagnostic points. Before deciding which protocol to implement, you need to know what data you have, what state it's in and which is strategic for AI.

What's the correct implementation order?

This is what we recommend to clients we advise on GEO, ordered by impact and effort. It's not a universal recipe, but it works as a reasonable starting point for most companies in the 1M-20M€ range:

1-2 weeks
Schema Organization + Person + sameAs
Complete on web and external platforms. Entity disambiguation. Foundation for everything else.
Less than 1 day
Implement llms.txt
Curated map of key content at the root. Preparation for emerging standard. Cost near zero.
2-4 weeks
Schema Product + Offer + advanced FAQ
JSON-LD on product/service pages. Improves visibility in AI Overviews, ChatGPT and Perplexity.
3-6 weeks
Product feed (if applicable)
Ecommerce only. ChatGPT Commerce + Google Merchant Center. Direct transactional channel in LLMs.
1-week assessment
MCP evaluation (if applicable)
Requires technical team and governance. Medium-large companies with complex internal systems.

Within Elevam's CREF© methodology, this sequence fits in the Content pillar (data as structured content for AI) and in the Reputation/Entity pillar (disambiguation signals as an authority asset). They're not isolated actions: they're part of a system.

Strategic conclusion

What's happening with first-party data and AI bears a reasonable resemblance to what happened with mobile in 2012. Everyone knew it was coming, but most waited for it to "stabilize" before moving. Those who anticipated didn't just win traffic: they built a structural advantage others took years to close. I'm not saying the analogy is exact, but the dynamic is similar: available tools, standards in formation and massive adoption that hasn't happened yet.

Key fragment: The first-party data layer for LLMs isn't a one-off technical project. It's a strategic infrastructure that directly influences whether AI includes you in its answers or omits you. The companies that build it with criteria before it's obvious to everyone will have an advantage hard to close.

Next reading

If you want to understand how to measure your current visibility before AIs and which first-party data has most impact, start by reviewing the Elevam GEO Glossary to align terminology, and check our approach at the AI and GEO hub. If you need a concrete assessment of the state of your data and your entity before AI, the HSA Protocol is the starting point.

Shall we work together?

If you want to apply this in your company with a team that combines technical SEO, GEO and paid acquisition measured against the income statement, request a no-commitment audit. You can also check real case studies or read the public GEO baselines that Elevam Labs publishes every quarter.

By

Asier López Ruiz

March 15, 2026 · 13 min

Back to blog

GEO

Más artículos relacionados

Ver todos →

GEOMar 12, 2026

What ChatGPT says about your company (and how to correct it)

Why do LLMs make things up about your brand? The most solid explanation is offered by the paper Why Language Models Hallucinate by Kalai, Nachum, Vempala and Zhang ( OpenAI/Georgia Tech, September 2025 ): language…

By asier-lopez11 min

GEOMar 12, 2026

Why AI doesn't recommend the same to everyone and how to segment your GEO visibility

Why doesn't AI recommend the same to everyone? The question seems obvious, but most GEO strategies ignore it entirely. Language models don't work like a classic search engine that returns…

By asier-lopez14 min

GEOMar 11, 2026

Zero-click on Google: statistics, CTR decline and impact of AI (2019–2025)

I'll be direct: if your acquisition strategy depends on Google's organic traffic, you have a serious problem. And I'm not saying it for sensationalism. I'm saying it because I've spent months measuring what's happening in the accounts…

By asier-lopez20 min

How to expose your data to AI with schema, feeds and entity