Back to Blog
AI Visibility

58% of AI Brand Mentions Come from Training Data. You Cannot Optimize What You Did Not Train.

Your website optimization addresses less than half the problem. A Geometriqs study reveals where AI brand mentions actually come from.

Presenc AI Research Team

January 15, 20267 min read
58% of AI Brand Mentions Come from Training Data. You Cannot Optimize What You Did Not Train.

Most marketers assume AI assistants work like search engines: publish content, get indexed, get found. That assumption is wrong.

A 2025 study by Geometriqs analyzed how 80 of the world's largest companies appear in AI-generated responses across three major language models. The most surprising finding: 58% of all brand mentions came from the model's training data, not from live web retrieval. The remaining 42% came through RAG (retrieval-augmented generation), where the model pulls from live sources during a conversation.

This changes how you should think about AI visibility. If more than half of brand mentions are baked into the model before anyone asks a question, your current website optimization is addressing less than half the problem.

The Numbers Are Worse Than You Think

The Geometriqs Global80 report found the average AI visibility rate across 80 major companies was just 4.1%. One in five companies received zero mentions. Not low mentions. Zero.

The top 5 companies (Microsoft, Apple, Amazon, Unilever, and Alphabet) captured over 50% of all AI-generated brand mentions. That is a winner-take-most dynamic on a scale that even Google's search results don't produce.

And the industry breakdown is stark. Technology brands accounted for 39.5% of all mentions. Finance companies: 2.9%. Energy: 1.8%. Some of the world's most valuable companies, by any financial metric, are invisible to AI.

If this is happening to the Global 80, imagine the numbers for mid-market SaaS companies.

Why Training Data Matters More Than Your Website

When you ask ChatGPT a question, the model first checks its internal knowledge, built from the text it was trained on. If it has a confident answer from training data, it may not bother retrieving live information at all.

Brands with strong historical presence in training corpora (Wikipedia, Reddit, Stack Overflow, academic papers, Common Crawl snapshots) have a structural advantage. They're already in the model's memory. Everyone else needs to earn their way in through the RAG layer, which is smaller, more selective, and harder to influence.

Think of it this way. SEO is about making your website visible to crawlers. GEO is partly about that too (the RAG side), but it's also about whether your brand already exists in the model's memory. And that memory was determined by training data collected months or years ago.

This is why a company ranked #8 on Google can appear in 73% of ChatGPT responses while the company ranked #1 gets nothing. The #8 company had a Wikipedia page, community discussions, and media citations baked into the training data. The #1 company had great on-page SEO and not much else.

What You Can Do About Training Data Visibility

The frustrating part: you can't retroactively change what's in a model's training data. But you can influence what goes into the next training cycle, and you can build your presence in the sources that training datasets tend to include.

These sources are surprisingly consistent across models:

  • Wikipedia (if your brand is notable enough for an article)
  • Wikidata (structured brand data that models parse directly)
  • Reddit and Stack Overflow discussions (community mentions carry weight)
  • Academic and industry papers that reference your product
  • Crunchbase and similar directories with structured company data

The common thread: these are all third-party sources. Not your website. Not your marketing blog. Places where other people and institutions talk about you.

This is a meaningful shift for marketing teams used to focusing on owned content. In the AI era, earned presence in the broader knowledge ecosystem may matter more than anything on your own domain.

The RAG Side of the Equation

The other 42% of mentions come through live retrieval. Here, your website does matter, but not in the way traditional SEO has taught you to think about it.

RAG systems don't crawl the entire web. They pull from curated, trusted source sets. Your content needs to be technically accessible to AI crawlers (clean HTML, proper schema.org markup, heading hierarchy) and semantically relevant to the query being asked.

This is where a structured audit approach helps. We use a 6-Factor AI Visibility Framework that splits the problem across both layers. Two factors (RAG Fetchability and Contextual Integrity) directly address the retrieval layer. The other four (Knowledge Presence, Semantic Authority, Entity Linking, Citations & Mentions) largely address the training data side.

Both layers matter. But most brands are only thinking about one of them, if they're thinking about AI visibility at all.

Building a Strategy That Addresses Both Layers

A practical approach:

Start by auditing your training data presence. Search for your brand name across Wikipedia, Wikidata, Crunchbase, GitHub (if applicable), and major industry forums. If you're absent from most of these, you have a training data problem that no amount of website optimization will fix.

On the RAG side, run your website through a structured data check. Verify your robots.txt allows AI crawlers. Test whether your content structure gives retrieval systems clean, parsable chunks of information.

And keep measuring. AI visibility is not a one-time fix. Models get retrained. RAG sources change. Competitors optimize. A quarterly audit is the minimum; monthly is better.

The Bottom Line

The 58% stat from the Geometriqs study is the single most important number in GEO right now. For more than half of all brand mentions in AI, the outcome was decided before the user even asked their question.

Your job is to make sure the next training cycle includes your brand in the right context. And in the meantime, make sure the RAG layer can find you too.

See where your brand stands with Presenc AI.

Measure your brand's presence across training data sources and live retrieval layers. Benchmark against competitors. Get a clear picture of what's working and what's missing.

Share this article:
#GEO#Training Data#AI Visibility