Let's be honest, the line between our "online" and "offline" lives has pretty much disappeared. In the last few minutes, you’ve probably glanced at your phone while walking down the street, checked the reviews for a cafe you were about to enter, or sent a friend a...
MORE NEWS
DIGITAL MARKETING
SEO
SEM
The audience is the author how user-generated content redefined marketing’s golden rule
In the deafening, chaotic bazaar of the digital world, where every brand shouts to be heard and attention is the most fleeting of commodities, an old truth has been given a radical, transformative new meaning. The phrase "Content is King," famously penned by Bill...
Semrush Social Media Poster vs. Hootsuite – Which one actually works?
Both Semrush Social Media Poster and Hootsuite promise to simplify social media management, but they are built for different types of users and needs. Semrush Social Media Poster is tightly integrated with SEO tools and appeals mainly to marketers looking to align...
Invisible watermarking in AI content with Google SynthID
Invisible watermarking is a key innovation in authenticating and protecting content created by generative AI. Google SynthID is a state-of-the-art watermarking system designed to embed imperceptible digital signatures directly into AI-generated images, videos, text,...
How to prepare your company for Google, YouTube, TikTok, Voice Assistants, and ChatGPT
The traditional model of digital visibility, where companies focused 90% of their efforts on Google SEO, is no longer sufficient. Today’s customers use a variety of search tools: they watch tutorials on YouTube, verify opinions on TikTok, ask Siri or Alexa for nearby...
Google Search API – A technical deep dive into ranking logic
📑 Key Takeaways from the API Leak If you don't have time to analyze 2,500 pages of documentation, here are the 3 most important facts that reshape our understanding of SEO: 1. Clicks are a ranking factor (End of Debate): The leak confirmed the existence of the...
Google Discover optimization – technical guide
We have moved from a query-based retrieval model to a predictive push architecture. In this new environment, Google Discover is no longer a secondary traffic source. It is a primary engine for organic growth. The rise of zero-click searches, which now account for...
Parasite SEO strategy for weak domains
The barrier to entry for new digital entities has reached unprecedented heights in this year. For professionals entering competitive verticals, such as SaaS or finance, the mathematical reality of ranking algorithms presents a formidable challenge....
The resurrection protocol of toxic expired domains
The digital economy is littered with the remnants of abandoned web properties, often referred to in the cybersecurity sector as zombie domains. These are domain names that have expired, been dropped by their original registrants, and subsequently re-registered or...
Llms.txt guide for AI search optimization
The internet is currently undergoing a fundamental infrastructure shift driven by artificial intelligence. Webmasters and developers are facing a new challenge regarding how content is consumed by machines. Traditionally, we optimized websites for human eyes and...
Beyond the walled garden silo – true ROAS across platforms
Google says your campaign generated 150 sales. Amazon claims 200. Meta swears it drove 180. Add them up and you get 530 conversions. Check your actual revenue and you'll find you sold 250 units total. This is the walled garden nightmare every e-commerce marketer...
Data-driven CRO for PPC landing pages
In paid search campaigns, exceptional Quality Scores and high conversion rates don’t happen by accident—they’re the result of rigorous, data-driven optimization that blends user behavior insights with systematic testing. By combining visual tools like heatmaps and...
Integrating first-party and third-party data to optimize advertising
In today's data-driven marketing landscape, the ability to seamlessly blend first-party and third-party data has become a critical competitive advantage. While first-party data provides unparalleled accuracy and compliance, third-party data offers...
New YouTube Shorts campaign features in Google Ads
YouTube Shorts advertising has undergone significant transformation in 2025, introducing groundbreaking features that revolutionize how advertisers can target, optimize, and monetize short-form video content. The most notable advancement is the introduction...
The latest changes to Google Ads in 2025
Google Ads has undergone its most significant transformation in 2025, with artificial intelligence taking center stage in nearly every aspect of campaign management and optimization. The platform has evolved from a traditional keyword-based advertising system into a...
Jacek Białas
Information gain in the age of AI
The digital information ecosystem stands at a precipice of transformation that is arguably more significant than the introduction of the hyperlink. For the past twenty-five years, the fundamental contract of the web was navigational. Users queried a search engine, and the engine provided a list of locations where the answer might reside. This model, predicated on lexical matching and link graph authority, is rapidly eroding in favor of a new paradigm driven by generative AI and large language models (LLMs). In this emerging environment, the metric of success is no longer relevance, which has become a commodity, but information gain.
Information gain represents a fundamental restructuring of how value is assigned to digital content. It moves beyond the simplistic matching of keywords to a sophisticated analysis of novelty, entropy, and the reduction of user uncertainty. As search engines evolve into answer engines, they are no longer satisfied with retrieving ten documents that say the same thing. The algorithms powering Google, Bing, and emerging platforms like Perplexity are now designed to identify and prioritize the single document that adds net new knowledge to the user’s existing context. This shift is not merely a feature update; it is a survival mechanism for search engines facing a deluge of synthetic, homogenized content.
The theoretical physics of search optimization
To truly comprehend the trajectory of modern search, one must look past the user interface and into the mathematical principles that govern information retrieval. The concept of information gain is not a recent invention of Google engineers; it is a fundamental concept rooted in the information theory established by Claude Shannon in the mid-20th century. Understanding these physical laws of information is essential for predicting how search algorithms will evolve.
Shannon entropy and the measurement of surprise
At the core of information theory lies the concept of entropy. In thermodynamics, entropy is a measure of disorder. In information theory, however, it serves as a measure of uncertainty or “surprisal”.1 Shannon defined information not as the content of a message, but as the resolution of uncertainty. If an outcome is entirely predictable, communicating it conveys zero information. Conversely, if an outcome is highly improbable, communicating it conveys a significant amount of information.
The mathematical definition of Shannon entropy ($H$) for a discrete random variable $X$ is given by the formula:
In this equation, P(xi) represents the probability of a specific outcome occurring. This formula has profound implications for SEO. Consider a Search Engine Results Page (SERP) as a system of messages. If a user queries “how to boil an egg,” and the top ten results all provide the exact same instructions, the probability ($P$) of encountering that specific set of instructions approaches 1. Consequently, the entropy ($H$) of the result set approaches 0. The system provides no new information after the first result is consumed.
Search engines are increasingly modeling the web as a probability distribution. Low entropy content is content that is predictable, redundant, and uniform. It confirms what is already known but adds nothing new. High entropy content, by contrast, is unpredictable. It occupies the “tails” of the probability distribution. It offers a data point, a perspective, or a narrative that differs from the consensus. In a generative AI environment, where models are trained to output the most probable next token, high entropy content is the only content that possesses intrinsic value because it is the only content the model cannot generate itself.
Kullback-Leibler divergence in ranking
The mechanism by which search engines quantify this value difference is often based on Kullback-Leibler divergence (KL divergence), also known as relative entropy. KL divergence measures the statistical distance between two probability distributions. In the context of information retrieval, the first distribution represents the user’s current knowledge state (based on their search history and the documents they have already clicked). The second distribution represents the content of a candidate document.
The search engine’s objective is to maximize the KL divergence between the user’s state before and after viewing a document. If the divergence is low, the document is redundant. If the divergence is high, the document provides significant information gain. This mathematical framework explains why “unique value” is no longer a vague marketing recommendation but a technical requirement. A document that is semantically identical to the top-ranking pages has a vector representation that overlaps with the user’s existing knowledge state. To rank, a document must occupy a distinct region in the vector space, necessitating the inclusion of novel entities, data, or relationships that push the content vector away from the consensus cluster.
Information foraging theory applications
The behavior of users in this environment is best described by Information Foraging Theory, developed by Peter Pirolli and Stuart Card. This theory posits that humans seek information using the same adaptive mechanisms that animals use to forage for food. Users attempt to maximize the rate of information gain per unit of interaction cost.
In the pre-AI era, the “cost of access” involved clicking a blue link, waiting for the page to load, and scrolling past ads to find the answer. The “information value” only needed to be moderate to justify this cost. However, AI Overviews and generative summaries have reduced the cost of accessing basic information to near zero. A user can now get a definition, a list of dates, or a simple instruction set instantly without clicking.
This radically alters the foraging calculation. Because the cost of the AI summary is zero, the “information value” required to entice a user to click a link must be exponentially higher. Users will only click if they perceive that the information gain of the external document significantly exceeds what the AI has already provided. This phenomenon creates a “barbell” distribution in content value. Simple, summarizable content (the middle) loses all traffic to the AI. Complex, experiential, and data-rich content (the edges) retains value because it offers a nutrient density that the zero-cost summary cannot replicate.
Algorithmic mechanics of information gain
The transition from theoretical physics to applied computer science is codified in the patent architecture of modern search engines. Specifically, Google’s patent US20200349181A1, titled “Contextual Estimation of Link Information Gain,” provides the blueprint for how these mathematical principles are executed in code.
The Google patent architecture
The patent describes a system that moves beyond static ranking signals like PageRank to a dynamic, state-dependent scoring model. The core innovation is the recognition that the relevance of a document is not absolute but relative to what the user has already consumed. The system operates through a sequential process of analysis and reranking.
First, the system identifies a set of documents relevant to the user’s query. It then analyzes the browsing history of the user to determine which documents or “knowledge clusters” the user has likely already viewed. These viewed documents form the baseline knowledge state. The system then evaluates the remaining unviewed documents, calculating an information gain score for each. This score represents the “additional information” included in the document beyond the information contained in the previously viewed set.
This scoring allows the search engine to dynamically reorder the SERP. A document that might be ranked #10 based on traditional keyword density or backlink authority could be promoted to #1 if it contains a specific piece of information that is missing from the documents the user has just clicked. This mechanism explicitly targets content redundancy. It penalizes the common SEO strategy of creating “Skyscraper” content that merely aggregates existing top results without adding new insight. Under this patent’s logic, a summary of the top 10 results has an information gain score of zero for a user who has already engaged with the topic.
The technical implementation of this scoring relies heavily on machine learning models and vector embeddings. Documents are not analyzed as strings of text but are converted into high-dimensional vectors representing their semantic meaning.4 The system calculates the semantic overlap between the vectors of the user’s viewed documents and the candidate documents.
| Feature | Traditional Scoring | Information Gain Scoring |
| Input Data | Keywords, Links, meta tags | Semantic Vectors, User History |
| Comparison | Query vs. Document | User State vs. Document |
| Goal | Relevance matching | Novelty maximization |
| Outcome | Static ranking list | Dynamic reranking |
| Redundancy | Ignored (often rewarded) | Penalized heavily |
This vector-based approach enables the system to detect redundancy even when different words are used. If Document A describes “canine obedience training” and Document B describes “dog discipline techniques” using semantically similar concepts, the vectors will overlap. If the user reads Document A, the information gain score of Document B drops, regardless of its keyword optimization. This forces content creators to focus on semantic distance. To achieve a high information gain score, a document must introduce new vectors (concepts, data, entities) that are orthogonal to the vectors of the consensus content.
The role of entity extraction
A critical component of this architecture is entity extraction. The patent and subsequent research indicate that Google does not just compare text; it extracts named entities (people, places, concepts) and maps the relationships between them. The system constructs a mini-knowledge graph of the user’s session.
If the user has read articles connecting the entity “Tesla” to the entity “Batteries,” the system looks for documents that connect “Tesla” to different entities, such as “AI Robotics” or “Regulatory Challenges.” This entity expansion is a primary signal of information gain. Content that introduces new, valid relationships between entities enriches the user’s mental model and is therefore prioritized. This shifts the SEO focus from “keywords” to “knowledge graph construction,” where the goal is to map the unexplored edges of a topic.
The crisis of synthetic content and model collapse
The urgency behind the adoption of information gain metrics is driven by the existential threat of synthetic content. The democratization of generative AI has lowered the marginal cost of content creation to near zero, leading to a flood of machine-generated text that threatens to degrade the quality of the web.
The mechanics of model collapse
Model collapse is a degenerative process that occurs when generative models are trained on data produced by previous generations of models. LLMs act as probabilistic compression engines. They analyze the statistical distribution of human language and output the most likely sequences of words. By definition, they sample from the center of the probability distribution, smoothing out the “tails” where rare, idiosyncratic, and highly specific data points reside.
When the web becomes saturated with AI-generated content, the training data for future models becomes this center-weighted, averaged output. The variance in the dataset decreases. The “tails” of the distribution are systematically erased. This leads to a homogenization of reality, often referred to as Model Autophagy Disorder (MAD).
For search engines, this presents a dire quality control problem. If a significant percentage of the web is synthetic, and that synthetic content is essentially a rehashing of existing consensus, the search index becomes bloated with “grey goo.” This is content that is grammatically correct and factually plausible but devoid of information gain. To maintain utility, search engines must aggressively filter for high entropy content that exhibits the statistical irregularities characteristic of human authorship.
Detecting the synthetic signature
Search algorithms are evolving to detect the statistical signatures of synthetic text. AI-generated content often exhibits “low perplexity” (it is very predictable) and “low burstiness” (it lacks sentence variation). Human writing, by contrast, is “spiky.” It contains abrupt shifts in tone, varied sentence structures, and idiomatic phrasing that defies statistical prediction.
Research indicates that Google and other platforms are down-ranking content that falls within the statistical “mean” of a topic. If a piece of content can be accurately predicted by a base model (i.e., the model can guess the next sentence with high accuracy), it offers no information gain. It is redundant to the model itself. To rank, content must surprise the model. It must contain low-probability sequences that validate as factually accurate. This creates a paradox for businesses using AI for content creation: the more you rely on the “default” output of an LLM, the more likely you are to be filtered out as noise.
Pollution of the data pool
The implications of model collapse extend beyond ranking. It threatens the global knowledge ecosystem. As AI models ingest their own outputs, they reinforce their own biases and hallucinations, creating a closed loop of misinformation. This “AI-generated fog” makes it increasingly difficult for users to distinguish between original primary sources and synthetic derivatives.
In response, search engines are pivoting to prioritize provenance. They are seeking to identify the “Patient Zero” of a piece of information. The original study, the first-person account, and the raw dataset are valued exponentially higher than the derivative articles that summarize them. For content strategy, this means that aggregation is a dying business model. The only defensible position is that of the primary source creator, the entity that injects new, verified data into the pool before the AI can metabolize it.
Entity SEO and the semantic web structure
To operationalize information gain, one must master Entity SEO. The search ecosystem has transitioned from a lexical engine (matching strings) to a semantic engine (understanding things). This shift underpins the technical structure of modern optimization and is the mechanism by which machines understand novelty.
From strings to things: the knowledge graph
In the legacy era of search, Google matched the query “jaguar speed” to pages containing those character strings. Today, Google utilizes a Knowledge Graph to understand “Jaguar” as an entity. It resolves disambiguation based on context, distinguishing between the animal (Panthera onca), the luxury car manufacturer, and the Fender guitar model.
An entity is a distinct, well-defined concept that can be linked to other concepts in a graph structure. The Knowledge Graph is essentially a map of billions of these nodes and edges. For information gain, this is crucial because Google assesses value by identifying new relationships. If the graph already contains the edge “Jaguar (Animal) -> Eats -> Deer,” a new article stating this fact adds no value. However, an article that establishes a new, valid edge, such as “Jaguar (Animal) -> Population Dynamics -> Impact of Palm Oil,” provides measurable information gain. It expands the graph.
Semantic triples and machine readability
The fundamental unit of the Knowledge Graph is the semantic triple: Subject-Predicate-Object. For example, “Google (Subject) filed (Predicate) Patent US20200349181A1 (Object)”. AI models and search crawlers parse text to extract these triples. Content that is verbose, unstructured, or “fluffy” often obscures these triples, making it difficult for machines to extract facts and assign credit.
To optimize for AI readiness, content must be engineered to facilitate triple extraction. This involves using clear, declarative sentences and active voice. It means structuring data so that the relationships between entities are unambiguous.
- Low Extractability – “When we consider the various aspects of the new update, it seems that maybe the focus is shifting towards user experience.”
- High Extractability – “The Google Hidden Gems update prioritizes authentic user experiences.”
The latter sentence allows the engine to instantly log the connection: Update -> Prioritizes -> User Experience. This clarity increases the confidence score the engine assigns to the information, making it more likely to be cited in an AI Overview or knowledge panel.
Vector embeddings and semantic distance
Modern information retrieval relies on vector databases to store and compare these semantic units. In this model, every piece of content is converted into a vector, a long list of numbers representing its semantic position in a multi-dimensional space.
When a user searches, their query is also vectorized. The search engine looks for document vectors that are close to the query vector (relevance) but also vectors that are sufficiently distant from the vectors of documents the user has already seen (novelty). This creates a “Goldilocks zone” for content strategy. Content must be semantically close enough to the core topic to be relevant, but distinct enough in vector space to be unique.
Content that simply paraphrases the top-ranking results will have a vector representation that is nearly identical to those results. In a system designed for diversity, these “clone vectors” are discarded. To survive, a page must occupy a unique coordinate. This is achieved by introducing new vocabulary, data points, or thematic connections that shift the document’s vector representation away from the cluster of consensus content.
Strategic frameworks for high entropy content
Understanding the theory of information gain is the foundation, but execution is what drives visibility. Professionals must adopt specific strategic frameworks that force the creation of high-entropy material, ensuring their content survives the filter of modern algorithms.
E-E-A-T and the primacy of experience
Google’s E-E-A-T guidelines serve as the qualitative filter for information gain. The addition of the second “E” for Experience was a direct strategic countermeasure to the rise of AI content.
AI has zero experience. It cannot taste food, feel the texture of a fabric, navigate a complex airport, or manage a difficult employee. It can only synthesize descriptions of these things based on training data. Therefore, first-hand experience is the most reliable form of information gain. It is the one signal that is currently un-fakeable by a machine.
- Strategic shift – content must move from “How to” to “How I”m,
- Implementation – instead of writing a generic guide on “Best Project Management Practices,” a firm should publish “How We Recovered a $2M Project Using Agile.” The specific details of that recovery, the meetings held, the errors made, the specific tools that failed, are high-entropy details that an AI cannot hallucinate convincingly.
The “Hidden Gems” algorithmic update
In late 2023, Google rolled out the Hidden Gems update, explicitly designed to surface content from forums, personal blogs, and social media. This was an admission that high-quality information often lives on low-authority domains that possess high authenticity.
This update signals that domain authority (DA) is no longer the sole gatekeeper of visibility. A low-DA site can outrank a heritage publisher if the former possesses a “gem” of information a unique insight, a personal story, or a raw data point that the latter lacks. This democratizes search for small businesses and niche experts, provided they double down on the authenticity of their voice. It validates the strategy of being “unpolished but real” over “polished but generic.”
Brand as a defensive moat
In an era where general information is a commodity provided freely by AI, brand becomes the only defensible moat. If a user asks an AI “How do I fix a leaky faucet?”, the answer is a commodity. If the user asks “What does recommend for a leaky faucet?”, the query is navigational and brand-specific.
Information gain builds this brand moat. By consistently providing unique data and perspectives, a brand teaches users (and search engines) that it is a primary source. Over time, this trains the user to seek the source rather than the topic. This is critical for B2B companies, where trust and authority are the primary drivers of conversion. The goal is to become the entity that the AI cites, rather than one of the generic results it summarizes.
The “unsummarizable” content strategy
Generative AI excels at summarization. If a 2,000-word article can be condensed into a 50-word bullet point without losing significant value, that article is “low density.” Unsummarizable content is content where the value lies in the nuance, the journey, or the specific details that cannot be compressed.
Strategies to create unsummarizable content include:
- narrative complexity – stories with non-linear structures, irony, or emotional arcs are difficult for AI to summarize without stripping away the core impact,
- high-resolution data – detailed charts, heatmaps, and raw data logs lose value when averaged into a text summary. The user must view the original file to gain the insight,
- interactive elements – tools, calculators, and dynamic visualizations cannot be fully captured by a text summary. These require user interaction, forcing a click-through,
- multimedia integration – embedding video timestamps or audio clips that correspond to specific text sections creates a multimodal experience that a text-only summary cannot replicate.
Technical architecture – schema markup and structured data
While the strategic side of information gain involves experience and nuance, the technical side involves explicit communication with machines. Schema markup (structured data) is the language used to tell search engines exactly what entities are on a page and how they relate. In the age of AI, schema is not optional; it is the API for the semantic web.
The necessity of structured data for AI citations
LLMs and search crawlers are probabilistic engines. They “guess” the meaning of text based on training patterns. Schema markup removes this guesswork. It turns unstructured text into structured data that machines can ingest with 100% confidence. This is vital for AEO (Answer Engine Optimization). If an AI is looking for a fact to cite in a generated answer, it will prioritize the source that explicitly labels that fact with schema over a source where the fact is buried in a dense, unstructured paragraph.
Dataset schema for original research
One of the most powerful yet underutilized schemas for establishing information gain is the Dataset schema. If a company conducts original research, they should not just publish a blog post summarizing the findings. They should wrap their data tables and findings in Dataset schema.
This informs Google that the page contains a structured dataset, making the content eligible for Google Dataset Search and signaling to AI models that this is “ground truth” data, not just opinion.
Implementation of Dataset Schema (JSON-LD):
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
{ "@context": "https://schema.org", "@type": "Dataset", "name": "2025 State of AI Search Report", "description": "Comprehensive survey of 5,000 marketing professionals regarding the impact of generative AI on organic traffic.", "url": "https://www.example.com/research/ai-search-2025", "creator": { "@type": "Organization", "name": "Example Data Corp" }, "variableMeasured":, "distribution": { "@type": "DataDownload", "encodingFormat": "text/csv", "contentUrl": "https://www.example.com/research/data.csv" } } |
This code explicitly defines the variables measured and provides a link to the raw data, proving the existence of the research.
ClaimReview for authoritative fact-checking
In an ecosystem prone to hallucinations, validity is a form of information gain. The ClaimReview schema is utilized to explicitly fact-check a claim. While typically used by news organizations, businesses can leverage this to correct common misconceptions in their industry, establishing themselves as the arbiter of truth.
By publishing a page that explicitly “debunks” a popular industry myth using ClaimReview schema, a site positions itself as a high-authority node. This is highly attractive to AI models seeking to verify information before generating a response.
Implementation of ClaimReview Schema (JSON-LD):
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
{ "@context": "https://schema.org", "@type": "ClaimReview", "claimReviewed": "AI content is automatically penalized by Google algorithms.", "itemReviewed": { "@type": "CreativeWork", "author": { "@type": "Person", "name": "Various Industry Blogs" } }, "reviewRating": { "@type": "Rating", "ratingValue": "1", "bestRating": "5", "alternateName": "False" }, "author": { "@type": "Organization", "name": "SEO Authority Brand" } } |
Nesting and schema architecture
Advanced technical SEO involves nesting schemas to create a complete, page-level knowledge graph. A page shouldn’t just have an Article schema and a Product schema sitting side-by-side; they should be nested to show their relationship.
For example, an Article about a new software tool can contain a review property, which contains a Review object, which refers to a Product object (the software), which has an offers property. This connected graph gives AI models a complete, contextual understanding of the content, significantly increasing the likelihood of the content being used as a citation source in an AI overview.
The Article schema should also utilize the mentions and about properties to create hard links to entities.
|
1 2 3 4 5 6 7 |
{ "@context": "https://schema.org", "@type": "Article", "headline": "The Physics of Information Gain", "about":, "mentions": } |
The sameAs property pointing to Wikipedia or Wikidata is critical. It acts as an “anchor,” grounding the content in the global Knowledge Graph and disambiguating the topics discussed.
Content engineering for Generative Engine Optimization (GEO)
Generative Engine Optimization (GEO) is the emerging discipline of optimizing content specifically to be cited by generative AI engines like ChatGPT, Perplexity, and Google’s AI Overviews. It requires a distinct approach to content structure and formatting.
The answer-first format and S.L.A.
AI models tend to prioritize information found at the beginning of a document or section. GEO requires an answer-first structure, often referred to as the inverted pyramid. This ensures that the core entity definitions and answers are immediately available for extraction.
A highly effective framework for this is the S.L.A. format:
- Summary – a 40-60 word direct answer at the top of the section. Use the “is” definition structure (e.g., “Information Gain is…”). This serves as “snippet bait” for the AI.
- Logic – a bulleted list or structured set of paragraphs explaining the “why” or “how.” AI models excel at parsing lists because they represent structured relationships.
- Analysis – the nuanced, human experience and examples that follow. This provides the depth required for the information gain score.
This structure allows the AI to easily extract the direct answer for its summary while retaining the depth required for long-form engagement metrics.
Semantic density and concreteness
LLMs favor concrete language over abstract language. A sentence like “We offer robust solutions for various problems” is semantically empty. It has high entropy in terms of interpretation (it could mean anything) but low information value. A sentence like “Our API handles 10,000 requests per second with 99.9% uptime” is semantically dense.
Position-Adjusted Word Count (PAWC) is a metric emerging in GEO research. It suggests that visibility depends not just on having the keyword, but on the density of relevant terms relative to their position in the text. Writers should strip away adjectives and adverbs (“fluff”) and focus on nouns and verbs (entities and actions). This increases the “information density” of the text, making it a more attractive source for an AI looking to conserve token usage while maximizing factual output.
Formatting for machine readability
AI models struggle with dense walls of text. They parse structure better than prose. To optimize for extraction, content engineers should leverage HTML semantic tags:
- lists – use
<ul>and<ol>for steps or features. This explicitly tells the parser that the items are related, - tables – use HTML
<table>for comparisons. AI models are excellent at reading table rows and columns to extract relational data, - headings – use clear, question-based H2s and H3s. Instead of “Benefits,” use “What are the benefits of?” This maps directly to the user’s prompt structure.
Zero-click optimization strategy
The reality of GEO is that many users will not click through to the website. The zero-click search is becoming the norm. The strategy, therefore, must shift from “optimizing for clicks” to “optimizing for attribution.”
If a brand is cited in an AI overview, even without a click, it gains share of mind. The goal is to ensure that the AI attributes the information to the brand. This is achieved by:
- coining unique terms – naming a framework (e.g., “The Skyscraper Technique”) ensures that when the AI describes it, it must use the name,
- publishing original statistics – if a brand publishes “60% of marketers use AI,” the AI is likely to cite “According to…”
- entity consistency – ensuring that the brand entity is consistently defined across the web (Crunchbase, LinkedIn, Wikipedia) so the AI confidently associates the brand with the topic.
Measurement and the future of information retrieval
As the mechanisms of search change, so too must the metrics of success. Traditional SEO metrics like rankings and organic traffic are becoming less reliable indicators of performance as zero-click searches rise.
New KPIs for the AI era
To measure the impact of information gain strategies, organizations need new Key Performance Indicators (KPIs):
| Metric | Definition | Measurement Method |
| Share of Model (SoM) | Frequency of brand mentions in AI responses for category queries. | Manual testing or specialized tools to query LLMs and track citations. |
| Information Gain Proxy | Engagement depth on long-form content. | High time-on-page and low bounce rates on deep content suggest value beyond the summary. |
| Entity Salience | The confidence score of the brand entity within a topic cluster. | Google Natural Language API analysis of content. |
| Citation Quality | Backlinks from high-entropy sources (research papers, universities). | Backlink analysis tools filtering for academic/forum domains. |
Monitoring data decay
Information gain is temporal. What is “novel” today becomes “consensus” tomorrow as competitors copy it. Data decay is the process by which high-entropy content becomes low-entropy over time.
Companies must implement a “content refresh” cycle that goes beyond simple date updates. It involves re-injecting new data points, new examples, and updated perspectives to restore the information gain score of aging assets. A static library is a decaying asset; a dynamic library that evolves with the industry maintains its entropy.
The bifurcation of the web
The trajectory of search indicates a bifurcation of the web. One web will be for consumption by machines,a vast library of facts, definitions, and summaries, formatted in perfect schema and semantic triples. This web will feed the AI models. The other web will be for consumption by humans, a messy, chaotic, vibrant backgroung of stories, opinions, and experiences. The interactive model below illustrates this divergence, projecting the saturation of synthetic content over the next few years:
Figure 1. Projected saturation of synthetic content relative to human-authored content.
Information gain is the bridge between them. It is the mathematical proof that a piece of content contributes to the human web, which allows the machine web to value it. For professionals and developers, the “middle” is the death zone. There is no future in being a “slightly better” content farm. The future belongs to those who can generate net new information. This requires a cultural shift within organizations. Marketing teams must think less like advertisers and more like journalists or scientists. The mandate is to discover, experiment, and document reality, not just to aggregate existing content. By focusing on information gain, businesses align themselves with the fundamental mathematical incentives of the next generation of search engines. They become the signal in the noise.
Related News



