Google Search API

MORE NEWS

DIGITAL MARKETING

SEO

SEM

Invisible watermarking in AI content with Google SynthID

Invisible watermarking is a key innovation in authenticating and protecting content created by generative AI. Google SynthID is a state-of-the-art watermarking system designed to embed imperceptible digital signatures directly into AI-generated images, videos, text,...

Information gain in the age of AI

The digital information ecosystem stands at a precipice of transformation that is arguably more significant than the introduction of the hyperlink. For the past twenty-five years, the fundamental contract of the web was navigational. Users queried a search engine, and...

Google Discover optimization – technical guide

We have moved from a query-based retrieval model to a predictive push architecture. In this new environment, Google Discover is no longer a secondary traffic source. It is a primary engine for organic growth. The rise of zero-click searches, which now account for...

Parasite SEO strategy for weak domains

The barrier to entry for new digital entities has reached unprecedented heights in this year. For professionals entering competitive verticals, such as SaaS or finance, the mathematical reality of ranking algorithms presents a formidable challenge....

The resurrection protocol of toxic expired domains

The digital economy is littered with the remnants of abandoned web properties, often referred to in the cybersecurity sector as zombie domains. These are domain names that have expired, been dropped by their original registrants, and subsequently re-registered or...

Llms.txt guide for AI search optimization

The internet is currently undergoing a fundamental infrastructure shift driven by artificial intelligence. Webmasters and developers are facing a new challenge regarding how content is consumed by machines. Traditionally, we optimized websites for human eyes and...

Beyond the walled garden silo – true ROAS across platforms

Google says your campaign generated 150 sales. Amazon claims 200. Meta swears it drove 180. Add them up and you get 530 conversions. Check your actual revenue and you'll find you sold 250 units total.​ This is the walled garden nightmare every e-commerce marketer...

Data-driven CRO for PPC landing pages

In paid search campaigns, exceptional Quality Scores and high conversion rates don’t happen by accident—they’re the result of rigorous, data-driven optimization that blends user behavior insights with systematic testing. By combining visual tools like heatmaps and...

New YouTube Shorts campaign features in Google Ads

YouTube Shorts advertising has undergone significant transformation in 2025, introducing groundbreaking features that revolutionize how advertisers can target, optimize, and monetize short-form video content. The most notable advancement is the introduction...

The latest changes to Google Ads in 2025

Google Ads has undergone its most significant transformation in 2025, with artificial intelligence taking center stage in nearly every aspect of campaign management and optimization. The platform has evolved from a traditional keyword-based advertising system into a...

Jacek Białas

Holds a Master’s degree in Public Finance Administration and is an experienced SEO and SEM specialist with over eight years of professional practice. His expertise includes creating comprehensive digital marketing strategies, conducting SEO audits, managing Google Ads campaigns, content marketing, and technical website optimization. He has successfully supported businesses in Poland and international markets across diverse industries such as finance, technology, medicine, and iGaming.

Google Search API – A technical deep dive into ranking logic

Dec 9, 2025 | SEO

📑 Key Takeaways from the API Leak

If you don’t have time to analyze 2,500 pages of documentation, here are the 3 most important facts that reshape our understanding of SEO:

  • 1. Clicks are a ranking factor (End of Debate): The leak confirmed the existence of the NavBoost system, which uses user behavior data (clicks, returns to results, session time) to modify rankings. Google doesn’t just “read” your page; it primarily observes how users react to it – largely utilizing data from the Chrome browser.
  • 2. Website authority really exists: Contrary to years of denials from Google engineers, a siteAuthority variable exists in the code. This means domain strength (brand recognition) is a measurable signal in the algorithm. New domains find it much harder to break through this barrier (the “sandbox”), while major brands benefit from a systemic trust bonus.
  • 3. Quality is now content effort: Google has introduced metrics evaluating the effort put into creating content (the contentEffort variable). This is a direct response to the flood of cheap AI content. The algorithm looks for evidence of uniqueness, original research, and added value, demoting content that is merely derivative.

In May 2024 the search engine optimization industry gained unprecedented access to the internal mechanics of Google Search. An automated bot known as yoshi-code-bot inadvertently published over 2,500 pages of internal API documentation to a public GitHub repository. This leak identified as the “Google Content Warehouse API” reveals over 14,014 attributes and modules that define how the world’s most dominant search engine processes, indexes and ranks information.

Unlike previous leaks or patents which were often theoretical or outdated this documentation aligns with testimony provided during the United States. Google LLC antitrust trial. It serves as a verified architectural blueprint of the system. This report provides a high-level technical analysis of these findings focusing on AI readiness and ranking mechanics. It translates raw API definitions into actionable strategies for professionals who must navigate an ecosystem that is shifting from heuristic keyword matching to algorithmic intent modeling.  

Infrastructure – The tiered indexing system

The leak confirms that Google does not store the web in a single monolithic index. Instead it utilizes a sophisticated tiered storage architecture managed by systems such as Alexandria and TeraGoogle. Understanding this physical segregation is critical for diagnosing indexing latency and visibility issues.

Flash memory and solid state drives

The documentation distinguishes between storage mediums based on content value. The most critical documents are stored in flash memory for instant retrieval. Less important content resides on Solid State Drives (SSDs). The lowest tier often referred to in the industry as “garbage” or “landfills” stores stale content and pages with low access frequency on standard hard drives.

This physical hierarchy implies that content freshness and engagement velocity are not just ranking factors but survival factors. If a page fails to generate user interaction or lacks regular updates it risks being relegated to slower storage tiers where retrieval for long-tail queries becomes statistically less probable. The system aggressively deprioritizes content that does not justify the computational cost of high-speed storage.

Composite doc – The digital DNA of a URL

At the core of Google’s data structure is the CompositeDoc. This is a protocol buffer that aggregates every signal associated with a specific URL. It is not merely a snapshot of the text on a page but a comprehensive “rap sheet” containing the document’s entire history.

The PerDocData module

Within the CompositeDoc the PerDocData module acts as the primary container for ranking signals. It stores attributes that are computed offline and attached to the document before it ever enters a live auction for a search query. This confirms that many ranking decisions are pre-computed rather than determined in real-time.

Key attributes found within this module include:

  • spamScore – a granular score indicating the likelihood of spam rather than a simple binary flag,
  • gibberishScore – a metric likely used to detect auto-generated content of low quality,
  • solidity – a measure of the document’s structural integrity and content robustness.

NavBoost – The primacy of user interaction signals

Perhaps the most significant confirmation from the leak is the existence and dominance of NavBoost. For years Google representatives downplayed the direct use of click data for ranking. The documentation and antitrust testimony now irrefutably confirm that user interaction signals are a pillar of the ranking algorithm.

NavBoost is a system that re-ranks results based on click logs. It does not just count clicks. It analyzes the quality and intent behind those clicks to adjust the information retrieval score.

The 13-month memory window

The system operates on a rolling window of data. The documentation specifies that NavBoost utilizes click data from the past 13 months. This 13-month cycle accounts for seasonality and ensures that pages must maintain consistent performance year-round. A drop in user satisfaction during a peak season will negatively impact the document’s scoring for the subsequent annual cycle.

Classification of clicks

NavBoost segregates user interactions into distinct categories:

  • goodClicks – a click followed by a long dwell time or successful task completion,
  • badClicks – a click followed by a rapid return to the search results page also known as pogo-sticking,
  • lastLongestClicks – the strongest signal of satisfaction. This occurs when a user clicks a result and does not return to Google to refine their query.

The implication is that user experience (UX) is a direct ranking factor. Technical SEO gets you to the index but user satisfaction keeps you in the rankings. Optimization must focus on satisfying the query immediately to prevent badClicks.   

Site authority – Validation of domain metrics

The SEO industry has long debated the existence of a “Domain Authority” metric. Google has historically denied using such a singular score. The leak however explicitly contains an attribute named siteAuthority located within the CompressedQualitySignals module.

This attribute suggests that Google does calculate a sitewide authority score that influences the ranking potential of individual pages. This score likely acts as a “credibility floor” for new content. When a high-authority domain publishes a new URL it inherits a degree of trust that allows it to rank immediately before page-level signals are accumulated. Conversely a low siteAuthority acts as a dampener requiring individual pages to work significantly harder to achieve visibility.

The sandbox effect

Closely related to authority is the hostAge attribute. The documentation notes that this attribute is used “to sandbox fresh spam in serving time.” This provides technical validation for the “Sandbox Effect” observed by SEOs where new domains struggle to rank for competitive keywords regardless of content quality. The system applies a probationary period to new hosts until sufficient trust signals are established.

Chrome data – The browser as a sensor

Another controversial revelation is the use of data from the Chrome browser in ranking calculations. The attribute ChromeInTotal indicates that Google tracks site-wide views and traffic through its browser ecosystem. This contradicts years of public statements claiming that Chrome data is not used for search ranking.

This means that total traffic volume matters. A site that generates significant direct traffic or traffic from other channels (social media or email) and has users browsing via Chrome is feeding positive data into the ChromeInTotal signal. This holistically boosts the domain’s authority profile. SEO can no longer be isolated from broader marketing channels. Traffic acquisition from verified users helps validate a site’s legitimacy.

Content effort – Optimization for the AI era

The leak provides insight into how Google is adapting to the explosion of generative AI content. The attribute contentEffort is described as an “LLM-based effort estimation for article pages.” This indicates that Google uses Large Language Models to algorithmically assess the amount of work, expertise and nuance present in a piece of content.

The mechanism of effort scoring

This metric likely functions as a countermeasure to “thin” or programmatic content. The system parses the text to identify:

  • Originality: Measured by originalContentScore.   
  • Depth: The presence of unique insights rather than summarized facts.
  • Multimedia: The inclusion of original images and video assets.

For AI readiness this is the most critical finding. Using AI to generate generic content results in a low contentEffort score. To future-proof content strategies creators must use AI as a tool for leverage not replacement. The “human in the loop” who adds unique data, personal anecdotes or expert analysis is essential to achieve a high effort score.   

Twiddlers – The re-ranking engine

Ranking is not a static process. The leak details the existence of Twiddlers. These are re-ranking functions that intervene after the initial document retrieval (Mustang) but before the results are served to the user.   

How Twiddlers function

Twiddlers act as filters and boosters. They can demote a result that has a high relevance score but low quality signals. Or they can boost a result based on real-time data.

  • QualityBoost – adjusts ranking based on core quality signals,   
  • RealTimeBoost – likely responsible for “Query Deserves Freshness” (QDF) adjustments boosting breaking news,  
  • NavBoost – as mentioned applied here to re-order based on click history.

Twiddlers explain why rankings can fluctuate wildly without changes to the page itself. A page might pass the initial relevance filter but be struck down by a specific Twiddler (e.g. a demotion for poor location alignment) in the final millisecond.

How Google ranking works (Simplified)

User Query
Initial Ranking
(Ascorer)
Twiddlers
(Re-shuffling)
Final Result

Authorship and entity recognition

The documentation reinforces the shift from strings to things. Google explicitly tracks authors as entities. Attributes like isAuthor and authorReputation confirm that the identity of the content creator is a ranking signal.

This validates the E-E-A-T framework. Google attempts to map authors to known entities in its Knowledge Graph. If an author has a high entityConfidenceScore regarding a specific topic their content is algorithmically privileged. This necessitates a strategy of explicit authorship: robust bio pages, schema markup and cross-platform verification of identity.

Whitelists and exception handling

The system is not purely algorithmic. The leak reveals the existence of manual or semi-manual whitelists for sensitive topics. Attributes such as isElectionAuthority and isCovidLocalAuthority indicate that for “Your Money or Your Life” (YMYL) topics Google bypasses standard ranking signals to ensure that only pre-approved authoritative sources (like government sites or major news outlets) appear at the top.

This has immense strategic implications for sites in these verticals. Ranking for broad “crisis” keywords may be algorithmically impossible for non-whitelisted domains. Strategies should instead pivot to long-tail queries where these rigid overrides may not apply.

Small personal website – A potential boost?

An attribute named smallPersonalSite was found in the codebase. While its exact function is debated it suggests that Google has a specific classifier for independent blogs and small business sites.

In the context of recent “helpful content” updates this could be a mechanism to promote “hidden gems” . What means authentic human voices that offer personal experience. Alternatively it could be a containment label. However given the push for first-hand experience it is likely used to identify content that deserves a unique visibility path distinct from large commercial publishers.

Strategic roadmap – Future-proofing SEO

Based on these technical revelations the following strategic protocol is recommended to ensure AI readiness and ranking stability:

  1. Optimize for interaction not just keywords – the dominance of NavBoost means that a high ranking with a low click-through rate or high pogo-sticking rate is temporary. Titles and meta descriptions must be optimized for click magnetism without being misleading. Content structure must answer the query “above the fold” to secure goodClicks.
  2. Manufacture “Effort” – to combat the devaluation of AI content every page must demonstrate contentEffort. This is not about word count. It is about information gain. Include proprietary data, original photography (which has its own quality scores) and expert quotes that cannot be hallucinated by an LLM.
  3. Build a defensible brand entity – the siteAuthority and ChromeInTotal metrics reward brands that exist outside of search. Drive traffic from newsletters, social media and direct channels. A diversified traffic profile builds the “trust” signal that immunizes a site against algorithmic volatility.
  4. Audit for demotion vectors – the leak identifies specific demotion attributes such as anchorMismatchDemotion (irrelevant inbound link anchors) and exactMatchDomainDemotion. Clean up link profiles and ensure that internal linking uses semantically relevant anchors. Avoid over-optimization of exact match domains.
  5. Technical hygiene as a prerequisite – with the tiered indexing system (TeraGoogle), slow or technically flawed sites risk being stored on slower hardware. Ensure Core Web Vitals are optimized not just for a minor ranking boost but to ensure the document qualifies for the high-priority “Flash” storage tier.

In conclusion the “black box” is now transparent. Google ranks documents based on a composite score of content effortentity authority and user satisfaction. The algorithm is designed to mirror human preference. Therefore the most effective SEO strategy is no longer about tricking the bot but about convincing the user.

Share News on