Google Data Leak 2024

In recent weeks, the SEO community has been actively discussing the leaked Google Search API documentation. While these revelations don’t necessarily change our daily SEO strategies–many of which are still designed to optimize toward these signals either directly or indirectly–they've highlighted several contradictions from Google's PR teams over the years.

With thousands of documents leaked, I must admit that while I have taken a look into the documentation and reviewed media reports on the release, I cannot claim to have read every page of the documentation.

In this post, we'll break down the most notable insights we've gathered and provide high-level takeaways on what stood out the most.

How Did The Google Leak Happen?

Originally released on March 13 on GitHub by a bot called yoshi-code-bot, the documentation was then shared with Rand Fishkin on May 5th when he received an email from someone with access to the API documentation. In his article published on May 27th, Rand highlights several contradictions in Google's PR teams. These include repeated denials that click-centric user signals are used, that subdomains are ranked separately, that a sandbox exists for newer websites, and that a domain’s age is collected or considered, among many others.

What New Things Did We Learn From The Google Data Leak?

We’ve reviewed numerous articles to better understand the implications of these leaks for the future of SEO. Here's what we've uncovered:

Google uses its own Domain Authority Score

Google utilizes a form of "site authority" metric. While it may not directly correlate with Moz’s "Domain Authority" or other similar “authority” scores, Google does have a backend system to measure this. Currently, the exact calculation and the weight of this metric in ranking signals remain unclear.

Google NavBoost and Use of Click Data

Google uses a system called "NavBoost" in its ranking algorithm, as revealed during their antitrust trial. This documentation disclosed that NavBoost includes a module entirely focused on click signals, despite Google's prior claims to the contrary. It factors in metrics such as "good" clicks, "bad" clicks, "long" clicks, and more. While the extent of its impact on ranking and how clicks are rated remains unclear, it's important to note that there are safeguards in place to prevent abuse, including cookie history, logged-in Chrome data, and pattern detection.

NavBoost also evaluates queries based on user intent. For example, if users frequently click on videos for a specific query, this behavior will influence how video features are displayed in the SERP.

Subdomain vs. Subdirectory vs. Root

There is evidence suggesting that Google treats different levels of a site—such as subdomains, root domains, and individual URLs—differently. They score clicks for each level separately, indicating that click data might be weighted differently based on these distinctions.

Chrome User Data

One of the page quality modules includes a metric for views from Chrome. In my opinion, this is less about using Chrome users as a direct ranking signal and more about analyzing user behavior, clicks, and other interactions to gain overall insights.

Domain Registration Data

Google stores domain registration information, likely to monitor changes in domain ownership, expiration dates, and other related activities.

What The Leak Tells Us About How Google Treats Content

While most of the leaked information about content falls right in line with what Google has considered “best practices” for a long time, there are some new takeaways and insights:

Documents get truncated, so placing the most important content early is beneficial.
Short content is scored for originality, suggesting that heavily relying on AI is not advisable.
There is a keyword stuffing score, indicating that abusing keywords still has a negative impact on how Google measures “quality” content.
Page titles are still measured on how well they match queries, but there is no character counting measure. Therefore, longer titles could help drive rankings, though shorter titles are probably better at driving clicks. As noted above, clicks and CTR are important factors.
Dates are factored in, therefore setting a date and being consistent is important. For example, you don’t want to specify a date in a URL and then contradict it in the content itself.
Google does store “author data”, which in my opinion could be part of how they are actively measuring E-E-A-T.
Content related to YMYL (Your Money, Your Life), including topics focused on health, financial well-being, and similar topics are scored differently. They also have a predictor for "fringe queries", or queries they have not seen before, to help determine whether or not they fall under the YMYL category. While we already know these sites face higher scrutiny, the documentation confirms this.
Staying on topic is crucial. Google uses site embedding and vectors to determine how well a page sticks to its topic.
Certain industries and topics are seen as likely targets for misinformation, propaganda, and other related issues. The leaked documentation references flags for sites they consider an authority in a topic, suggesting those sites are “whitelisted” and more likely to rank highly. For instance, sites containing political and election information might be flagged as “isElectionAuthority”.

What We Know About Link Data and Google’s Algorithm

SEO and backlinking have a long history and it is something that Google has addressed in algorithm updates throughout the years, with updates focused on how to “devalue” or “demote” spammy or toxic backlinks. The leak tells us a little more about various link demotions, the most relevant being:

Anchor Mismatch: The link does not match the target side it’s linking to
SERP Demotion: Likely measured by clicks, signals of potential user dissatisfaction
NAV Demotion: Likely applied to pages with poor navigation or UX
Location Demotion: Likely that Google attempts to associate pages with a location, so “global” and “super global” pages are demoted here
Additional demotions include exact match domains, product reviews, adult content, and other link demotions.

On the flip side, there are “good” link signals. While we still don’t know exactly how signals are used, I believe these likely contribute to Google’s “site authority” factor and include:

Getting links from websites that Google considers to be “fresh” content or “top tier” is more valuable. This is directly in line with what we have always known- quality links from sites with “high authority” matter more than overall link quantity.
Google can measure spikes in spam anchors, links, and more.
Google stores previous versions of pages, but only uses the last 20 changes for a given URL when analyzing links. This is why redirects carry some link equity, but not all.
The homepage rank and trust are considered for all pages. If the homepage on a site is seen as “authoritative”, it’s more likely that subsequent pages on that site can rank well.
Font size matters; for instance, bolding links could still actually help.

There’s no documentation about “disavow” data, meaning this is likely stored elsewhere and not part of the main ranking system. Google has been saying for years that it’s not necessary to disavow links, but perhaps this data is used more to train their algorithm on what spam looks like.

What This Means For You

This line from Mike King’s article sums it up well “The bottom line here is that you need to drive more successful clicks using a broader set of queries and earn more link diversity if you want to continue to rank. “

It’s important to ultimately understand that this leak isn’t necessarily going to re-write the SEO playbook. There’s not nearly enough context into how different factors are weighted, and much of it is directly in line with existing best practices when it comes to SEO. If your SEO strategy has involved creating great, helpful, relevant content that provides your target users with the best answer for their queries and earns trustworthy backlinks, you’re on the right track.

What we have learned is that Google hasn’t been fully honest about the specifics of what they factor into ranking a website. I personally have always taken Google’s claims with at least a grain of salt and a healthy level of skepticism, but at the end of the day, the best practices are going to continue to be the best practices. Knowing some of the details that Google is paying attention to will undoubtedly provide more insights as time goes on and we further dive into the documentation and analyze the results of our own SEO strategy under this new context.

If you want to explore what these insights mean for your website, let’s chat! We'll be happy to help you understand the potential impact on your site's performance.

Search Engine Optimization (SEO) Digital Marketing

Julie Kalita

Julie has been working in the SEO industry since 2013. Throughout her career, she's had the privilege of working with all types of industries, ranging from Fortune 500 companies, e-commerce websites, and SaaS startups to local SMBs in healthcare, home services, food service, and more. With a passion for learning and development, she loves tackling unique challenges and taking an educational approach to elevate understanding and drive results.

Julie received a Bachelor's degree in Marketing and Communication from SUNY Fredonia and has consistently maintained her Google Analytics certification since 2014.

When Julie's not working, she enjoys spending time with her dog, seeing live music, and crocheting.

Connect with Julie on LinkedIn.

Google Search’s API Documentation Leak