Google representatives have said very little about how the Penguin algorithm works. This means the Penguin algorithm is a more or less a mystery to the search marketing community.
However I believe there is enough evidence out there to define what Penguin is and how it works.
The purpose of this article is to investigate available clues and begin the process of understanding the Penguin algorithm. Additionally, I believe a patent published by Google in late 2015 that was briefly discussed within the SEO community and quickly forgotten, may be the key to understanding Penguin (more on this later).
Some may question the need for this. In my opinion it is literally our business as SEOs to have at least a cursory understanding of how search engines work. This is what our industry has done from day one. No part of Google has gone unexamined. So why stop at Penguin? There’s no point in working in the dark. Let’s throw some light on this bird!
What Penguin is… Not
Is Penguin a Trust Algorithm?
In order to know what something is, it helps to know what it is not. There has been speculation that Penguin is a “trust” algorithm. Is it?
The truth about trust algorithms is that they tend to be biased toward large sites. Which is why the original Trust Rank research paper was superseded by another research paper, Topical Trust Rank. Topical Trust Rank was 19- to 43.1% better at finding spam than plain vanilla Trust Rank. However, the authors of that research acknowledged certain shortcomings in the algorithm and that further research was necessary.
There are statements from Googlers as early as 2007 making it clear that Google does not use Trust Rank. Additionally, in 2011 the point was made by Google that trust was not a ranking factor in itself, that the word “trust” was merely a catchall word they used for a variety of signals. The statements make it clear with absolutely no ambiguity that Google does not use a trust rank algorithm.
No patent application, no Google blog post, no Twitter tweet or Facebook post indicates that Penguin is a kind of trust-ranking algorithm. There is no evidence I can find that Penguin is a trust-based algorithm. It is therefore a reasonable observation that Penguin is not a trust rank algorithm.
Does Penguin Use Machine Learning?
Gary Illyes confirmed in October 2016 that Penguin does not use machine learning. This is an incredibly important clue.
Machine learning is, in a simplified description, a process where a computer is taught to identify something by giving it clues to what that something looks like. For a simple hypothetical example, we can teach a computer to identify a dog by giving it the clues that something is a dog. Those clues can be a tail, a dark nose, fur and a barking noise.
For machine learning, those clues are known as classifiers. The SEO industry calls those classifiers signals. A typical SEO tries to create “signals” of quality. A machine learning algorithm uses classifiers to understand if a web page meets the definition of a quality web page. This can also work in reverse for spam.
Does Penguin Use Statistical Analysis?
There is frequent reference to the possibility that statistical analysis plays a role in Penguin. Statistical analysis identifies variables that are common to normal sites and spam sites. Variables range from anchor text ratios to the percentage of inlinks to the home page to the rest of the site. When the entire web is analyzed, the abnormal (spam) pages stand out. These (spam) pages are called outliers.
The use of statistical analysis as a spam fighting technique was confirmed at Pubcon New Orleans 2005 when it was openly discussed by Google engineers at a keynote presentation. Thus, we know that statistical analysis has been a feature of Google’s spam fighting since at least 2005.
One of the most well-known research papers on statistical analysis is a research paper published by Microsoft in 2004. This research paper is titled, Spam, Damn Spam, and Statistics.
Statistical analysis has revealed that there are patterns in the way spam sites build links. These patterns are symptoms of their activities. Penguin does more than identify the symptoms of the activities.
Why it is Significant that Penguin is Not Machine Learning
The importance of this knowledge is that we can now understand that Penguin is not identifying spam links by the use of quality signals otherwise known as classifiers. Thus we can be reasonably certain that Penguin is not learning how to identify spam by statistical signals.
Examples of link-based spam features:
- Percentage of inbound links that contain an anchor text
- Ratio of inlinks to home page versus inner pages
- Ratio of outlinks to inlinks
- Edge-reciprocity (high PageRank spam sites feature low reciprocal link patterns)
So now we have a better understanding of what it means when it’s said that Penguin is not machine learning, which sometimes involves statistical analysis.
Further reading on link-based analysis features:
- Let Web Spammers Expose Themselves
- The Connectivity Sonar: Detecting Site Functionality by Structural Patterns
- Link-Based Characterization and Detection of Web Spam
- Efficient Identification of Web Communities, 2000
- Link spam alliances, 2005
- Web Spam Taxonomy, 2005
- SpamRank – Fully Automatic Link Spam Detection
- A Large-Scale Study of Link Spam Detection by Graph Algorithms, 2007
What the Penguin Algorithm May Be…
Information retrieval research has taken many directions but of the papers related to link analysis, there is one kind of algorithm that stands out because it represents a new direction in link spam detection. This new kind of algorithm can be referred to as a Link Ranking Algorithm or a Link Distance Ranking Algorithm. I believe that calling it a Link Ranking Algorithm is more appropriate and will explain further down.
Instead of ranking web pages, this new kind of algorithm ranks links. This kind of algorithm is different from any link related algorithm that has ever preceded it. This is how Google’s patent application filed in 2006 describes this algorithm:
…a system that ranks pages on the web based on distances between the pages, wherein the pages are interconnected with links to form a link-graph. More specifically, a set of high-quality seed pages are chosen as references for ranking the pages in the link-graph, and shortest distances from the set of seed pages to each given page in the link-graph are computed.
In plain English, this means that Google selects high-quality web pages as starting points for creating a map of the web (called a link graph). In this link graph, the distance from the seed page to another web page is measured and a rank is given to the web page. The shorter the distance between a seed page to a regular web page, the more authoritative that web page is computed to be.
…This is Not a Trust Algorithm
Nowhere does the patent describe itself as a trust algorithm. It makes six references to “trusted” sites but that is in the context of describing the quality of a seed page, not to describe the algorithm itself. The patent uses the words “distance” and “distances” 69 times. This is important because the word distance more accurately describes what this algorithm is about.
If this patent is a description of what Penguin is, then it is incorrect to call Penguin a trust algorithm. Penguin might be more accurately described as a link ranking algorithm. A short distance link is ranked higher than a longer distance link. This quality of distance is important because the distance from a seed page is what makes a link a high-value link. There is no quality called trust, only distance. It can be referred to as a link distance algorithm or as a link ranking algorithm.
How Are Link Distances Calculated?
The patent describes the problem of calculating a distance ranking score for the entire link graph as inefficient. This is what Google published:
Generally, it is desirable to use large number of seed pages to accommodate the different languages and a wide range of fields which are contained in the fast growing web contents. Unfortunately, this variation of PageRank requires solving the entire system for each seed separately. Hence, as the number of seed pages increases, the complexity of computation increases linearly, thereby limiting the number of seeds that can be practically used.
The patent describes problems in calculating link distances for the entire link graph and proposes diversifying the seed set pages, presumably by niche topics. This makes the ranking computation easier (and it also solves the problem of bias toward big and influential sites). Here is what Google’s patent says:
…as the number of seed pages increases, the complexity of computation increases linearly, thereby limiting the number of seeds that can be practically used… Hence, what is needed is a method… for producing a ranking for pages on the web using a large number of diversified seed pages without the problems of the above-described techniques.
What does Google mean by diversified seed pages? This diversification is described first as by connectivity to a wide range of sites, citing the Google Directory (DMOZ) and the New York Times as examples. It further adds to that requirement by stating
“…it would be desirable to have a largest possible set of seeds that include as many different types of seeds as possible.”
There are other link ranking and click distance ranking algorithms that make reference to diversification by niche topics. It’s a fairly common strategy for improving accuracy.
Distance Ranking Explained
The purpose of this algorithm is to create a reduced link graph that has link manipulating sites filtered out. Here’s how that is accomplished:
“The system then assigns lengths to the links based on properties of the links and properties of the pages attached to the links. The system next computes shortest distances from the set of seed pages to each page in the set of pages based on the lengths of the links between the pages. Next, the system determines a ranking score for each page in the set of pages based on the computed shortest distances.”
Penguin in Plain English
The system creates a score that is based on the shortest distance between a seed set and the non-seed set pages. The score is used to rank these pages. So it’s basically an overlay on top of the PageRank score to help weed out manipulated links, based on the theory that manipulated links will naturally have a longer distance of link connections between themselves and the trusted set.
Good sites tend not to link to bad sites. Bad sites tend to link to good sites. The seed set distance algorithm reinforces the linking tendencies of good sites and the linking properties of bad sites automatically sets them aside and organizes them within their own (spam) neighborhoods.
Link Direction and Spam Detection
An interesting observation from 2007 (A Large-Scale Study of Link Spam Detection by Graph Algorithms) noted that the direction of links was a good indicator of spam:
“…in link spam detection, the direction of links is significantly important because spam sites often point to good sites and good sites seldom point to spam sites…”
It is the truth of that observation, that direction of links is important, that underlies the accuracy of the Penguin. The algorithm can exclude those links from the reduced link graph so that the net effect is that they can not hurt a good site. This observation coincides with statements out of Google that low quality links won’t hurt a non-spam site, and this is the reason why those links might not affect a normal site.
So What’s the Key Takeaway?
This touches on the usefulness of filing disavowals. Disavowal reports are a spreadsheet uploaded to Google to inform them of any low-quality links. Googlers have commented that disavowals are no longer necessary for Penguin, presumably because the low-quality links aren’t a factor in Penguin related issues.
Disavowals, Penguin, and You
Let’s pause for a disavow report reality check, courtesy of Jeff Coyle, a cofounder and CRO of Market Muse, Inc. Jeff has a long and distinguished career in search marketing, notably in B2B lead generation. Here are his insights:
The ebb and flow of low quality links against a big site has little impact. On a site that is struggling to get authority or off page power of any kind, it can be a setback when an unfortunate set of malicious links enters the game.
Next, I turned to the UK, to hear from Jason Duke, CEO of The Domain Name. Jason has decades of experience in competitive search marketing. These are his thoughts on disavowal reports in the wake of the Penguin algorithm:
It is normal to have low quality links. As such non-controlled assets can do what they wish, i.e. link to you, and some of them are bad and not what you’d ideally like.
I do think there is value in a process for disavowing historical actions you or your predecessor, or indeed another party, have done to your website. But I don’t think it’s needed when you look at the web as a whole. Low quality links happen and are easily taken care of en masse as it normalises out.
Both of those opinions (and my own) conform to what Gary Illyes has stated about disavowals. Disavowals are not necessary in the context of Penguin, but could be useful outside of that context or for coming clean about low-quality links you are responsible for.
Are You In or Are You Out?
Under this algorithm, there’s no chance of ranking for meaningful keyword phrases unless the page is associated with the seed set and not heavily associated with the spam cliques. The patent references the algorithms resistance against link spam techniques:
“One possible variation of PageRank that would reduce the effect of these techniques is to select a few “trusted” pages (also referred to as the seed pages) and discovers other pages which are likely to be good by following the links from the trusted pages.”
Note this is different from the old Yahoo TrustRank algorithm. Yahoo TrustRank was shown to be biased toward large sites because the seed set was not diversified. A subsequent research paper demonstrated that a diversified seed set is organized by niche topics were more accurate.
Not all trust algorithms are the same. Majestic’s Topical Trust Flow metric is an example of an accurate trust metric. The reason it is accurate is because it uses a diversified seed set. In my opinion Majestic’s Topical Trust Flow is a useful tool for evaluating the quality of a web page or website as part of a link building project.
Reduced Link Graph
As I understand it, this Google patent calculates distances between a trusted seed set and assigns trust/distance scores which are then used as an overlay on the regularly ranked sites, almost like a filter applied to PageRank-scored sites to weed out less authoritative sites. This results in what is known as a reduced link graph. This is very important. Let’s take a closer look at what a Reduced Link Graph means for your search marketing strategy.
“In a variation on this embodiment, the links associated with the computed shortest distances constitute a reduced link-graph.”
What this means is that there’s a map of the entire Internet commonly known as the Link Graph and then there’s a smaller version of the link graph that is populated by web pages that have had spam pages filtered out. This filtered version of the web is the reduced link graph.
- TAKEAWAY 1: Sites that primarily have inbound and outbound link relationships with pages outside of the reduced link graph will never get inside and consequently will be shut out of the top ten ranking positions. Spam links give no traction.
- TAKEAWAY 2: Because this algorithm stops spam links from having any influence (positive or negative), spam links have no effect on high-quality sites. In this algorithm, a link either helps a site rank or it does not help a site rank.
- TAKEAWAY 3: The twin effects of identifying spam sites and shutting them out are the effects inherent in the concept of the reduced link graph.
The point of Penguin, in my opinion, is not to attach a spam label on spam sites and a trusted label on normal sites. The point is to get to the reduced link graph. The reduced link graph is the goal of Penguin because it filters out the sites that are trying to unfairly influence the algorithm.
Does This Mean Reduced Link Graphs Are New?
Reduced link graphs are not new. Reduced link graphs have likely been used in the past as part of a ranking process. The limitation of a reduced link graph is that it’s only as good as the filter used to create it. Below is a link to a PDF discussing a reduced link graph created by using statistical analysis.
“The early success of link-based ranking algorithms was predicated on the assumption that links imply merit of the target pages. However, today many links exist for purposes other than to confer authority. Such links bring noise into link analysis and harm the quality of retrieval. In order to provide high quality search results, it is important to detect them and reduce their influence… With the help of a classifier, these noisy links are detected and dropped. After that, link analysis algorithms are performed on the reduced link graph.”
More information about Reduced Link Graphs here.
Why Reduced Link Graphs are a Big Deal
What’s interesting about the concept of a reduced link graph is that it neatly fits into the what we know about Penguin. Penguin excludes sites from ranking. With the Penguin algorithm, you are either in the game or you are out of the game and have no chance of ranking. A reduced link graph works just like this. If your link profile excludes you from the reduced link graph, you will never ever rank for your phrases. That’s because your site is excluded from consideration.
What is The Seed Set?
This is an important question to answer. Having a good notion of what the seed set looks like could help you set the best link acquisition targets and also help identify the wrong kinds of sites to become involved with.
Dividing topics into niche buckets is an old and trusted technique. DMOZ has been cited as an inspiration for a taxonomical organization of topics. But researchers today turn to Wikipedia when they need a comprehensive taxonomy of topics. Researchers from Google, Microsoft, and artificial intelligence scientists turn to Wikipedia when they need to classify things. I believe it’s reasonable to assume that Wikipedia’s category structure is used for creating niche categories for the seed sets.
Google’s use of Wikipedia for classifying things is not without precedent. This Google research paper titled, Classifying YouTube Channels: a Practical System describes the use of Wikipedia topic categories for automatically creating thousands of YouTube categories without any human intervention whatsoever.
Here are more examples of how researchers routinely use Wikipedia for generating topics (taxonomies):
- Are Human-Input Seeds Good Enough for Entity Set Expansion?
Seeds Rewriting by Leveraging Wikipedia Semantic Knowledge
National Laboratory of Pattern Recognition(NLPR), Institute of Automation Chinese Academy of Sciences
- From the Journal of Artificial Intelligence Research (2009)
Wikipedia-based Semantic Interpretation for Natural Language Processing
- A Google research paper:
Classifying YouTube Channels: a Practical System
- A Google research paper detailing other uses for Wikipedia
Using Encyclopedic Knowledge for Named Entity Disambiguation
- This Microsoft research maps Wikipedia categories to user intent
Understanding User’s Query Intent with Wikipedia
Understand and Strategize
Gaining an understanding of Penguin, even in loose outlines, is important if making informed decisions for your search strategy are a priority. Search marketing has never exactly known the specific details of search algorithms, only the general outlines. Dealing with Penguin should be no different.
If this outline of how Penguin works is correct, then having a good estimate of how many clicks away a web page is to a seed site is useful information. Although the list of seed set sites is classified, we can take what we know about this algorithm and others like it and make some educated estimations.
Seed Sites – Web Connectivity
The patent application describes characteristics of a typical seed site in two terms. The first term is what it calls, web connectivity. That’s another way of saying that it has many outbound links to other web pages. Here is how the patent application describes it:
…seeds… are specially selected high-quality pages which provide good web connectivity to other non-seed pages.
Examples of these seed sites are the New York Times and the “Google Directory,” a reference to Google’s DMOZ clone. These are just examples and may or may not represent actual seed sites in use by Google. We already know that Wikipedia is useful to AI and Information Retrieval science. So it’s not far fetched to speculate that Wikipedia may be a seed site. A rebuttal to that speculation may be that all outbound links are no-followed, which technically means all links are dropped from the link graph. So how can something be a seed site in a link graph while simultaneously having zero web connectivity?
Seed Sites – Diversity
In the next section, the document states that the seed set must be diverse. What I believe they mean by diversity is choosing sites across a range of topical niches.
One approach for choosing seeds involves selecting a diverse set of trusted seeds. Choosing a more diverse set of seeds can shorten the paths from the seeds to a given page. Hence, it would be desirable to have a largest possible set of seeds that include as many different types of seeds as possible.
After that, it elaborates that the seed set must by necessity have a limit because they feel that too many seed sets makes the algorithm open to spamming.
Link Building Strategy
If this is the Penguin algorithm, then these are the key elements:
- Penguin works on a reduced link graph
- Penguin doesn’t penalize. You’re either in or you’re out of the SERPs.
- Spam sites link to quality sites. Linking to .edu sites won’t save you if you’re spamming.
- Penguin like many other link detection algorithms focuses on link direction
- Quality sites don’t link to spammy sites. This means understanding a site’s outbound links might be important
That last one, outlink research is interesting. Most link building/backlink tools are focused on inlink data. But if you really care about ranking, maybe it’s time to deep dive into outlink data. Xenu Link sleuth can do the trick in a pinch but the reports are Spartan. The modestly priced Screaming Frog app does it faster and generates clean reports that can help you get an idea if your next link prospect is useful or counterproductive.
We can’t be certain this is the Penguin algorithm. There is only one other algorithm that comes closest within the appropriate time frame to describe what Penguin is. That research paper on a link ranking algorithm authored by Ryan A. Rossi in 2011. It claims to be a completely new direction in spam link detection with a success rate of 90.54%. It’s called, Discovering Latent Graphs with Positive and Negative Links to Eliminate Spam in Adversarial Information Retrieval. It’s a fascinating algorithm and I encourage you to read it. Here is the description of that algorithm:
This paper proposes a new direction in Adversarial Information Retrieval through automatically ranking links. We use techniques based on Latent Semantic Analysis to define a novel algorithm to eliminate spam sites. Our model automatically creates, suppresses, and reinforces links.
It is a groundbreaking approach to spam links that describes a process very similar to the one described in Google’s patent application and well worth reading in order to understand the state of the art of adversarial information retrieval.
Based on what we know about Penguin, the Google patent application provides the best description to date of what the Penguin algorithm may be. Aside from the aforementioned link ranking research and to a much lesser extent a 2012 Microsoft patent on a Click Distance Ranking Algorithm, there isn’t another patent application or research paper on combating link spam that is any closer to describing the Penguin algorithm. So if we are going to pin a tail on an algorithm this is the likeliest donkey to pin it on.
Graphics made by author