There is a kind of link algorithm that isn’t widely discussed, not nearly enough. This article is meant as introduction to link and link distance ranking algorithms. It’s something that may play a role in how sites are ranked. In my opinion it’s important to be aware of this.
Does Google Use This?
While the algorithm under consideration is from a patent that was filed by Google, Google’s official statement about patents and research papers is that they produce many of them and that not all of them are used and sometimes they are used in a way that is different than what is described.
That said, the details of this algorithm appear to resemble the contours of what Google has officially said about how it handles links.
Complexity of Calculations
There are two sections of the patent (Producing a Ranking for Pages Using Distances in a Web-link Graph) that state how complex the calculations are:
“Unfortunately, this variation of PageRank requires solving the entire system for each seed separately. Hence, as the number of seed pages increases, the complexity of computation increases linearly, thereby limiting the number of seeds that can be practically used.”
Hence, what is needed is a method and an apparatus for producing a ranking for pages on the web using a large number of diversified seed pages…”
The above points to the difficulty of making these calculations web wide because of the large number of data points. It states that breaking these down by topic niches the calculations are easier to compute.
What’s interesting about that statement is that the original Penguin algorithm was calculated once a year or longer. Sites that were penalized pretty much stayed penalized until the next seemingly random date that Google recalculated the Penguin score.
At a certain point Google’s infrastructure must have improved. Google is constantly building it’s own infrastructure but apparently doesn’t announce it. The Caffeine web indexing system is one of the exceptions.
Real-time Penguin rolled out in the fall of 2016.
It is notable that these calculations are difficult. It points to the possibility that Google would do a periodic calculation for the entire web, then assign scores based on the distances from the trusted sites to all the rest of the sites. Thus, one gigantic calculation, done a year.
So when a SERP is calculated via PageRank, the distance scores are also calculated. This sounds a lot like the process we know as the Penguin Algorithm.
“The system then assigns lengths to the links based on properties of the links and properties of the pages attached to the links. The system next computes shortest distances from the set of seed pages to each page in the set of pages based on the lengths of the links between the pages. Next, the system determines a ranking score for each page in the set of pages based on the computed shortest distances.”
What is the System Doing?
The system creates a score that is based on the shortest distance between a seed set and the proposed ranked pages. The score is used to rank these pages.
So it’s basically an overlay on top of the PageRank score to help weed out manipulated links, based on the theory that manipulated links will naturally have a longer distance of link connections between the spam page and the trusted set.
Ranking a web page can be said to consist of three processes.
- Ranking Modification (usually related to personalization)
That’s an extreme reduction of the ranking process. There’s a lot more that goes on.
Interestingly, this distance ranking process happens during the ranking part of the process. Under this algorithm there’s no chance of ranking for meaningful phrases unless the page is associated with the seed set.
Here is what it says:
“One possible variation of PageRank that would reduce the effect of these techniques is to select a few “trusted” pages (also referred to as the seed pages) and discovers other pages which are likely to be good by following the links from the trusted pages.”
This is an important distinction, to know in what part of the ranking process the seed set calculation happens because it helps us formulate what our ranking strategy is going to be.
This is different from the Yahoo TrustRank thing. YTR was shown to be biased.
Majestic’s Topical TrustFlow can be said to be an improved version, similar to a research paper that demonstrated that by using a seed set that is organized by niche topics is more accurate. Research also showed that organizing a seed set algorithm by topic is several orders better than not doing so.
Thus, it makes sense that Google’s distance ranking algorithm also organizes it’s seed set by niche topic buckets.
As I understand this, this Google patent calculates distances between a seed set and assigns distance scores.
Reduced Link Graph
“In a variation on this embodiment, the links associated with the computed shortest distances constitute a reduced link-graph.”
What this means is that there’s a map of the Internet commonly known as the Link Graph and then there’s a smaller version the link graph populated by web pages that have had spam pages filtered out. Sites that primarily obtain links outside of the reduced link graph might never get inside. Dirty links thus get no traction.
What is a Reduced Link Graph?
I’ll keep this short and sweet. The link to the document follows below.
What you really need to know is this part:
“The early success of link-based ranking algorithms was predicated on the assumption that links imply merit of the target pages. However, today many links exist for purposes other than to confer authority. Such links bring noise into link analysis and harm the quality of retrieval.
In order to provide high quality search results, it is important to detect them and reduce their influence… With the help of a classifier, these noisy links are detected and dropped. After that, link analysis algorithms are performed on the reduced link graph.”
Read this PDF for more information about Reduced Link Graphs.
If you’re obtaining links from sites like news organizations, it may be fair to assume they are on the inside of the reduced link graph. But are they a part of the seed set? Maybe we should’t obsess over that.
Is This Why Google Says Negative SEO Doesn’t Exist?
“…the links associated with the computed shortest distances constitute a reduced link-graph”
A reduced link graph is different from a link graph. A link graph can be said to be a map of the entire Internet organized by the link relationships between sites, pages or even parts of pages.
Then there’s a reduced link graph, which is a map of everything minus certain sites that don’t meet specific criteria.
A reduced link graph can be a map of the web minus non-spam sites. The sites outside of the reduced link graph will have zero effect on the sites inside the link graph, because they’re on the outside.
That’s probably why a spam site linking to a normal site will not cause a negative effect on a non-spam site. Because the spam site is outside of the reduced link graph, it has no effect whatsoever. The link is ignored.
Could this be why Google is so confident that it’s catching link spam and that negative SEO does not exist?
Distance from Seed Set Equals Less Ranking Power?
I don’t think it’s necessary to try to map out what the seed set is. What’s more important, in my opinion, is to be aware of topical neighborhoods and how that relates to where you get your links.
At one time Google used to publicly display a PageRank score for every page, so I can remember what kinds of sites tended to have low scores. There are a class of sites that have low PageRank and low Moz DA, but they are closely linked to sites that in my opinion are likely a few clicks away from the seed set.
What Moz DA is measuring is an approximation of a site’s authority. It’s a good tool. However, what Moz DA is measuring may not be a distance from a seed set, which cannot be known because it’s a Google secret.
So I’m not putting down the Moz DA tool, keep using it. I’m just suggesting you may want to expand your criteria and definition of what a useful link may be.
What Does it Mean to be Close to a Seed Set?
From a Stanford university classroom document, page 17 asks, What is a good notion of proximity? The answers are:
- Multiple connections
- Quality of connection
- Direct & Indirect connections
- Length, Degree, Weight
That is an interesting consideration.
There are many people who are worried about anchor text ratios, DA/PA of inbound links, but I think those considerations are somewhat old.
The concern with DA/PA is a throwback to the hand-wringing about obtaining links from pages with a PageRank of 4 or more, which was a practice that began from a randomly chosen PageRank score, the number four.
When we talk about or think about when considering links in the context of ranking, it may be useful to consider distance ranking as a part of that conversation.
Read the patent here
Images by Shutterstock, Modified by Author