Many claims are made for Latent Semantic Indexing (LSI) and “LSI Keywords” for SEO.
Some even say that Google relies on “LSI keywords” for understanding webpages.
This has been discussed for nearly twenty years and the evidence-based facts have been there the entire time.
This Is Latent Semantic Indexing
Latent semantic indexing (also referred to as Latent Semantic Analysis) is a method of analyzing a set of documents in order to discover statistical co-occurrences of words that appear together which then give insights into the topics of those words and documents.
Two of the problems (among several) that LSI sets out to solve are the issues of synonymy and polysemy.
Synonymy is a reference to how many words can describe the same thing.
A person searching for “flapjack recipes” is equal to a search for “pancake recipes” (outside of the UK) because flapjacks and pancakes are synonymous.
Polysemy refers to words and phrases that have more than one meaning. The word jaguar can mean an animal, automobile, or an American football team.
LSI is able to statistically predict which meaning of a word represents by statistically analyzing the words that co-occur with it in a document.
If the word “jaguar” is accompanied in a document by the word “Jacksonville,” it is statistically probable that the word “jaguar” is a reference to an American football team.
By understanding how words occur together, a computer is better able to answer a query by correctly associating the right keywords to the search query.
The patent for LSI was filed on September 15, 1988. It’s an old technology that came years before the internet as we know it existed.
LSI is not new nor is it cutting edge.
It is important to understand that in 1988, LSI was advancing the state of the art of simple text matching.
LSI preceded the internet and was created during a time when Apple computers looked like this:
LSI was created when a popular business computer (IBM AS/400) looked like this:
LSI is a technology that goes way back.
Just like computers from 1988, the state of the art in Information Retrieval has come a long way over the past 30+ years.
LSI is Not Practical for the Web
A major shortcoming of using Latent Semantic Indexing for the entire web is that the calculations done to create the statistical analysis have to be recalculated every time a new webpage is published and indexed.
This shortcoming is mentioned in a 2003 (non-Google) research paper about using LSI for detecting email spam (Using Latent Semantic Indexing to Filter Spam PDF).
The research paper notes:
“One issue with LSI is that it does not support the ad-hoc addition of new documents once the semantic set has been generated. Any update to any cell value will change the coefficient in every other word vector, as SVD uses all linear relations in its assigned dimensionality to induce vectors that will predict every text samples in which the word occurs…”
I asked Bill Slawski about the unsuitability of LSI for search engine information retrieval and he agreed, saying:
“LSI is an older indexing approach developed for smaller static databases. There are similarities with newer technologies such as the use of word vectors or word2Vec.
One of the limitations of LSI is that if new content is added to a corpus that indexing for the entire corpus is required, which makes it of limited usefulness for a quickly changing corpus such as the Web.”
Is There a Google LSI Keywords Research Paper?
Some in the search community believe Google uses “LSI Keywords” in their search algorithm as if LSI is still a cutting-edge technology.
To prove it, some refer to a 2016 research paper called, Improving Semantic Topic Clustering for Search Queries with Word Co-occurrence and Bigraph Co-clustering (PDF).
That research paper is absolutely not an example of Latent Semantic Indexing. It’s a completely different technology.
In fact, that research paper is so not about LSI (a.k.a. Latent Semantic Analysis) that it cites a 1999 LSI research paper ( T. Hofmann. Probabilistic latent semantic indexing. …1999) as part of an explanation of why LSI is not useful for the problem the authors are trying to solve.
Here’s what it says:
“Latent dirichlet allocation (LDA) and probabilistic latent semantic analysis (PLSA) are widely used techniques to unveil latent themes in text data. …These models learn the hidden topics by implicitly taking advantage of document level word co-occurrence patterns.
Short texts however – such as search queries, tweets or instant messages – suffer from data sparsity, which causes problems for traditional topic modeling techniques.”
It’s a mistake to use the above research paper as proof that Google uses LSI as an important ranking factor. The paper is not about LSI and it’s not even about analyzing webpages.
It’s an interesting research paper from 2016 about data mining short search queries in order to understand what they mean.
That research paper aside, we know that Google uses BERT and neural matching technologies to understand search queries in the real world.
Long story short: the use of that research paper to make a definitive statement about Google’s ranking algorithm is sketchy all around.
Does Google Use LSI Keywords?
In search marketing, there are two kinds of trustworthy and authoritative data:
- Factual ideas that are based on public documents like research papers and patents.
- SEO ideas that are based on what Googlers have revealed.
Everything else is mere opinion.
It’s important to know the difference.
Google’s John Mueller has been straightforward about debunking the concept of LSI Keywords.
There's no such thing as LSI keywords — anyone who's telling you otherwise is mistaken, sorry.
— 🍌 John 🍌 (@JohnMu) July 30, 2019
Noted search patent expert Bill Slawski has also been outspoken about the notion of Latent Semantic Indexing and SEO.
Bill Slawski Tweets His Informed Opinion on Latent Semantic Indexing
Latent Semantic Indexing has nothing to do with SEO:https://t.co/X6KcEt9vSm
— Bill Slawski ⚓ (@bill_slawski) August 18, 2020
Those terms have their own technology and processes behind how they are determined, and do not use LSI. There is nothing "latent" about them. 3/3
— Bill Slawski ⚓ (@bill_slawski) August 18, 2020
Why Google Is Associated with Latent Semantic Analysis
Despite there not being any proof in terms of patents and research papers that LSI/LSA are important ranking-related factors, Google is still associated with Latent Semantic Indexing.
One reason for this is Google’s 2003 acquisition of a company called Applied Semantics.
Applied Semantics had created a technology called Circa. Circa was a semantic analysis algorithm that was used in AdSense and also in Google AdWords.
According to Google’s press release:
“Applied Semantics is a proven innovator in semantic text processing and online advertising,” said Sergey Brin, Google’s co-founder and president of Technology. “This acquisition will enable Google to create new technologies that make online advertising more useful to users, publishers, and advertisers alike.
Applied Semantics’ products are based on its patented CIRCA technology, which understands, organizes, and extracts knowledge from websites and information repositories in a way that mimics human thought and enables more effective information retrieval. A key application of the CIRCA technology is Applied Semantics’ AdSense product that enables web publishers to understand the key themes on web pages to deliver highly relevant and targeted advertisements.”
Semantic Analysis & SEO
The phrase “Semantic Analysis” was a hot buzzword in the early 2000s, perhaps partially driven by Ask Jeeves’ semantic search technology.
Google’s purchase of Applied Semantics accelerated the trend of associating Google with Latent Semantic Indexing, despite there being no credible evidence.
Thus, by 2005 the search marketing community was making unsubstantiated statements such as this:
“For several months I’ve noticed changes in website rankings on Google and it was clear something had changed in their algorithm.
One of the most important changes is the likelihood that Google is now giving more weight to Latent Semantic Indexing (LSI).
This should come as no surprise considering Google purchased Applied Semantics in April 2003 and has reportedly been serving up their AdSense ads using latent semantic indexing.”
The SEO myth that Google uses LSI Keywords quite possibly originated from the popularity of phrases like “Semantic Analysis,” “Semantic Indexing” and “Semantic Search” having become SEO buzzwords, given life by Ask Jeeves’ semantic search technology and Google’s purchase of semantic analysis company Applied Semantics.
The Facts About Latent Semantic Indexing
LSI is a very old method of understanding what a document is about.
It was patented in 1988, well before the internet as we know it existed.
The nature of LSI makes it unsuitable for applying across the entire internet for purposes of information retrieval.
There are no research papers that explicitly show that latent semantic indexing is an important feature of Google search ranking.
The facts presented in this article show that this has been the case since the early 2000s.
Rumors of Google’s use of LSI and LSA surfaced in 2003 after Google acquired Applied Semantics, the company that produced the contextual advertising product AdSense.
Yet Googlers have affirmed multiple times that Google uses no such thing as LSI Keywords.
Let me say it again louder for those at the back: There is no such thing as LSI Keywords.
Considering the overwhelming amount of evidence, it is reasonable to assert that it is a fact that the concept of LSI Keywords is false.
The facts also indicate that LSI is not an important part of Google’s ranking algorithms.
Regarded in the light of recent advancements in AI, natural language processing, and BERT, the idea that Google would prominently use LSI as a ranking feature is literally beyond belief and ridiculous.
- A Complete SEO Checklist for Website Owners
- How to Become an SEO Expert
- How to Avoid SEO Misinformation
Featured image by the author