A framework for semantic analytic centric content
Regardless of what I’ve said about the whole ‘LSI’ and Google crap in the past, one things worth bearing in mind is that all modern search engines do semantic analysis to one extent or another. It may be phrase based, using PLSA, HTTM or a hybrid. That part is really inconsequential. That is important is that we can take heart in the fact that content that is semantically flexible will do a better job of targeting the page in question.
First off, some common concepts worth looking at; semantic search is NOT semantic web. This is one area that seems to get convoluted all too often. We’re not talking about tagging. We’re talking about the probabilistic/statistical approach to understanding concepts/meanings of a web page/document.
The next thing to try and get away from is that it is only synonyms that play a role within these concepts.
Building out concepts
All too often I see people talking about stemming and synonyms. That’s only partially true. We also want to work on using terms that build out the theme/concept which we might call ‘supporting terms‘. That means we can consider;
Do not be limited to delivering only those signals. We want to go further into creating a deeper theme for that space including supporting terms such as;
- Spark plug
- High Performance
And phrases related to or containing them.
As we can see, those aren’t synonyms but supporting words or phrases that further establish the semantic concepts on the page. But we’d likely be more specific in our targeting with additional elements such as;
We can look at transactional and informational modifiers as well. This helps define the type of page that we have. And the type of queries we are targeting. Or for another example some possible terms for; ‘space shuttle’
Getting the picture here?
What we’re looking to do is create a strong semantic theme of what the page is about through the words we’re using to frame it. If one searches for ‘Jaguar‘ they have a few options to choose from,
- A Car
- An Animal
- Football team (US)
- Computer Application
By using semantic themes you will enable the search engine to better understand the concepts on your page. Remember, search engines have about a 6th grade reading/understanding level. We need to play nice with them.
Elements search engines may look at
The interesting part about using semantic signals/approaches in search is they can give a wealth of information by analysis of such elements as;
- TITLE of page
- Content of page (phrase ratios)
- Prominence factors (Headings, italics, lists)
- Anchor of inbound links
- TITLE and content of pages linking in
- Spam detection
- Duplicate content detection
Each of these can be weighted/dampened to give an over-all page relevance score which can then be send to the rest of the processing system. This scoring is based from the current seed set of documents in the system which has a learning mechanism to continually refine the algorithms.
Ranking the pages
Of course the obvious question remains; how are these signals used? In the more common implementations out there machine learning is the call of the day. The search engine would start with a seed set of documents that satisfy a given term/phrase ratio, similarity measure and compare other documents to those for future scoring. Then, using various signals such as query and click data, they can further refine the seed set on the fly.
This would ultimately be combined with other relevance scoring mechanisms and core rankings set to whatever threshold they deem to deliver the end results. While this may not be enough to garner great rankings on their own, they are likely useful to those playing grab and hold via the QDF (query deserves freshness). Any non-link velocity related signal would be at a premium in such cases.
Putting it to use
The first thing we want to do is expand on our keyword research to provide not only primary and secondary targets, but also get into semantic support terms and even semantic baskets. This will be endlessly useful for content development, site audits, link building and more. Given the many signals that can be had, having these concepts integrated into the entire SEO program can be invaluable.
When you do this at the beginning (during the KW research) it can be easily fed into every other aspect of the SEO program.
There really are no tools nor can I imagine one that would work, (although I did talk to the WordStream gang about it recently). But it still is an art more than a science. You see we don’t know the relevance scoring for the seed set and the SERPs are inclusive of other ranking factors. I have found it an interesting excercise to measure occurances on pages ranking top 10, with the least amount of link juice/authority. While not perfect, it oftens brings concept rich pages.
Getting into the mindset
As with many things in this thing of ours, it is something you need to get a feel for in the query space in question. What is important is getting into the habit of watching how you’re framing the content. Build around the core term with not only modifiers (geo-local, informational, transactional, plurals) but also with related terms that expand on the concepts.
Now, before I leave you, I dug up a ton of tools, post and even seminars to get you into the groove. Get a feel for how search engineers think and you will find getting actionable ideas all the more efficient.. I hope you got something from all this, it is an area not often discussed enough.. Enjoy!
Tools to play with
- Aaron’s tool has some interesting ‘Phrase Match’ data, but it is marginally effective for this excercise and would need sorting.
- KW Map is interesting, but also is marginally effective and has no export option to speak of. Close, but no cigar
- Vseo Tool – Also not the greatest, but certainly presents some reasonable semantic concepts and can be exported.
- WordStream – also comes close, (I am helping develop a tool tho) but nothing default to really group deeper semantic relations for our purposes. Emails the list to you for sorting purposes.
- Nichebot – these guys almost have it with the poorly named ‘LSI’ tool. This produces probably some of the best lists for our purposes. Fully exportable for sorting.
- Keyword Tool – about as use(less?) as the others. It has some insights, but not deep enough for this excercise. Although it is easier to sort and does support downloads
- Search-based Keyword Tool – not as good as the above KW tool in the testing I did recently for this. It does support exporting though.
- Google Sets – this one isn’t obvious right away, but handy. If you look at the ‘description’ element, you can start to see some supporting terms that might come in handy (since Googly is recommending them). Problem is that it doesn’t give results for granular/obscure terms.(also try Google Squared)
- Onelook reverse dictionary – returns the list of related terms, each word linked to its definition (more tricks from Ann here) – does a reasonable job but doesn’t have export function.
- Reference.com reverse dictionary – clusters related terms into groups by their meaning and gives the actual definition for each cluster: barely usable.
- Rhyme Zone – define your term and find rhymes, synonyms and antonyms. Using the ‘Find related terms’ option you can get some pretty usable lists, unfortunately they are not exportable.
Good Geeky Reading
- What you need to know about phrase based IR
- Phrase based IR one more time
- Lost Google patent on Phrase Based IR
- Google awarded another Phrase based IR patent
- Phrase based optimization resources
- Probabilistic latent semantic analysis
- Latent Dirichlet allocation
- Hidden Topic Markov Models
- Phrase Based Information Retrieval and Spam Detection
- Google Phrase Based Indexing Patent Granted
- Determining query term synonyms within query context
Domain Dictionary Creation (NLP for non-roman character sets)
- Word decompounder
- Integrating external related phrase information into a phrase based indexing IR system
- Semantic unit recognition
- Phrase-based generation of document descriptions
- Segmenting words using scaled probabilities
- Inferring search category synonyms from user logs
- System and method for identifying base noun phrases
- Consistent phrase relevance measures
- Semantic canvas
- Method and system for performing phrase/word clustering and cluster merging
- System for automatically annotating training data for a natural language understanding system
- Method for finding semantically related search engine queries
- Ranking parser for a natural language processing system
- Context-based key phrase discovery and similarity measurement utilizing search engine query logs
- Flexible keyword searching
Videos for Geeks
- Extracting Semantic Relations from Query Logs - Ricardo Baeza-Yates, Yahoo! Research
In this paper we study a large query log of more than twenty million queries with the goal of extracting the semantic relations that are implicitly captured in the actions of users submitting queries and clicking answers. Previous query log analyses were mostly done with just the queries and not the actions that followed after them.
- Machine learning and translation – Google tech talks –
his is an interesting presentation on probabilistic learning and dealing with better understandings of user intent. Kind of heavy lifting for the search geeks, but still worth watching for any SEO.
- Machine Learning, Probability and Graphical Models - Sam Roweis, Department of Computer Science, University of Toronto
- What’s the future of semantic search? – Matt Cutts video discussing the differences and his take on where it’s going