Summarization, the Answer to Web Search : Interview with Dmitri Soubbotin of SenseBot

SMS Text

Search engines are as much the first source of information as they are the starting point for research. Summarization of content of query results is an innovative technique to obtain an intelligent response from the system. This is the concept behind SenseBot and I am glad that Dmitri Soubbotin took the time to answer a few questions on the developments at SenseBot.

What were the key technology concepts and target areas for SenseBot?

I first came up with a concept of a summary as a type of response to a search engine query several years ago. Since then, the relevance of results returned by major search engines has improved dramatically. However, users’ expectations have grown as well. Today many search engine are trying to do more than just display 10 links on the first results page.

On a very high level , we read the sources returned by a major search engine. We perform text mining on each source, extracting key concepts. We assess similarities between the sources and even drop those that are far off, i.e. not related to what the mass of the sources is about. We assign the weighting to the concepts, and ensure preferential treatment to the concepts representing the query. We then perform multi-document summarization, constructing a text summary out of the documents, according to a proprietary algorithm. So the actual result of the Web search turns out to be a summary on the user’s query topic.

The best results can be achieved on a set of documents that are indeed close to the topic, and are primarily textual. Vertical search engines and portals seem to be the best application area for us from this perspective – financial, medical, legal, libraries, etc. As for generic Web searches, some amount of “noise” is inevitable, even for the sources from the first page of results – presumably the most relevant.

What are the new updates that have been made available recently?

We have improved the weighting of the query with this latest upgrade. This will affect the cases when the results returned by Google on the user’s query are only partially related to the query topic. What we did was to add more weight to the query-related concepts, ensuring that we are looking at the document content through the focus of the query. Also, the weighting applies to multiple languages in addition to English. We are continuously working on improving our algorithms.

We have also rolled out a performance upgrade that now gives up to 50% improvement over previous timings. Note that most of the time is spent on reading the Web results in; the actual processing is very fast. If SenseBot were to integrate with a major search engine or a portal which hosts the documents, the users would be seeing the summaries momentarily.

Does the engine directly search from any vertical search engine based on the assessment of the query or the keywords?

No, although it’s a great idea. At this moment we only query one of the Big 3, whichever the user prefers. For the verticals, we have separate tools that allow to summarize sets of selected Web pages or documents.

Are there any future applications for SenseBot in the pipeline? For example, considering that the engine could summarize content missing from online encyclopedias such as Wikipedia?

I see a number of future applications, mostly integrated with search engines (major or vertical) or portals. Enterprise search would be another natural fit. Bringing out relevant content from beyond the first page of results is something we can do. As for the content areas, here are just a few examples where we are seeing a good fit:

  • Education: automatically building introductions to a particular area of knowledge or study; preparing a groundwork for an essay on a subject.
  • Libraries: research applications for librarians and library patrons.
  • Financial news and research: giving a scoop on what’s happening in a particular area of the economy, the markets, in a given sector, etc.
  • Competitive intelligence: analyzing a set of documents on a competitor, or an area that is targeted by a number of competitors.
  • Medical information: providing a digest on a medical condition or symptoms.
  • Legal information: providing a digest on a legal situation or a development of a legal concept.

As for Wikipedia, I view it as a great source of information, though prefer to use it as one of the sources. One of our users has actually called SenseBot “a mechanical Wikipedia”. But the major difference is that we present an unedited, up to the minute content, based on whatever is returned as the most relevant information by search engines. Yes, our summary is sometimes rough; but the freshness, diversity, and lack of bias probably compensates for the roughness.

SenseBot takes a algorithmic approach to search when the trend seems to be more towards integrating social features such as user ranked, user submitted content (especially in the alternative search space). What is your view on these developments and new features that would be coming to SenseBot?

Yes, it looks like the main trend at the moment is “socializing” search. AltSearchEngines maintains a comprehensive roster and reviews of search engines of all kinds, and a lot of them involve users voting or somehow participating in figuring out the right results. But I think that with the humongous size of the Web, human participation can only help to a certain extent. You still need algorithms to mine and organize information in a meaningful way.

In the end, the goal of any search engine is user satisfaction, which can be expressed in whether the user has found an answer to his query or not; and how much time did he spend searching. Having a summary of the top relevant results may give the user 80% of the answer in just a few seconds – and in many cases those 80% are enough. The summary may already satisfy the user, without the need to drill down into individual sources. If the user wants to dig deeper, the summary can give him a good idea of the quality of the sources, so that he goes directly to those that speak on the same wavelength about the topic.

For example, I have just sent a query “omaha gunman” to SenseBot, picking Google as the engine. SenseBot returned a concise digest on the shooting, with a particular focus on who the gunman was. All news networks have reported the story, in their own way, with many details. But the summary, in just a few seconds gave me a good idea of what happened, with a few key details pulled out from different sources. It’s like watching several TV screens simultaneously, and being able to get the gist of the story.

Glancing through the summary, I also noticed that SenseBot has dropped 2 out of 8 sources that Google returned. I checked them both – they were indeed shallow on content!

So that was a news story, but the types of queries where SenseBot can really flourish is where a user is trying to understand a new concept, or research a particular subject.

Thank you Dmitri. Readers can try the features of SenseBot engine from here.

Download: The Beginner's Guide to SEO
A Complete Guide to Getting Started in SEO.