Exposing the Invisible Web to Search Engines

SMS Text

The Invisible Web (aka Deep Web) is that humoungous slice of the Internet’s web pages that traditional search engines either have not indexed or cannot index. Often, if they cannot index a page, it’s because the page is database-driven and requires a human trigger before it is rendered in your web browser. For example, you may have to ask a question, such as “show me all the job listings for project manager,” using an HTML form to enter information. In other instances, access to a web page requires authorization such as username and password.

Exposing the invisible web - pastThe end result is that only a relatively small slice of the Internet is indexed. Google may have the noble goal of indexing all the Earth’s information, but it’s an understatement to say that this will take some time – especially when the Invisible Web’s page already outnumber the Visible Web.

My own research into the Invisible Web shows that, not surprisingly, no one seems to have have an accurate figure as to how many pages are part of the Invisible Web since this number is constantly growing and never truly calculated. Approximations vary greatly. (I’m not mentioning anything here because the numbers I came across are several years old, and the mass of blogs created since have expanded the Invisible Web greatly.)

Exposing the invisible web - currentThe fact is, it really doesn’t matter how many pages are invisible, just that they are so. Some traditional search engines and some of the great number of Web 2.0 search engines [Read/Write Web] (over 100 at last count), are making a noble attempt at indexing content that would otherwise remain “invisible”. There are also other ways, too, to expose the Invisible Web. Here are a few ways, which exclude passworded pages:

  1. List important pages in some sort of sitemap or site index.
  2. Bookmark pages at social bookmarking and promote at community news sites. Here are just a few (apologies that it’s not comprehensive).
  • Build a suitable lens at Squidoo and link to relevant, current invisible pages.
  • Link from other authority sites, where relevant, such as Wikipedia.
  • Deep link into your own archives.
  • Exposing the invisible web - futureIn other words, create links to invisible pages from visible (indexed) pages wherever you can. Most spiders will follow your links at their leisure and if they can index the currently invisible pages, they will.

    Of course, some engines are trying to make it easier to access currently invisible content. Enth.com offers access into database-driven content (not just web pages) by converting an English query into a database query. The datasets are currently limited and the results are not very accurate, but it’s a start.

    My Computer Science Master’s Thesis was to have been on NQLs (Natural Query Languages) with a GIS (Geographical Information System) interface. My research predated my experience with online search engines, and my preliminary conclusion back in 1994 (I didnt finish) was in favor of English-like query. (Actually, I’d recommend something more phonetic like Esperanto.) We would, however, need a better understanding of how to parse the queries into something usable by computers. That would be the biggest hurdle.

    Thirteen years later, I’m not entirely sure how much further we’ve gotten with NQLs for online search engines, if only because I haven’t maintained my research.

    Still, I feel strongly that we’ll get to a point where we can speak queries and have a computer respond accurately. Though some like this cannot come about in a single generation of technology and research.

    Whatever features and functionality today’s search engines offer us, whether voice-based or not, have to be refined in successive generations. Then, I think, much of our invisible online content will be easier to index and thus easier to retrieve through queries, making search so much better.

    Subscribe to SEJ!
    Get our weekly newsletter from SEJ's Founder Loren Baker about the latest news in the industry!
    • Brent Franson

      If you are going to discuss ‘the invisible web’ you need to mention chris sherman and/or gary price. these are two guys in the search industry that do not get the credit they deserve. i like the post but hate it when chirs and gary are left out, which happens far too often.

    • Loren Baker, Editor

      More information on the Invisible Web:


      About.com Invisible Web (compiled by Sherman)

      Gary Price’s Direct Search

      UC Berkeley Library info on Invisible Web

    • David

      Most blog webpages aren’t part of the invisible web as they require no human-specific action to be displayed, for example if your frontpage is not an orphan page you can be pretty sure that your whole blog is indexed.
      Most blogs link into their archives so they are actually indexed, therefore your estimation seems way over reality.

    • Raj Dash

      @David: Actually, that’s not true. The bulk of blog pages NEVER get indexed. Thus they are in fact a part of the invisible web. The trick is to get them indexed.

      Try using “site:mydomainname.com” on a young blog, in either Google or Yahoo and then tell me if my estimation seems way over reality.

    • Raj Dash

      You’ll find that almost no permalinks are indexed for young blogs, just category and “page #” and monthly archive pages.

      These pages are typically transient in content because new text gets placed in reverse chron order, thus making it hard to find what you might be looking for. Even in PR6 blogs that I write for, you cannot always find what you seek, even though it’s there. That’s called invisible.

    • chris

      No offense but it sounds like you are just rewriting what everyone knows…
      What a waste of my time

    • Raj Dash

      No offense? Well bravo for you. I am sorry for wasting your time.

    • Robb

      “Link from other authority sites, where relevant, such as Wikipedia.”

      Links from Wikipedia will not allow you pages to be seen by search engines, because Wikipedia recently added rel=”nofollow” to all of there external links. see http://en.wikipedia.org/wiki/Wikipedia:Nofollow

    • Raj Dash

      @Robb: You are right, particularly for Google. Although, as I understand it (from discussing with a few SE experts), the algorithms are subject to change, and not all of the engines behave that way. Some supposedly index Nofollow’d links but do not pass any trust. But I’ll admit I’m a bit green re nofollow. Thanks for the input.

    • Loren

      No follow does not block pages from being found or indexed by search engines. It does block the passing of PageRank however, which is irrelevant to this argument.

    • Motorcycle Guy

      I have a suspicion that nofollows from wikipedia aren’t treated the same as other nofollows.

    • San Diego SEO

      I’m curious about the spam implications of tagging and bookmarking all of my ‘deep web’ blog entries or website pages. I know Newsvine in particular are super spam cops. I have submitted completely relevant and non-promotional blog entries that have been tagged as spam.

      Any thoughts?

    • Kameralı Sesli Chat


    • Kameralı Sesli Chat

      Sesli Chat Siteleri

    • dave tribbett

      Great post. Here is a good article that adds some additional detail to the topic and a good set of links to the deep web search engines and other helpful sites.