News

Google Experimenting with Technology to Index HTML Forms

Googlers Jayant Madhavan and Alon Halevy, members of the Crawling and Indexing team, recently indicated that Google has been testing out some HTML forms to see if they are able to discover web pages that otherwise couldn’t be found or indexed for users.  In this experiment to index HTML forms, including drop-down boxes and select menus, Google has taken one step closer to the Deep Web.

In their blog post, the Googlers indicated their process:

“For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the Web page resulting from our query is valid, interesting and includes content not in our index, we may include it in our index much as we would include any other Web page.”

If you’re worried about forms being indexed that you’d rather not be included, Google said that they will adhere to any instructions or tools included in a site that prevents search engines from crawling certain sections. Furthermore, they said they will also omit forms that require password inputs, or those that use terms frequently associated with personal information, such as logins or user IDs.

Concerns that this new enhanced crawling method will come at the expense of regular web pages should be unfounded.  According to Google, this method won’t affect sites already a part of the crawl and the method won’t impact page ranking.  This new method of crawling is aimed simply to increase the search engine’s coverage of the web.

You Might Also Like

Comments are closed.

2 thoughts on “Google Experimenting with Technology to Index HTML Forms

  1. Well, you knew it was coming. I also don’t feel that forms need to be indexed, but the spammers have already found a way to fill them with junk URLs and submit them, so I’m sure Google doesn’t want anyone to think that THEY can’t read forms, too.

    ;-)