Free Newsletters
InfoWorld Daily

InfoWorld
Log-in | Register

Google aims to penetrate Deep Web with HTML forms crawling

Google has starting experimenting with technologies that will allow its search engine to index HTML forms like drop-down boxes and selecte menus


In a move aimed at taking the search engine giant closer to what's commonly called the Deep Web, Google Inc. Friday said that it has started experimenting to find ways for its search engine to index HTML forms like drop-down boxes and select menus.

Over the past few months, Google has been trying out some HTML forms to see if they could discover Web pages that otherwise couldn't be found or indexed for users, noted Googlers Jayant Madhavan and Alon Halevy, members of the Crawling and Indexing team.

"For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML," they noted in a blog post. "Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the Web page resulting from our query is valid, interesting and includes content not in our index, we may include it in our index much as we would include any other Web page."

If a site includes tools for preventing being crawled by a search engine, Google will adhere to those instructions, it said. In addition, Google will omit any forms that require password input or that use terms commonly associated with personal information like logins or user IDs.

The Web pages discovered using the enhanced crawling method will not come at the expense of the regular Web pages that are already part of the crawl, so this methodology won't impact page ranking, Google noted.

"This experiment is part of Google's broader effort to increase its coverage of the Web," Google noted. "In fact, HTML forms have long been thought to be the gateway to large volumes of data beyond the normal scope of search engines. The terms Deep Web, Hidden Web, or Invisible Web have been used collectively to refer to such content that has so far been invisible to search engine users. By crawling using HTML forms, we are able to lead search engine users to documents that would otherwise not be easily found in search engines, and provide Web masters and users alike with a better and more comprehensive search experience."


Talkback:

commentPost a Comment

 

MOST COMMENTS

 
 





REMOTE ACCESS: MAINTAIN SECURITY AND DECREASE THE BURDEN ON IT
Join this interactive webcast to discover how IT Managers can control access rights, end-user security settings and end-point authorization. Sponsor: Citrix(R) GoToMyPC(R) Corporate

»  Click here to view this Webcast
  Planning For A Disaster
This new, comprehensive Solutions Guide is your one stop source for Disaster Recovery. In it you'll learn how to reduce the likelihood of a disaster and to create a rock solid business continuity plan should you face a disaster situation. Sponsored by Equallogic

»  Click here to download now

- Special Advertising Partners -
WHITE PAPERS
 

» Technology White Papers Library

Technology White Papers by Topic

Technology White Papers E-mail Alert

Find out when the latest white paper is available:
 
 
INFOWORLD MARKETPLACE
 
» BUY A LINK NOW
 
 

Video

 
 
 

Podcasts

 
IFW Daily 10/07/2008

AMD to split into two companies, SAP suffers from stock market turmoil...

 
 

 

Columnists

 
 
 

Resource Center


Ads by techwords beta  [See your link here]
 




Sponsored Technology Links

 
 
 HOME  NEWS  BLOGS  PODCASTS  VIDEOS  TECHNOLOGIES  TEST CENTER  EVENTS  CAREERS   About | Advertise | Awards | RSS | Contact Us 

Copyright © 2008, Reprints, Permissions, Licensing, IDG Network, Privacy Policy, Terms of Service.
All Rights reserved. InfoWorld is a leading publisher of technology information and product reviews on topics including viruses,
phishing, worms, firewalls, security, servers, storage, networking, wireless, databases, and web services.

CIO :: ComputerWorld :: CSO :: Demo :: GamePro :: Games.net :: IDG Connect :: IDG World Expo
Industry Standard :: IT World :: JavaWorld :: LinuxWorld :: MacUser :: Macworld :: Network World :: PC World :: Playlist