How Much Page Does Yahoo Search Index?
One of the lesser-discussed facets of Web searching is the spidering limits of search engines. Even if a search engine is a full-text engine, it may not search the entirety of a given page if it’s too large. In Google’s case the limit is 101K for HTML pages (its spider will only index the first 101K of an HTML Web page; search Google for aardvark apple zither zephyr filetype:html and look at the file sizes of the results) and ? for PDF pages. (I can’t see the limit; if you look at http://tinyurl.com/4px8n ; you’ll see that about two-thirds of the pages listed in the TOC are available in Google’s HTML version. 300K limit? 500K?)
I knew that Yahoo had a larger index limit, but I didn’t know how large. I learned earlier this week that Yahoo’s limit is the first 150K of a Web page, while its PDF indexing limit is 500K.
… this is what I’m told, anyway. However, I’m finding something interesting. If you search Yahoo for aardvark apple zither zephyr originurlextension:html (originurlextension: is Yahoo’s gawdawful syntax for filetype:; I’m told they’ll be fixing it soon. Propburgers to Greg Notess of http://www.searchengineshowdown.com for educating me about it) you’ll find that filesizes are listed with search results, and the filesizes listed are well over 150K — I see page sizes of over 800K listed here! At least one of the pages listed, at 173K, appears from its cache to be fully indexed (the headers, footers, and copyright disclaimers are all in place — it doesn’t look “cut off”) and a cache copied-and-pasted into a text editor weighs in at well over 200K.
The bottom line is that Yahoo indexes far more of HTML pages than Google; if you’re running searches which might tend to focus on large pages (like word listing searches that might point you to dictionaries) try Yahoo first.
Tara Calishain is writer and editor at ResearchBuzz and author of the new book Web Search Garage