Yahoo Embraces Open Source Some More with Hadoop Inside Search Webmap

February 20, 2008
⋅
2 min read

Arnold Zafra iPad News

350

READS

As part of its ongoing strategy towards moving to open source infrastructure and network, Yahoo announced that its webmap is now being processed using Apache Hadoop. The Yahoo webmap is a Hadoop application that produces the index from the billions of pages that Yahoo! Search crawlsThis according to Yahoo is by far the world’s largest application of Apache Hadoop. How big? Just look at these facts:

roughly 1 trillion links between pages in the Yahoo webmap index
over 300 TB of compressed output
over 10,000 cores used run a single Map-Reduce job
over 5 Petabytes of raw disk used in the production cluster

If you’re like me, who just scratches my head upon reading those facts, don’t fret out. What this data simply means is that even with such a large size of data, Yahoo still manages to run the identical processing of production clusters by 66% of the time it took when Apache Hadoop was not being utilized.

What this means is that, despite the gazillions of pages that Yahoo search crawls, it still manage to accomplish this task with reduced cost and less administration demand from Yahoo.

But the most important thing is that with the use of Apache Hadoop, Yahoo was able to demonstrate that Apache Hadoop is gaining traction in the search market and is now ready for prime time, capable of handling massive internet scale projects in a not so costly manner. Hopefully, search results were improved as well.

And of course, it’s a great step towards Yahoo’s openness strategy.

Category SEO

The Ultimate Topic Cluster Cheat Sheet & Checklist Bundle

The State Of AI in Marketing

The Hidden Cost Of Google Ads: Stop Wasting Budget Bidding Against Yourself

The State Of AI in Marketing

Social Media Planner: How To Plan Your Content (With Template)

The State Of AI in Marketing

Yahoo Embraces Open Source Some More with Hadoop Inside Search Webmap