As part of its ongoing strategy towards moving to open source infrastructure and network, Yahoo announced that its webmap is now being processed using Apache Hadoop. The Yahoo webmap is a Hadoop application that produces the index from the billions of pages that Yahoo! Search crawlsThis according to Yahoo is by far the world’s largest application of Apache Hadoop. How big? Just look at these facts:
- roughly 1 trillion links between pages in the Yahoo webmap index
- over 300 TB of compressed output
- over 10,000 cores used run a single Map-Reduce job
- over 5 Petabytes of raw disk used in the production cluster
If you’re like me, who just scratches my head upon reading those facts, don’t fret out. What this data simply means is that even with such a large size of data, Yahoo still manages to run the identical processing of production clusters by 66% of the time it took when Apache Hadoop was not being utilized.
What this means is that, despite the gazillions of pages that Yahoo search crawls, it still manage to accomplish this task with reduced cost and less administration demand from Yahoo.
But the most important thing is that with the use of Apache Hadoop, Yahoo was able to demonstrate that Apache Hadoop is gaining traction in the search market and is now ready for prime time, capable of handling massive internet scale projects in a not so costly manner. Hopefully, search results were improved as well.
And of course, it’s a great step towards Yahoo’s openness strategy.
 
        