Mozdex recently announced our ambitious project to build a search engine with global reach, modeled on open source technology. Mozdex will operate transparently, revealing the processes and methods used to create and manage our index. We will use this forum to give updates into the inner workings and creation of Mozdex.
The Making of a Search Engine
Designing a search engine is something of a challenge. Our goals are to provide an index that is fast, available and most importantly relevant. These are the core issues outside of the costs involved with such a project. This article will describe our basic goals as well as outline our key points regarding the project and what the critical factors are.
Mozdex runs on top of the Lucene index engine from the Apache Jakarta Project. Lucene is a very efficient and fast index system written entirely in Java. As with any large index system the critical factors regarding performance are cached results and load distribution.
We utilize modern hardware with large RAM resources to cache as much of the index as possible in RAM. We also distribute index segments across multiple servers so the amount of data that each server queries is smaller. This provides quick query completion and search results.
Compared to other search systems we use fewer servers but utilize them with much higher memory densities. Our most significant cost in this project is physical space, physical resources and co-location and bandwidth fees. By utilizing modern platforms, such as AMD Opteron powered servers, we are able to store up to 16 gigabytes of index data in memory and access this very quickly and affordably.
Nothing can be worse than visiting a site that is slow, missing data or simply not answering. We not only use open source solutions for our index, also leverage it in our front end. All application servers are designed around the Apache Jakarta Project Tomcat Java server and are load balanced and managed behind the Squid caching servers. This offers us control, performance and reliability unrivaled, even by many commercial products. Index availability is achieved by using switching technologies to distribute queries over replicated clusters of query servers. Each server will initially have an “A” and “A” node that is round-robined for performance and failed over for high availability. As our query demand grows a “C” & “D” server (so on and so forth) can be added to handle the load quickly & easily.
The key to a search is to find what you are looking for. Relevancy can be objective and is often based on the opinion and ideas of the searcher, as well as the methods of the algorithms used to determine relevancy.
You can run queries on our system and click ‘Explain’ and ‘Anchors’ to see the process by which the results are displayed and weighted. At our current index level of roughly 50 million pages we don’t yet have a large enough segment proportional to the whole Internet so the results will evolve, as we get closer to 250 million or more urls.
With a brief on our core issues completed here, we will delve deeper into the details of individual topics in Part 2 of this series where we will provide PDF diagrams of server & switching details as well as explore more into the insides of what our explain and anchor tags mean.
Links referenced within this article:
MozDex Open Search Engine