Motivation
Later this month we will be presenting a half day tutorial on Open Search at SIGIR. It’ll basically focus on how to use open source software and cloud services for building and quickly prototyping advanced search applications. Open Search isn’t just about building a Google-like search box on a free technology stack, but encouraging the community to extend and embrace search technology to improve the relevance of any application.
For example, one non-search application of BOSS leveraged the Spelling service to spell correct video comments before handing them off to their Spam filter. The Spelling correction process normalizes popular words that spammers intentionally misspell to get around spam models that rely on term statistics, and thus, can increase spam detection accuracy.
We have split up our upcoming talk into two sections:
- Services: Open Search Web APIs (Yahoo! BOSS, Twitter, Bing, and Google AJAX Search), interesting mashup examples, ranking models and academic research that leverage or could benefit from such services.
- Software: How to use popular open source packages for vertical indexing your own data.
While researching for the Software section, I was quite surprised by the number of open source vertical search solutions I found:
- Lucene (Nutch, Solr, Hounder), Sphinx, zettair, Terrier, Galago, Minnion, MG4J, Wumpus, RDBMS (mysql, sqlite), Indri, Xapian, grep …
And I was even more surprised by the lack of comparisons between these solutions. Many of these platforms advertise their performance benchmarks, but they are in isolation, use different data sets, and seem to be more focused on speed as opposed to say relevance.
The best paper I could find that compared performance and relevance of many open source search engines was Middleton+Baeza’07, but the paper is quite old now and didn’t make its source code and data sets publicly available.
So, I developed a couple of fun, off the wall experiments to test (but mainly to build code examples for) some of the popular vertical indexing solutions. Here’s a table of the platforms I selected to study, with some high level feature breakdowns:

High level feature comparison among the vertical search solutions I studied; The support rating and scale are based on information I collected from web sites and conversations. I tested each solution's latest stable release as of this week.
One key design decision I made was not to change any numerical tuning parameters. I really wanted to test “Out of the Box” performance to simulate the common developer scenario. Plus, it takes forever to optimize parameters fairly across multiple platforms and different data sets.
Also, I tried my best to write each experiment natively for each platform using the expected library routines or binary commands.
Twitter Experiment
For the first experiment, I wanted to see how well these platforms index Twitter data. Twitter is becoming very mainstream, and its real time nature and brevity differs greatly from traditional web content (which these search platforms are overall more tailored for) so its data should make for some interesting experiments.
So I proceeded to crawl Twitter to generate a sample data set. After about a full day and night, I had downloaded ~1M tweets (~10/second).
But before indexing, I did some quick analysis of my acquired Twitter data set:
# of Tweets: 968,937
Indexable Text Size (user, name, text message): 92MB
Average Tweet Size: 12 words
Types of Tweets based on simple word filters:

Out of a 1M sample, what types of Tweets do we find? Unique Users means that there were ~600k users that authored all of the 1M tweets in this sample.
Very interesting stats here – especially the high percentage of tweets that seem to be asking questions. Could Twitter (or an application) better serve this need?
Here’s a table comparing the indexing performance over this Twitter data set across the select vertical search solutions:
Lucene was the only solution that produced an index that was smaller than the input data size. Shaves an additional 5 megabytes if one runs it in optimize mode, but at the consequence of adding another ten seconds to indexing. sphinx and zettair index the fastest. Interestingly, I ran zettair in big-and-fast mode (which sucks up 300+ megabytes of RAM) but it ran slower by 3 seconds (maybe because of the nature of tweets). I was most disappointed by Xapian, which was 5x slower than sqlite (which stores the raw input data in addition to the index) and produced by far the largest index file sizes. The default index_text method in Xapian stores positional information, which blew the index size to 529 megabytes. One must use index_text_without_positions to make the size more reasonable. I checked my Xapian code against the examples and documentation to see if I was doing something wrong, but I couldn’t find any discrepancies. I also included a column about development issues I encountered. zettair was by far the easiest to use (simple command line) but required transforming the input data into a new format. I had some text issues with sqlite (also needs to be recompiled with FTS3 enabled) and sphinx given their strict input constraints. sphinx also requires a conf file which took some searching to find full examples of. Lucene, zettair, and Xapian were the most forgiving when it came to accepting text inputs (zero errors).
Measuring Relevancy: Medical Data Set
While this is a fun performance experiment for indexing short text, this test does not measure search performance and relevancy.
To measure relevancy, we need judgment data that tells us how relevant a document result is to a query. The best data set I could find that was publicly available for download (almost all of them require mailing in CD’s) was from the TREC-9 Filtering track, which provides a collection of 196,403 medical journal references – totaling ~300MB of indexable text (titles, authors, abstracts, keywords) with an average of 215 tokens per record. More importantly, this data set provides judgment data for 63 query-like tasks in the form of “<task, document, 2|1|0 rating>” (2 is very relevant, 1 is somewhat relevant, 0 is not relevant). An example task is “37 yr old man with sickle cell disease.” To turn this into a search benchmark, I treat these tasks as OR’ed queries. To measure relevancy, I compute the Average DCG across the 63 queries for results in positions 1-10.
With this larger data set (3x larger than the Twitter one), we see zettair’s indexing performance improve (makes sense as it’s more designed for larger corpora); zettair also dominates sphinx in all the key dimensions of performance and relevance. zettair’s search speed should probably be a bit faster because its search command line utility prints some unnecessary stats. For multi-searching in sphinx, I developed a Java client (with the hopes of making it competitive with Lucene – the one to beat) which connects to the sphinx searchd server via a socket (that’s their API model in the examples) so that probably suffers from some socket I/O overhead. Lucene appears to be the winner here with the highest relevance, fastest search speed, and by far the smallest index size. The index time could be improved by fiddling with its merge parameters but I wanted to avoid numerical adjustments in this evaluation. Xapian has very similar search performance to Lucene but with significant indexing costs (both time and space > 3x). sqlite has the worst relevance because it doesn’t sort by relevance nor seem to provide an ORDER BY function to do so.
Conclusion & Downloads
Based on these preliminary results (but mostly on anecdotal information I’ve collected from the web and people in the field), I would probably recommend Lucene (which is an IR library – use a wrapper platform like Solr w/ Nutch if you need all the search dressings like snippets, crawlers, servlets) for many vertical search indexing applications – especially if you need something that works well out of the box (as that’s what I’m mainly evaluating here). If you need faster indexing, first try fiddling with Lucene’s settings (like mergeFactor), and if that’s still not quick enough, try zettair or sphinx (sphinx has more commercial support and great MySQL integration). From a glance, it seems Lucene spends more time compacting files during its indexing phase. Although more time consuming than zettair and sphinx, this process produces much smaller index files, which not only uses less disk space, but also helps speed up searches (apparently making up for any Java v. C lag).
Keep in mind that these experiments are still very early and can be improved greatly with bigger and better data sets and tuned implementations. And like with any benchmark, please take these results with a grain of salt and still do your due diligence (in fact I am … just presenting the code examples for each platform at SIGIR).
To encourage further benchmarking, I’ve open sourced all the code here:
http://github.com/zooie/opensearch/tree/master
Happy to post any new and interesting results.







– but it’s this design control that greatly helped Facebook dominate the market IMO.