A Comparison of Open Source Search Engines

July 6, 2009 by Vik

Motivation

Later this month we will be presenting a half day tutorial on Open Search at SIGIR. It’ll basically focus on how to use open source software and cloud services for building and quickly prototyping advanced search applications. Open Search isn’t just about building a Google-like search box on a free technology stack, but encouraging the community to extend and embrace search technology to improve the relevance of any application.

For example, one non-search application of BOSS leveraged the Spelling service to spell correct video comments before handing them off to their Spam filter. The Spelling correction process normalizes popular words that spammers intentionally misspell to get around spam models that rely on term statistics, and thus, can increase spam detection accuracy.

We have split up our upcoming talk into two sections:

  • Services: Open Search Web APIs (Yahoo! BOSS, Twitter, Bing, and Google AJAX Search), interesting mashup examples, ranking models and academic research that leverage or could benefit from such services.
  • Software: How to use popular open source packages for vertical indexing your own data.

While researching for the Software section, I was quite surprised by the number of open source vertical search solutions I found:

And I was even more surprised by the lack of comparisons between these solutions. Many of these platforms advertise their performance benchmarks, but they are in isolation, use different data sets, and seem to be more focused on speed as opposed to say relevance.

The best paper I could find that compared performance and relevance of many open source search engines was Middleton+Baeza’07, but the paper is quite old now and didn’t make its source code and data sets publicly available.

So, I developed a couple of fun, off the wall experiments to test (but mainly to build code examples for) some of the popular vertical indexing solutions. Here’s a table of the platforms I selected to study, with some high level feature breakdowns:

High level feature comparison among the vertical search solutions I studied; The support rating and scale are based on information I collected from web sites and conversations (please feel free to comment).

High level feature comparison among the vertical search solutions I studied; The support rating and scale are based on information I collected from web sites and conversations. I tested each solution's latest stable release as of this week.

One key design decision I made was not to change any numerical tuning parameters. I really wanted to test “Out of the Box” performance to simulate the common developer scenario. Plus, it takes forever to optimize parameters fairly across multiple platforms and different data sets.

Also, I tried my best to write each experiment natively for each platform using the expected library routines or binary commands.

Twitter Experiment

For the first experiment, I wanted to see how well these platforms index Twitter data. Twitter is becoming very mainstream, and its real time nature and brevity differs greatly from traditional web content (which these search platforms are overall more tailored for) so its data should make for some interesting experiments.

So I proceeded to crawl Twitter to generate a sample data set. After about a full day and night, I had downloaded ~1M tweets (~10/second).

But before indexing, I did some quick analysis of my acquired Twitter data set:

# of Tweets: 968,937

Indexable Text Size (user, name, text message): 92MB

Average Tweet Size: 12 words

Types of Tweets based on simple word filters:

Out of a 1M sample, what kind of Tweet types do we find?

Out of a 1M sample, what types of Tweets do we find? Unique Users means that there were ~600k users that authored all of the 1M tweets in this sample.

Very interesting stats here – especially the high percentage of tweets that seem to be asking questions. Could Twitter (or an application) better serve this need?

Here’s a table comparing the indexing performance over this Twitter data set across the select vertical search solutions:

Indexing 1M twitter messages on a variety of open source search solutions; measuring time and space for each.

Indexing 1M twitter messages on a variety of open source search solutions.

Lucene was the only solution that produced an index that was smaller than the input data size. Shaves an additional 5 megabytes if one runs it in optimize mode, but at the consequence of adding another ten seconds to indexing. sphinx and zettair index the fastest. Interestingly, I ran zettair in big-and-fast mode (which sucks up 300+ megabytes of RAM) but it ran slower by 3 seconds (maybe because of the nature of tweets). I was most disappointed by Xapian, which was 5x slower than sqlite (which stores the raw input data in addition to the index) and produced by far the largest index file sizes. The default index_text method in Xapian stores positional information, which blew the index size to 529 megabytes. One must use index_text_without_positions to make the size more reasonable. I checked my Xapian code against the examples and documentation to see if I was doing something wrong, but I couldn’t find any discrepancies. I also included a column about development issues I encountered. zettair was by far the easiest to use (simple command line) but required transforming the input data into a new format. I had some text issues with sqlite (also needs to be recompiled with FTS3 enabled) and sphinx given their strict input constraints. sphinx also requires a conf file which took some searching to find full examples of. Lucene, zettair, and Xapian were the most forgiving when it came to accepting text inputs (zero errors).

Measuring Relevancy: Medical Data Set

While this is a fun performance experiment for indexing short text, this test does not measure search performance and relevancy.

To measure relevancy, we need judgment data that tells us how relevant a document result is to a query. The best data set I could find that was publicly available for download (almost all of them require mailing in CD’s) was from the TREC-9 Filtering track, which provides a collection of 196,403 medical journal references – totaling ~300MB of indexable text (titles, authors, abstracts, keywords) with an average of 215 tokens per record. More importantly, this data set provides judgment data for 63 query-like tasks in the form of “<task, document, 2|1|0 rating>” (2 is very relevant, 1 is somewhat relevant, 0 is not relevant). An example task is “37 yr old man with sickle cell disease.” To turn this into a search benchmark, I treat these tasks as OR’ed queries. To measure relevancy, I compute the Average DCG across the 63 queries for results in positions 1-10.

Performance and Relevancy marks on the TREC OHSUMED Data Set; Lucene is the smallest, most relevant and fastest to search; Xapian is very close to Lucene on the search side but 3x slower on indexing and 4x bigger in index space; zettair is the fastest indexer.

Performance and Relevancy marks on the TREC-9 across select vertical search solutions.

With this larger data set (3x larger than the Twitter one), we see zettair’s indexing performance improve (makes sense as it’s more designed for larger corpora); zettair also dominates sphinx in all the key dimensions of performance and relevance. zettair’s search speed should probably be a bit faster because its search command line utility prints some unnecessary stats. For multi-searching in sphinx, I developed a Java client (with the hopes of making it competitive with Lucene – the one to beat) which connects to the sphinx searchd server via a socket (that’s their API model in the examples) so that probably suffers from some socket I/O overhead. Lucene appears to be the winner here with the highest relevance, fastest search speed, and by far the smallest index size. The index time could be improved by fiddling with its merge parameters but I wanted to avoid numerical adjustments in this evaluation. Xapian has very similar search performance to Lucene but with significant indexing costs (both time and space > 3x). sqlite has the worst relevance because it doesn’t sort by relevance nor seem to provide an ORDER BY function to do so.

Conclusion & Downloads

Based on these preliminary results (but mostly on anecdotal information I’ve collected from the web and people in the field), I would probably recommend Lucene (which is an IR library – use a wrapper platform like Solr w/ Nutch if you need all the search dressings like snippets, crawlers, servlets) for many vertical search indexing applications – especially if you need something that works well out of the box (as that’s what I’m mainly evaluating here). If you need faster indexing, first try fiddling with Lucene’s settings (like mergeFactor), and if that’s still not quick enough, try zettair or sphinx (sphinx has more commercial support and great MySQL integration). From a glance, it seems Lucene spends more time compacting files during its indexing phase. Although more time consuming than zettair and sphinx, this process produces much smaller index files, which not only uses less disk space, but also helps speed up searches (apparently making up for any Java v. C lag).

Keep in mind that these experiments are still very early and can be improved greatly with bigger and better data sets and tuned implementations. And like with any benchmark, please take these results with a grain of salt and still do your due diligence (in fact I am … just presenting the code examples for each platform at SIGIR).

To encourage further benchmarking, I’ve open sourced all the code here:

http://github.com/zooie/opensearch/tree/master

Happy to post any new and interesting results.

TweetNews (Real-Time Search) Is Back

June 25, 2009 by Vik

Here it is  tweetnews.me

And an example query  steve ballmer

About six months ago I released a simple 100 line search application called TweetNews, which basically links tweets to the freshest Yahoo! News articles. The more related tweets an article has, the higher its rank. The tweet count and messages are presented underneath each result so that a user can read the social commentary inline with the article listing. It was developed more to demonstrate the openness and power of Yahoo! BOSS (you can read more about it in my previous posts here and here). Remarkably, many users found the service useful despite its slow performance, barebones UI, lack of homepage, domain, (you name it), etc.

Interestingly, the TweetNews concept has been popping up in my recent discussions around real-time search, so I felt it was about time to polish up TweetNews to serve as a better proof of concept.

Here are some of the new features:

  • Sweet UI (kudos to Kara McCain & Aaron Wheeler for the awesome design and template)
  • Continually Updated, Fresh Homepage (aggregates & ranks feeds like Techmeme, Delicious, Digg)
  • Faster Performance
  • Improved Algorithm
  • Local Views (re-rank & link tweets from a select region)

.

Here’s a screenshot of the homepage:

TweetNews Homepage

.

And here’s an example of Local Views:

London’s View of ‘iphone’

TweetNews IPhone (London Ranking)

Los Angeles’ View of ‘iphone’

TweetNews IPhone (Los Angeles Ranking)

Striking difference between Americans (actually just SoCal) and the British right there :)

I think the Local Views concept is pretty promising, although there’s plenty of room for improvement (use BOSS region filters, access Twitter’s Firehose Feed for more granularity, etc.).

Which is why, like I did with the last version, plan to open source all the code powering this application (just need a little more time to get it reviewed).

Interestingly, the homepage system in this package is very general. Just pass it any list of RSS feeds and it’ll do the clustering, tweet linking, ranking, and page generation automatically every X minutes for you. Anyone want a fresh, personalized Techmeme? Let me know if that sounds interesting.

Please keep in mind that this is still a simple, early prototype to show how one can use BOSS to experiment with very interesting data sources like Twitter to tackle big problems like real-time search.

Twitter + BOSS = Real Time Search

January 15, 2009 by Vik

This application has been updated. Go here to learn more.

Try itsteve jobs

Update: (6/11) In case you’re bored, here’s a link to a discussion we had with Google and Twitter where we discussed Open & Real-time Search: http://jellytalks.yahoo.com/

Update: (1/19) If you have issues try again in 5-10 minutes. You can also check out the screenshots below. (1/15) App Engine limits were reached (and fast). Appreciate the love and my apologies for not fully anticipating that. Google was nice enough though to temporarily raise the quota for this application. Anyways, this was more to show a cool BOSS developer example using code libraries I released earlier, but there might be more here. Stay tuned.

Here’s a screenshot as well (which should hopefully be stale by the time you read this).

Basically this service boosts Yahoo’s freshest news search results (which typically don’t have much relevance since they are ordered by timestamp and that’s it) based on how similar they are to the emerging topics found on Twitter for the same query (hence using Twitter to determine authority for content that don’t yet have links because they are so fresh). It also overlays related tweets via an AJAX expando button (big thanks to Greg Walloch at Yahoo! for the design) under results if they exist. A nice added feature to the overlay functionality is near-duplicate removal to ensure message threads on any given result provide as much comment diversity as possible.

Freshness (especially in the context of search) is a challenging problem. Traditional PageRank style algorithms don’t really work here as it takes time for a fresh URL to garner enough links to beat an older high ranking URL. One approach is to use cluster sizes as a feature for measuring the popularity of a story (i.e. Google News). Although quite effective IMO this may not be fast enough all the time. For the cluster size to grow requires other sources to write about the same story. Traditional media can be slow however, especially on local topics. I remember when I saw breaking Twitter messages describing the California Wildfires. When I searched Google/Yahoo/Microsoft right at that moment I barely got anything (< 5 results spanning 3 search results pages). I had a similar episode when I searched on the Mumbai attacks. Specifically, the Twitter messages were providing incredible focus on the important subtopics that had yet to become popular in the traditional media and news search worlds. What I found most interesting in both of these cases was that news articles did exist on these topics, but just weren’t valued highly enough yet or not focusing on the right stories (as the majority of tweets were). So why not just do that? Order these fresh news articles (which mostly provide authority and in-depth coverage) based on the number of related fresh tweets as well as show the tweets under each. That’s this service.

To illustrate the need, here’s a quick before and after shot. I searched for ‘nba’ using Yahoo’s news search ordered by latest results (first image). Very fresh (within a minute) but subpar quality. The first result talks about teams that are in a different league of basketball than the NBA. However, search for ‘nba’ on TweetNews (second image) and you get the Kings/Warriors triple OT game highlight which was buzzing more in Twitter at that minute.

'NBA' on Y! News latest

'NBA' on Y! News latest

'NBA' on Y! News latest enhanced by Twitter

'NBA' on TweetNews

There’s something very interesting here … Twitter as a ranking signal for search freshness may prove to be very useful if constructed properly. Definitely deserves more exploration – hence this service, which took < 100 lines of code to represent all the search logic thanks to Yahoo! BOSS, Twitter’s API, and the BOSS Mashup Framework.

To sum up, the contributions of this service are: (1) Real-time search + freshness (2) Stitching social commentary to authoritative sources of information (3) Another (hopefully cool) BOSS example.

The code is packaged for general open consumption and has been ported to run on App Engine (which powers this service actually). You can download all the source here.

Yahoo Boss – Google App Engine Integrated

August 4, 2008 by Vik

Updated: I see blogs doing evaluations of the Q&A engine. I have to admit, that wasn’t my focus here. The service is merely 50 lines of code … just to demonstrate the integration of BMF and GAE.

Updated: Direct link to the example Question-Answering Service

Today I finally plugged-in the Yahoo Boss Mashup Framework into the Google App Engine environment. Google App Engine (GAE) provides a pretty sweet yet simple platform for executing Python applications on Google’s infrastructure. The Boss Mashup Framework (BMF) provides Python API’s for accessing Yahoo’s Search API’s as well remixing data a la SQL constructs. Running BMF on top of GAE is a seemingly natural progression, and quite arguably the easiest way to deploy Boss – so I spent today porting BMF to the GAE platform.

Here’s the full BMF-GAE integrated project source download.

There’s a README file included. Just unzip, put your appid’s in the config files, and you’re done. No setup or dependencies (easier than installing BMF standalone!). It’s a complete GAE project directory which includes a directory called yos which holds all the ported BMF code. Also made a number of improvements to the BMF code (SQL ‘where’ support, stopwords, yql.db refactoring, util & templates in yos namespace, yos.crawl.rest refactored & optimized, etc.).

The next natural thing to do is to develop a test application on top of this united framework. In the original BMF package, there’s an examples directory. In particular, ex6.py was able to answer some ‘when’ style questions. I simply wrapped that code as a function and referenced it as a GAE handler in main.py.

Here’s the ‘when’ q&a source code as a webpage (less than 25 lines).

The algorithm is quite easy – use the question as the search query and fetch 50 results via the Boss API. Count the dates that occur in the results’ abstracts, and simply return the most popular one.

For fun, following a similar pattern to the ‘when’ code, I developed another handler to answer ‘who’ or ‘what’ or ‘where’ style questions (finding the most popular capitalized phrase).

Here’s the complete example (just ~50 lines of code – bundled in project download):

Q&A Running Service Example

Keep in mind that this is just a quick proof of concept to hopefully showcase the power of BMF and the idea of Open Web Search.

If you’re interested in learning more about this Q&A system (or how to improve it), check out AskMSR – the original inspiration behind this example.

Also, shoutout to Sam for his very popular Yuil example, which is powered by BMF + GAE. The project download linked above is aimed to make it hopefully easier for people to build these types of web services.

Yahoo! Boss – An Insider View

July 10, 2008 by Vik

Disclaimer: This is my personal blog. The views expressed on these pages are mine alone and not those of my employer.

Boss stands for Build your Own Search Service. The goal of Boss is to open up search to enable third parties to build incredibly useful and powerful search-based applications. Several months ago I pitched this idea to the executives on how Yahoo! can specifically open up its search assets to fragment the market. It’s remarkable to finally see some of the vision (with the help of many talented people) reach the public today.

Web search is a tough business to get into. $300+ Million capex, amazing talent, infrastructure, a prayer, etc. just to get close to basic parity. Only 3 companies have really pulled it off. However, I strongly believe we need to find innovative, incremental ways to spread the search love in order to encourage fragmentation and help promising companies get to basic parity instantly so that they can leverage their unique assets (new algorithm, user data, talent) to push their search solution beyond the current baseline.

Search is all about understanding the user’s intent. If we can nail the intent, then search is pretty much a solved problem. However, the current model of a single search box for everything loses an intent focus as it aims to cater to all people and queries. Albeit, a single search box definitely makes our lives easier, but I have a hard time believing this is the *right* approach.

In my online experience, I typically visit a variety of sites: Techmeme, Digg, Techcrunch, eBay, Amazon, del.icio.us, etc. While on these pages, something almost always catches my eye, and so I proceed to the search box in my browser to find out more on the web. Why do we have this disconnected experience? I think it’s because these sites do not provide web-level comprehensiveness. It’s unfortunate, because the page that I’m on may have additional information about my intent (maybe I’m logged in so it has my user info, or it’s a techy shopping site).

The biggest goal of Boss is to help bootstrap sites like these to get comprehensiveness and basic ranking for free, as well as offer tools to re-rank, blend, and overlay the results in a way that revolutionizes the search experience.

When I’m on del.icio.us, why can’t I search in their box, get relevant del.icio.us results at the top, and also have web results backfill below? I think users should be confident that if they searched in a search box on any page in the whole wide web that they’ll get results that are just as good as Yahoo/Google and only better.

The first milestone of Boss is a simple one: Make available a clean search API that turns off the traditional restrictions so that developers can totally control presentation, re-rank results, run an unlimited number of queries, and blend in external content all without having to include any Yahoo! attribution in the resulting product(s). Want to build the example above or put news search results on a map – go for it!

Here’s a link to the API:

http://developer.yahoo.com/search/boss/

Also, check out the Boss Mashup Framework:

http://developer.yahoo.com/search/boss/mashup.html

The Boss Mashup Framework in my opinion makes the Boss Search API really useful. It lets developers use SQL like syntax for operating on heterogeneous web data sources. The idea came up as I was working on examples to showcase Boss, and realized the operations I was developing imperatively followed closely to declarative SQL like constructs. Since it’s a recent idea and implementation, there may be some bugs or weird designs lurking in there, but I strongly recommend playing around with it and viewing the examples included in the package. I’m biased of course but do think it’s a fun framework for remixing online data. One can rank web results by digg and youtube favorite counts, remove duplicates, and publish the results using a provided search results page template in less than 30 lines of code and without having to specify any parsing logic of the data sources/API’s as the framework can infer the structure and unify the data formats automatically in most cases.

The next couple of milestones for Boss I think are even more interesting and disruptive – server side services, monetization, blending ranking models, more features exposure, query classifiers, open source … so stay tuned.

Is the Facebook Application Platform Fair?

June 30, 2008 by Vik

Take a look at this stats deck from O’Reilly’s Graphing Social Patterns conference:

http://en.oreilly.com/gspeast2008/public/asset/attachment/2950

Fairly in-depth and recent [6/01/2008] analysis of the application usage in Facebook and MySpace.

As expected, lots of power law behavior.

I found the slides describing churn to be pretty interesting. Since October 2007, nine of the top fifteen most popular applications are new. However, only three of those new applications debuted after March 2008. I expect the amount of churn in the top spots to continue to drop based on the recent declining active usage trends and Facebook’s efforts to curb application spam (new UI that puts applications in a separate profile tab, app module minimizing, viral friend messaging limits, security compliance, etc).

What I would find even more interesting is a study of the number of applications users install, and how those moving averages have changed over time. Like say the number of applications a typical user installs is 4. Once the user reaches that threshold, what’s the churn like then? Specifically, what are the chances that a user will add a new app? Or maybe an even better metric: how long does it take, and how does this length of time compare to when the user had 1 app and increased to 2, or 2 apps and increased to 3, etc.? Basically, what’s the adoption rate/times based on current application counts?

I believe it becomes harder to influence a user to add or replace for a new app if the number of current apps the user has is high. I think most users, without even knowing it, have a threshold of how many total apps they are willing display on their profile – and that this threshold is based on an ongoing evaluation of the utility and efficiency of the page. Each app takes up real estate on the profile page, and a “rational” user will only show so many until page load times degrade and/or core modules (wall, general information, albums, networks) get drowned in clutter and thus become difficult for users to locate. Of course, social networks like MySpace which have very minimal profile page design constraints prove that most users are irrational ;) – but it’s this design control that greatly helped Facebook dominate the market IMO.

If this is true, then it means that first movers really, really win in the Facebook apps world. Companies like Slide and Rockyou manage many of the top applications, and given the power law market share phenomena, they control a majority stake of application usage and installs. Many of these companies had the early bird advantage, and once winners, always winners – acquisitions of emerging applications, leveraging branding and existing audiences (a.k.a monopoly) to cross promote potentially copy-cat applications faster and wider than the competition, etc. Monopolies inside Facebook have unsettling ramifications, as they block newcomers from capturing profile space. If they fail to innovate (as most monopolies) then next-gen application development may never get through.

Now, if users do have an application count threshold, and it becomes successively more difficult to replace/add a new app as this count increases, then any apps developed now have a substantially rarer chance of gaining market share. If winner’s win, first movers reap, and churn becomes improbable over time, then the early top apps have already most likely filled up users’ allocated app slots.

I find thinking of the profile page as a resource allocation problem rather fascinating. Essentially, there are finite resources on a page and we expect rational users to perform some optimization to allocate resources to maximize utility for themselves and for others (potential game theory link). Once users fill up these resources, human laziness kicks in. Another warrant for improbable churn is that users who want to add new applications after filling up their resource limit will need to remove an existing app to make space. The standards for change are higher now, as the user must compare the new app to an existing preferred app (which probably is a popular early-bird app that friends use), and so the decision will incur a trade-off.

One could also argue that with more apps available now (second slide shows that despite sluggish usage the # of app’s being developed is still growing insanely) users are burdened with more choices. Or, one could argue because most users have reached their app limit, and thus, churn has become improbable, the discoverability of new apps among friends (a critical channel for adoption) also becomes improbable.

Under this theory, especially in context of Facebook’s current efforts and app stats, the growth of new app adoption in social networks will continue to slow down.

So what can be done here?

The platform needs to encourage more churn by building a fairer market that matches users to high quality apps that satisfy their expressed intents. At the end of the day, these applications are really just web pages, but unlike the web, they do not leverage important primitives like linking and meta tags. Search engines like Google and Yahoo use these features extensively to calculate authority and relevance. In the long run, as the number of sources increases, advanced ranking algorithms and marketplaces are necessary to scale and ensure fairness to worthy tail publishers. Maybe social networks should inherit these system properties to bolster their tail applications.

Also, Facebook needs to encourage users to variate or add more applications to their profile page. Facebook’s move to put applications in its own profile tab may very well achieve this goal, but at a consequence of lowering their visibility.

Anyways, just some random thoughts about the current state of Facebook apps. It’ll be very interesting to see how their platform progresses and how it will be perceived by end users and developers in the future.

Techmeme Leaderboard 2007 – More!

October 14, 2007 by Vik

I’m an avid reader of Techmeme. Love the idea, UI, freshness, coverage, and most of all the quality of the articles.

When the Techmeme Leaderboard debuted earlier this month, lots of buzz circulated the blogosphere. Me, being a huge fan of partying on data, loved the concept, and wanted to take the analysis even further (Yuvi style, but with a search twist).

So yesterday I wrote up some code to crawl and analyze Techmeme articles over the whole year (Leaderboard shows the Top 50 sources for this month). I took a snapshot of Techmeme at 1:00PM every day between beginning January – end of September of 2007.

I computed basic statistics, like number of stories by author and source, as well as more involved measurements like the top word mentions of the year – in total and by category (used simple NLP to clean up the text and remove stopwords).

So, without further ado, here are the results:

Number of Stories by Author in 2007, Ranked

Number of Stories by Source in 2007, Ranked

Most Mentioned Words in 2007, Ranked

* words are stemmed

Most Mentioned Words, by Category, Trends in 2007, Ranked

Hope you guys find these results super interesting and useful.

Surviving a Lunch Interview

July 1, 2007 by Vik

I always found lunch interviews to be the most frustrating experiences ever. There you are, given an opportunity to pig out in a grand cafeteria on corporate expense – so naturally, you stock up the tray to get your chow down. You sit down at the table across from your interviewer, and right as you’re about to take that first scrumptious bite, your interviewer asks you a question. You of course answer it completely, but before returning to the meal you’re asked a follow-up question, and then another one, and before you know it rapid fire Q/A begins. You do your best to answer each one … as your food gets cold … as your stomach growls … and as you watch the interviewer nodding to your comments with his/her mouth filled with that savory steak and potatoes you’re dying to devour. Why can’t the interviewer just go to the bathroom or receive a cell call already?!

This isn’t the interviewer’s fault by any means. After all, it is an interview, and their role compels constant question asking (silence is awkward). Additionally, this whole food tease leading to short-term starvation isn’t the worst consequence. You can get food stuck in your teeth, pass gas, get bad breath, spill your food and drink all over your interviewer, etc. It’s probably the most dangerous, error-prone part of the interview process (actually probably not … since you typically don’t get asked technically involved questions over food).

So here’s some advice to those who find themselves in similar situations. Sadly, it took me nearly three years of lunch interviews to discover these pointers:

  1. Eat a big breakfast. Lunch should be a snack.
  2. When you do eat lunch, order the soup with bread. Warms your body and soothes your throat. Simple to eat. Nothing gets stuck to your teeth. No need to wash the hands, so hands don’t get dirty for that final handshake. It’s not greasy (like pizza) so doesn’t reflect bad diet habits to your interviewer. Also, the bread soaks in the soup to make the meal filling plus give you additional energy for the rest of the day.
  3. Eat slowly, since your interviewer probably got more food than you. You don’t want to finish earlier than him/her. It tends to rush the other person. Your goal is to make the lunch round long and fun. Keep the conversation going but don’t over do it to the point where the interviewer starts to daze off. Ask questions when the interviewer runs out of questions (also gives you more time to eat!). Make the most of lunch to learn as much as you can about the group. Their insight will be super useful in the upcoming rounds. Just think of lunch as a break before the more technical rounds.
  4. Drink water. It really is the best drink ever. No chance of an upset stomach during or after the round. If you’re starving and know the soup + bread won’t fill you up (eating slowly helps fill you up though), get an Odwalla. It’s seriously a second meal.
  5. Don’t take notes. That’s too much IMHO. Keep it informal, unless the interviewer specifies otherwise.

That’s all I got. Nothing crazy.

Anyways, hope these pointers come in handy.

How Google is putting us back into the Stone Age

April 18, 2007 by Vik

Yeah, I know – what a linkbait title. If that’s what it takes these days to get visitors and diggs then so be it. Also, just to forewarn, as you read this you might find that a better title choice for this post would have been “How Web 2.0 is putting us back into the Stone Age” since many of these thoughts generalize to Web 2.0 companies as a whole. I used Google in the title mainly because they are the big daddy in the web world, the model many web 2.0 companies strive to be like, the one to beat. Plus, the title just looks and sounds cooler with ‘Google’ in it.

Here’s the main problem I have with web applications coming from companies like Google: About 2 years ago I bought a pretty good box – which is now fairly standard and cheap these days – 2 gigs of ram, dual core AMD-64 3400+’s, 250 gigs hd, nVidia 6600 GT PCI Express, etc. It’s a beast. However, because I don’t play games, its potential isn’t being utilized – not even close. Most of the applications I use are web-based, mainly because the web provides a medium which is cross platform (all machines have a web browser), synchronized (since the data is stored server side I can access it from anywhere like the library, friend’s computer, my laptop) and it keeps my machine pretty light (no need to install anything and waste disk and risk security issues). The web UI experience for the most part isn’t too bad either – in fact, I find that the browser’s restrictions force many UI’s to be far simpler and easier to use. To me, the benefits mentioned above clearly compensate for any UI deficiencies. Unfortunately, this doesn’t mean that Web 2.0 is innovating the user’s experience. Visualizing data – search results, semantic networks, social networks, excel data sheets – is still very primitive, and a lot can be done to improve this experience by taking advantage of the user’s hardware.

My machine, and most likely yours, is very powerful and underutilized. For instance, my graphics card has tons of cores. We live in an age where GPU’s like mine can sort terabytes of data faster than the top-of-the-line Xeon based workstation (refer to Jim Gray’s GPUTerasort paper). For sorting, which is typically the bottleneck in database query plans and MapReduce jobs, it’s all about I/O – or in this case, how fast you can swap memory (for example, a 2-pass bitonic radix sort iteratively swaps the lows and the highs). Say you call memcpy in your C program on a $6,000 Xeon machine. The memory bandwidth is about 4 GB/s. Do the equivalent on a $200 graphics co-processor and you get about 50 GB/s. Holy smokes! I know I’m getting off-topic here, but why is it so much faster on a GPU? Well, in CPU world, memory access can be quite slow. You have almost these random jumps in memory, which can result in expensive TLB/cache misses, page faults, etc. You also have context switching for multi-processing. Lots of overhead going on here. Now compare this with a GPU, which has the memory almost stream directly to tons of cores. The cores on a GPU are fairly cheap, dumb processing units in comparison to the cores found in a CPU. But the GPU uses hundreds of these cores, in parallel, to drastically speed-up the overall processing. This, coupled with its specialized memory architecture, results in amazing performance bandwidth. Also, interestingly, since these cores are cheap (bad), there’s a lot of room for improvement. At the current rate, GPU advancements are occurring 3-4x faster than Moore’s law for CPU’s. Additionally, the graphical experience is near real-life quality. Current API’s enable developers to draw 3D triangles directly off the video card! This is some amazing hardware folks. GPU’s, and generally this whole notion of co-processing to optimize for operations that lag on CPU’s (memory bandwidth, I/O) promise to make future computers even faster than ever.

OK, so the basic story here is our computers are really powerful machines. The web world doesn’t take advantage of this, and considering how much time we spend there, it’s an unfortunate waste of computing potential. Because of this, I feel we are losing an appreciation for our computer’s capabilities. For example, when my friend first started using Gmail, he was non-stop clicking on the ‘Invite a friend’ drop-down. He couldn’t believe how the page could change without a browser refresh. Although this is quite an extreme example, I’ve seen this same phenomena for many users on other websites. IMHO, this is completely pathetic, especially when considering how powerful client-end applications can be in comparison.

Again, I’m not against web-based applications. I love Gmail, Google Maps, Reader, etc. However, there are applications which I do not think should be web-based. An example of this is YouOS, which is an OS accessible through the web-browser. I mean, there’s some potential here, but the way it’s currently implemented is very limiting and unnecessary.

To me, people are developing web-services with the mindset ‘can it hurt?’, when I think a better mantra is ‘will it advance computing and communication?’. Here’s the big web 2.0 problem: Just because you can make something web 2.0′ish, doesn’t mean you should. I think of this along the lines of Turing Complete, which is a notion in computer science for determining whether a system can express any computation. Basically, as long as you can process an input, store state, and return an output (i.e. a potentially stateful function), you can do any computation. Now web pages provide an input form, perform calculations server side, and can generate outputting pages – enough to do anything according to this paradigm, but with extreme limitations on visualization and performance (like with games). AJAX makes web views richer, but it is not only a terribly hacked up programming model, but for some reason compels developers to convert previously successful client-end-based applications into web-based services. Sometimes this makes sense from an end-user perspective, but consequently results in dumbing down the user experience.

We have amazing hardware that’s not being leveraged in web-based services. Browsers provide an emulation for a real application. However, given the proliferation of AJAX web 2.0 services, we’re starting to see applications only appear in the browser and not on the client. I think this current architecture view is unfortunate, because what I see in a browser is typically static content – something I could capture the essence of with a camera shot. In some sense, Web 2.0 is a surreal hack on what the real online experience should be.

I feel we really deserve truly rich applications that deliver ‘Minority Report’ style interfaces that utilize the client’s hardware. Movies predating the 1970’s predicted so much more for our current state’s user experience level. It’s up to us, the end-consumer, to encourage innovation in this space. It’s up to us, the developer, to build killer-applications that require tapping into a computer’s powerful hardware.The more we hype up web 2.0 and dumb-downed webpage experiences, the more website-based services we get – and consequently, less innovation in hardware driven UI’s.

But there’s hope. I think there exists a fair compromise between client-end applications and server-side web services. Internet is getting faster, the browser + Flash are getting fine tuned to make better use of a computer’s resources. Soon, the internet will be well-suited for thin-client computing. A great example of this already exists today, and I’m sure many of you have used it: Google Earth. It’s a client-end application – taking advantage of the computer’s graphics and processing power to make the user feel like he/she is traveling in and out of space – while being a server-side service since it gathers updated geographical data from the web. The only problem is there’s no cross-platform, preexisting layer to build applications like this. How do we make these services without forcing the user to do an interventionist, slow installation? How do we make it run over different platforms? Personally, I think Microsoft completely missed the boat here with .NET. If MS could have recognized the web phenomena early on, they could have build this layer into Vista to encourage users to develop these rich thin-client applications, while also promoting Vista. I have no reason to change my OS – this could have been my reason! Even if it was cross platform, if they had better performance it’s still a reason to prefer (providing some business case). Instead, they treated .NET as a Java-based replacement for MFC, thereby forcing developers to resort to building their cross-platform, no-installation-required services through AJAX and Flash.

Now, even if this layer existed, which would enable developers to build and instantly deploy Google Earth style applications in a cross-platform manner, there would be security concerns. I mean, one could make the case that ActiveX attempted to do this – allowing developers to run arbitrary code on the client’s machines. Unfortunately, this led to numerous viruses. Security violations and spyware scare(d) all of us – so much so that we now do traditionally client-end functions through a dumb-downed web browser interface. But, I think we made some serious inroads in security since then. The fact that we even recognize security in current development makes us readily prepared to support such a platform. I am confident that the potential security issues can be tackled.

To make a final point, I think we all really need higher expectations in the user experience front. We need to develop killer applications that push the limitations of our hardware – to promote innovation and progress. We’re currently at a standstill in my opinion. This isn’t how the internet should be. This is not how I envisioned the future to be like 5 years ago. We can do better. We can build richer applications. But to do this, we as consumers must demand it in order for companies to have a business case to further pursue it. We need developers to come up with innovative ways of visualizing the large amounts of data being generated with the use of hardware – thereby delivering long-awaited killer-applications for our idly computers. Let’s take our futuristic dreams and finally translate them into our present reality.

Google Co-op just got del.icio.us!

January 3, 2007 by Vik

Update: Sorry, link is going up and down. Worth trying, but will try to find a more stable option when time cycles free up.

This past week I decided to cook up a service (link in bold near the middle of this post) I feel will greatly assist users in developing advanced Google Custom Search Engines (CSE’s). I read through the Co-op discussion posts, digg/blog comments, reviews, emails, etc. and learned many of our users are fascinated by the refinements feature – in particular, building search engines that produce results like this:

‘linear regression” on my Machine Learning Search Engine

… but unfortunately, many do not know how to do this nor understand/want to hack up the XML. Additionally, I think it’s fair to say many users interested in building advanced CSE’s have already done similar site tagging/bookmarking through services like del.icio.us. del.icio.us really is great. Here are a couple of reasons why people should (and do) use del.icio.us:

  • It’s simple and clean
  • You can multi-tag a site quickly (comma separated field; don’t have to keep reopening the bookmarklet like with Google’s)
  • You can create new tags on the fly (don’t choose the labels from a fixed drop-down like with Google’s)
  • The bookmarklet provides auto-complete tag suggestions; shows you the popular tags others have used for that current site
  • Can have bundles (two level tag hierarchies)
  • Can see who else has bookmarked the site (can also view their comments); builds a user community
  • Generates a public page serving all your bookmarks

Understandably, we received several requests to support del.icio.us bookmark importing. My part-time role with Google just ended last Friday, so, as a non-Googler, I decided to build this project. Initially, I was planning to write a simple service to convert del.icio.us bookmarks into CSE annotations – and that’s it – but realized, as I learned more about del.icio.us, that there were several additional features I could develop that would make our users’ lives even easier. Instead of just generating the annotations, I decided to also generate the CSE contexts as well.

Ok, enough talk, here’s the final product:
http://basundi.com:8000/login.html

If you don’t have a del.icio.us account, and just want to see how it works, then shoot me an email (check the bottom of the Bio page) and I’ll send you a dummy account to play with (can’t publicize it or else people might spam it or change the password).

Here’s a quick feature list:

  • Can build a full search engine (like the machine learning one above) in two steps, without having to edit any XML, and in less than two minutes
  • Auto-generates the CSE annotations XML from your del.icio.us bookmarks and tags
  • Provides an option to auto-generate CSE annotations just for del.icio.us bookmarks that have a particular tag
  • Provides an option to Auto-calculate each annotation’s boost score (log normalizes over the max # of Others per bookmark)
  • Provides an option to Auto-expand links (appends a wildcard * to any links that point to a directory)
  • Auto-generates the CSE context XML
  • Auto-generates facet titles
  • Since there’s a four facet by five labels restriction (that’s the max that one can fit in the refinements display on the search results page), I provide two options for automatic facet/refinement generation:
    • The first uses a machine learning algorithm to find the four most frequent disjoint 5-item-sets (based on the # of del.icio.us tag co-occurrences; it then does query-expansion over the tag sets to determine good facet titles)
    • The other option returns the user’s most popular del.ico.us bundles and corresponding tags
    • Any refinements that do not make it in the top 4 facets are dumped in a fifth facet in order of popularity. If you don’t understand this then don’t worry, you don’t need to! The point is all of this is automated for you (just use the default Cluster option). If you want control over which refinements/facets get displayed, then just choose Bundle.
  • Provides help documentation links at key steps
  • And best of all … You don’t need to understand the advanced options of Google CSE/Co-op to build an advanced CSE! This seriously does all the hard, tedious work for you!

In my opinion, there’s no question that this is the easiest way to make a fancy search engine. If I make any future examples I’m using this – I can simply use del.icio.us, sign-in to this service, and voila I have a search engine with facets and multi-label support.


Please note that this tool is not officially endorsed by nor affiliated with Google or Yahoo! It was just something I wanted to work on for fun that I think will benefit many users (including myself). Also, send your feedback/issues/bugs to me or post them on this blog.