Yahoo! Boss – An Insider View

Disclaimer: This is my personal blog. The views expressed on these pages are mine alone and not those of my employer.

Boss stands for Build your Own Search Service. The goal of Boss is to open up search to enable third parties to build incredibly useful and powerful search-based applications. Several months ago I pitched this idea to the executives on how Yahoo! can specifically open up its search assets to fragment the market. It’s remarkable to finally see some of the vision (with the help of many talented people) reach the public today.

Web search is a tough business to get into. $300+ Million capex, amazing talent, infrastructure, a prayer, etc. just to get close to basic parity. Only 3 companies have really pulled it off. However, I strongly believe we need to find innovative, incremental ways to spread the search love in order to encourage fragmentation and help promising companies get to basic parity instantly so that they can leverage their unique assets (new algorithm, user data, talent) to push their search solution beyond the current baseline.

Search is all about understanding the user’s intent. If we can nail the intent, then search is pretty much a solved problem. However, the current model of a single search box for everything loses an intent focus as it aims to cater to all people and queries. Albeit, a single search box definitely makes our lives easier, but I have a hard time believing this is the *right* approach.

In my online experience, I typically visit a variety of sites: Techmeme, Digg, Techcrunch, eBay, Amazon, del.icio.us, etc. While on these pages, something almost always catches my eye, and so I proceed to the search box in my browser to find out more on the web. Why do we have this disconnected experience? I think it’s because these sites do not provide web-level comprehensiveness. It’s unfortunate, because the page that I’m on may have additional information about my intent (maybe I’m logged in so it has my user info, or it’s a techy shopping site).

The biggest goal of Boss is to help bootstrap sites like these to get comprehensiveness and basic ranking for free, as well as offer tools to re-rank, blend, and overlay the results in a way that revolutionizes the search experience.

When I’m on del.icio.us, why can’t I search in their box, get relevant del.icio.us results at the top, and also have web results backfill below? I think users should be confident that if they searched in a search box on any page in the whole wide web that they’ll get results that are just as good as Yahoo/Google and only better.

The first milestone of Boss is a simple one: Make available a clean search API that turns off the traditional restrictions so that developers can totally control presentation, re-rank results, run an unlimited number of queries, and blend in external content all without having to include any Yahoo! attribution in the resulting product(s). Want to build the example above or put news search results on a map – go for it!

Here’s a link to the API:

http://developer.yahoo.com/search/boss/

Also, check out the Boss Mashup Framework:

http://developer.yahoo.com/search/boss/mashup.html

The Boss Mashup Framework in my opinion makes the Boss Search API really useful. It lets developers use SQL like syntax for operating on heterogeneous web data sources. The idea came up as I was working on examples to showcase Boss, and realized the operations I was developing imperatively followed closely to declarative SQL like constructs. Since it’s a recent idea and implementation, there may be some bugs or weird designs lurking in there, but I strongly recommend playing around with it and viewing the examples included in the package. I’m biased of course but do think it’s a fun framework for remixing online data. One can rank web results by digg and youtube favorite counts, remove duplicates, and publish the results using a provided search results page template in less than 30 lines of code and without having to specify any parsing logic of the data sources/API’s as the framework can infer the structure and unify the data formats automatically in most cases.

The next couple of milestones for Boss I think are even more interesting and disruptive – server side services, monetization, blending ranking models, more features exposure, query classifiers, open source … so stay tuned.

46 Comments

Filed under Blog Stuff, Data Mining, Information Retrieval, Non-Technical-Read, Open, Search, Techmeme

Is the Facebook Application Platform Fair?

Take a look at this stats deck from O’Reilly’s Graphing Social Patterns conference:

http://en.oreilly.com/gspeast2008/public/asset/attachment/2950

Fairly in-depth and recent [6/01/2008] analysis of the application usage in Facebook and MySpace.

As expected, lots of power law behavior.

I found the slides describing churn to be pretty interesting. Since October 2007, nine of the top fifteen most popular applications are new. However, only three of those new applications debuted after March 2008. I expect the amount of churn in the top spots to continue to drop based on the recent declining active usage trends and Facebook’s efforts to curb application spam (new UI that puts applications in a separate profile tab, app module minimizing, viral friend messaging limits, security compliance, etc).

What I would find even more interesting is a study of the number of applications users install, and how those moving averages have changed over time. Like say the number of applications a typical user installs is 4. Once the user reaches that threshold, what’s the churn like then? Specifically, what are the chances that a user will add a new app? Or maybe an even better metric: how long does it take, and how does this length of time compare to when the user had 1 app and increased to 2, or 2 apps and increased to 3, etc.? Basically, what’s the adoption rate/times based on current application counts?

I believe it becomes harder to influence a user to add or replace for a new app if the number of current apps the user has is high. I think most users, without even knowing it, have a threshold of how many total apps they are willing display on their profile – and that this threshold is based on an ongoing evaluation of the utility and efficiency of the page. Each app takes up real estate on the profile page, and a “rational” user will only show so many until page load times degrade and/or core modules (wall, general information, albums, networks) get drowned in clutter and thus become difficult for users to locate. Of course, social networks like MySpace which have very minimal profile page design constraints prove that most users are irrational ;) – but it’s this design control that greatly helped Facebook dominate the market IMO.

If this is true, then it means that first movers really, really win in the Facebook apps world. Companies like Slide and Rockyou manage many of the top applications, and given the power law market share phenomena, they control a majority stake of application usage and installs. Many of these companies had the early bird advantage, and once winners, always winners – acquisitions of emerging applications, leveraging branding and existing audiences (a.k.a monopoly) to cross promote potentially copy-cat applications faster and wider than the competition, etc. Monopolies inside Facebook have unsettling ramifications, as they block newcomers from capturing profile space. If they fail to innovate (as most monopolies) then next-gen application development may never get through.

Now, if users do have an application count threshold, and it becomes successively more difficult to replace/add a new app as this count increases, then any apps developed now have a substantially rarer chance of gaining market share. If winner’s win, first movers reap, and churn becomes improbable over time, then the early top apps have already most likely filled up users’ allocated app slots.

I find thinking of the profile page as a resource allocation problem rather fascinating. Essentially, there are finite resources on a page and we expect rational users to perform some optimization to allocate resources to maximize utility for themselves and for others (potential game theory link). Once users fill up these resources, human laziness kicks in. Another warrant for improbable churn is that users who want to add new applications after filling up their resource limit will need to remove an existing app to make space. The standards for change are higher now, as the user must compare the new app to an existing preferred app (which probably is a popular early-bird app that friends use), and so the decision will incur a trade-off.

One could also argue that with more apps available now (second slide shows that despite sluggish usage the # of app’s being developed is still growing insanely) users are burdened with more choices. Or, one could argue because most users have reached their app limit, and thus, churn has become improbable, the discoverability of new apps among friends (a critical channel for adoption) also becomes improbable.

Under this theory, especially in context of Facebook’s current efforts and app stats, the growth of new app adoption in social networks will continue to slow down.

So what can be done here?

The platform needs to encourage more churn by building a fairer market that matches users to high quality apps that satisfy their expressed intents. At the end of the day, these applications are really just web pages, but unlike the web, they do not leverage important primitives like linking and meta tags. Search engines like Google and Yahoo use these features extensively to calculate authority and relevance. In the long run, as the number of sources increases, advanced ranking algorithms and marketplaces are necessary to scale and ensure fairness to worthy tail publishers. Maybe social networks should inherit these system properties to bolster their tail applications.

Also, Facebook needs to encourage users to variate or add more applications to their profile page. Facebook’s move to put applications in its own profile tab may very well achieve this goal, but at a consequence of lowering their visibility.

Anyways, just some random thoughts about the current state of Facebook apps. It’ll be very interesting to see how their platform progresses and how it will be perceived by end users and developers in the future.

3 Comments

Filed under Economics, Facebook, Non-Technical-Read, Social, Statistics, Trends

Techmeme Leaderboard 2007 – More!

I’m an avid reader of Techmeme. Love the idea, UI, freshness, coverage, and most of all the quality of the articles.

When the Techmeme Leaderboard debuted earlier this month, lots of buzz circulated the blogosphere. Me, being a huge fan of partying on data, loved the concept, and wanted to take the analysis even further (Yuvi style, but with a search twist).

So yesterday I wrote up some code to crawl and analyze Techmeme articles over the whole year (Leaderboard shows the Top 50 sources for this month). I took a snapshot of Techmeme at 1:00PM every day between beginning January – end of September of 2007.

I computed basic statistics, like number of stories by author and source, as well as more involved measurements like the top word mentions of the year – in total and by category (used simple NLP to clean up the text and remove stopwords).

So, without further ado, here are the results:

Number of Stories by Author in 2007, Ranked
Number of Stories by Source in 2007, Ranked
Most Mentioned Words in 2007, Ranked
* words are stemmed
Most Mentioned Words, by Category, Trends in 2007, Ranked

Hope you guys find these results super interesting and useful.

1 Comment

Filed under Blog Stuff, Data Mining, Information Retrieval, NLP, Statistics, Techmeme, Trends

Surviving a Lunch Interview

I always found lunch interviews to be the most frustrating experiences ever. There you are, given an opportunity to pig out in a grand cafeteria on corporate expense – so naturally, you stock up the tray to get your chow down. You sit down at the table across from your interviewer, and right as you’re about to take that first scrumptious bite, your interviewer asks you a question. You of course answer it completely, but before returning to the meal you’re asked a follow-up question, and then another one, and before you know it rapid fire Q/A begins. You do your best to answer each one … as your food gets cold … as your stomach growls … and as you watch the interviewer nodding to your comments with his/her mouth filled with that savory steak and potatoes you’re dying to devour. Why can’t the interviewer just go to the bathroom or receive a cell call already?!

This isn’t the interviewer’s fault by any means. After all, it is an interview, and their role compels constant question asking (silence is awkward). Additionally, this whole food tease leading to short-term starvation isn’t the worst consequence. You can get food stuck in your teeth, pass gas, get bad breath, spill your food and drink all over your interviewer, etc. It’s probably the most dangerous, error-prone part of the interview process (actually probably not … since you typically don’t get asked technically involved questions over food).

So here’s some advice to those who find themselves in similar situations. Sadly, it took me nearly three years of lunch interviews to discover these pointers:

  1. Eat a big breakfast. Lunch should be a snack.
  2. When you do eat lunch, order the soup with bread. Warms your body and soothes your throat. Simple to eat. Nothing gets stuck to your teeth. No need to wash the hands, so hands don’t get dirty for that final handshake. It’s not greasy (like pizza) so doesn’t reflect bad diet habits to your interviewer. Also, the bread soaks in the soup to make the meal filling plus give you additional energy for the rest of the day.
  3. Eat slowly, since your interviewer probably got more food than you. You don’t want to finish earlier than him/her. It tends to rush the other person. Your goal is to make the lunch round long and fun. Keep the conversation going but don’t over do it to the point where the interviewer starts to daze off. Ask questions when the interviewer runs out of questions (also gives you more time to eat!). Make the most of lunch to learn as much as you can about the group. Their insight will be super useful in the upcoming rounds. Just think of lunch as a break before the more technical rounds.
  4. Drink water. It really is the best drink ever. No chance of an upset stomach during or after the round. If you’re starving and know the soup + bread won’t fill you up (eating slowly helps fill you up though), get an Odwalla. It’s seriously a second meal.
  5. Don’t take notes. That’s too much IMHO. Keep it informal, unless the interviewer specifies otherwise.

That’s all I got. Nothing crazy.

Anyways, hope these pointers come in handy.

3 Comments

Filed under Job Stuff, Non-Technical-Read

How Google is putting us back into the Stone Age

Yeah, I know – what a linkbait title. If that’s what it takes these days to get visitors and diggs then so be it. Also, just to forewarn, as you read this you might find that a better title choice for this post would have been “How Web 2.0 is putting us back into the Stone Age” since many of these thoughts generalize to Web 2.0 companies as a whole. I used Google in the title mainly because they are the big daddy in the web world, the model many web 2.0 companies strive to be like, the one to beat. Plus, the title just looks and sounds cooler with ‘Google’ in it.

Here’s the main problem I have with web applications coming from companies like Google: About 2 years ago I bought a pretty good box – which is now fairly standard and cheap these days – 2 gigs of ram, dual core AMD-64 3400+’s, 250 gigs hd, nVidia 6600 GT PCI Express, etc. It’s a beast. However, because I don’t play games, its potential isn’t being utilized – not even close. Most of the applications I use are web-based, mainly because the web provides a medium which is cross platform (all machines have a web browser), synchronized (since the data is stored server side I can access it from anywhere like the library, friend’s computer, my laptop) and it keeps my machine pretty light (no need to install anything and waste disk and risk security issues). The web UI experience for the most part isn’t too bad either – in fact, I find that the browser’s restrictions force many UI’s to be far simpler and easier to use. To me, the benefits mentioned above clearly compensate for any UI deficiencies. Unfortunately, this doesn’t mean that Web 2.0 is innovating the user’s experience. Visualizing data – search results, semantic networks, social networks, excel data sheets – is still very primitive, and a lot can be done to improve this experience by taking advantage of the user’s hardware.

My machine, and most likely yours, is very powerful and underutilized. For instance, my graphics card has tons of cores. We live in an age where GPU’s like mine can sort terabytes of data faster than the top-of-the-line Xeon based workstation (refer to Jim Gray’s GPUTerasort paper). For sorting, which is typically the bottleneck in database query plans and MapReduce jobs, it’s all about I/O – or in this case, how fast you can swap memory (for example, a 2-pass bitonic radix sort iteratively swaps the lows and the highs). Say you call memcpy in your C program on a $6,000 Xeon machine. The memory bandwidth is about 4 GB/s. Do the equivalent on a $200 graphics co-processor and you get about 50 GB/s. Holy smokes! I know I’m getting off-topic here, but why is it so much faster on a GPU? Well, in CPU world, memory access can be quite slow. You have almost these random jumps in memory, which can result in expensive TLB/cache misses, page faults, etc. You also have context switching for multi-processing. Lots of overhead going on here. Now compare this with a GPU, which has the memory almost stream directly to tons of cores. The cores on a GPU are fairly cheap, dumb processing units in comparison to the cores found in a CPU. But the GPU uses hundreds of these cores, in parallel, to drastically speed-up the overall processing. This, coupled with its specialized memory architecture, results in amazing performance bandwidth. Also, interestingly, since these cores are cheap (bad), there’s a lot of room for improvement. At the current rate, GPU advancements are occurring 3-4x faster than Moore’s law for CPU’s. Additionally, the graphical experience is near real-life quality. Current API’s enable developers to draw 3D triangles directly off the video card! This is some amazing hardware folks. GPU’s, and generally this whole notion of co-processing to optimize for operations that lag on CPU’s (memory bandwidth, I/O) promise to make future computers even faster than ever.

OK, so the basic story here is our computers are really powerful machines. The web world doesn’t take advantage of this, and considering how much time we spend there, it’s an unfortunate waste of computing potential. Because of this, I feel we are losing an appreciation for our computer’s capabilities. For example, when my friend first started using Gmail, he was non-stop clicking on the ‘Invite a friend’ drop-down. He couldn’t believe how the page could change without a browser refresh. Although this is quite an extreme example, I’ve seen this same phenomena for many users on other websites. IMHO, this is completely pathetic, especially when considering how powerful client-end applications can be in comparison.

Again, I’m not against web-based applications. I love Gmail, Google Maps, Reader, etc. However, there are applications which I do not think should be web-based. An example of this is YouOS, which is an OS accessible through the web-browser. I mean, there’s some potential here, but the way it’s currently implemented is very limiting and unnecessary.

To me, people are developing web-services with the mindset ‘can it hurt?’, when I think a better mantra is ‘will it advance computing and communication?’. Here’s the big web 2.0 problem: Just because you can make something web 2.0′ish, doesn’t mean you should. I think of this along the lines of Turing Complete, which is a notion in computer science for determining whether a system can express any computation. Basically, as long as you can process an input, store state, and return an output (i.e. a potentially stateful function), you can do any computation. Now web pages provide an input form, perform calculations server side, and can generate outputting pages – enough to do anything according to this paradigm, but with extreme limitations on visualization and performance (like with games). AJAX makes web views richer, but it is not only a terribly hacked up programming model, but for some reason compels developers to convert previously successful client-end-based applications into web-based services. Sometimes this makes sense from an end-user perspective, but consequently results in dumbing down the user experience.

We have amazing hardware that’s not being leveraged in web-based services. Browsers provide an emulation for a real application. However, given the proliferation of AJAX web 2.0 services, we’re starting to see applications only appear in the browser and not on the client. I think this current architecture view is unfortunate, because what I see in a browser is typically static content – something I could capture the essence of with a camera shot. In some sense, Web 2.0 is a surreal hack on what the real online experience should be.

I feel we really deserve truly rich applications that deliver ‘Minority Report’ style interfaces that utilize the client’s hardware. Movies predating the 1970′s predicted so much more for our current state’s user experience level. It’s up to us, the end-consumer, to encourage innovation in this space. It’s up to us, the developer, to build killer-applications that require tapping into a computer’s powerful hardware.The more we hype up web 2.0 and dumb-downed webpage experiences, the more website-based services we get – and consequently, less innovation in hardware driven UI’s.

But there’s hope. I think there exists a fair compromise between client-end applications and server-side web services. Internet is getting faster, the browser + Flash are getting fine tuned to make better use of a computer’s resources. Soon, the internet will be well-suited for thin-client computing. A great example of this already exists today, and I’m sure many of you have used it: Google Earth. It’s a client-end application – taking advantage of the computer’s graphics and processing power to make the user feel like he/she is traveling in and out of space – while being a server-side service since it gathers updated geographical data from the web. The only problem is there’s no cross-platform, preexisting layer to build applications like this. How do we make these services without forcing the user to do an interventionist, slow installation? How do we make it run over different platforms? Personally, I think Microsoft completely missed the boat here with .NET. If MS could have recognized the web phenomena early on, they could have build this layer into Vista to encourage users to develop these rich thin-client applications, while also promoting Vista. I have no reason to change my OS – this could have been my reason! Even if it was cross platform, if they had better performance it’s still a reason to prefer (providing some business case). Instead, they treated .NET as a Java-based replacement for MFC, thereby forcing developers to resort to building their cross-platform, no-installation-required services through AJAX and Flash.

Now, even if this layer existed, which would enable developers to build and instantly deploy Google Earth style applications in a cross-platform manner, there would be security concerns. I mean, one could make the case that ActiveX attempted to do this – allowing developers to run arbitrary code on the client’s machines. Unfortunately, this led to numerous viruses. Security violations and spyware scare(d) all of us – so much so that we now do traditionally client-end functions through a dumb-downed web browser interface. But, I think we made some serious inroads in security since then. The fact that we even recognize security in current development makes us readily prepared to support such a platform. I am confident that the potential security issues can be tackled.

To make a final point, I think we all really need higher expectations in the user experience front. We need to develop killer applications that push the limitations of our hardware – to promote innovation and progress. We’re currently at a standstill in my opinion. This isn’t how the internet should be. This is not how I envisioned the future to be like 5 years ago. We can do better. We can build richer applications. But to do this, we as consumers must demand it in order for companies to have a business case to further pursue it. We need developers to come up with innovative ways of visualizing the large amounts of data being generated with the use of hardware – thereby delivering long-awaited killer-applications for our idly computers. Let’s take our futuristic dreams and finally translate them into our present reality.

7 Comments

Filed under Computer Science, Databases, Google, Hardware, UI, Web2.0

Google Co-op just got del.icio.us!

Update: Sorry, link is going up and down. Worth trying, but will try to find a more stable option when time cycles free up.

This past week I decided to cook up a service (link in bold near the middle of this post) I feel will greatly assist users in developing advanced Google Custom Search Engines (CSE’s). I read through the Co-op discussion posts, digg/blog comments, reviews, emails, etc. and learned many of our users are fascinated by the refinements feature – in particular, building search engines that produce results like this:

‘linear regression” on my Machine Learning Search Engine

… but unfortunately, many do not know how to do this nor understand/want to hack up the XML. Additionally, I think it’s fair to say many users interested in building advanced CSE’s have already done similar site tagging/bookmarking through services like del.icio.us. del.icio.us really is great. Here are a couple of reasons why people should (and do) use del.icio.us:

  • It’s simple and clean
  • You can multi-tag a site quickly (comma separated field; don’t have to keep reopening the bookmarklet like with Google’s)
  • You can create new tags on the fly (don’t choose the labels from a fixed drop-down like with Google’s)
  • The bookmarklet provides auto-complete tag suggestions; shows you the popular tags others have used for that current site
  • Can have bundles (two level tag hierarchies)
  • Can see who else has bookmarked the site (can also view their comments); builds a user community
  • Generates a public page serving all your bookmarks

Understandably, we received several requests to support del.icio.us bookmark importing. My part-time role with Google just ended last Friday, so, as a non-Googler, I decided to build this project. Initially, I was planning to write a simple service to convert del.icio.us bookmarks into CSE annotations – and that’s it – but realized, as I learned more about del.icio.us, that there were several additional features I could develop that would make our users’ lives even easier. Instead of just generating the annotations, I decided to also generate the CSE contexts as well.

Ok, enough talk, here’s the final product:
http://basundi.com:8000/login.html

If you don’t have a del.icio.us account, and just want to see how it works, then shoot me an email (check the bottom of the Bio page) and I’ll send you a dummy account to play with (can’t publicize it or else people might spam it or change the password).

Here’s a quick feature list:

  • Can build a full search engine (like the machine learning one above) in two steps, without having to edit any XML, and in less than two minutes
  • Auto-generates the CSE annotations XML from your del.icio.us bookmarks and tags
  • Provides an option to auto-generate CSE annotations just for del.icio.us bookmarks that have a particular tag
  • Provides an option to Auto-calculate each annotation’s boost score (log normalizes over the max # of Others per bookmark)
  • Provides an option to Auto-expand links (appends a wildcard * to any links that point to a directory)
  • Auto-generates the CSE context XML
  • Auto-generates facet titles
  • Since there’s a four facet by five labels restriction (that’s the max that one can fit in the refinements display on the search results page), I provide two options for automatic facet/refinement generation:
    • The first uses a machine learning algorithm to find the four most frequent disjoint 5-item-sets (based on the # of del.icio.us tag co-occurrences; it then does query-expansion over the tag sets to determine good facet titles)
    • The other option returns the user’s most popular del.ico.us bundles and corresponding tags
    • Any refinements that do not make it in the top 4 facets are dumped in a fifth facet in order of popularity. If you don’t understand this then don’t worry, you don’t need to! The point is all of this is automated for you (just use the default Cluster option). If you want control over which refinements/facets get displayed, then just choose Bundle.
  • Provides help documentation links at key steps
  • And best of all … You don’t need to understand the advanced options of Google CSE/Co-op to build an advanced CSE! This seriously does all the hard, tedious work for you!

In my opinion, there’s no question that this is the easiest way to make a fancy search engine. If I make any future examples I’m using this – I can simply use del.icio.us, sign-in to this service, and voila I have a search engine with facets and multi-label support.


Please note that this tool is not officially endorsed by nor affiliated with Google or Yahoo! It was just something I wanted to work on for fun that I think will benefit many users (including myself). Also, send your feedback/issues/bugs to me or post them on this blog.

74 Comments

Filed under AI, Co-op, CS, CSE, Google, Machine Learning, Research, Tagging

SDSS Skyserver Traffic

This past summer I worked at MSR alongside Dr. Jim Gray on analyzing the Skyserver’s (the online worldwide telescope portal) web and SQL logs. We just published our findings, which you can access here (MSR) or here (updated).

Still needs some clean-up (spelling, grammar, flow) and additional sections to tie up some loose ends, but it’s definitely presentable. Would love to hear what you guys think about the results (besides how pretty the graphs look :) .

3 Comments

Filed under CS, Databases, Education, Publications, Research, Science