Category Archives: Data Mining

Twitter + BOSS = Real Time Search

Try ityahoo

Update: (6/25) This application has been updated. Go here to learn more. The description below though still applies.

Update: (6/11) In case you’re bored, here’s a discussion we had with Google and Twitter about Open & Real-time Search.

Update: (1/19) If you have issues try again in 5-10 minutes. You can also check out the screenshots below. (1/15) App Engine limits were reached (and fast). Appreciate the love and my apologies for not fully anticipating that. Google was nice enough though to temporarily raise the quota for this application. Anyways, this was more to show a cool BOSS developer example using code libraries I released earlier, but there might be more here. Stay tuned.

Here’s a screenshot as well (which should hopefully be stale by the time you read this).

Basically this service boosts Yahoo’s freshest news search results (which typically don’t have much relevance since they are ordered by timestamp and that’s it) based on how similar they are to the emerging topics found on Twitter for the same query (hence using Twitter to determine authority for content that don’t yet have links because they are so fresh). It also overlays related tweets via an AJAX expando button (big thanks to Greg Walloch at Yahoo! for the design) under results if they exist. A nice added feature to the overlay functionality is near-duplicate removal to ensure message threads on any given result provide as much comment diversity as possible.

Freshness (especially in the context of search) is a challenging problem. Traditional PageRank style algorithms don’t really work here as it takes time for a fresh URL to garner enough links to beat an older high ranking URL. One approach is to use cluster sizes as a feature for measuring the popularity of a story (i.e. Google News). Although quite effective IMO this may not be fast enough all the time. For the cluster size to grow requires other sources to write about the same story. Traditional media can be slow however, especially on local topics. I remember when I saw breaking Twitter messages describing the California Wildfires. When I searched Google/Yahoo/Microsoft right at that moment I barely got anything (< 5 results spanning 3 search results pages). I had a similar episode when I searched on the Mumbai attacks. Specifically, the Twitter messages were providing incredible focus on the important subtopics that had yet to become popular in the traditional media and news search worlds. What I found most interesting in both of these cases was that news articles did exist on these topics, but just weren’t valued highly enough yet or not focusing on the right stories (as the majority of tweets were). So why not just do that? Order these fresh news articles (which mostly provide authority and in-depth coverage) based on the number of related fresh tweets as well as show the tweets under each. That’s this service.

To illustrate the need, here’s a quick before and after shot. I searched for ‘nba’ using Yahoo’s news search ordered by latest results (first image). Very fresh (within a minute) but subpar quality. The first result talks about teams that are in a different league of basketball than the NBA. However, search for ‘nba’ on TweetNews (second image) and you get the Kings/Warriors triple OT game highlight which was buzzing more in Twitter at that minute.

'NBA' on Y! News latest

'NBA' on Y! News latest

'NBA' on Y! News latest enhanced by Twitter

'NBA' on TweetNews

There’s something very interesting here … Twitter as a ranking signal for search freshness may prove to be very useful if constructed properly. Definitely deserves more exploration – hence this service, which took < 100 lines of code to represent all the search logic thanks to Yahoo! BOSS, Twitter’s API, and the BOSS Mashup Framework.

To sum up, the contributions of this service are: (1) Real-time search + freshness (2) Stitching social commentary to authoritative sources of information (3) Another (hopefully cool) BOSS example.

The code is packaged for general open consumption and has been ported to run on App Engine (which powers this service actually). You can download all the source here.

99 Comments

Filed under Blog Stuff, Boss, Code, CS, Data Mining, Google, Information Retrieval, Non-Technical-Read, Open, Research, Search, Social, Twitter, Yahoo

Yahoo Boss – Google App Engine Integrated

Updated: I see blogs doing evaluations of the Q&A engine. I have to admit, that wasn’t my focus here. The service is merely 50 lines of code … just to demonstrate the integration of BMF and GAE.

Updated: Direct link to the example Question-Answering Service

Today I finally plugged-in the Yahoo Boss Mashup Framework into the Google App Engine environment. Google App Engine (GAE) provides a pretty sweet yet simple platform for executing Python applications on Google’s infrastructure. The Boss Mashup Framework (BMF) provides Python API’s for accessing Yahoo’s Search API’s as well remixing data a la SQL constructs. Running BMF on top of GAE is a seemingly natural progression, and quite arguably the easiest way to deploy Boss – so I spent today porting BMF to the GAE platform.

Here’s the full BMF-GAE integrated project source download.

There’s a README file included. Just unzip, put your appid’s in the config files, and you’re done. No setup or dependencies (easier than installing BMF standalone!). It’s a complete GAE project directory which includes a directory called yos which holds all the ported BMF code. Also made a number of improvements to the BMF code (SQL ‘where’ support, stopwords, yql.db refactoring, util & templates in yos namespace, yos.crawl.rest refactored & optimized, etc.).

The next natural thing to do is to develop a test application on top of this united framework. In the original BMF package, there’s an examples directory. In particular, ex6.py was able to answer some ‘when’ style questions. I simply wrapped that code as a function and referenced it as a GAE handler in main.py.

Here’s the ‘when’ q&a source code as a webpage (less than 25 lines).

The algorithm is quite easy – use the question as the search query and fetch 50 results via the Boss API. Count the dates that occur in the results’ abstracts, and simply return the most popular one.

For fun, following a similar pattern to the ‘when’ code, I developed another handler to answer ‘who’ or ‘what’ or ‘where’ style questions (finding the most popular capitalized phrase).

Here’s the complete example (just ~50 lines of code – bundled in project download):

Q&A Running Service Example

Keep in mind that this is just a quick proof of concept to hopefully showcase the power of BMF and the idea of Open Web Search.

If you’re interested in learning more about this Q&A system (or how to improve it), check out AskMSR – the original inspiration behind this example.

Also, shoutout to Sam for his very popular Yuil example, which is powered by BMF + GAE. The project download linked above is aimed to make it hopefully easier for people to build these types of web services.

34 Comments

Filed under Boss, Code, Computer Science, CS, Data Mining, Databases, Google, Information Retrieval, NLP, Research, Search, Yahoo

Yahoo! Boss – An Insider View

Disclaimer: This is my personal blog. The views expressed on these pages are mine alone and not those of my employer.

Boss stands for Build your Own Search Service. The goal of Boss is to open up search to enable third parties to build incredibly useful and powerful search-based applications. Several months ago I pitched this idea to the executives on how Yahoo! can specifically open up its search assets to fragment the market. It’s remarkable to finally see some of the vision (with the help of many talented people) reach the public today.

Web search is a tough business to get into. $300+ Million capex, amazing talent, infrastructure, a prayer, etc. just to get close to basic parity. Only 3 companies have really pulled it off. However, I strongly believe we need to find innovative, incremental ways to spread the search love in order to encourage fragmentation and help promising companies get to basic parity instantly so that they can leverage their unique assets (new algorithm, user data, talent) to push their search solution beyond the current baseline.

Search is all about understanding the user’s intent. If we can nail the intent, then search is pretty much a solved problem. However, the current model of a single search box for everything loses an intent focus as it aims to cater to all people and queries. Albeit, a single search box definitely makes our lives easier, but I have a hard time believing this is the *right* approach.

In my online experience, I typically visit a variety of sites: Techmeme, Digg, Techcrunch, eBay, Amazon, del.icio.us, etc. While on these pages, something almost always catches my eye, and so I proceed to the search box in my browser to find out more on the web. Why do we have this disconnected experience? I think it’s because these sites do not provide web-level comprehensiveness. It’s unfortunate, because the page that I’m on may have additional information about my intent (maybe I’m logged in so it has my user info, or it’s a techy shopping site).

The biggest goal of Boss is to help bootstrap sites like these to get comprehensiveness and basic ranking for free, as well as offer tools to re-rank, blend, and overlay the results in a way that revolutionizes the search experience.

When I’m on del.icio.us, why can’t I search in their box, get relevant del.icio.us results at the top, and also have web results backfill below? I think users should be confident that if they searched in a search box on any page in the whole wide web that they’ll get results that are just as good as Yahoo/Google and only better.

The first milestone of Boss is a simple one: Make available a clean search API that turns off the traditional restrictions so that developers can totally control presentation, re-rank results, run an unlimited number of queries, and blend in external content all without having to include any Yahoo! attribution in the resulting product(s). Want to build the example above or put news search results on a map – go for it!

Here’s a link to the API:

http://developer.yahoo.com/search/boss/

Also, check out the Boss Mashup Framework:

http://developer.yahoo.com/search/boss/mashup.html

The Boss Mashup Framework in my opinion makes the Boss Search API really useful. It lets developers use SQL like syntax for operating on heterogeneous web data sources. The idea came up as I was working on examples to showcase Boss, and realized the operations I was developing imperatively followed closely to declarative SQL like constructs. Since it’s a recent idea and implementation, there may be some bugs or weird designs lurking in there, but I strongly recommend playing around with it and viewing the examples included in the package. I’m biased of course but do think it’s a fun framework for remixing online data. One can rank web results by digg and youtube favorite counts, remove duplicates, and publish the results using a provided search results page template in less than 30 lines of code and without having to specify any parsing logic of the data sources/API’s as the framework can infer the structure and unify the data formats automatically in most cases.

The next couple of milestones for Boss I think are even more interesting and disruptive – server side services, monetization, blending ranking models, more features exposure, query classifiers, open source … so stay tuned.

46 Comments

Filed under Blog Stuff, Data Mining, Information Retrieval, Non-Technical-Read, Open, Search, Techmeme

Techmeme Leaderboard 2007 – More!

I’m an avid reader of Techmeme. Love the idea, UI, freshness, coverage, and most of all the quality of the articles.

When the Techmeme Leaderboard debuted earlier this month, lots of buzz circulated the blogosphere. Me, being a huge fan of partying on data, loved the concept, and wanted to take the analysis even further (Yuvi style, but with a search twist).

So yesterday I wrote up some code to crawl and analyze Techmeme articles over the whole year (Leaderboard shows the Top 50 sources for this month). I took a snapshot of Techmeme at 1:00PM every day between beginning January – end of September of 2007.

I computed basic statistics, like number of stories by author and source, as well as more involved measurements like the top word mentions of the year – in total and by category (used simple NLP to clean up the text and remove stopwords).

So, without further ado, here are the results:

Number of Stories by Author in 2007, Ranked
Number of Stories by Source in 2007, Ranked
Most Mentioned Words in 2007, Ranked
* words are stemmed
Most Mentioned Words, by Category, Trends in 2007, Ranked

Hope you guys find these results super interesting and useful.

1 Comment

Filed under Blog Stuff, Data Mining, Information Retrieval, NLP, Statistics, Techmeme, Trends