<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Vik&#039;s Blog</title>
	<atom:link href="http://zooie.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://zooie.wordpress.com</link>
	<description>&#34;Let&#039;s party on the data!&#34; -- Jim Gray</description>
	<lastBuildDate>Thu, 26 Jan 2012 03:04:57 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='zooie.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://1.gravatar.com/blavatar/fbfd4e3186e2ecd3a7ad448bf907c50f?s=96&#038;d=http%3A%2F%2Fs2.wp.com%2Fi%2Fbuttonw-com.png</url>
		<title>Vik&#039;s Blog</title>
		<link>http://zooie.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://zooie.wordpress.com/osd.xml" title="Vik&#039;s Blog" />
	<atom:link rel='hub' href='http://zooie.wordpress.com/?pushpress=hub'/>
		<item>
		<title>Betting on UFC Fights &#8211; A Statistical Data Analysis</title>
		<link>http://zooie.wordpress.com/2011/09/21/betting-on-ufc-fights-a-statistical-data-analysis/</link>
		<comments>http://zooie.wordpress.com/2011/09/21/betting-on-ufc-fights-a-statistical-data-analysis/#comments</comments>
		<pubDate>Wed, 21 Sep 2011 08:56:47 +0000</pubDate>
		<dc:creator>Vik</dc:creator>
				<category><![CDATA[AI]]></category>
		<category><![CDATA[Blog Stuff]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Economics]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Science]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Trends]]></category>
		<category><![CDATA[Betting]]></category>
		<category><![CDATA[Finance]]></category>
		<category><![CDATA[Gambling]]></category>
		<category><![CDATA[MMA]]></category>
		<category><![CDATA[Sports]]></category>
		<category><![CDATA[Stats]]></category>
		<category><![CDATA[UFC]]></category>

		<guid isPermaLink="false">http://zooie.wordpress.com/?p=592</guid>
		<description><![CDATA[Mixed Martial Arts (MMA) is an incredibly entertaining and technical sport to watch. It&#8217;s become one of the fastest growing sports in the world. I&#8217;ve been following MMA organizations like the Ultimate Fighting Championship (UFC) for almost eight years now, and &#8230; <a href="http://zooie.wordpress.com/2011/09/21/betting-on-ufc-fights-a-statistical-data-analysis/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=zooie.wordpress.com&amp;blog=31469&amp;post=592&amp;subd=zooie&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Mixed Martial Arts (MMA) is an incredibly entertaining and technical sport to watch. It&#8217;s become one of the fastest growing sports in the world. I&#8217;ve been following MMA organizations like the Ultimate Fighting Championship (UFC) for almost eight years now, and in that time have developed a great appreciation for MMA techniques. After watching dozens of fights, you begin to pick up on what moves win and when, and spot strengths and weaknesses in certain fighters. However, I&#8217;ve always wanted to test my knowledge against the actual stats &#8211; like do accomplished wrestlers really beat fighters with little wrestling experience?</p>
<p>To do this, we need fight data, so I crawled and parsed all the MMA fights from <a href="http://www.sherdog.com">Sherdog.com</a>. This data includes fighter profiles (birth date, weight, height, disciplines, training camp, location) and fight records (challenger, opponent, time, round, outcome, event). After some basic data cleaning, I had a dataset of 11,886 fight records, 1,390 of which correspond to the UFC.</p>
<p>I then trained a <a href="http://en.wikipedia.org/wiki/Random_forest">random forest classifier</a> from this data to see if a state-of-the-art machine learning model can identify any winning and losing characteristics. Over <a href="http://en.wikipedia.org/wiki/Cross-validation_(statistics)">cross-validation</a> with 10 folds, the resulting model scored a surprisingly decent <a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_Under_Curve">AUC</a> score of 0.69; a AUC score closer to 0.5 would indicate that the model can&#8217;t predict winning fights any better than random or fair coin flips.</p>
<p>So there may be interesting patterns in this data &#8230; Feeling motivated, I ran exhaustive searches over the data to find feature combinations that indicate winning or losing behaviors. Many hours later, several dozens of such insights were found.</p>
<p>Here are the most interesting ones (stars indicate statistical significance at the 5% level):</p>
<h2>Top UFC Insights</h2>
<p>&nbsp;</p>
<div>
<h3><strong>Fighters older than 32 years of age will more likely lose</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 173 out of 277 (62%) fights*</em></div>
<p>&nbsp;</p>
<h3><strong>Fighters with more than 6 TKO victories fighting opponents older than 32 years of age will more likely win</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 47 out of 60 (78%) fights*</em></div>
<p>&nbsp;</p>
<h3><strong>Fighters from Japan will more likely lose</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 36 out of 51 (71%) fights*</em></div>
<p>&nbsp;</p>
<h3><strong>Fighters who have lost 2 or more KOs will more likely lose</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 54 out of 84 (64%) fights*</em></div>
<p>&nbsp;</p>
<h3><strong>Fighters with 3x or more decision wins and are greater than 3% taller than their opponents will more likely win</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 32 out of 38 (84%) fights*</em></div>
<p>&nbsp;</p>
<h3><strong>Fighters who have won 3x or more decisions than their opponent will more likely win</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 142 out of 235 (60%) fights*</em></div>
<p>&nbsp;</p>
<h3><strong>Fighters with no wrestling background vs fighters who do have one more likely lose</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 136 out of 212 (64%) fights*</em></div>
<p>&nbsp;</p>
<h3><strong>Fighters fighting opponents with 3x or less decision wins and are on a 6 fight (or better) winning streak more likely win</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 30 out of 39 (77%) fights*</em></div>
<p>&nbsp;</p>
</div>
<div>
<h3><strong>Fighters younger than their opponents by 3 or more years in age will more likely win</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 324 out of 556 (58%) fights*</em></div>
<p>&nbsp;</p>
<h3><strong>Fighters who haven&#8217;t fought in more than 210 days will more likely lose</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 162 out of 276 (59%) fights*</em></div>
<p>&nbsp;</p>
<h3><strong>Fighters taller than their opponents by 3% will more likely win</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 159 out of 274 (58%) fights*</em></div>
<p>&nbsp;</p>
<h3><strong>Fighters who have lost less by submission than their opponents will more likely win</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 295 out of 522 (57%) fights*</em></div>
<p>&nbsp;</p>
<h3><strong>Fighters who have lost 6 or more fights will more likely lose</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 172 out of 291 (60%) fights*</em></div>
<p>&nbsp;</p>
<h3><strong>Fighters who have 18 or more wins and never had a 2 fight losing streak more likely win</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 79 out of 126 (63%) fights*</em></div>
<p>&nbsp;</p>
<h3><strong>Fighters who have lost back to back fights will more likely lose</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 514 out of 906 (57%) fights*</em></div>
<p>&nbsp;</p>
<h3><strong>Fighters with 0 TKO victories will more likely lose</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 90 out of 164 (55%) fights</em></div>
<p>&nbsp;</p>
<h3><strong>Fighters fighting opponents out of Greg Jackson&#8217;s camp will more likely lose</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 38 out of 63 (60%) fights</em></div>
<p>&nbsp;</p>
</div>
<div>
&nbsp;</p>
<h2>Top Insights over All Fights</h2>
<p>&nbsp;</p>
<h3><strong>Fighters with 15 or more wins that have 50% less losses than their opponents will more likely win</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 239 out of 307 (78%) fights*</em></div>
<p>&nbsp;</p>
<h3><strong>Fighters fighting American opponents will more likely win</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 803 out of 1303 (62%) fights*</em></div>
<p>&nbsp;</p>
<h3><strong>Fighters with 2x more (or better) wins than their opponents and those opponents lost their last fights will more likely win</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 709 out of 1049 (68%) fights*</em></div>
<p>&nbsp;</p>
<h3><strong>Fighters who&#8217;ve lost their last 4 fights in a row will more likely lose</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 345 out of 501 (68%) fights*</em></div>
<p>&nbsp;</p>
<h3><strong>Fighters currently on a 5 fight (or better) winning streak will more likely win</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 1797 out of 2960 (61%) fights*</em></div>
<p>&nbsp;</p>
<h3><strong>Fighters with 3x or more wins than their opponents will more likely win</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 2831 out of 4764 (59%) fights*</em></div>
<p>&nbsp;</p>
<h3><strong>Fighters who have lost 7 or more times will more likely lose</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 2551 out of 4547 (56%) fights*</em></div>
<p>&nbsp;</p>
<h3><strong>Fighters with no jiu jitsu in their background versus fighters who do have it more likely lose</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 334 out of 568 (59%) fights*</em></div>
<p>&nbsp;</p>
<h3><strong>Fighters who have lost by submission 5 or more times will more likely lose</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 1166 out of 1982 (59%) fights*</em></div>
<p>&nbsp;</p>
<h3><strong>Fighters in the Middleweight division who fought their last fight more recently will more likely win</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 272 out of 446 (61%) fights*</em></div>
<p>&nbsp;</p>
<h3><strong>Fighters in the Lightweight division fighting 6 foot tall fighters (or higher) will more likely win</strong></h3>
<div style="padding-left:30px;"><em>This was validated in 50 out of 83 (60%) fights</em></div>
<p>&nbsp;</p>
<div>
<p>Note &#8211; I separated UFC fights from all fights because regulations and rules can vary across MMA organizations.</p>
<p>Most of these insights are intuitive except for maybe the last one and an earlier one which states 77% of the time fighters beat opponents who are on 6 fight or better winning streaks but have 3x less decision wins.</p>
<p>Many of these insights demonstrate statistically significant winning biases. I couldn&#8217;t help but wonder &#8211; could we use these insights to effectively bet on UFC fights? For the sake of simplicity, what happens if we make bets based on just the very first insight which states that fighters older than 32 years old will more likely lose (with a 62% chance)?</p>
<p>To evaluate this betting rule, I pulled the most recent UFC fights where in each fight there&#8217;s a fighter that&#8217;s at least 33 years old. I found 52 such fights, spanning <a href="http://en.wikipedia.org/wiki/List_of_UFC_events" target="_blank">2/5/2011 &#8211; 8/14/2011</a>. I placed a $10K bet on the younger fighter in each of these fights.</p>
<p>Surprisingly, this rule calls 33 of these 52 fights correctly (63% &#8211; very close to the rule&#8217;s observed 62% overall win rate). Each fight called incorrectly results in a loss of $10,000, and for each of the fights called correctly I obtained the corresponding Bodog money line (betting odds) to compute the actual winning amount.</p>
</div>
<p>I&#8217;ve compiled the betting data for these fights in this <a href="https://docs.google.com/spreadsheet/ccc?key=0AhC1tMzehH5tdHFtS0d6Z0tIdEN5WDVacHRNN0E1bnc#gid=0">Google spreadsheet</a>.</p>
<p>Note, for 6 of the fights that our rule called correctly, the money lines favored the losing fighters.</p>
<p>Let&#8217;s compute the overall return of our simple betting rule:</p>
<div style="padding-left:30px;">For each of these 52 fights, we risked $10,000, or in all $520,000</div>
<div style="padding-left:30px;">We lost 19 times, or a total of $190,000</div>
<div style="padding-left:30px;">Based on the betting odds of the 33 fights we called correctly (see spreadsheet), we won $255,565.44</div>
<div style="padding-left:30px;">Profit = $255,565.44 - $190,000 = $65,565.44</div>
<div style="padding-left:30px;">Return on investment (<strong>ROI</strong>) = 100 * 65,565.44 / 520,000 = <strong>12.6%</strong></div>
<p>&nbsp;</p>
<p>That&#8217;s a very decent return.</p>
<p>For kicks, let&#8217;s compare this to investing in the stock market over the same period of time. If we buy the S&amp;P 500 with a conventional <a href="http://en.wikipedia.org/wiki/Dollar_cost_averaging">dollar cost averaging</a> strategy to spread out the $520,000 investment, then we get a <a href="http://www.buyupside.com/calculators/dollarcostaveinclude.php?symbol=SPY&amp;amount=74285.71&amp;divre=No&amp;start_month=01&amp;start_year=2011&amp;end_month=07&amp;end_year=2011&amp;submit=Calculate+Returns">ROI of -7.31%</a>. Ouch.</p>
<p>Keep in mind that we&#8217;re using a simple betting rule that&#8217;s based on a single insight. The random forest model, which optimizes over many insights, should predict better and be applicable to more fights.</p>
<p>Please note that I&#8217;m just poking fun at stocks &#8211; I&#8217;m not saying betting on UFC fights with this rule is a more sound investment strategy (risk should be thoroughly examined &#8211; the variance of the performance of the rule should be evaluated over many periods of time).</p>
</div>
<p>The main goal here is to demonstrate the effectiveness of data driven approaches for better understanding the patterns in a sport like MMA. The UFC could leverage these data mining approaches for coming up with fairer matches (dismiss fights that match obvious winning and losing biases). I don&#8217;t favor this, but given many fans want to see knockouts, the UFC could even use these approaches to design fights that will likely avoid decisions or submissions.</p>
<p>Anyways, there&#8217;s so much more analysis I&#8217;ve done (and haven&#8217;t done) over this data. Will post more results when cycles permit. Stay tuned.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/zooie.wordpress.com/592/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/zooie.wordpress.com/592/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/zooie.wordpress.com/592/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/zooie.wordpress.com/592/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/zooie.wordpress.com/592/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/zooie.wordpress.com/592/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/zooie.wordpress.com/592/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/zooie.wordpress.com/592/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/zooie.wordpress.com/592/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/zooie.wordpress.com/592/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/zooie.wordpress.com/592/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/zooie.wordpress.com/592/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/zooie.wordpress.com/592/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/zooie.wordpress.com/592/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=zooie.wordpress.com&amp;blog=31469&amp;post=592&amp;subd=zooie&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://zooie.wordpress.com/2011/09/21/betting-on-ufc-fights-a-statistical-data-analysis/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/17518bf0a462f22fc174f2df8e464e69?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">zooie</media:title>
		</media:content>
	</item>
		<item>
		<title>Ranking High Schools Based On Outcomes</title>
		<link>http://zooie.wordpress.com/2011/04/18/ranking-high-schools-on-outcomes/</link>
		<comments>http://zooie.wordpress.com/2011/04/18/ranking-high-schools-on-outcomes/#comments</comments>
		<pubDate>Mon, 18 Apr 2011 09:00:28 +0000</pubDate>
		<dc:creator>Vik</dc:creator>
				<category><![CDATA[Blog Stuff]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Education]]></category>
		<category><![CDATA[Job Stuff]]></category>
		<category><![CDATA[LinkedIn]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Science]]></category>
		<category><![CDATA[Social]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[bay area]]></category>
		<category><![CDATA[education]]></category>
		<category><![CDATA[high schools]]></category>
		<category><![CDATA[linkedin]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://zooie.wordpress.com/?p=547</guid>
		<description><![CDATA[High school is arguably the most important phase of your education. Some families will move just to be in the district of the best ranked high school in the area. However, the factors that these rankings are based on, such &#8230; <a href="http://zooie.wordpress.com/2011/04/18/ranking-high-schools-on-outcomes/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=zooie.wordpress.com&amp;blog=31469&amp;post=547&amp;subd=zooie&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>High school is arguably the most important phase of your education. Some families will move just to be in the district of the best ranked high school in the area. However, the factors that these rankings are based on, such as test scores, tuition amount, average class size, teacher to student ratio, location, etc. do not measure key outcomes such as what colleges or jobs the students get into.</p>
<p>Unfortunately, measuring outcomes is tough &#8211; there&#8217;s no data source that I know of that describes how all past high school students ended up. However, I thought it would be a fun experiment to approximate using LinkedIn data. I took eight top high schools in the Bay Area (see the table below) and ran a whole bunch of advanced LinkedIn search queries to find graduates from these high schools while also counting up their key outcomes like what colleges they graduated from, what companies they went on to work for, what industries are they in, what job titles have they earned, etc.</p>
<p>The results are quite interesting. Here are a few statistics:</p>
<p><strong>College Statistics</strong></p>
<ul>
<li>The top 5 high schools that have the largest share of users going to top private schools (Ivy League&#8217;s + Stanford + Caltech + MIT) are<span style="text-decoration:underline;"> (1) Harker (2) Gunn (3) Saratoga (4) Lynbrook (5) Bellarmine.</span></li>
<li>The top 5 high schools that have the largest share of users going to the top 3 UC&#8217;s (Berkeley, LA, San Diego) are <span style="text-decoration:underline;">(1) Mission (2) Gunn (3) Saratoga (4) Lynbrook (5) Leland</span>.</li>
<li>Although Harker has the highest share of users going to top privates (30%), their share of users going to the top UC&#8217;s is below average. It&#8217;s worth nothing that Harker&#8217;s tuition is the highest at $36K a year.</li>
<li>Bellarmine, an all men&#8217;s high school with tuition of $15K a year, is below average in its share of users going on to top private universities as well as to the UC system.</li>
<li>Gunn has the highest share of users (11%) going on to Stanford. That&#8217;s more than 2x the second place high school (Harker).</li>
<li>Mission has the highest share of users (31%) going to the top 3 UC&#8217;s and to UC Berkeley alone (14%).</li>
</ul>
<p><strong>Career Statistics</strong></p>
<ul>
<li>In rank order<span style="text-decoration:underline;"> (1) Saratoga (2) Bellarmine (3) Leland</span> have the biggest share of users which hold job titles that allude to leadership positions (CEO, VP, Manager, etc.).</li>
<li>The highest share of lawyers come from <span style="text-decoration:underline;">(1) Bellarmine (2) Lynbrook (3) Leland</span>. Gunn has 0 lawyers and Harker is second lowest at 6%.</li>
<li>Saratoga has the best overall balance of users in each industry (median share of users).</li>
<li>Hardware is fading &#8211; 5 schools (Leland, Gunn,  Harker, Mission, Lynbrook) have zero users in this industry.</li>
<li>Harker has the highest share of its users in the Internet, Financial, and Medical industries.</li>
<li>Harker has the lowest percentage of Engineers and below average share of users in the Software industry.</li>
<li>Gunn has the highest share of users in the Software and Media industries.</li>
<li>Harker high school is relatively new (formed in 1998), so its graduates are still early in the workforce. Leadership takes time to earn, so the leadership statistic is unfairly biased against Harker.</li>
</ul>
<p>You can see all the stats I collected in the table below. Keep in mind that percentages correspond to the share of users from the high school that match that column&#8217;s criteria. Yellow highlights correspond to the best score; blue shaded boxes correspond to scores that are above average. There are quite a few caveats which I&#8217;ll note in more detail later, so take these results with a grain of salt. However, as someone who grew up in the Bay Area his whole life, I will say that many of these results make sense to me.</p>
<p><a href="http://zooie.files.wordpress.com/2011/04/high_schools.png"><img class="alignleft size-full wp-image-548" title="Bay Area High Schools Outcome Statistics" src="http://zooie.files.wordpress.com/2011/04/high_schools.png?w=500" alt=""   /></a></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/zooie.wordpress.com/547/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/zooie.wordpress.com/547/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/zooie.wordpress.com/547/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/zooie.wordpress.com/547/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/zooie.wordpress.com/547/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/zooie.wordpress.com/547/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/zooie.wordpress.com/547/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/zooie.wordpress.com/547/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/zooie.wordpress.com/547/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/zooie.wordpress.com/547/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/zooie.wordpress.com/547/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/zooie.wordpress.com/547/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/zooie.wordpress.com/547/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/zooie.wordpress.com/547/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=zooie.wordpress.com&amp;blog=31469&amp;post=547&amp;subd=zooie&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://zooie.wordpress.com/2011/04/18/ranking-high-schools-on-outcomes/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/17518bf0a462f22fc174f2df8e464e69?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">zooie</media:title>
		</media:content>

		<media:content url="http://zooie.files.wordpress.com/2011/04/high_schools.png" medium="image">
			<media:title type="html">Bay Area High Schools Outcome Statistics</media:title>
		</media:content>
	</item>
		<item>
		<title>An Evaluation of Google&#8217;s Realtime Search</title>
		<link>http://zooie.wordpress.com/2011/03/17/an-evaluation-of-googles-realtime-search/</link>
		<comments>http://zooie.wordpress.com/2011/03/17/an-evaluation-of-googles-realtime-search/#comments</comments>
		<pubDate>Thu, 17 Mar 2011 10:04:29 +0000</pubDate>
		<dc:creator>Vik</dc:creator>
				<category><![CDATA[Blog Stuff]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Social]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Twitter]]></category>
		<category><![CDATA[Wikipedia]]></category>
		<category><![CDATA[location]]></category>
		<category><![CDATA[realtime]]></category>
		<category><![CDATA[Web]]></category>
		<category><![CDATA[wikipedia]]></category>

		<guid isPermaLink="false">http://zooie.wordpress.com/?p=537</guid>
		<description><![CDATA[How timely are the results returned from Google&#8217;s Realtime (RT) Search Engine? How often do Twitter results appear in these results? Over the weekend I developed a few basic experiments to find out and published the results below. Key Findings &#8230; <a href="http://zooie.wordpress.com/2011/03/17/an-evaluation-of-googles-realtime-search/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=zooie.wordpress.com&amp;blog=31469&amp;post=537&amp;subd=zooie&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<div>How timely are the results returned from Google&#8217;s Realtime (RT) Search Engine? How often do Twitter results appear in these results? Over the weekend I developed a few basic experiments to find out and published the results below.</div>
<div></div>
<p></p>
<div><strong>Key Findings</strong></div>
<p></p>
<div>
<div>
<ul>
<li>For location-based queries, there&#8217;s nearly a flip of a coin chance (43%) that a Twitter result will be the #1 ranked result.</li>
<li>For general knowledge queries, there&#8217;s a 23% chance that a Twitter result will be #1.</li>
<li>The newest Twitter results are usually 4 seconds old. The newest Web results are 10x older (41 seconds).</li>
<li>A top ranking Twitter result for a location-based query is usually 2 minutes old (compared with Web which is 22 minutes old &#8211; again nearly 10x older).</li>
<li>When Twitter results appear at least one of them is in the top ranked position</li>
</ul>
</div>
</div>
<div><strong>Experiment #1 &#8211; General Knowledge</strong></div>
<p></p>
<div></div>
<div>I crawled 1,370 article titles from Wikipedia and ran each title as a query into Google RT search.</div>
<div></div>
<p></p>
<div><em>Market Shares</em></div>
<p></p>
<div></div>
<div>81% of all queries returned search results that included web page results</div>
<div>23% of all queries returned search results that included Twitter results</div>
<div>7% of all queries returned 0 search results</div>
<div></div>
<p></p>
<div>70% of all queries had a web page result in the #1 ranked position</div>
<div>When Twitter results appeared there was always at least one result in the #1 ranked position (so 23% of queries)</div>
<div></div>
<p></p>
<div><em>Time Lag</em></div>
<div></div>
<p></p>
<div>When a web page was the #1 ranked result, that result on average was 6736 seconds (or 1 hr and 52 minutes) old.</div>
<div>When a Tweet was the #1 ranked result, that result on average was 261 seconds (or 4 minutes and 21 seconds) old.</div>
<div></div>
<p></p>
<div>The average age of the top 10% newest web page results (across all queries) is 41 seconds</div>
<div>
<div>The average age of the top 10% newest Twitter results (across all queries) is 2 seconds</div>
<div></div>
<p>
</div>
<div><em>Tail</em></div>
<div></div>
<p></p>
<div>Query length was between 1 &#8211; 12 words (where 1-2 word long queries are most popular)</div>
<div>Worth noting that no Twitter results appear for queries with greater than 5 words</div>
<div></div>
<p></p>
<div><strong>Experiment #2 &#8211; Location</strong></div>
<div></div>
<p></p>
<div>I crawled 265 major populated U.S. cities from the U.S. Census Bureau and ran each city name as a query into Google RT search.</div>
<div></div>
<p></p>
<div>
<div><em>Market Shares</em></div>
<div></div>
<p></p>
<div>73% of all queries returned search results that included web page results</div>
<div>43% of all queries returned search results that included Twitter results</div>
<div>5% of all queries returned 0 search results</div>
<div></div>
<p></p>
<div>52% of all queries had a web page result in the #1 ranked position</div>
<div>When Twitter results appeared there was always at least one result in the #1 ranked position (so 43% of queries)</div>
<div></div>
<p></p>
<div><em>Time Lag</em></div>
<div></div>
<p></p>
<div>When a web page was the #1 ranked result, that result on average was 1341 seconds (or 22 minutes and 21 seconds) old.</div>
<div>When a Tweet was the #1 ranked result, that result on average was 138 seconds (or 2 minutes and 18 seconds) old.</div>
<div></div>
<p></p>
<div>The average age of the top 10% newest web page results (across all queries) is 41 seconds</div>
<div>
<div>The average age of the top 10% newest Twitter results (across all queries) is 4 seconds</div>
<div></div>
<p>
</div>
</div>
<div><em>Tail</em></div>
<div></div>
<p></p>
<div>
<div>Query length was between 1 &#8211; 3 words</div>
<div>Worth noting that no Twitter results appear for 3 word long queries</div>
</div>
<p></p>
<div><strong>Implementation Details</strong></div>
<div></p>
<ul>
<li>Generated Wiki queries by running &#8220;site:en.wikipedia.org&#8221; searches on Google and Blekko, and extracting the titles (en.wikipedia.org/{title_is_here}) from the result links. Side point: I tried Bing but the result links had mostly one word long titles (Bing seems to really bias query length in their ranking) and I wanted more diversity to test out tail queries.</li>
<li>Crawled cities (for the location-based queries) from <a href="http://www.census.gov/popest/cities/tables/SUB-EST2009-01.csv" target="_blank">http://www.census.gov/popest/cities/tables/SUB-EST2009-01.csv</a></li>
</ul>
</div>
<p></p>
<div><strong>Caveats</strong></div>
<div></p>
<ul>
<li>I ran these experiments at 2:45a PST on Monday. The location-based queries all relate to U.S., so probably not many people up at that time generating up-to-date information. The time lag stats could vary depending on when these experiments are ran. I did however re-run the experiments in the late morning and didn&#8217;t see much difference in the timings.</li>
<li>I ran all queries through Google&#8217;s normal web search engine with &#8216;Latest&#8217; on (in the left bar under Search Tools). These results are not exactly the same as those generated from the standalone Google Realtime Search portal, which seems to bias Tweets more while the &#8216;Latest&#8217; results seems to find middle ground between real-time Twitter results and web page results. I used &#8216;Latest&#8217; because it seems like it would be the most popular gateway to Google&#8217;s Realtime search results.</li>
</ul>
</div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/zooie.wordpress.com/537/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/zooie.wordpress.com/537/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/zooie.wordpress.com/537/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/zooie.wordpress.com/537/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/zooie.wordpress.com/537/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/zooie.wordpress.com/537/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/zooie.wordpress.com/537/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/zooie.wordpress.com/537/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/zooie.wordpress.com/537/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/zooie.wordpress.com/537/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/zooie.wordpress.com/537/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/zooie.wordpress.com/537/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/zooie.wordpress.com/537/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/zooie.wordpress.com/537/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=zooie.wordpress.com&amp;blog=31469&amp;post=537&amp;subd=zooie&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://zooie.wordpress.com/2011/03/17/an-evaluation-of-googles-realtime-search/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/17518bf0a462f22fc174f2df8e464e69?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">zooie</media:title>
		</media:content>
	</item>
		<item>
		<title>Does Facebook leak what profiles you click on?</title>
		<link>http://zooie.wordpress.com/2010/10/26/does-facebook-leak-what-profiles-you-click-on/</link>
		<comments>http://zooie.wordpress.com/2010/10/26/does-facebook-leak-what-profiles-you-click-on/#comments</comments>
		<pubDate>Tue, 26 Oct 2010 07:36:58 +0000</pubDate>
		<dc:creator>Vik</dc:creator>
				<category><![CDATA[Blog Stuff]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Non-Technical-Read]]></category>
		<category><![CDATA[Social]]></category>

		<guid isPermaLink="false">http://zooie.wordpress.com/?p=507</guid>
		<description><![CDATA[Check out Preview My Profile on Facebook: Account (top right) &#62; Privacy Settings &#62; Customize Settings &#62; Preview My Profile Now say you have a friend named Bob. Type &#8216;Bob&#8217; in the box at the top of Preview My Profile &#8230; <a href="http://zooie.wordpress.com/2010/10/26/does-facebook-leak-what-profiles-you-click-on/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=zooie.wordpress.com&amp;blog=31469&amp;post=507&amp;subd=zooie&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<div>Check out Preview My Profile on Facebook:</div>
<div>
<p><strong>Account </strong>(top right) &gt; <strong>Privacy Settings</strong> &gt;</p>
<p><strong>Customize Settings</strong> &gt; <strong>Preview My Profile</strong></p>
</div>
<div>
<p style="text-align:center;"><a href="http://static.businessinsider.com/image/4be964b67f8b9a6e229e0000-590-/facebook-will-show-you-what-your-friends-can-see-if-all-looks-good-then-congratulations.jpg"><img class="aligncenter" title="Preview My Profile" src="http://static.businessinsider.com/image/4be964b67f8b9a6e229e0000-590-/facebook-will-show-you-what-your-friends-can-see-if-all-looks-good-then-congratulations.jpg" alt="" width="537" height="400" /></a></p>
<p>Now say you have a friend named Bob. Type &#8216;Bob&#8217; in the box at the top of Preview My Profile to see how your profile will be seen by him. Take a look at the Mutual Friends section (bottom left in the screenshot above) of your profile (from Bob&#8217;s view &#8211; so still in Preview My Profile). Notice how these mutual friends seem to bias towards those who are closest to Bob (and perhaps to you as well). This by itself is pretty interesting. I can see who my friends are closer to relative to our other mutual friends. This pattern seems to hold up well in my trials over my friends who I know well (I saw that their closest friends were popping up more often than not in the mutual friends section).</p>
</div>
<div>
<p>This got me curious about how Facebook determines &#8220;closeness&#8221; between two people. In particular, does Facebook leverage your clicks on a friend&#8217;s profile in determining how close you are to that friend? To experiment, I frequently clicked on my friend&#8217;s (say her name is Alice) profile and newsfeed updates over two weeks. She&#8217;s someone I rarely communicate with. I then normally browsed profiles of mutual friends I share with Alice and noticed that in the mutual friends section of those profiles Alice frequently showed up (even when the total number of mutual friends was greater than 80 &#8211; keep in mind that the mutual friends section only shows 3 friends). Now, there&#8217;s definitely randomness at times and I believe multiple ranking features are probably being used here (like perhaps number of exchanged messages) but I have a feeling clicks might be in play here as well based on this result.</p>
</div>
<div>
<p>If Preview My Profile gives you the same view over mutual friends as what you see normally when you click on a friend&#8217;s profile, and if mutual friends uses private information like clicks / messages as features in the ranking, then it may be possible to infer who your friends are communicating with or clicking on more &#8211; or at the very least, find who they are closer to relative to your other mutual friends. If I view my profile from Bob&#8217;s eyes and frequently see Alice appear in the Mutual Friends section over multiple runs it may imply a strong relationship from Bob to Alice. Also, when the number of mutual friends is high relative to the number of total friends your friend has, then this result may be even more accurate.</p>
</div>
<div>
<p>This isn&#8217;t scientific by any means &#8211; I really don&#8217;t know how the ranking is done and may be completely wrong &#8211; so take it with a grain of salt. Just thought it was an interesting feature and pattern worth sharing &#8230;</p>
</div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/zooie.wordpress.com/507/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/zooie.wordpress.com/507/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/zooie.wordpress.com/507/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/zooie.wordpress.com/507/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/zooie.wordpress.com/507/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/zooie.wordpress.com/507/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/zooie.wordpress.com/507/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/zooie.wordpress.com/507/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/zooie.wordpress.com/507/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/zooie.wordpress.com/507/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/zooie.wordpress.com/507/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/zooie.wordpress.com/507/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/zooie.wordpress.com/507/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/zooie.wordpress.com/507/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=zooie.wordpress.com&amp;blog=31469&amp;post=507&amp;subd=zooie&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://zooie.wordpress.com/2010/10/26/does-facebook-leak-what-profiles-you-click-on/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/17518bf0a462f22fc174f2df8e464e69?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">zooie</media:title>
		</media:content>

		<media:content url="http://static.businessinsider.com/image/4be964b67f8b9a6e229e0000-590-/facebook-will-show-you-what-your-friends-can-see-if-all-looks-good-then-congratulations.jpg" medium="image">
			<media:title type="html">Preview My Profile</media:title>
		</media:content>
	</item>
		<item>
		<title>pplmatch &#8211; Find Like Minded People on LinkedIn</title>
		<link>http://zooie.wordpress.com/2010/07/20/pplmatch-find-like-minded-people-on-linkedin/</link>
		<comments>http://zooie.wordpress.com/2010/07/20/pplmatch-find-like-minded-people-on-linkedin/#comments</comments>
		<pubDate>Tue, 20 Jul 2010 07:59:19 +0000</pubDate>
		<dc:creator>Vik</dc:creator>
				<category><![CDATA[AI]]></category>
		<category><![CDATA[Blog Stuff]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[CS]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Science]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Social]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Web2.0]]></category>

		<guid isPermaLink="false">http://zooie.wordpress.com/?p=489</guid>
		<description><![CDATA[http://www.pplmatch.com Just provide a link to a public LinkedIn profile and an email address and that&#8217;s it. The system will go find other folks on LinkedIn who best match that given profile and email back a summary of the results. &#8230; <a href="http://zooie.wordpress.com/2010/07/20/pplmatch-find-like-minded-people-on-linkedin/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=zooie.wordpress.com&amp;blog=31469&amp;post=489&amp;subd=zooie&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><strong><a href="http://www.pplmatch.com">http://www.pplmatch.com</a></strong></p>
<p>Just provide a link to a public LinkedIn profile and an email address and that&#8217;s it. The system will go find other folks on LinkedIn who best match that given profile and email back a summary of the results.</p>
<p>It leverages some very useful IR techniques along with a basic machine learned model to optimize the matching quality.</p>
<p>Some use cases:</p>
<ul>
<li>If I provide a link to a star engineer, I can find a bunch of folks like that person to go try to recruit. One could also use LinkedIn / Google search to find people, but sometimes it can be difficult to formulate the right query and may be easier to just pivot off an ideal candidate.</li>
</ul>
<ul>
<li>I recently shared it with a colleague of mine who just graduated from college. He really wants to join a startup but doesn&#8217;t know of any (he just knows about the big companies like Microsoft, Google, Yahoo!, etc.). With this tool he found people who shared similar backgrounds and saw which small companies they work at.</li>
</ul>
<ul>
<li>Generally browsing the people graph based on credentials as opposed to relationships. It seems to be a fun way to find like minded people around the world and see where they ended up. I&#8217;ve recently been using it to find advisors and customers based on folks I admire.</li>
</ul>
<p>Anyways, just a fun application I developed on the side. It&#8217;s not perfect by any means but I figured it&#8217;s worth sharing.</p>
<p>It&#8217;s pretty compute intensive, so if you want to try it send mail to [contact at pplmatch dot com] to get your email address added to the list. Also, do make sure that the profiles you supply expose lots of text publicly &#8211; the more text the better the results.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/zooie.wordpress.com/489/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/zooie.wordpress.com/489/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/zooie.wordpress.com/489/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/zooie.wordpress.com/489/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/zooie.wordpress.com/489/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/zooie.wordpress.com/489/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/zooie.wordpress.com/489/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/zooie.wordpress.com/489/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/zooie.wordpress.com/489/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/zooie.wordpress.com/489/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/zooie.wordpress.com/489/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/zooie.wordpress.com/489/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/zooie.wordpress.com/489/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/zooie.wordpress.com/489/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=zooie.wordpress.com&amp;blog=31469&amp;post=489&amp;subd=zooie&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://zooie.wordpress.com/2010/07/20/pplmatch-find-like-minded-people-on-linkedin/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/17518bf0a462f22fc174f2df8e464e69?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">zooie</media:title>
		</media:content>
	</item>
		<item>
		<title>anymeme: Breaking News, Tweets in your URLs</title>
		<link>http://zooie.wordpress.com/2010/03/16/anymeme-breaking-news-tweets-in-your-urls/</link>
		<comments>http://zooie.wordpress.com/2010/03/16/anymeme-breaking-news-tweets-in-your-urls/#comments</comments>
		<pubDate>Tue, 16 Mar 2010 08:00:55 +0000</pubDate>
		<dc:creator>Vik</dc:creator>
				<category><![CDATA[Blog Stuff]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[News]]></category>
		<category><![CDATA[Social]]></category>
		<category><![CDATA[Trends]]></category>
		<category><![CDATA[Twitter]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Web2.0]]></category>

		<guid isPermaLink="false">http://zooie.wordpress.com/?p=450</guid>
		<description><![CDATA[A very basic experiment that pads URLs with messages: http://anymeme.appspot.com or more appropriately http://anymeme.appspot.com/anymeme.appspot.com Notes This is not related to any work I&#8217;ve been pursuing during my EIR gig. It&#8217;s kind of like the opposite of bit.ly (there is a &#8230; <a href="http://zooie.wordpress.com/2010/03/16/anymeme-breaking-news-tweets-in-your-urls/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=zooie.wordpress.com&amp;blog=31469&amp;post=450&amp;subd=zooie&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<div>
<p>A very basic experiment that pads URLs with messages:</p>
</div>
<div>
<p><span style="font-size:small;"><strong><a href="http://anymeme.appspot.com">http://anymeme.appspot.com</a></strong></span></p>
</div>
<div>
<p>or more appropriately <a href="http://anymeme.appspot.com/anymeme.appspot.com">http://anymeme.appspot.com/anymeme.appspot.com</a></p>
<p>Notes</p>
<ul>
<li>This is not related to any work I&#8217;ve been pursuing during my EIR gig.</li>
<li>It&#8217;s kind of like the opposite of <a href="http://bit.ly">bit.ly</a> (there is a shortener available on the site though). It&#8217;s better tailored for shorter URLs where there&#8217;s enough address bar space to display a message at the end of the URL.</li>
<li>I tested this on the <a href="http://www.quantcast.com/top-sites-1">top 30 or so sites</a> using a mix of Firefox and Chrome.</li>
<li>This could easily be the dumbest thing I&#8217;ve ever developed, but then again there are a lot of dumb things on the web. It took longer for me to write these posts describing anymeme than to develop the code for it. This is more of an experiment to see:
<ul>
<li>If users, publishers, and advertisers like it</li>
</ul>
<ul>
<li>To try to make URLs more interesting and valuable</li>
</ul>
</li>
<li>It would be so cool:
<ul>
<li>To generate enough cash via sponsored messages to make meaningful contributions to great causes</li>
<li>To see an important breaking news headline or an interesting tweet as you load up hulu to check for new episodes &#8211; visible in the previously half empty address bar so there&#8217;s no need to frame or change the destination page to show the content.</li>
</ul>
</li>
<li>It currently runs on Google App Engine</li>
</ul>
</div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/zooie.wordpress.com/450/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/zooie.wordpress.com/450/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/zooie.wordpress.com/450/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/zooie.wordpress.com/450/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/zooie.wordpress.com/450/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/zooie.wordpress.com/450/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/zooie.wordpress.com/450/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/zooie.wordpress.com/450/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/zooie.wordpress.com/450/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/zooie.wordpress.com/450/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/zooie.wordpress.com/450/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/zooie.wordpress.com/450/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/zooie.wordpress.com/450/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/zooie.wordpress.com/450/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=zooie.wordpress.com&amp;blog=31469&amp;post=450&amp;subd=zooie&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://zooie.wordpress.com/2010/03/16/anymeme-breaking-news-tweets-in-your-urls/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/17518bf0a462f22fc174f2df8e464e69?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">zooie</media:title>
		</media:content>
	</item>
		<item>
		<title>Some Stats about Twitter&#8217;s Content</title>
		<link>http://zooie.wordpress.com/2009/10/12/some-stats-about-twitters-content/</link>
		<comments>http://zooie.wordpress.com/2009/10/12/some-stats-about-twitters-content/#comments</comments>
		<pubDate>Mon, 12 Oct 2009 22:33:13 +0000</pubDate>
		<dc:creator>Vik</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Social]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Trends]]></category>
		<category><![CDATA[Twitter]]></category>
		<category><![CDATA[content]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[dedup]]></category>
		<category><![CDATA[Stats]]></category>

		<guid isPermaLink="false">http://zooie.wordpress.com/?p=371</guid>
		<description><![CDATA[Near the end of July, I crawled a sample of ~10M tweets. On my way over from Open Hack Day NYC yesterday I finally got some time to do some preliminary analysis of this data. Several posts have analyzed Twitter’s &#8230; <a href="http://zooie.wordpress.com/2009/10/12/some-stats-about-twitters-content/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=zooie.wordpress.com&amp;blog=31469&amp;post=371&amp;subd=zooie&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Near the end of July, I crawled a sample of ~10M tweets. On my way over from <a href="http://openhacknyc.pbworks.com/">Open Hack Day NYC</a> yesterday I finally got some time to do some preliminary analysis of this data. Several posts have analyzed Twitter’s traffic stats [<a href="http://www.techcrunch.com/2009/10/05/twitter-data-analysis-an-investors-perspective/">TechCrunch</a>] [<a href="http://mashable.com/2009/09/25/twitter-traffic-ceiling/">Mashable</a>] [<a href="http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/">zooie</a>], so I thought I’d focus more on the content here.</p>
<h3>Duplication</h3>
<p>By compressing the data and comparing the before and after sizes, one can get a pretty decent understanding of the duplication factor. To do this, I extracted just the raw text messages, sorted them, and then ran gzip over the sorted set.</p>
<p>Compression ratio</p>
<p>&gt;&gt;&gt; 284023259 / 739273532 bytes</p>
<p><strong>0.38</strong>419238171778614</p>
<p>Typically, for text compression, gzip-like programs can achieve around 50% without the sort (and sorting typically helps), and here we get 38%. A standard text corpus consists of much larger document sizes, so it’s interesting to see a similar or larger duplication factor for tweets.</p>
<p>We can dive even deeper into this area by analyzing the term overlap statistics to measure <strong>near duplication</strong>, or messages that aren’t necessarily identical but are close enough.</p>
<p>To do this, I first cleaned the text (removed stopwords, stemmed terms, normalized case). Interesting, <strong>after cleaning</strong> the text, <strong>the average number of tokens for a message is just</strong> <strong>6.28</strong>, or 2.5x the size of a standard web search query.</p>
<p>Then, I employed consistent term sampling to select N representatives for each cleaned message and coalesced the representatives together as a single key. By comparing the total number of unique keys to messages, one can infer the near duplication factor. Also, the higher the N, the higher the threshold is to match (so N &gt;= 6, 6 being the average number of tokens per message, probably means that two messages that generate the same key are exact duplicates).</p>
<p style="text-align:left;">You’ll notice N &gt;=6 converges around 84%, implying that after cleaning the text, 16% of the messages exactly match some other message. Additionally, when N = 2 (or requiring 2 / 6 tokens or 33% of the text on average) to match, 45% of the messages collide with other messages in the corpus. At N = 2, matching often means the messages discuss the same general topic, but aren’t close near duplicates.</p>
<table class="alignleft" border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr style="text-align:center;">
<td valign="top" width="120"><strong>N Term Samples</strong></td>
<td valign="top" width="120"><strong>Unique Keys</strong></td>
<td valign="top" width="120"><strong>Coverage</strong></td>
</tr>
<tr style="text-align:center;">
<td valign="top" width="120"><strong>8</strong></td>
<td valign="top" width="120">8548695</td>
<td valign="top" width="120">0.8356</td>
</tr>
<tr style="text-align:center;">
<td valign="top" width="120"><strong>6</strong></td>
<td valign="top" width="120">8512672</td>
<td valign="top" width="120">0.8321</td>
</tr>
<tr style="text-align:center;">
<td valign="top" width="120"><strong>5</strong></td>
<td valign="top" width="120">8476590</td>
<td valign="top" width="120">0.8286</td>
</tr>
<tr style="text-align:center;">
<td valign="top" width="120"><strong>4</strong></td>
<td valign="top" width="120">8366391</td>
<td valign="top" width="120">0.8177</td>
</tr>
<tr style="text-align:center;">
<td valign="top" width="120"><strong>3</strong></td>
<td valign="top" width="120">8098400</td>
<td valign="top" width="120">0.7916</td>
</tr>
<tr style="text-align:center;">
<td valign="top" width="120"><strong>2</strong></td>
<td valign="top" width="120">5716566</td>
<td valign="top" width="120">0.5588</td>
</tr>
<tr style="text-align:center;">
<td valign="top" width="120"><strong>1</strong></td>
<td valign="top" width="120">1013783</td>
<td valign="top" width="120">0.0991</td>
</tr>
</tbody>
</table>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<h3>URLs</h3>
<p>URLs are present in <strong>~18%</strong> of the tweets</p>
<p>Of those, <strong>~65%</strong> of the URLs are <strong>unique</strong></p>
<p><strong>70K Unique Domains covering 2M URLS</strong></p>
<p><strong>Top Domains:</strong></p>
<p>['bit.ly', 'tinyurl.com', 'twitpic.com', 'is.gd', 'myloc.me', 'ow.ly', 'ustre.am', 'cli.gs', 'tr.im', 'plurk.com', 'ff.im', 'tumblr.com', 'yfrog.com', '140mafia.com', 'u.mavrev.com', 'twurl.nl', 'tweeterfollow.com', 'mypict.me', 'viagracan.com', 'vipfollowers.com', 'morefollowers.net', 'digg.com', 'tweeteradder.com', 'ping.fm', 'tiny.cc', 'followersnow.com', 'short.to', 'twit.ac', 'snipr.com', 'wefollow.com', 'tweet.sg', 'url4.eu', 'the-twitter-follow-train.info', 'fwix.com', 'budurl.com', 'su.pr', 'shar.es', 'tinychat.com', 'snipurl.com', 'loopt.us', 'migre.me', 'flic.kr', 'myspace.com', 'snurl.com', 'twitgoo.com', 'zshare.net', 'post.ly', 'bkite.com', 'yes.com', 'flickr.com', 'twitter.com', 'artistsforschapelle.com', '140army.com', 'youtube.com', 'x.imeem.com', 'pic.gd', 'TwitterBackgrounds.com', 'raptr.com', 'twt.gs', 'twitthis.com', 'mobypicture.com', 'tobtr.com', 'ad.vu', 'sml.vg', 'rubyurl.com', 'tinylink.com', 'redirx.com', 'a2a.me', 'eCa.sh', 'vimeo.com', 'meadd.com', 'hotjobs.yahoo.com', 'doiop.com', 'myurl.in', 'urlpire.com', 'buzzup.com', 'freead.im', 'youradder.com', 'facebook.com', 'adf.ly', 'justin.tv', 'twitvid.com', 'adjix.com', 'twcauses.com', 'lkbk.nu', 'tlre.us', 'htxt.it', 'stickam.com', 'twubs.com', 'isy.gs', 'reverbnation.com', 'news.bbc.co.uk', 'sn.im', 'twibes.com', 'ustream.tv', 'trim.su', 'hashjobs.com', 'blogtv.com', 'jobs-cb.de', 'xsaimex.com']</p>
<h3>Retweets</h3>
<p>~<strong>4%</strong> <strong>of messages are retweets</strong></p>
<h3>Replied @Users</h3>
<p>~<strong>1M total replied-to users</strong> in this data set</p>
<p><strong>37% of tweets contain &#8216;@x&#8217; terms</strong></p>
<p><strong>Most Popular Replied-to Users </strong>(almost all celebrities):<strong> </strong></p>
<p>['@mileycyrus', '@jonasbrothers', '@ddlovato', '@mitchelmusso', '@donniewahlberg', '@souljaboytellem', '@tommcfly', '@addthis', '@officialtila', '@johncmayer', '@shanedawson', '@bowwow614', '@jordanknight', '@ryanseacrest', '@perezhilton', '@jonathanrknight', '@petewentz', '@tweetmeme', '@adamlambert', '@david_henrie', '@dealsplus', '@dwighthoward', '@iamdiddy', '@lancearmstrong', '@songzyuuup', '@imeem', '@blakeshelton', '@dannymcfly', '@lilduval', '@selenagomez', '@markhoppus', '@yelyahwilliams', '@therealpickler', '@stephenfry', '@mrtweet.', '@taylorswift13', '@michaelsarver1', '@davidarchie', '@the_real_shaq', '@tyrese4real', '@britneyspears', '@106andpark', '@ashleytisdale', '@mariahcarey', '@kimkardashian', '@wale', '@mashable', '@programapanico', '@therealjordin', '@listensto', '@misskeribaby', '@alyssa_milano', '@alexalltimelow', '@aplusk', '@thisisdavina', '@breakingnews:', '@peterfacinelli', '@truebloodhbo', '@mgiraudofficial', '@tonyspallelli', '@mtv', '@jackalltimelow', '@dfizzy', '@youngq', '@tomfelton', '@pooch_dog', '@jonaskevin', '@princesammie', '@nkotb', '@christianpior', '@cthagod', '@johnlloydtaylor', '@neilhimself', '@moontweet', '@katyperry', '@danilogentili', '@mchammer', '@rainnwilson', '@joeymcintyre', '@30secondstomars', '@phillyd', '@heidimontag', '@mrpeterandre', '@andyclemmensen', '@crystalchappell', '@kevindurant35', '@huckluciano', '@dannygokey', '@jaketaustin', '@revrunwisdom', '@jamesmoran', '@musewire', '@dannywood', '@nickiminaj', '@akgovsarahpalin', '@terrencej106', '@mashable:', '@drewryanscott', '@mrtweet', '@necolebitchie', '@lilduval:', '@willie_day26', '@kirstiealley', '@betthegame', '@radiomsn', '@alancarr', '@rafinhabastos', '@krisallen4real', '@iamjericho', '@breakingnews', '@babygirlparis', '@ladygaga', '@chris_daughtry', '@hypem', '@danecook', '@imcudi', '@jeepersmedia', '@buckhollywood', '@kimmyt22', '@giulianarancic', '@chrisbrogan', '@nasa', '@addtoany', '@nickcarter', '@debbiefletcher', '@marcoluque', '@shaundiviney', '@ogochocinco', '@twitter', '@eddieizzard', '@youngbillymays', '@real_ron_artest', '@pink', '@laurenconrad', '@rubarrichello', '@ianjamespoulter', '@liltwist', '@teyanataylor', '@dougiemcfly', '@theellenshow', '@robkardashian', '@sherrieshepherd', '@justinbieber', '@paulaabdul', '@jason_manford', '@jaredleto', '@tracecyrus', '@itsonalexa', '@ddlovato:', '@khloekardashian', '@revrunwisdom:', '@solangeknowles', '@allison4realzzz', '@nickjonas', '@reply', '@anarbor', '@donlemoncnn', '@gfalcone601', '@moonfrye', '@symphnysldr', '@iamspectacular', '@honorsociety', '@questlove', '@guykawasaki', '@dawnrichard', '@_maxwell_', '@somaya_reece', '@mandyyjirouxx', '@teemwilliams', '@greggarbo', '@pennjillette', '@mikeyway', '@matthardybrand', '@iamjonwalker', '@andyroddick', '@kohnt01', '@chris_gorham', '@seankingston', '@joshgroban', '@mousebudden', '@misskatieprice', '@spencerpratt', '@wilw', '@jgshock', '@swear_bot', '@joelmadden', '@techcrunch', '@americanwomannn', '@kelly__rowland', '@mionzera', '@astro_127', '@_@', '@spam', '@sookiebontemps', '@drakkardnoir', '@noh8campaign', '@kayako', '@trvsbrkr', '@qbkilla', '@mw55', '@guykawasaki:', '@donttrythis', '@cv31', '@liljjdagreat', '@tiamowry', '@nickensimontwit', '@holdemtalkradio', '@bradiewebbstack', '@nytimes', '@riskybizness23', '@radityadika', '@adrienne_bailon', '@riccklopes', '@jessicasimpson', '@sportsnation', '@jasonbradbury', '@huffingtonpost', '@oceanup', '@gilbirmingham', '@iconic88', '@the', '@thebrandicyrus', '@gordela', '@thedebbyryan', '@jessemccartney', '@?', '@caiquenogueira', '@celsoportiolli', '@shontelle_layne', '@calvinharris', '@chattyman', '@ali_sweeney', '@anamariecox', '@joshthomas87', '@emilyosment', '@nasa:', '@sevinnyne6126', '@thebiggerlights', '@theboygeorge', '@jbarsodmg', '@goldenorckus', '@warrenwhitlock', '@bobbyedner', '@myfabolouslife', '@descargaoficial', '@ochonflcinco85', '@ninabrown', '@billycurrington', '@oprah', '@junior_lima', '@asherroth', '@starbucks', '@jason_pollock', '@intanalwi', '@harrislacewell', '@serenajwilliams', '@kevinruddpm', '@bigbrotherhoh', '@oliviamunn', '@chamillionaire', '@tamekaraymond', '@teamwinnipeg', '@littlefletcher', '@piercethemind', '@brookandthecity', '@iranbaan:', '@tonyrobbins', '@maestro', '@glennbeck', '@1omarion', '@nadhiyamali', '@slimthugga', '@jason_mraz', '@profbrendi', '@djaaries', '@juanestwiter', '@davegorman', '@zackalltimelow', '@mamajonas', '@itschristablack', '@skydiver', '@gigva', '@currensy_spitta', '@paulwallbaby', '@rpattzproject', '@petewentz:', '@rodrigovesgo', '@drdrew', '@sportsguy33', '@cthagod:', '@hollymadison123', '@mjjnews', '@itsbignicholas', '@_supernatural_', '@santoevandro', '@demar_derozan', '@marthastewart', '@billganz62', '@oodle', '@davidleibrandt']</p>
<h3>Hashtags</h3>
<p>~<strong>7% of messages contain hashtags</strong></p>
<p><strong>Total Unique Hashtags found: ~94k</strong></p>
<p><strong>Top Hashtags:</strong></p>
<p>['#lies', '#fb', '#musicmonday', '#truth', '#iranelection', '#moonfruit', '#tendance', '#jobs', '#ihavetoadmit', '#mariomarathon', '#140mafia', '#tcot', '#zyngapirates', '#followfriday', '#spymaster', '#ff', '#1', '#sotomayor', '#turnon', '#notagoodlook', '#tweetmyjobs', '#hiring:', '#iran', '#fun140', '#jesus', '#72b381.', '#quote', '#tinychat', '#neda', '#militarymon', '#gr88', '#trueblood', '#fail', '#news', '#140army', '#livestrong', '#noh8', '#wpc09', '#music', '#turnoff', '#unacceptable', '#twables', '#masterchef', '#noh84kradison', '#writechat', '#job', '#squarespace', '#michaeljackson', '#2', '#nothingpersonal', '#iphone', '#ala2009', '#mj', '#tdf', '#blogtalkradio', '#mlb', '#1stdraftmovielines', '#p2', '#secretagent', '#tlot', '#72b381', '#honduras', '#twitter', '#jtv', '#tehran', '#gorillapenis', '#porn', '#bb11', '#sotoshow', '#brazillovesatl', '#google', '#oneandother', '#bb10', '#chucknorris', '#cmonbrazil', '#agendasource', '#travel', '#ashes', '#dumbledore', '#freeschapelle', '#tl', '#dealsplus', '#nsfw', '#entourage', '#tech', '#hottest100', '#3693dh...', '#torchwood', '#design', '#teaparty', '#love', '#dontyouhate', '#mileycyrus', '#sgp', '#harrypottersequels', '#peteandinvisiblechildren', '#stopretweets', '#tscc', '#wimbledon', '#hive', '#cubs', '#3', '#redsox', '#photography', '#voss', '#snods', '#lol', '#socialmedia', '#gop', '#health', '#esriuc', '#green', '#follow', '#echo!', '#obama', '#digg', '#shazam', '#hhrs', '#video', '#moonfruit.', '#swineflu', '#politics', '#ebuyer683', '#umad', '#quizdostandup', '#thankyoumichael', '#blogchat', '#wordpress', '#3693dh', '#haiku', '#ttparty', '#lastfm:', '#healthcare', '#hcr', '#ecgc', '#seo', '#apple', '#chuck', '#wine', '#sammie', '#h1n1', '#marketing', '#twitition', '#happybirthdaymitchel18', '#cnn', '#lie', '#rt:', '#art', '#nasa', '#blog', '#quotes', '#bruno', '#business', '#palin', '#mw2', '#hcsm', '#harrypotter', '#4', '#lastfm', '#askclegg', '#photo', '#jobfeedr', '#lgbt', '#lies:', '#ihavetoadmit.i', '#jamlegend,', '#truthbetold', '#mcfly', '#microsoft', '#fashion', '#tweetphoto', '#ebuyer167201', '#noh84adison', '#5', '#mets', '#china', '#bigprize', '#whythehell', '#money', '#sophiasheart', '#finance', '#michael', '#f1', '#adamlambert100k', '#web', '#urwashed', '#moonfruit!', '#1:', '#kayako', '#lies.', '#thankyouaaron', '#food', '#wow', '#moonfruit,', '#facebook', '#ebuyer291', '#ecomonday', '#ihave', '#happybdaydenise', '#postcrossing', '#ichc', '#912', '#demilovatolive', '#gijoemoviefan', '#funny', '#media', '#meowmonday', '#israel', '#blogger', '#forasarney', '#tv', '#topgear', '#chrisisadouche', '#stlcards', '#wec09', '#forex', '#aots1000', '#celebrity', '#dwarffilmtitles', '#6', '#yeg', '#slaughterhouse', '#nfl', '#photog', '#ny', '#firstdraftmovies', '#ufc', '#reddit', '#free', '#iwish', '#etsy', '#rulez', '#sports', '#icmillion', '#mmot', '#webdesign', '#deals', '#moonfruit?', '#pawpawty', '#twitterfahndung', '#billymaystribute', '#sytycd', '#runkeeper', '#scotus', '#yoconfieso', '#mariomarathon,', '#musicmondays', '#lies,', '#findbob', '#realestate', '#sohrab', '#sales', '#metal', '#runescape', '#hypem', '#threadless', '#gay', '#isyouserious', '#hollywood,', '#2:', '#ca,', '#golf', '#diadorock', '#newyork,', '#meteor', '#dailyquestion', '#photoshop', '#saveiantojones', '#musicmonday:', '#rock', '#sex', '#mlbfutures', '#ilove', '#mikemozart', '#nascar', '#indico', '#crossfitgames', '#gratitude', '#quote:', '#creativetechs', '#truth:', '#sharepoint', '#mkt', '#why', '#bigbrother', '#tam7', '#ihate', '#futureruby', '#slickrick', '#105.3', '#youareinatl', '#vegan', '#dontletmefindout', '#imustadmit', '#7', '#twitterafterdark', '#sunnyfacts', '#gilad', '#japan', '#iremember', '#97.3', '#puffdaddy', '#blogher', '#ade2009', '#aaliyah', '#alfredosms', '#95.1', '#truth,', '#twine', '#hiring']</p>
<h3>Questions</h3>
<p>Hard to infer exactly whether a message is a question or not, so I ran a couple of different filters:</p>
<p>5W&#8217;s, H, ? present ANYWHERE in tweet:</p>
<p><strong>0.102789281948 or 10%</strong></p>
<p>5W&#8217;s, H first token or ? last token:</p>
<p><strong>0.0238229662219 or 2%</strong></p>
<p>Just ? ANYWHERE in tweet:</p>
<p><strong>0.0040984928533 or 0.4%</strong></p>
<h3>Users</h3>
<p><strong>Discovered ~2M unique users</strong></p>
<p><strong>Top Sending Users</strong> (many bots):</p>
<p>['followermonitor', 'Tweet_Words', 'currentcet', 'currentutc', 'whattimeisitnow', 'ItIsNow', 'ThinkingStiff', 'otvrecorder', 'delicious50', 'Porngus', 'craigslistjobs', 'GorPen', 'hashjobs', 'TransAlchemy2', 'bot_theta', 'CHRISVOSS', 'bot_iota', 'bot_kappa', 'TIPAS', 'VeolaJBanner', 'StacyDWatson', 'LMAObot', 'SarahJSlonecker', 'AllisonMRussell', 'bot_eta', 'SandraHOakley', 'bot_psi', 'bot_tau', 'LoreleiRMercer', 'bot_zeta', 'bot_gamma', 'bot_sigma', 'bot_lambda', 'bot_pi', 'bot_epsilon', 'bot_nu', 'bot_rho', 'bot_omicron', 'bot_khi', 'LindaTYoung', 'mensrightsindia', 'bot_omega', 'bot_ksi', 'bot_delta', 'bot_alpha', 'bot_phi', 'CindaDJenkins', 'bot_mu', 'ImogeneDPetit', 'bot_upsilon', 'OPENLIST_CA', 'openlist', 'isygs', 'dq_jumon', 'gamingscoop', 'MildredSLogan', 'ObiWanKenobi_', 'pulseSearch', 'MaryEVo', 'ImeldaGMcward', 'MaryJNewman', 'SharonTForde', 'LoriJCornelius', 'BrandyWPulliam', 'RhondaTLopez', 'AprilKOropeza', 'CarolETrotman', 'SusanATouvell', 'dinoperna', 'buzzurls', '_Freelance_', 'DrSnooty', 'illstreet', 'bibliotaph_eyes', 'loc4lhost', 'bsiyo', 'BOTHOUSE', 'post_ads', 'qazkm', 'frugaldonkey', 'free_post', 'groovera', 'wonkawonkawonka', 'ForksGirlBella', 'casinopokera', 'dermdirectoryny', 'Yoowalk_chat', 'mstehr', 'hashgoogle', 'perry1949', 'ensiz_news', 'Bezplatno_net', 'timesmirror', 'work_freelance', 'cockbot', 'pdurham', 'bombtter_raw', 'ocha1', 'AlairAneko24', 'HaiIAmDelicious', 'Freshestjobs', 'fast_followers', 'LeadsForFree', 'RideOfYourLife', 'AlastairBotan30', 'helpmefast25', 'TheMLMWizard', 'uitrukken', 'adoptedALICE', 'TKATI', 'ezadsncash', 'tweetshelp', 'LAmetro_traffic', 'thinkpozzitive', 'StarrNeishaa', 'AldenCho36', 'JobHits', 'wootboot', 'smacula', 'faithclubdotnet', 'DmitriyVoronov', 'brownthumbgirl', 'NYCjobfeed', 'hfradiospacewx', 'FakeeKristenn', 'MLBDAILYTIMES', 'wildingp', 'JacksonsReview', 'EarthTimesPR', 'friedretweet', 'Wealthy23', 'RokpoolFM', 'HDOLLAZ', '_MrSpacely', 'Bestdocnyc', 'Rabidgun', 'flygatwick', 'live_china', 'friendlinks', 'retweetinator', 'iamamro', 'thayferreira', 'AldisDai39', 'AndersHana60', 'nonstopNEWS', 'VivaLaCash', 'TravelNewsFeeds', 'vuelosplus', 'threeporcupines', 'DemiAuzziefan', 'worldofprint', 'KevinEdwardsJr', 'REDDITSPAMMOR', 'NatValentine', 'ChanelLebrun', 'nowbot', 'hollyswansonUK', 'youngrhome', 'M_Abricot', 'thefakemandyv', 'scrapbookingpas', 'Naughtytimes', 'Opcode1300_bot', 'tellsecret', 'tboogie937', 'Climber_IT', 'comlist', 'with_a_smile', 'USN_retired', 'Climber_EngJobs', 'Climber_Finance', 'Climber_HRJobs', 'intanalwi', 'Climber_Sales', 'nadhiyamali', 'wonderfulquotes', 'MRAustria', 'O2Q', 'GL0', 'SookieBonTemps', 'MRSchweiz', 'latinasabor', 'nineleal', 'casservice', 'AltonGin54', 'KulerFeed', '_cesaum', 'HFMONAIR', 'DeeOnDreeYah', 'rockstalgica', 'iamword', 'rpattzproject', 'madblackcatcom', 'ftfradio', 'marciomtc', 'SocialNetCircus', 'AnotherYearOver', 'ichig', 'tcikcik', 'HelenaMarie210', 'mrbax0', 'SWBot', 'DayTrends', '_Embry_Call_', 'eProducts24', 'The_Sims_3', 'tom_ssa', 'woxy_vintage', 'urbanmusic2000', 'dopeguhxfresh', 'erections', 'DudeBroChill', 'lookingformoney', 'drnschneider', 'MosesMaimonides', '92Blues', 'elarmelar', 'rock937fm', 'sonicfm', 'erikadotnet', 'sky0311', 'weqx', 'brandamc', 'Hot106', 'woxy_live', 'ksopthecowboy', 'vixalius', 'cogourl', 'Cashintoday', 'Andrewdaflirt', 'oodle', 'mkephart25', 'doomed', 'spotifyuri', 'mangelat', 'Cody_K', 'swayswaystacey', 'KLLY953', 'onlaa', 'Ginger_Swan', 'Call_Embry', 'conservatweet', 'weerinlelystad', 'ruhanirabin', 'tmgadops', 'wakemeupinside1', 'horaoficial', 'xstex', 'franzidee', 'tommytrc', 'khopmusic', 'tez19', 'GaryGotnought', 'UnemployKiller', 'felloff', 'Kalediscope', 'TheRealSherina', 'jasonsfreestuff', 'johnkennick', 'sel_gomezx3', 'OE3', 'AddisonMontg', '_rosieCAKES', 'neownblog', 'PrinceP23', 'ontd_fluffy', 'USofAl', 'Kacizzle88', 'somalush', 'FrankieNichelle', 'jiva_music', 'itz_cookie', 'soundOfTheTone', 'knowheremom', 'Jayme1988', 'TrafficPilot', 'tweetalot', 'TheStation1610', 'lasvegasdivorce', '1000_LINKS_NOW2', 'KeepOnTweeting', 'uFreelance', 'ChocoKouture', 'Magic983', 'SnarkySharky', 'agthekid', 'cashinnow', 'jamokie', 'jessicastanely', 'Q103Albany', 'GPGTwit', 'xAmberNicholex', 'wjtlplaylist', 'sjAimee', 'chrisduhhh', 'failbus', '1stwave', 'RichardBejah', 'nyanko_love']</p>
<h3>Web Queries Overlap</h3>
<p>How much overlap is there between tweets and trending web search queries?</p>
<p>I took the top trending queries during the days of my twitter crawl from Google Trends, then query expanded each trending query until the length was 6 tokens so as to equalize the average lengths. Then, I simply counted how many tweets match at least 2 (cleaned) tokens of any of these query-expanded trends:</p>
<p><strong>0.0185654981775 or 2%</strong></p>
<p>That’s it for now. I have some more stats but need a bit more time to clean those up before publishing here.</p>
<p><em>Notes</em></p>
<p>Can&#8217;t distribute my data set unfortunately, but it shouldn’t take too long to assemble a comparable set via <a href="http://apiwiki.twitter.com/Streaming-API-Documentation#statuses/sample">Twitter’s spritzer feed</a> &#8211; that’ll probably be more useful as it’ll be more update-to-date than the one I analyzed here. Feel free to pull my stats off if you find them useful (top hashtags and users are in JSON format).</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/zooie.wordpress.com/371/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/zooie.wordpress.com/371/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/zooie.wordpress.com/371/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/zooie.wordpress.com/371/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/zooie.wordpress.com/371/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/zooie.wordpress.com/371/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/zooie.wordpress.com/371/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/zooie.wordpress.com/371/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/zooie.wordpress.com/371/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/zooie.wordpress.com/371/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/zooie.wordpress.com/371/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/zooie.wordpress.com/371/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/zooie.wordpress.com/371/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/zooie.wordpress.com/371/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=zooie.wordpress.com&amp;blog=31469&amp;post=371&amp;subd=zooie&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://zooie.wordpress.com/2009/10/12/some-stats-about-twitters-content/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/17518bf0a462f22fc174f2df8e464e69?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">zooie</media:title>
		</media:content>
	</item>
		<item>
		<title>Build an Automatic Tagger in 200 lines with BOSS</title>
		<link>http://zooie.wordpress.com/2009/10/09/build-an-automatic-tagger-in-200-lines-with-boss/</link>
		<comments>http://zooie.wordpress.com/2009/10/09/build-an-automatic-tagger-in-200-lines-with-boss/#comments</comments>
		<pubDate>Fri, 09 Oct 2009 15:45:59 +0000</pubDate>
		<dc:creator>Vik</dc:creator>
				<category><![CDATA[AI]]></category>
		<category><![CDATA[Boss]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[CS]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[delicious]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Social]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Talk]]></category>
		<category><![CDATA[Tutorial]]></category>
		<category><![CDATA[Yahoo]]></category>
		<category><![CDATA[api]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[hack]]></category>
		<category><![CDATA[hack day]]></category>
		<category><![CDATA[svm]]></category>
		<category><![CDATA[text]]></category>

		<guid isPermaLink="false">http://zooie.wordpress.com/?p=360</guid>
		<description><![CDATA[My colleagues and I will be giving a talk on BOSS at Yahoo!’s Hack Day in NYC on October 9. To show developers the versatility of an open search API, I developed a simple toy example (see my past ones: &#8230; <a href="http://zooie.wordpress.com/2009/10/09/build-an-automatic-tagger-in-200-lines-with-boss/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=zooie.wordpress.com&amp;blog=31469&amp;post=360&amp;subd=zooie&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>My colleagues and I will be giving a talk on BOSS at <a href="http://openhacknyc.pbworks.com/">Yahoo!’s Hack Day in NYC</a> on October 9. To show developers the versatility of an open search API, I developed a simple toy example (see my past ones: <a href="http://zooie.wordpress.com/2009/01/15/twitter-boss-real-time-search/">TweetNews</a>, <a href="http://zooie.wordpress.com/2008/08/04/yahoo-boss-google-app-engine-integrated/">Q&amp;A</a>) on the flight over that uses BOSS to generate data for training a machine learned text classifier. The resulting application basically takes two tags, some text, and tells you which tag best classifies that text. For example, you can ask the system if some piece of text is more liberal or conservative.</p>
<p>How does it work? BOSS offers delicious metadata for many search results that have been saved in delicious. This includes top tags, their frequencies, and the number of user saves. Additionally, BOSS makes available an option to retrieve extended search result abstracts. So, to generate a training set, I first build up a query list (100 delicious popular tags), search each query through BOSS (asking for 500 results per), and filter the results to just those that have delicious tags.</p>
<p>Basically, the collection logically looks like this:</p>
<p><span style="font-family:courier;">[(result_1, delicious_tags), (result_2, delicious_tags) …]</span></p>
<p>Then, I invert the collection on the tags while retaining each result’s extended abstract and title fields (concatenated together)</p>
<p>This logically looks like this now:</p>
<p><span style="font-family:courier;">[(tag_1, result_1.abstract + result_1.title), (tag_2, result_1.abstract + result_1.title), …, (tag_1, result_2.abstract + result_2.title), (tag_2, result_2.abstract + result_2.title) …]</span></p>
<p>To build a model comparing 2 tags, the system selects pairs from the above collection that have matching tags, converts the abstract + title text into features, and then passes the resulting pairs over to LibSVM to train a binary classification model.</p>
<p>Here’s how it works:</p>
<p><span style="font-family:courier;">tagger viksi$ python gen_training_test_set.py liberal conservative</span></p>
<p><span style="font-family:courier;">tagger viksi$ python autosvm.py training_data.txt test_data.txt</span></p>
<p>__Searching / Training Best Model</p>
<p>____Trained A Better Model: 60.5263</p>
<p>____Trained A Better Model: 68.4211</p>
<p>__Predicting Test Data</p>
<p>__Evaluation</p>
<p>____Right: 16</p>
<p>____Wrong: 4</p>
<p>____Total: 20</p>
<p>____Accuracy: 0.800000</p>
<p>get_training_test_set finds the pairs with matching tags and split those results into a training (80% of the pairs) and test set (20%), saving the data as training_data.txt and test_data.txt respectively. autosvm learns the best model (brute forcing the parameters for you – could be handy by itself as a general learning tool) and then applies it to the test set, reporting how well it did. In the above case, the system achieved 80% accuracy over 20 test instances.</p>
<p>Here’s another way to use it:</p>
<p><span style="font-family:courier;">tagger viksi$ python classify.py apple microsoft bill gates steve ballmer windows vista xp</span></p>
<p>microsoft</p>
<p><span style="font-family:courier;">tagger viksi$ python classify.py apple microsoft steve jobs ipod iphone macbook</span></p>
<p>apple</p>
<p>classify combines the above steps into an application that, given two tags and some text, will return which tag more likely describes the text. Or, in command line form, ‘python classify.py [tag1] [tag2] [some free text]’ =&gt; ‘tag1’ or ‘tag2’</p>
<p>My main goal here is not to build a perfect experiment or classifier (see caveats below), but to show a proof of concept of how BOSS or open search can be leveraged to build intelligent applications. BOSS isn’t just a search API, but really a general data API for powering any application that needs to party on a lot of the world’s knowledge.</p>
<p><span style="font-size:small;">I’ve open sourced the code here:</span></p>
<p><span style="font-size:small;"> </span></p>
<p><span style="font-size:small;"><strong><a href="http://github.com/zooie/tagger">http://github.com/zooie/tagger</a></strong></span></p>
<p>Caveats</p>
<p><span style="font-size:xx-small;">Although the total lines of code is ~200 lines, the system is fairly state-of-the-art as it employs LibSVM for its learning model. However, this classifier setup has several caveats due to my time constraints and goals, as my main intention for this example was to show the awesomeness of the BOSS data. For example, training and testing on abstracts and titles means the top features will probably be inclusive of the query, so the test set may be fairly easy to score well on as well as not be representative of real input data. I did later add code to remove query related features from the test set and the accuracy seemed to dip just slightly. For classify.py, the &#8216;some free text&#8217; input needs to be fairly large (about an extended abstract&#8217;s size) to be more accurate. Another caveat is what happens when both tags have been used to label a particular search result. The current system may only choose one tag, which may incur an error depending on what’s selected in the test set. Furthermore, the features I’m using are super simple and can be greatly improved with TFIDF scaling, normalization, feature selection (mutual information gain), etc. Also, more training / test instances (and check the distribution of the labels), baselines and evaluation measures should be tested.</span></p>
<p><span style="font-size:xx-small;"> </span></p>
<p><span style="font-size:xx-small;">I could have made this code a lot cleaner and shorter if I just used LibSVM’s python interface, but I for some reason forgot about that and wrote up scripts that parsed the stdout messages of the binaries to get something working fast (but dirty).</span></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/zooie.wordpress.com/360/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/zooie.wordpress.com/360/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/zooie.wordpress.com/360/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/zooie.wordpress.com/360/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/zooie.wordpress.com/360/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/zooie.wordpress.com/360/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/zooie.wordpress.com/360/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/zooie.wordpress.com/360/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/zooie.wordpress.com/360/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/zooie.wordpress.com/360/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/zooie.wordpress.com/360/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/zooie.wordpress.com/360/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/zooie.wordpress.com/360/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/zooie.wordpress.com/360/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=zooie.wordpress.com&amp;blog=31469&amp;post=360&amp;subd=zooie&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://zooie.wordpress.com/2009/10/09/build-an-automatic-tagger-in-200-lines-with-boss/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/17518bf0a462f22fc174f2df8e464e69?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">zooie</media:title>
		</media:content>
	</item>
		<item>
		<title>Delicious.com Gets Fresh</title>
		<link>http://zooie.wordpress.com/2009/08/04/delicious-com-gets-fresh/</link>
		<comments>http://zooie.wordpress.com/2009/08/04/delicious-com-gets-fresh/#comments</comments>
		<pubDate>Tue, 04 Aug 2009 14:54:41 +0000</pubDate>
		<dc:creator>Vik</dc:creator>
				<category><![CDATA[Boss]]></category>
		<category><![CDATA[delicious]]></category>
		<category><![CDATA[Non-Technical-Read]]></category>
		<category><![CDATA[Open]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Social]]></category>
		<category><![CDATA[Twitter]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Yahoo]]></category>
		<category><![CDATA[fresh]]></category>
		<category><![CDATA[homepage]]></category>

		<guid isPermaLink="false">http://zooie.wordpress.com/?p=315</guid>
		<description><![CDATA[Today we have officially released an experimental Fresh tab on the delicious.com page. Learn more about it here on the delicious blog. I won&#8217;t rehash too much of the delicious blog post as that describes the motivation and idea in &#8230; <a href="http://zooie.wordpress.com/2009/08/04/delicious-com-gets-fresh/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=zooie.wordpress.com&amp;blog=31469&amp;post=315&amp;subd=zooie&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Today we have officially released an experimental Fresh tab on the <strong><a href="http://delicious.com">delicious.com</a></strong> page. Learn more about it<strong> <a href="http://blog.delicious.com/blog/2009/08/delicious-homepage-gets-%E2%80%9Cfresh%E2%80%9D.html">here on the delicious blog</a></strong>.</p>
<p style="text-align:center;"><a href="http://farm4.static.flickr.com/3467/3789010486_8508913f68_o.png"><img class="aligncenter" title="Delicious Fresh Homepage" src="http://farm4.static.flickr.com/3467/3789010486_8508913f68_o.png" alt="" width="594" height="293" /></a></p>
<p>I won&#8217;t rehash too much of the delicious blog post as that describes the motivation and idea in detail, but the basic idea was to advance and apply the <a href="http://zooie.wordpress.com/2009/01/15/twitter-boss-real-time-search/">TweetNews</a> model to the latest stream of delicious bookmarks. The result is what we feel to be a pretty relevant and fresh (updates every minute or so) homepage. Please check it out and bookmark it (no pun intended). Just a simple start to hopefully better surfacing of content on delicious &#8211; expect more updates soon.</p>
<p>delicious also greatly advanced its search experience and sharing options in this release. You can learn more about it from the release posts <a href="http://blog.delicious.com/blog/2009/08/new-and-delicious.html">here</a> and soon <a href="http://www.ysearchblog.com/2009/08/04/search-tweet-and-discover-delicious-bookmarks">here</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/zooie.wordpress.com/315/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/zooie.wordpress.com/315/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/zooie.wordpress.com/315/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/zooie.wordpress.com/315/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/zooie.wordpress.com/315/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/zooie.wordpress.com/315/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/zooie.wordpress.com/315/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/zooie.wordpress.com/315/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/zooie.wordpress.com/315/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/zooie.wordpress.com/315/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/zooie.wordpress.com/315/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/zooie.wordpress.com/315/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/zooie.wordpress.com/315/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/zooie.wordpress.com/315/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=zooie.wordpress.com&amp;blog=31469&amp;post=315&amp;subd=zooie&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://zooie.wordpress.com/2009/08/04/delicious-com-gets-fresh/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/17518bf0a462f22fc174f2df8e464e69?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">zooie</media:title>
		</media:content>

		<media:content url="http://farm4.static.flickr.com/3467/3789010486_8508913f68_o.png" medium="image">
			<media:title type="html">Delicious Fresh Homepage</media:title>
		</media:content>
	</item>
		<item>
		<title>A Comparison of Open Source Search Engines</title>
		<link>http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/</link>
		<comments>http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/#comments</comments>
		<pubDate>Mon, 06 Jul 2009 10:45:19 +0000</pubDate>
		<dc:creator>Vik</dc:creator>
				<category><![CDATA[Blog Stuff]]></category>
		<category><![CDATA[Boss]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[CS]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Databases]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Job Stuff]]></category>
		<category><![CDATA[Open]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Talk]]></category>
		<category><![CDATA[Tutorial]]></category>
		<category><![CDATA[Twitter]]></category>

		<guid isPermaLink="false">http://zooie.wordpress.com/?p=172</guid>
		<description><![CDATA[Updated: sphinx setup wasn&#8217;t exactly &#8216;out of the box&#8217;. Sphinx searches the fastest now and its relevancy increased (charts updated below). Motivation Later this month we will be presenting a half day tutorial on Open Search at SIGIR. It&#8217;ll basically &#8230; <a href="http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=zooie.wordpress.com&amp;blog=31469&amp;post=172&amp;subd=zooie&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><span style="font-size:xx-small;"><em>Updated: sphinx setup wasn&#8217;t exactly &#8216;out of the box&#8217;. Sphinx searches the fastest now and its relevancy increased (charts updated below).</em></span></p>
<p><strong>Motivation</strong></p>
<p>Later this month we will be presenting a half day tutorial on Open Search at <a href="http://sigir2009.org/">SIGIR</a>. It&#8217;ll basically focus on how to use open source software and cloud services for building and quickly prototyping advanced search applications. Open Search isn&#8217;t just about building a Google-like search box on a free technology stack, but encouraging the community to extend and embrace search technology to improve the relevance of any application.</p>
<p><span style="font-size:xx-small;">For example, one non-search application of <a href="http://zooie.wordpress.com/2008/07/10/yahoo-boss-an-insider-view/">BOSS</a> leveraged the Spelling service to spell correct video comments before handing them off to their Spam filter. The Spelling correction process normalizes popular words that spammers intentionally misspell to get around spam models that rely on term statistics, and thus, can increase spam detection accuracy.</span></p>
<p>We have split up our upcoming talk into two sections:</p>
<ul>
<li><em>Services</em>: Open Search Web APIs (<a href="http://developer.yahoo.com/search/boss/">Yahoo! BOSS</a>, <a href="http://apiwiki.twitter.com/Twitter-API-Documentation">Twitter</a>, <a href="http://www.bing.com/developers/">Bing</a>, and <a href="http://code.google.com/apis/ajaxsearch/">Google AJAX Search</a>), interesting mashup examples, ranking models and academic research that leverage or could benefit from such services.</li>
</ul>
<ul>
<li><em>Software</em>: How to use popular open source packages for vertical indexing your own data.</li>
</ul>
<p>While researching for the Software section, I was quite surprised by the number of open source vertical search solutions I found:</p>
<ul>
<li><a href="http://lucene.apache.org/java/docs/">Lucene</a> (<a href="http://lucene.apache.org/nutch/">Nutch</a>, <a href="http://lucene.apache.org/solr/">Solr</a>, <a href="http://hounder.org/">Hounder</a>), <a href="http://sphinxsearch.com/">Sphinx</a>, <a href="http://www.seg.rmit.edu.au/zettair/">zettair</a>, <a href="http://ir.dcs.gla.ac.uk/terrier/">Terrier</a>, <a href="http://www.galagosearch.org">Galago</a>, <a href="https://minion.dev.java.net/">Minnion</a>, <a href="http://mg4j.dsi.unimi.it/">MG4J</a>, <a href="http://www.wumpus-search.org/">Wumpus</a>, <a href="http://en.wikipedia.org/wiki/Relational_database_management_system">RDBMS</a> (<a href="http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html">mysql</a>, <a href="http://www.sqlite.org/cvstrac/wiki?p=FtsUsage">sqlite</a>), <a href="http://www.lemurproject.org/indri/">Indri</a>, <a href="http://xapian.org/">Xapian</a>, <a href="http://en.wikipedia.org/wiki/Grep">grep</a> &#8230;</li>
</ul>
<p>And I was even more surprised by the lack of comparisons between these solutions. Many of these platforms advertise their performance benchmarks, but they are in isolation, use different data sets, and seem to be more focused on speed as opposed to say relevance.</p>
<p>The best paper I could find that compared performance and relevance of many open source search engines was <a href="http://wrg.upf.edu/WRG/dctos/Middleton-Baeza.pdf">Middleton+Baeza&#8217;07</a>, but the paper is quite old now and didn&#8217;t make its source code and data sets publicly available.</p>
<p>So, I developed a couple of fun, off the wall experiments to test (for building code examples &#8211; this is just a simple/quick evaluation and not for SIGIR &#8211; read disclaimer in the conclusion section) some of the popular vertical indexing solutions. Here&#8217;s a table of the platforms I selected to study, with some high level feature breakdowns:</p>
<div id="attachment_203" class="wp-caption aligncenter" style="width: 460px"><a href="http://zooie.files.wordpress.com/2009/07/opensearch_compare.jpg"><img class="size-full wp-image-203 " title="open_search_compare" src="http://zooie.files.wordpress.com/2009/07/opensearch_compare.jpg?w=500" alt="High level feature comparison among the vertical search solutions I studied; The support rating and scale are based on information I collected from web sites and conversations (please feel free to comment)."   /></a><p class="wp-caption-text">High level feature comparison among the vertical search solutions I studied; The support rating and scale are based on information I collected from web sites and conversations. I tested each solution&#039;s latest stable release as of this week (Indri is TODO).</p></div>
<p>One key design decision I made was not to change any numerical tuning parameters. I really wanted to test &#8220;Out of the Box&#8221; performance to simulate the common developer scenario. Plus, it takes forever to optimize parameters fairly across multiple platforms and different data sets esp. for an over-the-weekend benchmark (see disclaimer in the Conclusion section).</p>
<p>Also, I tried my best to write each experiment natively for each platform using the expected library routines or binary commands.</p>
<p><strong>Twitter Experiment</strong></p>
<p>For the first experiment, I wanted to see how well these platforms index Twitter data. Twitter is becoming very mainstream, and its real time nature and brevity differs greatly from traditional web content (which these search platforms are overall more tailored for) so its data should make for some interesting experiments.</p>
<p>So I proceeded to crawl Twitter to generate a sample data set. After about a full day and night, I had downloaded ~1M tweets (~10/second).</p>
<p>But before indexing, I did some quick analysis of my acquired Twitter data set:</p>
<p># of Tweets: <strong>968,937</strong></p>
<p>Indexable Text Size (user, name, text message): <strong>92MB</strong></p>
<p>Average Tweet Size: <strong>12 words</strong></p>
<p>Types of Tweets based on simple word filters:</p>
<p style="text-align:center;">
<div id="attachment_176" class="wp-caption aligncenter" style="width: 276px"><a href="http://zooie.files.wordpress.com/2009/07/twitter_1m_stats.jpg"><img class="size-full wp-image-176 " title="twitter_1m_stats" src="http://zooie.files.wordpress.com/2009/07/twitter_1m_stats.jpg?w=500" alt="Out of a 1M sample, what kind of Tweet types do we find?"   /></a><p class="wp-caption-text">Out of a 1M sample, what types of Tweets do we find? Unique Users means that there were ~600k users that authored all of the 1M tweets in this sample.</p></div>
<p>Very interesting stats here &#8211; especially the high percentage of tweets that seem to be asking questions. Could Twitter (or an application) better serve this need?</p>
<p>Here&#8217;s a table comparing the indexing performance over this Twitter data set across the select vertical search solutions:</p>
<div id="attachment_173" class="wp-caption aligncenter" style="width: 460px"><a href="http://zooie.files.wordpress.com/2009/07/open_search_tweets_perf.jpg"><img class="size-full wp-image-173 " title="open_search_tweets_perf" src="http://zooie.files.wordpress.com/2009/07/open_search_tweets_perf.jpg?w=500" alt="Indexing 1M twitter messages on a variety of open source search solutions; measuring time and space for each."   /></a><p class="wp-caption-text">Indexing 1M twitter messages on a variety of open source search solutions.</p></div>
<p><span style="font-size:xx-small;">Lucene was the only solution that produced an index that was smaller than the input data size. Shaves an additional 5 megabytes if one runs it in optimize mode, but at the consequence of adding another ten seconds to indexing. sphinx and zettair index the fastest. Interestingly, I ran zettair in big-and-fast mode (which sucks up 300+ megabytes of RAM) but it ran slower by 3 seconds (maybe because of the nature of tweets). Xapian ran 5x slower than sqlite (which stores the raw input data in addition to the index) and produced the largest index file sizes. The default index_text method in Xapian stores positional information, which blew the index size to 529 megabytes. One must use index_text_without_positions to make the size more reasonable. I checked my Xapian code against the examples and documentation to see if I was doing something wrong, but I couldn&#8217;t find any discrepancies. I also included a column about development issues I encountered. zettair was by far the easiest to use (simple command line) but required transforming the input data into a new format. I had some text issues with sqlite (also needs to be recompiled with FTS3 enabled) and sphinx given their strict input constraints. sphinx also requires a conf file which took some searching to find full examples of. Lucene, zettair, and Xapian were the most forgiving when it came to accepting text inputs (zero errors).</span></p>
<p><strong>Measuring Relevancy: Medical Data Set</strong></p>
<p>While this is a fun performance experiment for indexing short text, this test does not measure search performance and relevancy.</p>
<p>To measure relevancy, we need judgment data that tells us how relevant a document result is to a query. The best data set I could find that was publicly available for download (almost all of them require mailing in CD&#8217;s) was from the <a href="http://trec.nist.gov/data/t9_filtering.html">TREC-9 Filtering track</a>, which provides a collection of 196,403 medical journal references &#8211; totaling ~300MB of indexable text (titles, authors, abstracts, keywords) with an average of 215 tokens per record. More importantly, this data set provides judgment data for 63 query-like tasks in the form of &#8220;&lt;task, document, 2|1|0 rating&gt;&#8221; (2 is very relevant, 1 is somewhat relevant, 0 is not rated). An example task is &#8220;37 yr old man with sickle cell disease.&#8221; To turn this into a search benchmark, I treat these tasks as OR&#8217;ed queries. To measure relevancy, I compute the Average <a href="http://en.wikipedia.org/wiki/Discounted_Cumulative_Gain">DCG</a> across the 63 queries for results in positions 1-10.</p>
<div id="attachment_187" class="wp-caption aligncenter" style="width: 460px"><a href="http://zooie.files.wordpress.com/2009/07/opensearch_ohsumed.jpg"><img class="size-full wp-image-187 " title="open_search_ohsumed_perf" src="http://zooie.files.wordpress.com/2009/07/opensearch_ohsumed.jpg?w=500" alt="Performance and Relevancy marks on the TREC OHSUMED Data Set; Lucene is the smallest, most relevant and fastest to search; Xapian is very close to Lucene on the search side but 3x slower on indexing and 4x bigger in index space; zettair is the fastest indexer."   /></a><p class="wp-caption-text">Performance and Relevancy marks on the TREC-9 across select vertical search solutions.</p></div>
<p><span style="font-size:xx-small;">With this larger data set (3x larger than the Twitter one), we see zettair&#8217;s indexing performance improve (makes sense as it&#8217;s more designed for larger corpora); zettair&#8217;s search speed should probably be a bit faster because its search command line utility prints some unnecessary stats. For multi-searching in sphinx, I developed a Java client (with the hopes of making it competitive with Lucene &#8211; the one to beat) which connects to the sphinx searchd server via a socket (that&#8217;s their API model in the examples). sphinx returned searches the fastest &#8211; ~3x faster than Lucene. Its indexing time was also on par with zettair. Lucene obtained the highest relevance and smallest index size. The index time could probably be improved by fiddling with its merge parameters, but I wanted to avoid numerical adjustments in this evaluation. Xapian has very similar search performance to Lucene but with significant indexing costs (both time and space &gt; 3x). sqlite has the worst relevance because it doesn&#8217;t sort by relevance nor seem to provide an ORDER BY function to do so.</span></p>
<p><strong>Conclusion &amp; Downloads</strong></p>
<p>Based on these preliminary results and anecdotal information I&#8217;ve collected from the web and people in the field (with more emphasis on the latter), I would probably recommend Lucene (which is an <a href="http://en.wikipedia.org/wiki/Information_retrieval">IR</a> library &#8211; use a wrapper platform like <a href="http://lucene.apache.org/solr/">Solr</a> w/ <a href="http://lucene.apache.org/nutch/">Nutch</a> if you need all the search dressings like snippets, crawlers, servlets) for many vertical search indexing applications &#8211; <em>especially if you need something that runs decently well out of the box (as that&#8217;s what I&#8217;m mainly evaluating here) and community support</em>.</p>
<p><em>Keep in mind that these experiments are still very early (done on a weekend budget) and can/should be improved greatly with bigger and better data sets, tuned implementations, and community support (I&#8217;d be the first one to say these are far from perfect, so I open sourced my code below). </em>It’s pretty hard to make a benchmark that everybody likes (especially in this space where there haven’t really been many … and I’m starting to see why <img src='http://s0.wp.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> ), not necessarily because there are always winners/losers and biases in benchmarks, but because there are so many different types of data sets and platform APIs and tuning parameters (at least databases support SQL!). This is just a start. I see this as a very evolutionary project that requires community support to get it right. Take the results here for what it’s worth and still run your own tuned benchmarks.</p>
<p>To encourage further search development and benchmarks, I&#8217;ve open sourced all the code here:</p>
<p><a href="http://github.com/zooie/opensearch/tree/master">http://github.com/zooie/opensearch/tree/master</a></p>
<p>Happy to post any new and interesting results.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/zooie.wordpress.com/172/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/zooie.wordpress.com/172/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/zooie.wordpress.com/172/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/zooie.wordpress.com/172/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/zooie.wordpress.com/172/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/zooie.wordpress.com/172/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/zooie.wordpress.com/172/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/zooie.wordpress.com/172/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/zooie.wordpress.com/172/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/zooie.wordpress.com/172/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/zooie.wordpress.com/172/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/zooie.wordpress.com/172/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/zooie.wordpress.com/172/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/zooie.wordpress.com/172/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=zooie.wordpress.com&amp;blog=31469&amp;post=172&amp;subd=zooie&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/feed/</wfw:commentRss>
		<slash:comments>141</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/17518bf0a462f22fc174f2df8e464e69?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">zooie</media:title>
		</media:content>

		<media:content url="http://zooie.files.wordpress.com/2009/07/opensearch_compare.jpg" medium="image">
			<media:title type="html">open_search_compare</media:title>
		</media:content>

		<media:content url="http://zooie.files.wordpress.com/2009/07/twitter_1m_stats.jpg" medium="image">
			<media:title type="html">twitter_1m_stats</media:title>
		</media:content>

		<media:content url="http://zooie.files.wordpress.com/2009/07/open_search_tweets_perf.jpg" medium="image">
			<media:title type="html">open_search_tweets_perf</media:title>
		</media:content>

		<media:content url="http://zooie.files.wordpress.com/2009/07/opensearch_ohsumed.jpg" medium="image">
			<media:title type="html">open_search_ohsumed_perf</media:title>
		</media:content>
	</item>
	</channel>
</rss>
