Category Archives: Twitter

An Evaluation of Google’s Realtime Search

How timely are the results returned from Google’s Realtime (RT) Search Engine? How often do Twitter results appear in these results? Over the weekend I developed a few basic experiments to find out and published the results below.

Key Findings

  • For location-based queries, there’s nearly a flip of a coin chance (43%) that a Twitter result will be the #1 ranked result.
  • For general knowledge queries, there’s a 23% chance that a Twitter result will be #1.
  • The newest Twitter results are usually 4 seconds old. The newest Web results are 10x older (41 seconds).
  • A top ranking Twitter result for a location-based query is usually 2 minutes old (compared with Web which is 22 minutes old – again nearly 10x older).
  • When Twitter results appear at least one of them is in the top ranked position
Experiment #1 – General Knowledge

I crawled 1,370 article titles from Wikipedia and ran each title as a query into Google RT search.

Market Shares

81% of all queries returned search results that included web page results
23% of all queries returned search results that included Twitter results
7% of all queries returned 0 search results

70% of all queries had a web page result in the #1 ranked position
When Twitter results appeared there was always at least one result in the #1 ranked position (so 23% of queries)

Time Lag

When a web page was the #1 ranked result, that result on average was 6736 seconds (or 1 hr and 52 minutes) old.
When a Tweet was the #1 ranked result, that result on average was 261 seconds (or 4 minutes and 21 seconds) old.

The average age of the top 10% newest web page results (across all queries) is 41 seconds
The average age of the top 10% newest Twitter results (across all queries) is 2 seconds

Tail

Query length was between 1 – 12 words (where 1-2 word long queries are most popular)
Worth noting that no Twitter results appear for queries with greater than 5 words

Experiment #2 – Location

I crawled 265 major populated U.S. cities from the U.S. Census Bureau and ran each city name as a query into Google RT search.

Market Shares

73% of all queries returned search results that included web page results
43% of all queries returned search results that included Twitter results
5% of all queries returned 0 search results

52% of all queries had a web page result in the #1 ranked position
When Twitter results appeared there was always at least one result in the #1 ranked position (so 43% of queries)

Time Lag

When a web page was the #1 ranked result, that result on average was 1341 seconds (or 22 minutes and 21 seconds) old.
When a Tweet was the #1 ranked result, that result on average was 138 seconds (or 2 minutes and 18 seconds) old.

The average age of the top 10% newest web page results (across all queries) is 41 seconds
The average age of the top 10% newest Twitter results (across all queries) is 4 seconds

Tail

Query length was between 1 – 3 words
Worth noting that no Twitter results appear for 3 word long queries

Implementation Details

  • Generated Wiki queries by running “site:en.wikipedia.org” searches on Google and Blekko, and extracting the titles (en.wikipedia.org/{title_is_here}) from the result links. Side point: I tried Bing but the result links had mostly one word long titles (Bing seems to really bias query length in their ranking) and I wanted more diversity to test out tail queries.
  • Crawled cities (for the location-based queries) from http://www.census.gov/popest/cities/tables/SUB-EST2009-01.csv

Caveats

  • I ran these experiments at 2:45a PST on Monday. The location-based queries all relate to U.S., so probably not many people up at that time generating up-to-date information. The time lag stats could vary depending on when these experiments are ran. I did however re-run the experiments in the late morning and didn’t see much difference in the timings.
  • I ran all queries through Google’s normal web search engine with ‘Latest’ on (in the left bar under Search Tools). These results are not exactly the same as those generated from the standalone Google Realtime Search portal, which seems to bias Tweets more while the ‘Latest’ results seems to find middle ground between real-time Twitter results and web page results. I used ‘Latest’ because it seems like it would be the most popular gateway to Google’s Realtime search results.

5 Comments

Filed under Blog Stuff, Computer Science, Data Mining, Google, Information Retrieval, Research, Search, Social, Statistics, Twitter, Wikipedia

anymeme: Breaking News, Tweets in your URLs

A very basic experiment that pads URLs with messages:

or more appropriately http://anymeme.appspot.com/anymeme.appspot.com

Notes

  • This is not related to any work I’ve been pursuing during my EIR gig.
  • It’s kind of like the opposite of bit.ly (there is a shortener available on the site though). It’s better tailored for shorter URLs where there’s enough address bar space to display a message at the end of the URL.
  • I tested this on the top 30 or so sites using a mix of Firefox and Chrome.
  • This could easily be the dumbest thing I’ve ever developed, but then again there are a lot of dumb things on the web. It took longer for me to write these posts describing anymeme than to develop the code for it. This is more of an experiment to see:
    • If users, publishers, and advertisers like it
    • To try to make URLs more interesting and valuable
  • It would be so cool:
    • To generate enough cash via sponsored messages to make meaningful contributions to great causes
    • To see an important breaking news headline or an interesting tweet as you load up hulu to check for new episodes – visible in the previously half empty address bar so there’s no need to frame or change the destination page to show the content.
  • It currently runs on Google App Engine

5 Comments

Filed under Blog Stuff, Google, News, Social, Trends, Twitter, Uncategorized, Web2.0

Some Stats about Twitter’s Content

Near the end of July, I crawled a sample of ~10M tweets. On my way over from Open Hack Day NYC yesterday I finally got some time to do some preliminary analysis of this data. Several posts have analyzed Twitter’s traffic stats [TechCrunch] [Mashable] [zooie], so I thought I’d focus more on the content here.

Duplication

By compressing the data and comparing the before and after sizes, one can get a pretty decent understanding of the duplication factor. To do this, I extracted just the raw text messages, sorted them, and then ran gzip over the sorted set.

Compression ratio

>>> 284023259 / 739273532 bytes

0.38419238171778614

Typically, for text compression, gzip-like programs can achieve around 50% without the sort (and sorting typically helps), and here we get 38%. A standard text corpus consists of much larger document sizes, so it’s interesting to see a similar or larger duplication factor for tweets.

We can dive even deeper into this area by analyzing the term overlap statistics to measure near duplication, or messages that aren’t necessarily identical but are close enough.

To do this, I first cleaned the text (removed stopwords, stemmed terms, normalized case). Interesting, after cleaning the text, the average number of tokens for a message is just 6.28, or 2.5x the size of a standard web search query.

Then, I employed consistent term sampling to select N representatives for each cleaned message and coalesced the representatives together as a single key. By comparing the total number of unique keys to messages, one can infer the near duplication factor. Also, the higher the N, the higher the threshold is to match (so N >= 6, 6 being the average number of tokens per message, probably means that two messages that generate the same key are exact duplicates).

You’ll notice N >=6 converges around 84%, implying that after cleaning the text, 16% of the messages exactly match some other message. Additionally, when N = 2 (or requiring 2 / 6 tokens or 33% of the text on average) to match, 45% of the messages collide with other messages in the corpus. At N = 2, matching often means the messages discuss the same general topic, but aren’t close near duplicates.

N Term Samples Unique Keys Coverage
8 8548695 0.8356
6 8512672 0.8321
5 8476590 0.8286
4 8366391 0.8177
3 8098400 0.7916
2 5716566 0.5588
1 1013783 0.0991

 

 

 

 

 

 

 

URLs

URLs are present in ~18% of the tweets

Of those, ~65% of the URLs are unique

70K Unique Domains covering 2M URLS

Top Domains:

['bit.ly', 'tinyurl.com', 'twitpic.com', 'is.gd', 'myloc.me', 'ow.ly', 'ustre.am', 'cli.gs', 'tr.im', 'plurk.com', 'ff.im', 'tumblr.com', 'yfrog.com', '140mafia.com', 'u.mavrev.com', 'twurl.nl', 'tweeterfollow.com', 'mypict.me', 'viagracan.com', 'vipfollowers.com', 'morefollowers.net', 'digg.com', 'tweeteradder.com', 'ping.fm', 'tiny.cc', 'followersnow.com', 'short.to', 'twit.ac', 'snipr.com', 'wefollow.com', 'tweet.sg', 'url4.eu', 'the-twitter-follow-train.info', 'fwix.com', 'budurl.com', 'su.pr', 'shar.es', 'tinychat.com', 'snipurl.com', 'loopt.us', 'migre.me', 'flic.kr', 'myspace.com', 'snurl.com', 'twitgoo.com', 'zshare.net', 'post.ly', 'bkite.com', 'yes.com', 'flickr.com', 'twitter.com', 'artistsforschapelle.com', '140army.com', 'youtube.com', 'x.imeem.com', 'pic.gd', 'TwitterBackgrounds.com', 'raptr.com', 'twt.gs', 'twitthis.com', 'mobypicture.com', 'tobtr.com', 'ad.vu', 'sml.vg', 'rubyurl.com', 'tinylink.com', 'redirx.com', 'a2a.me', 'eCa.sh', 'vimeo.com', 'meadd.com', 'hotjobs.yahoo.com', 'doiop.com', 'myurl.in', 'urlpire.com', 'buzzup.com', 'freead.im', 'youradder.com', 'facebook.com', 'adf.ly', 'justin.tv', 'twitvid.com', 'adjix.com', 'twcauses.com', 'lkbk.nu', 'tlre.us', 'htxt.it', 'stickam.com', 'twubs.com', 'isy.gs', 'reverbnation.com', 'news.bbc.co.uk', 'sn.im', 'twibes.com', 'ustream.tv', 'trim.su', 'hashjobs.com', 'blogtv.com', 'jobs-cb.de', 'xsaimex.com']

Retweets

~4% of messages are retweets

Replied @Users

~1M total replied-to users in this data set

37% of tweets contain ‘@x’ terms

Most Popular Replied-to Users (almost all celebrities):

['@mileycyrus', '@jonasbrothers', '@ddlovato', '@mitchelmusso', '@donniewahlberg', '@souljaboytellem', '@tommcfly', '@addthis', '@officialtila', '@johncmayer', '@shanedawson', '@bowwow614', '@jordanknight', '@ryanseacrest', '@perezhilton', '@jonathanrknight', '@petewentz', '@tweetmeme', '@adamlambert', '@david_henrie', '@dealsplus', '@dwighthoward', '@iamdiddy', '@lancearmstrong', '@songzyuuup', '@imeem', '@blakeshelton', '@dannymcfly', '@lilduval', '@selenagomez', '@markhoppus', '@yelyahwilliams', '@therealpickler', '@stephenfry', '@mrtweet.', '@taylorswift13', '@michaelsarver1', '@davidarchie', '@the_real_shaq', '@tyrese4real', '@britneyspears', '@106andpark', '@ashleytisdale', '@mariahcarey', '@kimkardashian', '@wale', '@mashable', '@programapanico', '@therealjordin', '@listensto', '@misskeribaby', '@alyssa_milano', '@alexalltimelow', '@aplusk', '@thisisdavina', '@breakingnews:', '@peterfacinelli', '@truebloodhbo', '@mgiraudofficial', '@tonyspallelli', '@mtv', '@jackalltimelow', '@dfizzy', '@youngq', '@tomfelton', '@pooch_dog', '@jonaskevin', '@princesammie', '@nkotb', '@christianpior', '@cthagod', '@johnlloydtaylor', '@neilhimself', '@moontweet', '@katyperry', '@danilogentili', '@mchammer', '@rainnwilson', '@joeymcintyre', '@30secondstomars', '@phillyd', '@heidimontag', '@mrpeterandre', '@andyclemmensen', '@crystalchappell', '@kevindurant35', '@huckluciano', '@dannygokey', '@jaketaustin', '@revrunwisdom', '@jamesmoran', '@musewire', '@dannywood', '@nickiminaj', '@akgovsarahpalin', '@terrencej106', '@mashable:', '@drewryanscott', '@mrtweet', '@necolebitchie', '@lilduval:', '@willie_day26', '@kirstiealley', '@betthegame', '@radiomsn', '@alancarr', '@rafinhabastos', '@krisallen4real', '@iamjericho', '@breakingnews', '@babygirlparis', '@ladygaga', '@chris_daughtry', '@hypem', '@danecook', '@imcudi', '@jeepersmedia', '@buckhollywood', '@kimmyt22', '@giulianarancic', '@chrisbrogan', '@nasa', '@addtoany', '@nickcarter', '@debbiefletcher', '@marcoluque', '@shaundiviney', '@ogochocinco', '@twitter', '@eddieizzard', '@youngbillymays', '@real_ron_artest', '@pink', '@laurenconrad', '@rubarrichello', '@ianjamespoulter', '@liltwist', '@teyanataylor', '@dougiemcfly', '@theellenshow', '@robkardashian', '@sherrieshepherd', '@justinbieber', '@paulaabdul', '@jason_manford', '@jaredleto', '@tracecyrus', '@itsonalexa', '@ddlovato:', '@khloekardashian', '@revrunwisdom:', '@solangeknowles', '@allison4realzzz', '@nickjonas', '@reply', '@anarbor', '@donlemoncnn', '@gfalcone601', '@moonfrye', '@symphnysldr', '@iamspectacular', '@honorsociety', '@questlove', '@guykawasaki', '@dawnrichard', '@_maxwell_', '@somaya_reece', '@mandyyjirouxx', '@teemwilliams', '@greggarbo', '@pennjillette', '@mikeyway', '@matthardybrand', '@iamjonwalker', '@andyroddick', '@kohnt01', '@chris_gorham', '@seankingston', '@joshgroban', '@mousebudden', '@misskatieprice', '@spencerpratt', '@wilw', '@jgshock', '@swear_bot', '@joelmadden', '@techcrunch', '@americanwomannn', '@kelly__rowland', '@mionzera', '@astro_127', '@_@', '@spam', '@sookiebontemps', '@drakkardnoir', '@noh8campaign', '@kayako', '@trvsbrkr', '@qbkilla', '@mw55', '@guykawasaki:', '@donttrythis', '@cv31', '@liljjdagreat', '@tiamowry', '@nickensimontwit', '@holdemtalkradio', '@bradiewebbstack', '@nytimes', '@riskybizness23', '@radityadika', '@adrienne_bailon', '@riccklopes', '@jessicasimpson', '@sportsnation', '@jasonbradbury', '@huffingtonpost', '@oceanup', '@gilbirmingham', '@iconic88', '@the', '@thebrandicyrus', '@gordela', '@thedebbyryan', '@jessemccartney', '@?', '@caiquenogueira', '@celsoportiolli', '@shontelle_layne', '@calvinharris', '@chattyman', '@ali_sweeney', '@anamariecox', '@joshthomas87', '@emilyosment', '@nasa:', '@sevinnyne6126', '@thebiggerlights', '@theboygeorge', '@jbarsodmg', '@goldenorckus', '@warrenwhitlock', '@bobbyedner', '@myfabolouslife', '@descargaoficial', '@ochonflcinco85', '@ninabrown', '@billycurrington', '@oprah', '@junior_lima', '@asherroth', '@starbucks', '@jason_pollock', '@intanalwi', '@harrislacewell', '@serenajwilliams', '@kevinruddpm', '@bigbrotherhoh', '@oliviamunn', '@chamillionaire', '@tamekaraymond', '@teamwinnipeg', '@littlefletcher', '@piercethemind', '@brookandthecity', '@iranbaan:', '@tonyrobbins', '@maestro', '@glennbeck', '@1omarion', '@nadhiyamali', '@slimthugga', '@jason_mraz', '@profbrendi', '@djaaries', '@juanestwiter', '@davegorman', '@zackalltimelow', '@mamajonas', '@itschristablack', '@skydiver', '@gigva', '@currensy_spitta', '@paulwallbaby', '@rpattzproject', '@petewentz:', '@rodrigovesgo', '@drdrew', '@sportsguy33', '@cthagod:', '@hollymadison123', '@mjjnews', '@itsbignicholas', '@_supernatural_', '@santoevandro', '@demar_derozan', '@marthastewart', '@billganz62', '@oodle', '@davidleibrandt']

Hashtags

~7% of messages contain hashtags

Total Unique Hashtags found: ~94k

Top Hashtags:

['#lies', '#fb', '#musicmonday', '#truth', '#iranelection', '#moonfruit', '#tendance', '#jobs', '#ihavetoadmit', '#mariomarathon', '#140mafia', '#tcot', '#zyngapirates', '#followfriday', '#spymaster', '#ff', '#1', '#sotomayor', '#turnon', '#notagoodlook', '#tweetmyjobs', '#hiring:', '#iran', '#fun140', '#jesus', '#72b381.', '#quote', '#tinychat', '#neda', '#militarymon', '#gr88', '#trueblood', '#fail', '#news', '#140army', '#livestrong', '#noh8', '#wpc09', '#music', '#turnoff', '#unacceptable', '#twables', '#masterchef', '#noh84kradison', '#writechat', '#job', '#squarespace', '#michaeljackson', '#2', '#nothingpersonal', '#iphone', '#ala2009', '#mj', '#tdf', '#blogtalkradio', '#mlb', '#1stdraftmovielines', '#p2', '#secretagent', '#tlot', '#72b381', '#honduras', '#twitter', '#jtv', '#tehran', '#gorillapenis', '#porn', '#bb11', '#sotoshow', '#brazillovesatl', '#google', '#oneandother', '#bb10', '#chucknorris', '#cmonbrazil', '#agendasource', '#travel', '#ashes', '#dumbledore', '#freeschapelle', '#tl', '#dealsplus', '#nsfw', '#entourage', '#tech', '#hottest100', '#3693dh...', '#torchwood', '#design', '#teaparty', '#love', '#dontyouhate', '#mileycyrus', '#sgp', '#harrypottersequels', '#peteandinvisiblechildren', '#stopretweets', '#tscc', '#wimbledon', '#hive', '#cubs', '#3', '#redsox', '#photography', '#voss', '#snods', '#lol', '#socialmedia', '#gop', '#health', '#esriuc', '#green', '#follow', '#echo!', '#obama', '#digg', '#shazam', '#hhrs', '#video', '#moonfruit.', '#swineflu', '#politics', '#ebuyer683', '#umad', '#quizdostandup', '#thankyoumichael', '#blogchat', '#wordpress', '#3693dh', '#haiku', '#ttparty', '#lastfm:', '#healthcare', '#hcr', '#ecgc', '#seo', '#apple', '#chuck', '#wine', '#sammie', '#h1n1', '#marketing', '#twitition', '#happybirthdaymitchel18', '#cnn', '#lie', '#rt:', '#art', '#nasa', '#blog', '#quotes', '#bruno', '#business', '#palin', '#mw2', '#hcsm', '#harrypotter', '#4', '#lastfm', '#askclegg', '#photo', '#jobfeedr', '#lgbt', '#lies:', '#ihavetoadmit.i', '#jamlegend,', '#truthbetold', '#mcfly', '#microsoft', '#fashion', '#tweetphoto', '#ebuyer167201', '#noh84adison', '#5', '#mets', '#china', '#bigprize', '#whythehell', '#money', '#sophiasheart', '#finance', '#michael', '#f1', '#adamlambert100k', '#web', '#urwashed', '#moonfruit!', '#1:', '#kayako', '#lies.', '#thankyouaaron', '#food', '#wow', '#moonfruit,', '#facebook', '#ebuyer291', '#ecomonday', '#ihave', '#happybdaydenise', '#postcrossing', '#ichc', '#912', '#demilovatolive', '#gijoemoviefan', '#funny', '#media', '#meowmonday', '#israel', '#blogger', '#forasarney', '#tv', '#topgear', '#chrisisadouche', '#stlcards', '#wec09', '#forex', '#aots1000', '#celebrity', '#dwarffilmtitles', '#6', '#yeg', '#slaughterhouse', '#nfl', '#photog', '#ny', '#firstdraftmovies', '#ufc', '#reddit', '#free', '#iwish', '#etsy', '#rulez', '#sports', '#icmillion', '#mmot', '#webdesign', '#deals', '#moonfruit?', '#pawpawty', '#twitterfahndung', '#billymaystribute', '#sytycd', '#runkeeper', '#scotus', '#yoconfieso', '#mariomarathon,', '#musicmondays', '#lies,', '#findbob', '#realestate', '#sohrab', '#sales', '#metal', '#runescape', '#hypem', '#threadless', '#gay', '#isyouserious', '#hollywood,', '#2:', '#ca,', '#golf', '#diadorock', '#newyork,', '#meteor', '#dailyquestion', '#photoshop', '#saveiantojones', '#musicmonday:', '#rock', '#sex', '#mlbfutures', '#ilove', '#mikemozart', '#nascar', '#indico', '#crossfitgames', '#gratitude', '#quote:', '#creativetechs', '#truth:', '#sharepoint', '#mkt', '#why', '#bigbrother', '#tam7', '#ihate', '#futureruby', '#slickrick', '#105.3', '#youareinatl', '#vegan', '#dontletmefindout', '#imustadmit', '#7', '#twitterafterdark', '#sunnyfacts', '#gilad', '#japan', '#iremember', '#97.3', '#puffdaddy', '#blogher', '#ade2009', '#aaliyah', '#alfredosms', '#95.1', '#truth,', '#twine', '#hiring']

Questions

Hard to infer exactly whether a message is a question or not, so I ran a couple of different filters:

5W’s, H, ? present ANYWHERE in tweet:

0.102789281948 or 10%

5W’s, H first token or ? last token:

0.0238229662219 or 2%

Just ? ANYWHERE in tweet:

0.0040984928533 or 0.4%

Users

Discovered ~2M unique users

Top Sending Users (many bots):

['followermonitor', 'Tweet_Words', 'currentcet', 'currentutc', 'whattimeisitnow', 'ItIsNow', 'ThinkingStiff', 'otvrecorder', 'delicious50', 'Porngus', 'craigslistjobs', 'GorPen', 'hashjobs', 'TransAlchemy2', 'bot_theta', 'CHRISVOSS', 'bot_iota', 'bot_kappa', 'TIPAS', 'VeolaJBanner', 'StacyDWatson', 'LMAObot', 'SarahJSlonecker', 'AllisonMRussell', 'bot_eta', 'SandraHOakley', 'bot_psi', 'bot_tau', 'LoreleiRMercer', 'bot_zeta', 'bot_gamma', 'bot_sigma', 'bot_lambda', 'bot_pi', 'bot_epsilon', 'bot_nu', 'bot_rho', 'bot_omicron', 'bot_khi', 'LindaTYoung', 'mensrightsindia', 'bot_omega', 'bot_ksi', 'bot_delta', 'bot_alpha', 'bot_phi', 'CindaDJenkins', 'bot_mu', 'ImogeneDPetit', 'bot_upsilon', 'OPENLIST_CA', 'openlist', 'isygs', 'dq_jumon', 'gamingscoop', 'MildredSLogan', 'ObiWanKenobi_', 'pulseSearch', 'MaryEVo', 'ImeldaGMcward', 'MaryJNewman', 'SharonTForde', 'LoriJCornelius', 'BrandyWPulliam', 'RhondaTLopez', 'AprilKOropeza', 'CarolETrotman', 'SusanATouvell', 'dinoperna', 'buzzurls', '_Freelance_', 'DrSnooty', 'illstreet', 'bibliotaph_eyes', 'loc4lhost', 'bsiyo', 'BOTHOUSE', 'post_ads', 'qazkm', 'frugaldonkey', 'free_post', 'groovera', 'wonkawonkawonka', 'ForksGirlBella', 'casinopokera', 'dermdirectoryny', 'Yoowalk_chat', 'mstehr', 'hashgoogle', 'perry1949', 'ensiz_news', 'Bezplatno_net', 'timesmirror', 'work_freelance', 'cockbot', 'pdurham', 'bombtter_raw', 'ocha1', 'AlairAneko24', 'HaiIAmDelicious', 'Freshestjobs', 'fast_followers', 'LeadsForFree', 'RideOfYourLife', 'AlastairBotan30', 'helpmefast25', 'TheMLMWizard', 'uitrukken', 'adoptedALICE', 'TKATI', 'ezadsncash', 'tweetshelp', 'LAmetro_traffic', 'thinkpozzitive', 'StarrNeishaa', 'AldenCho36', 'JobHits', 'wootboot', 'smacula', 'faithclubdotnet', 'DmitriyVoronov', 'brownthumbgirl', 'NYCjobfeed', 'hfradiospacewx', 'FakeeKristenn', 'MLBDAILYTIMES', 'wildingp', 'JacksonsReview', 'EarthTimesPR', 'friedretweet', 'Wealthy23', 'RokpoolFM', 'HDOLLAZ', '_MrSpacely', 'Bestdocnyc', 'Rabidgun', 'flygatwick', 'live_china', 'friendlinks', 'retweetinator', 'iamamro', 'thayferreira', 'AldisDai39', 'AndersHana60', 'nonstopNEWS', 'VivaLaCash', 'TravelNewsFeeds', 'vuelosplus', 'threeporcupines', 'DemiAuzziefan', 'worldofprint', 'KevinEdwardsJr', 'REDDITSPAMMOR', 'NatValentine', 'ChanelLebrun', 'nowbot', 'hollyswansonUK', 'youngrhome', 'M_Abricot', 'thefakemandyv', 'scrapbookingpas', 'Naughtytimes', 'Opcode1300_bot', 'tellsecret', 'tboogie937', 'Climber_IT', 'comlist', 'with_a_smile', 'USN_retired', 'Climber_EngJobs', 'Climber_Finance', 'Climber_HRJobs', 'intanalwi', 'Climber_Sales', 'nadhiyamali', 'wonderfulquotes', 'MRAustria', 'O2Q', 'GL0', 'SookieBonTemps', 'MRSchweiz', 'latinasabor', 'nineleal', 'casservice', 'AltonGin54', 'KulerFeed', '_cesaum', 'HFMONAIR', 'DeeOnDreeYah', 'rockstalgica', 'iamword', 'rpattzproject', 'madblackcatcom', 'ftfradio', 'marciomtc', 'SocialNetCircus', 'AnotherYearOver', 'ichig', 'tcikcik', 'HelenaMarie210', 'mrbax0', 'SWBot', 'DayTrends', '_Embry_Call_', 'eProducts24', 'The_Sims_3', 'tom_ssa', 'woxy_vintage', 'urbanmusic2000', 'dopeguhxfresh', 'erections', 'DudeBroChill', 'lookingformoney', 'drnschneider', 'MosesMaimonides', '92Blues', 'elarmelar', 'rock937fm', 'sonicfm', 'erikadotnet', 'sky0311', 'weqx', 'brandamc', 'Hot106', 'woxy_live', 'ksopthecowboy', 'vixalius', 'cogourl', 'Cashintoday', 'Andrewdaflirt', 'oodle', 'mkephart25', 'doomed', 'spotifyuri', 'mangelat', 'Cody_K', 'swayswaystacey', 'KLLY953', 'onlaa', 'Ginger_Swan', 'Call_Embry', 'conservatweet', 'weerinlelystad', 'ruhanirabin', 'tmgadops', 'wakemeupinside1', 'horaoficial', 'xstex', 'franzidee', 'tommytrc', 'khopmusic', 'tez19', 'GaryGotnought', 'UnemployKiller', 'felloff', 'Kalediscope', 'TheRealSherina', 'jasonsfreestuff', 'johnkennick', 'sel_gomezx3', 'OE3', 'AddisonMontg', '_rosieCAKES', 'neownblog', 'PrinceP23', 'ontd_fluffy', 'USofAl', 'Kacizzle88', 'somalush', 'FrankieNichelle', 'jiva_music', 'itz_cookie', 'soundOfTheTone', 'knowheremom', 'Jayme1988', 'TrafficPilot', 'tweetalot', 'TheStation1610', 'lasvegasdivorce', '1000_LINKS_NOW2', 'KeepOnTweeting', 'uFreelance', 'ChocoKouture', 'Magic983', 'SnarkySharky', 'agthekid', 'cashinnow', 'jamokie', 'jessicastanely', 'Q103Albany', 'GPGTwit', 'xAmberNicholex', 'wjtlplaylist', 'sjAimee', 'chrisduhhh', 'failbus', '1stwave', 'RichardBejah', 'nyanko_love']

Web Queries Overlap

How much overlap is there between tweets and trending web search queries?

I took the top trending queries during the days of my twitter crawl from Google Trends, then query expanded each trending query until the length was 6 tokens so as to equalize the average lengths. Then, I simply counted how many tweets match at least 2 (cleaned) tokens of any of these query-expanded trends:

0.0185654981775 or 2%

That’s it for now. I have some more stats but need a bit more time to clean those up before publishing here.

Notes

Can’t distribute my data set unfortunately, but it shouldn’t take too long to assemble a comparable set via Twitter’s spritzer feed – that’ll probably be more useful as it’ll be more update-to-date than the one I analyzed here. Feel free to pull my stats off if you find them useful (top hashtags and users are in JSON format).

10 Comments

Filed under Data Mining, Research, Search, Social, Statistics, Trends, Twitter

Delicious.com Gets Fresh

Today we have officially released an experimental Fresh tab on the delicious.com page. Learn more about it here on the delicious blog.

I won’t rehash too much of the delicious blog post as that describes the motivation and idea in detail, but the basic idea was to advance and apply the TweetNews model to the latest stream of delicious bookmarks. The result is what we feel to be a pretty relevant and fresh (updates every minute or so) homepage. Please check it out and bookmark it (no pun intended). Just a simple start to hopefully better surfacing of content on delicious – expect more updates soon.

delicious also greatly advanced its search experience and sharing options in this release. You can learn more about it from the release posts here and soon here.

Leave a comment

Filed under Boss, delicious, Non-Technical-Read, Open, Research, Social, Twitter, Uncategorized, Yahoo

A Comparison of Open Source Search Engines

Updated: sphinx setup wasn’t exactly ‘out of the box’. Sphinx searches the fastest now and its relevancy increased (charts updated below).

Motivation

Later this month we will be presenting a half day tutorial on Open Search at SIGIR. It’ll basically focus on how to use open source software and cloud services for building and quickly prototyping advanced search applications. Open Search isn’t just about building a Google-like search box on a free technology stack, but encouraging the community to extend and embrace search technology to improve the relevance of any application.

For example, one non-search application of BOSS leveraged the Spelling service to spell correct video comments before handing them off to their Spam filter. The Spelling correction process normalizes popular words that spammers intentionally misspell to get around spam models that rely on term statistics, and thus, can increase spam detection accuracy.

We have split up our upcoming talk into two sections:

  • Services: Open Search Web APIs (Yahoo! BOSS, Twitter, Bing, and Google AJAX Search), interesting mashup examples, ranking models and academic research that leverage or could benefit from such services.
  • Software: How to use popular open source packages for vertical indexing your own data.

While researching for the Software section, I was quite surprised by the number of open source vertical search solutions I found:

And I was even more surprised by the lack of comparisons between these solutions. Many of these platforms advertise their performance benchmarks, but they are in isolation, use different data sets, and seem to be more focused on speed as opposed to say relevance.

The best paper I could find that compared performance and relevance of many open source search engines was Middleton+Baeza’07, but the paper is quite old now and didn’t make its source code and data sets publicly available.

So, I developed a couple of fun, off the wall experiments to test (for building code examples – this is just a simple/quick evaluation and not for SIGIR – read disclaimer in the conclusion section) some of the popular vertical indexing solutions. Here’s a table of the platforms I selected to study, with some high level feature breakdowns:

High level feature comparison among the vertical search solutions I studied; The support rating and scale are based on information I collected from web sites and conversations (please feel free to comment).

High level feature comparison among the vertical search solutions I studied; The support rating and scale are based on information I collected from web sites and conversations. I tested each solution's latest stable release as of this week (Indri is TODO).

One key design decision I made was not to change any numerical tuning parameters. I really wanted to test “Out of the Box” performance to simulate the common developer scenario. Plus, it takes forever to optimize parameters fairly across multiple platforms and different data sets esp. for an over-the-weekend benchmark (see disclaimer in the Conclusion section).

Also, I tried my best to write each experiment natively for each platform using the expected library routines or binary commands.

Twitter Experiment

For the first experiment, I wanted to see how well these platforms index Twitter data. Twitter is becoming very mainstream, and its real time nature and brevity differs greatly from traditional web content (which these search platforms are overall more tailored for) so its data should make for some interesting experiments.

So I proceeded to crawl Twitter to generate a sample data set. After about a full day and night, I had downloaded ~1M tweets (~10/second).

But before indexing, I did some quick analysis of my acquired Twitter data set:

# of Tweets: 968,937

Indexable Text Size (user, name, text message): 92MB

Average Tweet Size: 12 words

Types of Tweets based on simple word filters:

Out of a 1M sample, what kind of Tweet types do we find?

Out of a 1M sample, what types of Tweets do we find? Unique Users means that there were ~600k users that authored all of the 1M tweets in this sample.

Very interesting stats here – especially the high percentage of tweets that seem to be asking questions. Could Twitter (or an application) better serve this need?

Here’s a table comparing the indexing performance over this Twitter data set across the select vertical search solutions:

Indexing 1M twitter messages on a variety of open source search solutions; measuring time and space for each.

Indexing 1M twitter messages on a variety of open source search solutions.

Lucene was the only solution that produced an index that was smaller than the input data size. Shaves an additional 5 megabytes if one runs it in optimize mode, but at the consequence of adding another ten seconds to indexing. sphinx and zettair index the fastest. Interestingly, I ran zettair in big-and-fast mode (which sucks up 300+ megabytes of RAM) but it ran slower by 3 seconds (maybe because of the nature of tweets). Xapian ran 5x slower than sqlite (which stores the raw input data in addition to the index) and produced the largest index file sizes. The default index_text method in Xapian stores positional information, which blew the index size to 529 megabytes. One must use index_text_without_positions to make the size more reasonable. I checked my Xapian code against the examples and documentation to see if I was doing something wrong, but I couldn’t find any discrepancies. I also included a column about development issues I encountered. zettair was by far the easiest to use (simple command line) but required transforming the input data into a new format. I had some text issues with sqlite (also needs to be recompiled with FTS3 enabled) and sphinx given their strict input constraints. sphinx also requires a conf file which took some searching to find full examples of. Lucene, zettair, and Xapian were the most forgiving when it came to accepting text inputs (zero errors).

Measuring Relevancy: Medical Data Set

While this is a fun performance experiment for indexing short text, this test does not measure search performance and relevancy.

To measure relevancy, we need judgment data that tells us how relevant a document result is to a query. The best data set I could find that was publicly available for download (almost all of them require mailing in CD’s) was from the TREC-9 Filtering track, which provides a collection of 196,403 medical journal references – totaling ~300MB of indexable text (titles, authors, abstracts, keywords) with an average of 215 tokens per record. More importantly, this data set provides judgment data for 63 query-like tasks in the form of “<task, document, 2|1|0 rating>” (2 is very relevant, 1 is somewhat relevant, 0 is not rated). An example task is “37 yr old man with sickle cell disease.” To turn this into a search benchmark, I treat these tasks as OR’ed queries. To measure relevancy, I compute the Average DCG across the 63 queries for results in positions 1-10.

Performance and Relevancy marks on the TREC OHSUMED Data Set; Lucene is the smallest, most relevant and fastest to search; Xapian is very close to Lucene on the search side but 3x slower on indexing and 4x bigger in index space; zettair is the fastest indexer.

Performance and Relevancy marks on the TREC-9 across select vertical search solutions.

With this larger data set (3x larger than the Twitter one), we see zettair’s indexing performance improve (makes sense as it’s more designed for larger corpora); zettair’s search speed should probably be a bit faster because its search command line utility prints some unnecessary stats. For multi-searching in sphinx, I developed a Java client (with the hopes of making it competitive with Lucene – the one to beat) which connects to the sphinx searchd server via a socket (that’s their API model in the examples). sphinx returned searches the fastest – ~3x faster than Lucene. Its indexing time was also on par with zettair. Lucene obtained the highest relevance and smallest index size. The index time could probably be improved by fiddling with its merge parameters, but I wanted to avoid numerical adjustments in this evaluation. Xapian has very similar search performance to Lucene but with significant indexing costs (both time and space > 3x). sqlite has the worst relevance because it doesn’t sort by relevance nor seem to provide an ORDER BY function to do so.

Conclusion & Downloads

Based on these preliminary results and anecdotal information I’ve collected from the web and people in the field (with more emphasis on the latter), I would probably recommend Lucene (which is an IR library – use a wrapper platform like Solr w/ Nutch if you need all the search dressings like snippets, crawlers, servlets) for many vertical search indexing applications – especially if you need something that runs decently well out of the box (as that’s what I’m mainly evaluating here) and community support.

Keep in mind that these experiments are still very early (done on a weekend budget) and can/should be improved greatly with bigger and better data sets, tuned implementations, and community support (I’d be the first one to say these are far from perfect, so I open sourced my code below). It’s pretty hard to make a benchmark that everybody likes (especially in this space where there haven’t really been many … and I’m starting to see why :)), not necessarily because there are always winners/losers and biases in benchmarks, but because there are so many different types of data sets and platform APIs and tuning parameters (at least databases support SQL!). This is just a start. I see this as a very evolutionary project that requires community support to get it right. Take the results here for what it’s worth and still run your own tuned benchmarks.

To encourage further search development and benchmarks, I’ve open sourced all the code here:

http://github.com/zooie/opensearch/tree/master

Happy to post any new and interesting results.

146 Comments

Filed under Blog Stuff, Boss, Code, CS, Data Mining, Databases, Information Retrieval, Job Stuff, Open, Open Source, Performance, Research, Search, Statistics, Talk, Tutorial, Twitter

TweetNews (Real-Time Search) Is Back

Update: Twitter’s Search API seems to timeout quite a bit so many search results don’t get any tweets linked. Try again later or refer to the screenshots below. Also, delicious.com is now testing an early version of this model for its homepage ranking.

Here it is  tweetnews.appspot.com

And an example query  yahoo

About six months ago I released a simple 100 line search application called TweetNews, which basically links tweets to the freshest Yahoo! News articles. The more related tweets an article has, the higher its rank. The tweet count and messages are presented underneath each result so that a user can read the social commentary inline with the article listing. It was developed more to demonstrate the openness and power of Yahoo! BOSS (you can read more about it in my previous posts here and here). Remarkably, many users found the service useful despite its slow performance, barebones UI, lack of homepage, domain, (you name it), etc.

Interestingly, the TweetNews concept has been popping up in my recent discussions around real-time search, so I felt it was about time to polish up TweetNews to serve as a better proof of concept.

Here are some of the new features:

  • Sweet UI (kudos to Kara McCain & Aaron Wheeler for the awesome design and template)
  • Continually Updated, Fresh Homepage (aggregates & ranks feeds like Techmeme, Delicious, Digg)
  • Faster Performance
  • Improved Algorithm
  • Local Views (re-rank & link tweets from a select region)

.

Here’s a screenshot of the homepage:

TweetNews Homepage

.

And here’s an example of Local Views:

London’s View of ‘iphone’

TweetNews IPhone (London Ranking)

Los Angeles’ View of ‘iphone’

TweetNews IPhone (Los Angeles Ranking)

Striking difference between Americans (actually just SoCal) and the British right there :)

I think the Local Views concept is pretty promising, although there’s plenty of room for improvement (use BOSS region filters, access Twitter’s Firehose Feed for more granularity, etc.).

Which is why, like I did with the last version, plan to open source all the code powering this application (just need a little more time to get it reviewed).

Interestingly, the homepage system in this package is very general. Just pass it any list of RSS feeds and it’ll do the clustering, tweet linking, ranking, and page generation automatically every X minutes for you. Anyone want a fresh, personalized Techmeme? Let me know if that sounds interesting.

Please keep in mind that this is still a simple, early prototype to show how one can use BOSS to experiment with very interesting data sources like Twitter to tackle big problems like real-time search.

6 Comments

Filed under Blog Stuff, Boss, Code, Information Retrieval, Non-Technical-Read, Open, Research, Search, Social, Techmeme, Twitter, UI, Yahoo

Twitter + BOSS = Real Time Search

Try ityahoo

Update: (6/25) This application has been updated. Go here to learn more. The description below though still applies.

Update: (6/11) In case you’re bored, here’s a discussion we had with Google and Twitter about Open & Real-time Search.

Update: (1/19) If you have issues try again in 5-10 minutes. You can also check out the screenshots below. (1/15) App Engine limits were reached (and fast). Appreciate the love and my apologies for not fully anticipating that. Google was nice enough though to temporarily raise the quota for this application. Anyways, this was more to show a cool BOSS developer example using code libraries I released earlier, but there might be more here. Stay tuned.

Here’s a screenshot as well (which should hopefully be stale by the time you read this).

Basically this service boosts Yahoo’s freshest news search results (which typically don’t have much relevance since they are ordered by timestamp and that’s it) based on how similar they are to the emerging topics found on Twitter for the same query (hence using Twitter to determine authority for content that don’t yet have links because they are so fresh). It also overlays related tweets via an AJAX expando button (big thanks to Greg Walloch at Yahoo! for the design) under results if they exist. A nice added feature to the overlay functionality is near-duplicate removal to ensure message threads on any given result provide as much comment diversity as possible.

Freshness (especially in the context of search) is a challenging problem. Traditional PageRank style algorithms don’t really work here as it takes time for a fresh URL to garner enough links to beat an older high ranking URL. One approach is to use cluster sizes as a feature for measuring the popularity of a story (i.e. Google News). Although quite effective IMO this may not be fast enough all the time. For the cluster size to grow requires other sources to write about the same story. Traditional media can be slow however, especially on local topics. I remember when I saw breaking Twitter messages describing the California Wildfires. When I searched Google/Yahoo/Microsoft right at that moment I barely got anything (< 5 results spanning 3 search results pages). I had a similar episode when I searched on the Mumbai attacks. Specifically, the Twitter messages were providing incredible focus on the important subtopics that had yet to become popular in the traditional media and news search worlds. What I found most interesting in both of these cases was that news articles did exist on these topics, but just weren’t valued highly enough yet or not focusing on the right stories (as the majority of tweets were). So why not just do that? Order these fresh news articles (which mostly provide authority and in-depth coverage) based on the number of related fresh tweets as well as show the tweets under each. That’s this service.

To illustrate the need, here’s a quick before and after shot. I searched for ‘nba’ using Yahoo’s news search ordered by latest results (first image). Very fresh (within a minute) but subpar quality. The first result talks about teams that are in a different league of basketball than the NBA. However, search for ‘nba’ on TweetNews (second image) and you get the Kings/Warriors triple OT game highlight which was buzzing more in Twitter at that minute.

'NBA' on Y! News latest

'NBA' on Y! News latest

'NBA' on Y! News latest enhanced by Twitter

'NBA' on TweetNews

There’s something very interesting here … Twitter as a ranking signal for search freshness may prove to be very useful if constructed properly. Definitely deserves more exploration – hence this service, which took < 100 lines of code to represent all the search logic thanks to Yahoo! BOSS, Twitter’s API, and the BOSS Mashup Framework.

To sum up, the contributions of this service are: (1) Real-time search + freshness (2) Stitching social commentary to authoritative sources of information (3) Another (hopefully cool) BOSS example.

The code is packaged for general open consumption and has been ported to run on App Engine (which powers this service actually). You can download all the source here.

99 Comments

Filed under Blog Stuff, Boss, Code, CS, Data Mining, Google, Information Retrieval, Non-Technical-Read, Open, Research, Search, Social, Twitter, Yahoo