5 Things You Should Know Before Starting an Enterprise Company

Just posted a guest article on The Next Web on some of the key startup learnings my team and I have picked up while building up our company Infer. Although our company is emerging and in the enterprise space, I think you’ll find many of these insights to be broadly applicable.

Leave a comment

Filed under Enterprise, Entrepreneurship, Startups, Venture Capital

Taking Seed Money from VCs Is A Risk Worth Taking

Here’s the link to a a guest article I wrote for VentureBeat arguing the benefits of including VCs early on as well as how the VC “signaling effect” (negative or positive) is sometimes a good thing for entrepreneurs to experience.

Leave a comment

Filed under Blog Stuff, Startups, VC

Infer – Partying on Business Data

Today, my co-founders and I are extremely excited to launch our company Infer. We’re applying consumers smarts (a la the science of Google) to business to specifically help companies win more customers. We’ve been able to deliver consistent lift across the board for our customers. Learn more about what all this means and how we do it here.

Leave a comment

Filed under Uncategorized

Betting on UFC Fights – A Statistical Data Analysis

Mixed Martial Arts (MMA) is an incredibly entertaining and technical sport to watch. It’s become one of the fastest growing sports in the world. I’ve been following MMA organizations like the Ultimate Fighting Championship (UFC) for almost eight years now, and in that time have developed a great appreciation for MMA techniques. After watching dozens of fights, you begin to pick up on what moves win and when, and spot strengths and weaknesses in certain fighters. However, I’ve always wanted to test my knowledge against the actual stats – like do accomplished wrestlers really beat fighters with little wrestling experience?

To do this, we need fight data, so I crawled and parsed all the MMA fights from Sherdog.com. This data includes fighter profiles (birth date, weight, height, disciplines, training camp, location) and fight records (challenger, opponent, time, round, outcome, event). After some basic data cleaning, I had a dataset of 11,886 fight records, 1,390 of which correspond to the UFC.

I then trained a random forest classifier from this data to see if a state-of-the-art machine learning model can identify any winning and losing characteristics. Over cross-validation with 10 folds, the resulting model scored a surprisingly decent AUC score of 0.69; a AUC score closer to 0.5 would indicate that the model can’t predict winning fights any better than random or fair coin flips.

So there may be interesting patterns in this data … Feeling motivated, I ran exhaustive searches over the data to find feature combinations that indicate winning or losing behaviors. Many hours later, several dozens of such insights were found.

Here are the most interesting ones (stars indicate statistical significance at the 5% level):

Top UFC Insights

Fighters older than 32 years of age will more likely lose

This was validated in 173 out of 277 (62%) fights*

Fighters with more than 6 TKO victories fighting opponents older than 32 years of age will more likely win

This was validated in 47 out of 60 (78%) fights*

Fighters from Japan will more likely lose

This was validated in 36 out of 51 (71%) fights*

Fighters who have lost 2 or more KOs will more likely lose

This was validated in 54 out of 84 (64%) fights*

Fighters with 3x or more decision wins and are greater than 3% taller than their opponents will more likely win

This was validated in 32 out of 38 (84%) fights*

Fighters who have won 3x or more decisions than their opponent will more likely win

This was validated in 142 out of 235 (60%) fights*

Fighters with no wrestling background vs fighters who do have one more likely lose

This was validated in 136 out of 212 (64%) fights*

Fighters fighting opponents with 3x or less decision wins and are on a 6 fight (or better) winning streak more likely win

This was validated in 30 out of 39 (77%) fights*

Fighters younger than their opponents by 3 or more years in age will more likely win

This was validated in 324 out of 556 (58%) fights*

Fighters who haven’t fought in more than 210 days will more likely lose

This was validated in 162 out of 276 (59%) fights*

Fighters taller than their opponents by 3% will more likely win

This was validated in 159 out of 274 (58%) fights*

Fighters who have lost less by submission than their opponents will more likely win

This was validated in 295 out of 522 (57%) fights*

Fighters who have lost 6 or more fights will more likely lose

This was validated in 172 out of 291 (60%) fights*

Fighters who have 18 or more wins and never had a 2 fight losing streak more likely win

This was validated in 79 out of 126 (63%) fights*

Fighters who have lost back to back fights will more likely lose

This was validated in 514 out of 906 (57%) fights*

Fighters with 0 TKO victories will more likely lose

This was validated in 90 out of 164 (55%) fights

Fighters fighting opponents out of Greg Jackson’s camp will more likely lose

This was validated in 38 out of 63 (60%) fights

 

Top Insights over All Fights

Fighters with 15 or more wins that have 50% less losses than their opponents will more likely win

This was validated in 239 out of 307 (78%) fights*

Fighters fighting American opponents will more likely win

This was validated in 803 out of 1303 (62%) fights*

Fighters with 2x more (or better) wins than their opponents and those opponents lost their last fights will more likely win

This was validated in 709 out of 1049 (68%) fights*

Fighters who’ve lost their last 4 fights in a row will more likely lose

This was validated in 345 out of 501 (68%) fights*

Fighters currently on a 5 fight (or better) winning streak will more likely win

This was validated in 1797 out of 2960 (61%) fights*

Fighters with 3x or more wins than their opponents will more likely win

This was validated in 2831 out of 4764 (59%) fights*

Fighters who have lost 7 or more times will more likely lose

This was validated in 2551 out of 4547 (56%) fights*

Fighters with no jiu jitsu in their background versus fighters who do have it more likely lose

This was validated in 334 out of 568 (59%) fights*

Fighters who have lost by submission 5 or more times will more likely lose

This was validated in 1166 out of 1982 (59%) fights*

Fighters in the Middleweight division who fought their last fight more recently will more likely win

This was validated in 272 out of 446 (61%) fights*

Fighters in the Lightweight division fighting 6 foot tall fighters (or higher) will more likely win

This was validated in 50 out of 83 (60%) fights

 

Note – I separated UFC fights from all fights because regulations and rules can vary across MMA organizations.

Most of these insights are intuitive except for maybe the last one and an earlier one which states 77% of the time fighters beat opponents who are on 6 fight or better winning streaks but have 3x less decision wins.

Many of these insights demonstrate statistically significant winning biases. I couldn’t help but wonder – could we use these insights to effectively bet on UFC fights? For the sake of simplicity, what happens if we make bets based on just the very first insight which states that fighters older than 32 years old will more likely lose (with a 62% chance)?

To evaluate this betting rule, I pulled the most recent UFC fights where in each fight there’s a fighter that’s at least 33 years old. I found 52 such fights, spanning 2/5/2011 – 8/14/2011. I placed a $10K bet on the younger fighter in each of these fights.

Surprisingly, this rule calls 33 of these 52 fights correctly (63% – very close to the rule’s observed 62% overall win rate). Each fight called incorrectly results in a loss of $10,000, and for each of the fights called correctly I obtained the corresponding Bodog money line (betting odds) to compute the actual winning amount.

I’ve compiled the betting data for these fights in this Google spreadsheet.

Note, for 6 of the fights that our rule called correctly, the money lines favored the losing fighters.

Let’s compute the overall return of our simple betting rule:

For each of these 52 fights, we risked $10,000, or in all $520,000
We lost 19 times, or a total of $190,000
Based on the betting odds of the 33 fights we called correctly (see spreadsheet), we won $255,565.44
Profit = $255,565.44 - $190,000 = $65,565.44
Return on investment (ROI) = 100 * 65,565.44 / 520,000 = 12.6%

 

That’s a very decent return.

For kicks, let’s compare this to investing in the stock market over the same period of time. If we buy the S&P 500 with a conventional dollar cost averaging strategy to spread out the $520,000 investment, then we get a ROI of -7.31%. Ouch.

Keep in mind that we’re using a simple betting rule that’s based on a single insight. The random forest model, which optimizes over many insights, should predict better and be applicable to more fights.

Please note that I’m just poking fun at stocks – I’m not saying betting on UFC fights with this rule is a more sound investment strategy (risk should be thoroughly examined – the variance of the performance of the rule should be evaluated over many periods of time).

The main goal here is to demonstrate the effectiveness of data driven approaches for better understanding the patterns in a sport like MMA. The UFC could leverage these data mining approaches for coming up with fairer matches (dismiss fights that match obvious winning and losing biases). I don’t favor this, but given many fans want to see knockouts, the UFC could even use these approaches to design fights that will likely avoid decisions or submissions.

Anyways, there’s so much more analysis I’ve done (and haven’t done) over this data. Will post more results when cycles permit. Stay tuned.

23 Comments

Filed under AI, Blog Stuff, Computer Science, Data Mining, Economics, Machine Learning, Research, Science, Statistics, Trends

Ranking High Schools Based On Outcomes

High school is arguably the most important phase of your education. Some families will move just to be in the district of the best ranked high school in the area. However, the factors that these rankings are based on, such as test scores, tuition amount, average class size, teacher to student ratio, location, etc. do not measure key outcomes such as what colleges or jobs the students get into.

Unfortunately, measuring outcomes is tough – there’s no data source that I know of that describes how all past high school students ended up. However, I thought it would be a fun experiment to approximate using LinkedIn data. I took eight top high schools in the Bay Area (see the table below) and ran a whole bunch of advanced LinkedIn search queries to find graduates from these high schools while also counting up their key outcomes like what colleges they graduated from, what companies they went on to work for, what industries are they in, what job titles have they earned, etc.

The results are quite interesting. Here are a few statistics:

College Statistics

  • The top 5 high schools that have the largest share of users going to top private schools (Ivy League’s + Stanford + Caltech + MIT) are (1) Harker (2) Gunn (3) Saratoga (4) Lynbrook (5) Bellarmine.
  • The top 5 high schools that have the largest share of users going to the top 3 UC’s (Berkeley, LA, San Diego) are (1) Mission (2) Gunn (3) Saratoga (4) Lynbrook (5) Leland.
  • Although Harker has the highest share of users going to top privates (30%), their share of users going to the top UC’s is below average. It’s worth nothing that Harker’s tuition is the highest at $36K a year.
  • Bellarmine, an all men’s high school with tuition of $15K a year, is below average in its share of users going on to top private universities as well as to the UC system.
  • Gunn has the highest share of users (11%) going on to Stanford. That’s more than 2x the second place high school (Harker).
  • Mission has the highest share of users (31%) going to the top 3 UC’s and to UC Berkeley alone (14%).

Career Statistics

  • In rank order (1) Saratoga (2) Bellarmine (3) Leland have the biggest share of users which hold job titles that allude to leadership positions (CEO, VP, Manager, etc.).
  • The highest share of lawyers come from (1) Bellarmine (2) Lynbrook (3) Leland. Gunn has 0 lawyers and Harker is second lowest at 6%.
  • Saratoga has the best overall balance of users in each industry (median share of users).
  • Hardware is fading – 5 schools (Leland, Gunn,  Harker, Mission, Lynbrook) have zero users in this industry.
  • Harker has the highest share of its users in the Internet, Financial, and Medical industries.
  • Harker has the lowest percentage of Engineers and below average share of users in the Software industry.
  • Gunn has the highest share of users in the Software and Media industries.
  • Harker high school is relatively new (formed in 1998), so its graduates are still early in the workforce. Leadership takes time to earn, so the leadership statistic is unfairly biased against Harker.

You can see all the stats I collected in the table below. Keep in mind that percentages correspond to the share of users from the high school that match that column’s criteria. Yellow highlights correspond to the best score; blue shaded boxes correspond to scores that are above average. There are quite a few caveats which I’ll note in more detail later, so take these results with a grain of salt. However, as someone who grew up in the Bay Area his whole life, I will say that many of these results make sense to me.

6 Comments

Filed under Blog Stuff, Data Mining, Education, Job Stuff, LinkedIn, Research, Science, Social, Statistics

An Evaluation of Google’s Realtime Search

How timely are the results returned from Google’s Realtime (RT) Search Engine? How often do Twitter results appear in these results? Over the weekend I developed a few basic experiments to find out and published the results below.

Key Findings

  • For location-based queries, there’s nearly a flip of a coin chance (43%) that a Twitter result will be the #1 ranked result.
  • For general knowledge queries, there’s a 23% chance that a Twitter result will be #1.
  • The newest Twitter results are usually 4 seconds old. The newest Web results are 10x older (41 seconds).
  • A top ranking Twitter result for a location-based query is usually 2 minutes old (compared with Web which is 22 minutes old – again nearly 10x older).
  • When Twitter results appear at least one of them is in the top ranked position
Experiment #1 – General Knowledge

I crawled 1,370 article titles from Wikipedia and ran each title as a query into Google RT search.

Market Shares

81% of all queries returned search results that included web page results
23% of all queries returned search results that included Twitter results
7% of all queries returned 0 search results

70% of all queries had a web page result in the #1 ranked position
When Twitter results appeared there was always at least one result in the #1 ranked position (so 23% of queries)

Time Lag

When a web page was the #1 ranked result, that result on average was 6736 seconds (or 1 hr and 52 minutes) old.
When a Tweet was the #1 ranked result, that result on average was 261 seconds (or 4 minutes and 21 seconds) old.

The average age of the top 10% newest web page results (across all queries) is 41 seconds
The average age of the top 10% newest Twitter results (across all queries) is 2 seconds

Tail

Query length was between 1 – 12 words (where 1-2 word long queries are most popular)
Worth noting that no Twitter results appear for queries with greater than 5 words

Experiment #2 – Location

I crawled 265 major populated U.S. cities from the U.S. Census Bureau and ran each city name as a query into Google RT search.

Market Shares

73% of all queries returned search results that included web page results
43% of all queries returned search results that included Twitter results
5% of all queries returned 0 search results

52% of all queries had a web page result in the #1 ranked position
When Twitter results appeared there was always at least one result in the #1 ranked position (so 43% of queries)

Time Lag

When a web page was the #1 ranked result, that result on average was 1341 seconds (or 22 minutes and 21 seconds) old.
When a Tweet was the #1 ranked result, that result on average was 138 seconds (or 2 minutes and 18 seconds) old.

The average age of the top 10% newest web page results (across all queries) is 41 seconds
The average age of the top 10% newest Twitter results (across all queries) is 4 seconds

Tail

Query length was between 1 – 3 words
Worth noting that no Twitter results appear for 3 word long queries

Implementation Details

  • Generated Wiki queries by running “site:en.wikipedia.org” searches on Google and Blekko, and extracting the titles (en.wikipedia.org/{title_is_here}) from the result links. Side point: I tried Bing but the result links had mostly one word long titles (Bing seems to really bias query length in their ranking) and I wanted more diversity to test out tail queries.
  • Crawled cities (for the location-based queries) from http://www.census.gov/popest/cities/tables/SUB-EST2009-01.csv

Caveats

  • I ran these experiments at 2:45a PST on Monday. The location-based queries all relate to U.S., so probably not many people up at that time generating up-to-date information. The time lag stats could vary depending on when these experiments are ran. I did however re-run the experiments in the late morning and didn’t see much difference in the timings.
  • I ran all queries through Google’s normal web search engine with ‘Latest’ on (in the left bar under Search Tools). These results are not exactly the same as those generated from the standalone Google Realtime Search portal, which seems to bias Tweets more while the ‘Latest’ results seems to find middle ground between real-time Twitter results and web page results. I used ‘Latest’ because it seems like it would be the most popular gateway to Google’s Realtime search results.

5 Comments

Filed under Blog Stuff, Computer Science, Data Mining, Google, Information Retrieval, Research, Search, Social, Statistics, Twitter, Wikipedia

Does Facebook leak what profiles you click on?

Check out Preview My Profile on Facebook:

Account (top right) > Privacy Settings >

Customize Settings > Preview My Profile

Now say you have a friend named Bob. Type ‘Bob’ in the box at the top of Preview My Profile to see how your profile will be seen by him. Take a look at the Mutual Friends section (bottom left in the screenshot above) of your profile (from Bob’s view – so still in Preview My Profile). Notice how these mutual friends seem to bias towards those who are closest to Bob (and perhaps to you as well). This by itself is pretty interesting. I can see who my friends are closer to relative to our other mutual friends. This pattern seems to hold up well in my trials over my friends who I know well (I saw that their closest friends were popping up more often than not in the mutual friends section).

This got me curious about how Facebook determines “closeness” between two people. In particular, does Facebook leverage your clicks on a friend’s profile in determining how close you are to that friend? To experiment, I frequently clicked on my friend’s (say her name is Alice) profile and newsfeed updates over two weeks. She’s someone I rarely communicate with. I then normally browsed profiles of mutual friends I share with Alice and noticed that in the mutual friends section of those profiles Alice frequently showed up (even when the total number of mutual friends was greater than 80 – keep in mind that the mutual friends section only shows 3 friends). Now, there’s definitely randomness at times and I believe multiple ranking features are probably being used here (like perhaps number of exchanged messages) but I have a feeling clicks might be in play here as well based on this result.

If Preview My Profile gives you the same view over mutual friends as what you see normally when you click on a friend’s profile, and if mutual friends uses private information like clicks / messages as features in the ranking, then it may be possible to infer who your friends are communicating with or clicking on more – or at the very least, find who they are closer to relative to your other mutual friends. If I view my profile from Bob’s eyes and frequently see Alice appear in the Mutual Friends section over multiple runs it may imply a strong relationship from Bob to Alice. Also, when the number of mutual friends is high relative to the number of total friends your friend has, then this result may be even more accurate.

This isn’t scientific by any means – I really don’t know how the ranking is done and may be completely wrong – so take it with a grain of salt. Just thought it was an interesting feature and pattern worth sharing …

1 Comment

Filed under Blog Stuff, Facebook, Non-Technical-Read, Social