“Big Data” Analysis Becoming a Common Tool in the Developer’s Toolkit

The first recorded use of cryptography in warfare was by Julius Caesar in the first century BCE. He used a substitution cipher to encrypt his plans and personal correspondence, substituting each plaintext letter in the message with the letter a fixed number of letters away in the alphabet. A might be written as D, then B would be E, C would be F, etc. Today this is often referred to as the Caesar cipher thanks to Julius’ favor of the technique.

Today any message encrypted using a solely a substitution cipher would be viewed as laughably insecure. Decoding them is frequently a challenge given to children as homework and adult fans of the newspaper puzzle section. However, as simple a form of security it seems to us now, it was sufficient to keep Caesar’s messages secret for a thousand years. It wasn’t until the 9th century renaissance of mathematics in the Middle East that sufficient knowledge of linguistics, statistics, and frequency analysis existed all in once place for a system of solving the Caesar cipher to be developed.

These convergences come a lot more frequently nowadays, what with so much of our learning driven by technological advances. I was thinking about this bit of history as I toured the presentations at last week’s “Big Data Science Fair” sponsored by Precog and GNIP. Fifteen CU graduate students from Professor Tom Yeh’s Big Data class proved this as they presented their semester projects, science fair style, along with about a dozen representatives of corporate enterprises.

“Big Data” was once inscrutable to the masses. The infrastructure to collect and store giant data sets was, until very recently, very costly. Tools to access and combine them were rudimentary. The expertise needed to analyze them, rare. But, like military cryptography eventually sharing space with the crossword puzzle, Big Data analysis is now within the grasp of any technical person.

The number of data sets available alone shows how much there is to work with now. The presenters used data diverse as online video catalogs, recordings of multiplayer Xbox game sessions, logs from a biomedical instrument, websites for reuniting lost pets, as well as public data from Twitter and Google.

Computer science graduate student Greg Guyles captivated much of the room with his project “Aggregated Entertainment” which analyzed the Twitter mentions of big-event TV airings to reveal different tweeting patterns for different genera of shows. Walking Dead mentions consistently spike when something gruesome and violent happens. American Idol mentions peak every time a new contestant performs. His setup paired digital video along with a graph of Twitter mentions for the show so that users could scroll to see the action that Tweeters were talking about. Greg took home the “Most Impactful Insight’ award for the evening, as well as the Judges Choice award.

Big Data Science Fair

Dave Elchoness, CEO of TagWhat, showed off “Feed For iPhone: Deals Culled from Social Media Worldwide.” The Feed iPhone App (available on the iTunes App store) automatically finds businesses that are offering deals and uses the iPhone’s location services to deliver ones nearby to the user. The availability of Facebook’s graph search and Twitter’s public API make it possible to sort through posts for special bargains without users having to follow specific businesses. Dave was awarded the “The Creative Champion” award for the night.

There were more projects on display than I could fully explore before the awards were given out. If you want to see the full list of student and corporate projects, as well as all the award winners, you can find them on Precog’s website for the event. Clearly, with the tools of Big Data now open to all computer programmers, hobbyists, grad students, and researchers, we can expect to see even more creative and useful products every day.