Startup Colorado’s Big Data Dig In

Last week, Startup Colorado in partnership with the University of Colorado hosted the "Big Data Dig In," an event focused on the technical and legal issues surrounding big data. I attended the event on behalf of Quick Left and was treated to an afternoon of interesting panels and audience discussion. Here are some reflections on the events.

The Big Data Dig In was organized by the State of the Art ("START") Network, presented by Startup Colorado. The START network helps host and organize events that "help participants get smarter faster," and this event was no exception. The goal of the day was to continue to fostering relationships between startups, established businesses and research centers such as CU, in order to further advance the tech industry in Colorado. It's worth noting that to continue the pace of growth the Colorado has seen over the past few years will require participation and innovation from all parties, and Big Data is just one area of research and emerging technology that can act as a focal point for collaboration.

The event was divided into 3 panels focusing on various aspects of working with large data sets that ran the course of the afternoon. They were each focusing on a different topic: the Analytics panel discussed emerging techniques and technologies for working with a large data set (once you've figured out how to store all those terabytes and petabytes); the Database Management panel focused on the technical limits of storing and sorting unprecedented amounts of data in a variety of applications and systems; finally, the Legal panel talked about the increasingly complex legal implications for startups that are amassing more and more private or sensitive information about their users.

Analytics Panel

Moderated by Kenneth Anderson, Associate Professor and Associate Chair in the Computer Science Department at CU, the panel contained members of both the business and academic worlds. Mr. Anderson started by asking the participants to define Big Data in an attempt to bypass the buzz-words and marketing hype surrounding the topic. The sentiment that emerged from the panel was that in the last 5 years there has been an interesting shift in the scale of operations: from 10s to 50s of terabytes of data to 100s or 1000s, and moving into the 10s – 50s of petabytes. That shift has ushered in an explosion of technologies, specifically around storage mediums and cloud computing, that are making it possible for large-scale data sets to move from inside the enterprise to outside of the enterprise. Because of tools like AWS and the rapid decline in storage costs, having an extremely large and ever-growing data set is no longer an expensive liability, reserved for CIO's of Fortune 500 companies.

"The disk will be dead in 5 years," declared Brad Cowdrey of Clear Peak, a Business Intelligence and data-solutions provider. Spurned on by the rise of SSD storage, the panel agreed that the shift has already started to happen: (hard) disk is supplanting tape drives as the preferred way of dealing with long term storage and SSD and in-memory operations are the way forward when tackling analytics problems on extremely large data sets. George Mathew of Alteryx noted that the economics of the problem have started to shift; AWS and open source projects such as hadoop are enabling even small players to extract business intelligence from their data in ways that would have been out of scope (a function of either computational time or budget, or both) only a few years ago.

The panel moved on to talk about how research methods and the mathematics of big data are experiencing a renaissance because of these fundamental shifts in the industry. It's noted that it still takes significant amounts of research money to fuel advancements, but research institutions are still benefitting from the preponderance of tech that supports big data. Shannon Hughes, a Assistant Professor in the Department of Electrical, Computer, and Energy Engineering at CU discussed her recent research and noted that there has been a move towards machine learning and unsupervied approaches. This means that instead of data scientists directly analyzing or querying their data to solve problems, that they're writing algorithms to spot patterns and trends in data that are capable of responding to, and learning from, the data itself. It was noted that one of the most efficient solutions to the "traveling salesman problem" turns out to be a genetic algorithm, and that approach is being mirrored in the academic community as applied to big data.

Database Management Panel

The second panel was again moderated by Mr. Anderson, and featured members from the startup community here in Boulder as well as big businesses like Disney and SAP. This panel focused more on the practical problems of working with big data: how to store it, how to search through it effectively and how to find people that know how to solve the relevant problems.

The discussion quickly turned to the relevance of nosql vs sql technologies, and how each has a range of accuracy or effectiveness in respect to data size and shape. Sean Kelly of Disney Interactive expressed a somewhat agnostic view of the current crop of database technologies, favoring whatever has the best tooling for the host language. Greg Greenstreet of local data-provider Gnip, stated that his problems were with the speed of incoming data and compliance with the terms of service of the data providers he supports (Twitter, Tumblr, WordPress); Cassandra and Java ended up being the best fit for the bill.

After some questions from the audience, the panel talked about how hard computer skills are still necessary for folks that want to work with big data and dealing with the grab-bag of technologies on the table engineers need a fundamental understanding of the problem-space at a low level, since the tools they're using are likely to continue to change and move between languages as the problems evolve. Emerging stores like monogDB are an example of tech that adapts to work with the constraints and ergonomic considerations; developers are increasingly more concerned with the APIs of the databases they're using because of how much it affects how they can go about solving a problem.

Legal Panel

The final panel focuses on the legal considerations of housing and supplying data that is more and more likely to come from a social network and contains more and more sensitive or personal information. The panel was moderated by Bill Mooz and contained lawyers from defense contractors, businesses and startups that specialize in dealing with information technology.

International concerns, and the differences between EU and US laws and compliance was an early theme to emerge, and is now an early theme for most startups when drafting privacy policies and software licenses. Privacy by design was also a topic that the panel touched on and is one thing that's giving new businesses an edge over larger, well-established companies; the cost of re-engineering privacy into existing technologies is often steep when engineering something to be reasonably secured and compliant with your Terms of Service or Privacy Policy from day 1. New companies are able to capitalize on the work that's been done by other major players in the space in terms of legal language and enforceable clauses in their agreements with their users.

The panel also placed an emphasis on the new-found security concerns of storing users' personal information and the potential liability it poses if its harvested or maliciously accessed. Even thumbs drives can pose major problems for companies that house sensitive information; we're all aware of how easy it is for one disgruntled private to supply classified information to Wikileaks.

Closing Thoughts

2 thoughts emerged as the most prescient of the day:

"Moore's law will break down in our lifetime," said George Mathew of Alteryx. "It's already happening," chimed in Elliot Turner of AlchemyAPI, "advanced research methods have already outpaced it, just look at the current generation GPU driven parallel computation." This revelation was a bit stunning, but makes sense. At some point in the near future, we're going to see the limits of what can be done by computers and databases from a processing and speed, even on a large scale. At the same time that's happening, the amount of data that's generated to solve business problems is only getting bigger, and the next generation of solutions are going to be build on the research foundations that are being built today, and an even larger emphasis on machine learning and unsupervised analytics will ultimately help to narrow the range of "big data" that we'll need to work with.

And to paraphrase Greg Greenstreet, "Nosql vs SQL is essentially a joke played on us by people with extremist views." That was by far the most interesting sentiment that emerged from the panels, but I think it's a fitting description of some of the problems and hype surrounding Big Data–the engineering problems are more or less the same anywhere in the stack, and things will get easier or harder depending on the choice of tooling you use to solve your problems.

The Big Data Dig In was an interesting event to attend and participate in. I had expected a bigger focus on the nuts-and-bolts how to to solve big data problems (and was surpised at the notable lack of mention of high performance key-value stores like Redis or bleeding edge stores like RethinkDB).

The best take-away, however, was that Startup Colorado and the University of Colorado are both committed to fostering a relationship with the community at large in order to solve difficult problems and cut through the hype and FUD surrounding "hot" topics like big data. I'm looking forward to their next event!