Monthly Archives: February 2011

Lorem ipsum dolor sit amet consectetur

Etiam malesuada fringilla est a varius. Praesent quis dolor quis orci venenatis placerat. Maecenas facilisis tristique ipsum, at ultricies leo iaculis eget. Nunc ac tincidunt felis. Phasellus ut dui nisl, in tempus est. Mauris porttitor cursus eros sed luctus. Integer gravida congue quam, eu tempor nulla consectetur vel.

Phasellus bibendum ultricies mi, sit amet sagittis sapien sagittis in. Suspendisse massa enim, sagittis eget pulvinar sit amet, cursus molestie lacus. Nam malesuada accumsan venenatis. Mauris eget urna odio, ac scelerisque mi. Aliquam fermentum velit quis est dignissim placerat. Quisque a tellus eu turpis laoreet ornare.

Phasellus bibendum ultricies mi, sit amet sagittis sapien sagittis in. Suspendisse massa enim, sagittis eget pulvinar sit amet, cursus molestie lacus. Nam malesuada accumsan venenatis. Mauris eget urna odio, ac scelerisque mi. Aliquam fermentum velit quis est dignissim placerat. Quisque a tellus eu turpis laoreet ornare.

Phasellus bibendum ultricies mi, sit amet sagittis sapien sagittis in. Suspendisse massa enim, sagittis eget pulvinar sit amet, cursus molestie lacus. Nam malesuada accumsan venenatis. Mauris eget urna odio, ac scelerisque mi. Aliquam fermentum velit quis est dignissim placerat. Quisque a tellus eu turpis laoreet ornare.

Day 3 at the O’Reilly Strata Conference

The third day of the Strata Conference was again packed with great sessions. The day started off with numerous keynotes. The first one was Simon Rogers of The Guardian. Simon is not just a fabulous presenter, also the examples of his work at the Guarding were great examples of how to tell stories with data, and how The Guardian actually enhanced its news stories by sharing data with the public. Next up was an interesting panel discussion with Toby Segaran (Google), Amber Case (Geoloqi) and Bradford Cross (Flightcaster) and moderated by Alistair Croll (Bitcurrent). Topic of discussion was Posthumus, Big Data and New Interfaces. After the discussion we had some good presentations by Ed Boyajian (EnterpriseDB) and after that Barry Devlin (9sight consulting). Next was a very lively talk by DJ Patil (LinkedIn), and he showed very convincingly that the success of working with big data at LinkedIn is only possible with a good team of talented people. Scott Yara (EMC) came next, and also had a lively talk full of humor on how Your Data Rules The World. The closing keynote was from Carol McCall (Tenzing Health) with a serious problem brought with humor on how big data analytics can be used to improve the US healthcare, and turn it ‘from sickcare into healthcare’.

As my first session I chose a talk on Data Journalism, Applied Interfaces. Marshall Kirkpatrick (ReadWriteWeb) showed some really useful tools, like NeedleBase, that he uses for discovering stories on the Internet. He was followed up by Simon Rogers of The Guardian again, who more or less continued his keynote, showing very compelling examples of how The Guardian uses data to tell stories, and how they use for instance Google Fusion Tables to publish many of their data. The last speaker of this sesion was Jer Thorpe, and he absolutely blew me away with a beautiful interface he has created in Processing as an R&D project together with the New York Times. It’s called Cascade, and shows a visual representation of how Twitter messages are cascaded over various followers and links.

My next session was on ‘RealTime Analytics’ at Twitter where Kevin Weil mainly explained RainBird, a project they use for various counting applications so that realtime analytics can easily be applied. The project will be opensourced in the near future.

After the break I saw a session on AnySurface: Bringing Agent-based Simulation and Data Visualization to All Surfaces by Stephen Guerin (Santa Fe Complex). He showed how using a projector and a table of sand can be used to enhance a data visualization for simulation purposes. As an example he showed us how he projects agent-based models and emergent phenomena in complex system dynamics can help firefighters simulate bottlenecks in escape routes. It was also very cool to see that many of his simulations are built in Processing. Next up was a session by Creve Maples (Event Horizon) and I really like the first part of his talk, because he had a very good story on how we should keep the capacity of the human brain for processing information in mind when designing products and tools. It was really good to hear such a strong emphasis on this. The last part of his talk was mainly about some of the 3D visualizations he has done in the past that were very successful for his company, but didn’t struck me as much as the first half of his talk.

The session on Data as Art by J.J. Toothman (NASA Ames Research Center) was a good an fun talk with many examples of infographics and visualizations. I had already seen most of them myself, some were new. It was a great talk with lots of eye-candy. The final talk of the conference I saw was about Predicting the Future: Anticipating the World with Analytics. Three speakers gave their vision on how they do that: Christopher Ahlberg (Recorded Future) showed how his companies uses time-related hints (like the mention of the word ‘tommorrow’) in existing content on the Internet can be used to more or less predict the future. Robert McGrew (Palantir Technologies) showed how analyzing many large datasets in combination with human analysis can be used to perform effective fraud and crime predication. Finally Rion Snow (Twitter) showed that research has proven that analyzing tweets can be used effectively for stock market prediction (3 days ahead!), flu and virus spread prediction, and UK election result prediction (more accurate than exit polls). The predictive power of analyzing the Twitter crowd was really stunning.

This concluded the O’Reilly Strata Conference. The conference was fantastic, the sessions were great, and most of all, meeting all these people was probably even the best of all!

Day 2 at the O’Reilly Strata Conference

 

After a day of tutorials, the second day at Strata was the first of two conference days, packed with fascinating sessions. The day was kicked of with a plenary session with a long list of top-speakers in field of data science: Edd Dumbill of O’Reilly Media, Alistair Croll of Bitcurrent, Hilary Mason of bit.ly, James Powell of Thomson Reuters, Mark Madsen of Third Nature, Werner Vogels of Amazon.com, Zane Adam of Microsoft Corp, Abhishek Mehta of Tresata, Mike Olson of Cloudera, Rod Smith of IBM Emerging Internet Technologies and last but not least Anthony Goldbloom of Kaggle. Various topics were presented in presentations of 10 minutes each, like data without limits, data marketplace, and the mythology of big data. The shortest presentation struck me most: “the $3 Million Heritage Health Prize” presented by Anthony Goldbloom: people are challenged to create a predictive application that uses healthcare data to predict which people are most likely to go to hospital, so that ‘US healthcare becomes healthcare instead of sickcare’. The prize is $3 Million for the one who solves this!

Next up were the individual sessions, and I was very much looking forward to the talk “Telling Greate Data Stories Online” Jock MacKinlay of Tableau. And though the talk itself was excellent, for me it was all known stuff, but the talk is highly recommended for those unfamiliar with Visual Analytics or Tableau. Being biased towards visualization related sessions, my next session was “Desinging for Infinity” by Dustin Kirk of Neustar. Dustin showed 8 Design Patterns of User Interface Design, like infinite scrolling, which were really good. It reminded me of the updated version of the material in Steve Krugg’s book Don’t Make Me Think.

Next up was the best talk of the day: “Small is the New Big: Lessons in Visual Economy”. Kim Rees of Periscopic showed us very good examples of effective information visualizations. I was really blown away by this presentation, mostly because she really showed how creatively removing clutter and distractions can make the visualization very effective. Also the creative interactions that help the user using the visualization were compelling. Next was Philip Kromer of Infochimps on “Big Data, Lean Startup: Data Science on a Shoestring”. Though my expectations were that Philip was going to explain the Lean Startup principles, evangelized by Eric Ries, the talk was more about Infochimps approach to doing business. Some remarkable comments by Philip: “everything we do is for the purpose of programmer joy”, and “Java has many many virtues, but joy is not one of them”. Great presentation and inspiring insights!

My next sessions was “Visualizing Shared, Distributed Data” by Roman Stanek (GoodData), Pete Warden (OpenHeatMap) and Alon Halevy (Google). After short presentations of each, these three guys had a panel discussion where the audience could as questions. Their discussion evolved mostly around the fact that all three deal with data that is created and uploaded by a user, and how do you deal with that: do you clean it, what’s the balance between complex query functionality and ease of use, etc. My final session was “Wolfram Alpha: Answering Questions with the World’s Factual Data” by Joshua Martell. Half the talk was a demonstration of the features of WolframAlpha, and the other half was more or less a high level talk about how WolframAlpha handles user input, how data is stored, how user analytics is performed, and more.

The day ended with a Science Fair where students, researchers and companies were showing new advancements in the field of data science. There were really interesting showcases, like a simulation tool for system dynamics. But again biased towards visualization, the one that struck me most was Impure by Bestiaro. Impure is a visual programming language that allows users to easily create their own visualization, both simple and very advanced. It was also great to see the passion of Bestiario for their own product.

Finally one of the best things of the conference so far has been meeting people, some of which I only know virtually for some time now. I especially enjoyed meeting all the visualization people today. It’s really great to meet many of the online visualization community in person.

So again, a fantastic day at Strata, and I am looking forward to tomorrow!

Day 1 at O’Reilly Strata Conference: Data Bootcamp

Today was my first day at O’Reilly Strata Conference: a full day of tutorial sessions. The session I picked was the Data Bootcamp by Joseph Adler (LinkedIn), Hilary Mason (bit.ly), Drew Conway (New York University) and Jake Hofman (Yahoo!). The purpose of this bootcamp tutorial was to turn everybody in the room into data scientists by getting our hands dirty with some real hands-on experience.

The tutorial was kicked-off with an introduction of the speakers, and a general overview of the various aspects of working with data: getting data, cleaning data, applications of data intensive applications, and much more. Then Drew gave an interactive introduction in visualizing data using Python and R. The audience had to produce a normal-distribution of random numbers in R. And although some people managed to get along with all the examples, there were also lots of people struggling due to the fact that libraries were missing, or simply for the fact that everything was going pretty fast, at least for R and Python newbies like myself.

Next Jake gave an great introduction into image processing, and especially how you can cluster images based on similar features, color in our case. We used a K-Means clustering algorithm to cluster similar images based on color, and after that we classified images, whether they were images of landscapes or head-shots.

After the break Hilary took over with a great presentation on working with text-data. Starting with some basic examples on extracting data from webpages using command-line commands like curl and wget, and using Python and the BeautifulSoup Python library. After that we turned to the main example: ‘hacking’ a gmail account, and try to get some valuable information out of it. Hilary showed us how to classify email using probability statistics, and then Drew took over to show us how to visualize this data and turn it into network diagrams.

Last but not least Joseph gave a talk about Big Data. This was not an interactive session. Joseph shared some of his knowledge and experience of working with big data at LinkedIn, and explained the basics of Map/Reduce, Hadoop, and why and when to start thinking about big data solutions like Hadoop.

Overall it was an interesting day, also because I’ve met really great people. It was especially great to meet Naomi (@nbrgraphs), Kim (@krees), Jerome (@jcukier) and Daniel (@danielgm). For me the Data Bootcamp was especially an inspirational tutorial with lots of ideas to try out on my own. For some people tempo tempo was a little to high, especially if you’ve never programmed R or Python before. And becoming a Data Scientist in just 1 day may be an illusion anyway. At least the tutorial gave me a good head start, lots of inspiration, and great learnings of how the presenters approach working with data. So for me, this was a great and successful first day, and I’m looking forward to the next two days!

The source code and slides of the Data Bootcamp are available online at: https://github.com/drewconway/strata_bootcamp