Day 3 at the O’Reilly Strata Conference

The third day of the Strata Conference was again packed with great sessions. The day started off with numerous keynotes. The first one was Simon Rogers of The Guardian. Simon is not just a fabulous presenter, also the examples of his work at the Guarding were great examples of how to tell stories with data, and how The Guardian actually enhanced its news stories by sharing data with the public. Next up was an interesting panel discussion with Toby Segaran (Google), Amber Case (Geoloqi) and Bradford Cross (Flightcaster) and moderated by Alistair Croll (Bitcurrent). Topic of discussion was Posthumus, Big Data and New Interfaces. After the discussion we had some good presentations by Ed Boyajian (EnterpriseDB) and after that Barry Devlin (9sight consulting). Next was a very lively talk by DJ Patil (LinkedIn), and he showed very convincingly that the success of working with big data at LinkedIn is only possible with a good team of talented people. Scott Yara (EMC) came next, and also had a lively talk full of humor on how Your Data Rules The World. The closing keynote was from Carol McCall (Tenzing Health) with a serious problem brought with humor on how big data analytics can be used to improve the US healthcare, and turn it ‘from sickcare into healthcare’.

As my first session I chose a talk on Data Journalism, Applied Interfaces. Marshall Kirkpatrick (ReadWriteWeb) showed some really useful tools, like NeedleBase, that he uses for discovering stories on the Internet. He was followed up by Simon Rogers of The Guardian again, who more or less continued his keynote, showing very compelling examples of how The Guardian uses data to tell stories, and how they use for instance Google Fusion Tables to publish many of their data. The last speaker of this sesion was Jer Thorpe, and he absolutely blew me away with a beautiful interface he has created in Processing as an R&D project together with the New York Times. It’s called Cascade, and shows a visual representation of how Twitter messages are cascaded over various followers and links.

My next session was on ‘RealTime Analytics’ at Twitter where Kevin Weil mainly explained RainBird, a project they use for various counting applications so that realtime analytics can easily be applied. The project will be opensourced in the near future.

After the break I saw a session on AnySurface: Bringing Agent-based Simulation and Data Visualization to All Surfaces by Stephen Guerin (Santa Fe Complex). He showed how using a projector and a table of sand can be used to enhance a data visualization for simulation purposes. As an example he showed us how he projects agent-based models and emergent phenomena in complex system dynamics can help firefighters simulate bottlenecks in escape routes. It was also very cool to see that many of his simulations are built in Processing. Next up was a session by Creve Maples (Event Horizon) and I really like the first part of his talk, because he had a very good story on how we should keep the capacity of the human brain for processing information in mind when designing products and tools. It was really good to hear such a strong emphasis on this. The last part of his talk was mainly about some of the 3D visualizations he has done in the past that were very successful for his company, but didn’t struck me as much as the first half of his talk.

The session on Data as Art by J.J. Toothman (NASA Ames Research Center) was a good an fun talk with many examples of infographics and visualizations. I had already seen most of them myself, some were new. It was a great talk with lots of eye-candy. The final talk of the conference I saw was about Predicting the Future: Anticipating the World with Analytics. Three speakers gave their vision on how they do that: Christopher Ahlberg (Recorded Future) showed how his companies uses time-related hints (like the mention of the word ‘tommorrow’) in existing content on the Internet can be used to more or less predict the future. Robert McGrew (Palantir Technologies) showed how analyzing many large datasets in combination with human analysis can be used to perform effective fraud and crime predication. Finally Rion Snow (Twitter) showed that research has proven that analyzing tweets can be used effectively for stock market prediction (3 days ahead!), flu and virus spread prediction, and UK election result prediction (more accurate than exit polls). The predictive power of analyzing the Twitter crowd was really stunning.

This concluded the O’Reilly Strata Conference. The conference was fantastic, the sessions were great, and most of all, meeting all these people was probably even the best of all!

Day 2 at the O’Reilly Strata Conference

 

After a day of tutorials, the second day at Strata was the first of two conference days, packed with fascinating sessions. The day was kicked of with a plenary session with a long list of top-speakers in field of data science: Edd Dumbill of O’Reilly Media, Alistair Croll of Bitcurrent, Hilary Mason of bit.ly, James Powell of Thomson Reuters, Mark Madsen of Third Nature, Werner Vogels of Amazon.com, Zane Adam of Microsoft Corp, Abhishek Mehta of Tresata, Mike Olson of Cloudera, Rod Smith of IBM Emerging Internet Technologies and last but not least Anthony Goldbloom of Kaggle. Various topics were presented in presentations of 10 minutes each, like data without limits, data marketplace, and the mythology of big data. The shortest presentation struck me most: “the $3 Million Heritage Health Prize” presented by Anthony Goldbloom: people are challenged to create a predictive application that uses healthcare data to predict which people are most likely to go to hospital, so that ‘US healthcare becomes healthcare instead of sickcare’. The prize is $3 Million for the one who solves this!

Next up were the individual sessions, and I was very much looking forward to the talk “Telling Greate Data Stories Online” Jock MacKinlay of Tableau. And though the talk itself was excellent, for me it was all known stuff, but the talk is highly recommended for those unfamiliar with Visual Analytics or Tableau. Being biased towards visualization related sessions, my next session was “Desinging for Infinity” by Dustin Kirk of Neustar. Dustin showed 8 Design Patterns of User Interface Design, like infinite scrolling, which were really good. It reminded me of the updated version of the material in Steve Krugg’s book Don’t Make Me Think.

Next up was the best talk of the day: “Small is the New Big: Lessons in Visual Economy”. Kim Rees of Periscopic showed us very good examples of effective information visualizations. I was really blown away by this presentation, mostly because she really showed how creatively removing clutter and distractions can make the visualization very effective. Also the creative interactions that help the user using the visualization were compelling. Next was Philip Kromer of Infochimps on “Big Data, Lean Startup: Data Science on a Shoestring”. Though my expectations were that Philip was going to explain the Lean Startup principles, evangelized by Eric Ries, the talk was more about Infochimps approach to doing business. Some remarkable comments by Philip: “everything we do is for the purpose of programmer joy”, and “Java has many many virtues, but joy is not one of them”. Great presentation and inspiring insights!

My next sessions was “Visualizing Shared, Distributed Data” by Roman Stanek (GoodData), Pete Warden (OpenHeatMap) and Alon Halevy (Google). After short presentations of each, these three guys had a panel discussion where the audience could as questions. Their discussion evolved mostly around the fact that all three deal with data that is created and uploaded by a user, and how do you deal with that: do you clean it, what’s the balance between complex query functionality and ease of use, etc. My final session was “Wolfram Alpha: Answering Questions with the World’s Factual Data” by Joshua Martell. Half the talk was a demonstration of the features of WolframAlpha, and the other half was more or less a high level talk about how WolframAlpha handles user input, how data is stored, how user analytics is performed, and more.

The day ended with a Science Fair where students, researchers and companies were showing new advancements in the field of data science. There were really interesting showcases, like a simulation tool for system dynamics. But again biased towards visualization, the one that struck me most was Impure by Bestiaro. Impure is a visual programming language that allows users to easily create their own visualization, both simple and very advanced. It was also great to see the passion of Bestiario for their own product.

Finally one of the best things of the conference so far has been meeting people, some of which I only know virtually for some time now. I especially enjoyed meeting all the visualization people today. It’s really great to meet many of the online visualization community in person.

So again, a fantastic day at Strata, and I am looking forward to tomorrow!

Day 1 at O’Reilly Strata Conference: Data Bootcamp

Today was my first day at O’Reilly Strata Conference: a full day of tutorial sessions. The session I picked was the Data Bootcamp by Joseph Adler (LinkedIn), Hilary Mason (bit.ly), Drew Conway (New York University) and Jake Hofman (Yahoo!). The purpose of this bootcamp tutorial was to turn everybody in the room into data scientists by getting our hands dirty with some real hands-on experience.

The tutorial was kicked-off with an introduction of the speakers, and a general overview of the various aspects of working with data: getting data, cleaning data, applications of data intensive applications, and much more. Then Drew gave an interactive introduction in visualizing data using Python and R. The audience had to produce a normal-distribution of random numbers in R. And although some people managed to get along with all the examples, there were also lots of people struggling due to the fact that libraries were missing, or simply for the fact that everything was going pretty fast, at least for R and Python newbies like myself.

Next Jake gave an great introduction into image processing, and especially how you can cluster images based on similar features, color in our case. We used a K-Means clustering algorithm to cluster similar images based on color, and after that we classified images, whether they were images of landscapes or head-shots.

After the break Hilary took over with a great presentation on working with text-data. Starting with some basic examples on extracting data from webpages using command-line commands like curl and wget, and using Python and the BeautifulSoup Python library. After that we turned to the main example: ‘hacking’ a gmail account, and try to get some valuable information out of it. Hilary showed us how to classify email using probability statistics, and then Drew took over to show us how to visualize this data and turn it into network diagrams.

Last but not least Joseph gave a talk about Big Data. This was not an interactive session. Joseph shared some of his knowledge and experience of working with big data at LinkedIn, and explained the basics of Map/Reduce, Hadoop, and why and when to start thinking about big data solutions like Hadoop.

Overall it was an interesting day, also because I’ve met really great people. It was especially great to meet Naomi (@nbrgraphs), Kim (@krees), Jerome (@jcukier) and Daniel (@danielgm). For me the Data Bootcamp was especially an inspirational tutorial with lots of ideas to try out on my own. For some people tempo tempo was a little to high, especially if you’ve never programmed R or Python before. And becoming a Data Scientist in just 1 day may be an illusion anyway. At least the tutorial gave me a good head start, lots of inspiration, and great learnings of how the presenters approach working with data. So for me, this was a great and successful first day, and I’m looking forward to the next two days!

The source code and slides of the Data Bootcamp are available online at: https://github.com/drewconway/strata_bootcamp

Hello world!

Welcome to WordPress. This is your first post. Edit or delete it, then start blogging!

Visualizing the World Economic Forum Global Agenda Interlinkage

The World Economic Forum (WEF) and Visualizing.org have recently issued a Data Visualization contest in which interactive designers were asked to develop cutting-edge visualizations that will help elucidate the interconnectedness among issues, highlight emerging clusters and catalyze dialogue at the Summit between Councils. The data for this contest was derived from a survey of experts of the 72 Global Agenda Councils of the WEF, and they were asked the following three questions:

  • “Please select a maximum of 5 Global Agenda Councils that your Council would benefit from interacting with by order of priority”
  • “Please select a maximum of 3 Industry / Regional Agenda Councils that your Council would benefit from interacting with by order of priority”
  • “Please describe how it interlinks with your Council”

The data

The data was an Excel workbook with 3 sheets (or 3 CSV files) that contained the survey data:

  • A matrix with pre-calculated weighted links between Councils
  • A flat list of all the survey data
  • All the survey data, but in a different structure (this time by respondent Council)

The World Economic Forum Councils

The WEF consists of 3 Agenda’s:

  • Global Agenda (divided into 3 subgroups: Drivers and Trends, Risks and Opportunities, Policy and Institutional Responses)
  • Industry Agenda
  • Regional Agenda

The Global Agenda has 72 Councils, the Regional Agenda 10 Councils, and the Industry Agenda 14 Councils. All of the Councils are concerned with a specific issue (e.g. human rights or ocean governance). Each Council has 1 or more organizations of various types (government, NGO, business, etc.) and each organization may be located in a different country.

Visualizing the data

Since the purpose of the visualization was to find clusters, and show the interconnectedness, the most obvious visualization I started out with was a network or graph visualization. I started out with a network or graph visualization. I used the Force-Directed layout of Protovis to create a network of all the links of all the Councils. I also used a K-Means clustering algorithm and a Community Detection Algorithm to find clusters, but the graph was too dense to find any sensible clusters. It appeared that almost every Council links to every other Council. So even though the visualization looked impressively complex, you could not get any valuable information from it. So I stopped pursuing this direction.

My next approach was to try if a radial layout would work. I was inspired by many of the visualizations on www.visualcomplexity.com and Circos. First I started out with just a radial layout of all the Councils, and then played with the visual encoding some of the dimensions, like line thickness and color. For the image below, I filtered the data only to rank 1 and Global Agenda. This resulted in less data which is easier to work with when prototyping.

This was a good start, and proved enough potential for me to continue working on this. On of the biggest flaws of the image above is that you don’t see who interlinks with who (who is the respondent Council and who is the linked Council).  So next I decided to create two half circles instead of one: one for respondent Councils and one for linked Councils. This appeared to be a good choice. I also worked on a better color palette, and more encoding of the data (for instance, width of the bar shows the number of links). This is what I ended up with:

Then I added more refinements, like adding a height to each bar for stronger links, adding a filter option for the combination of rank and Agenda, and also the ability to view Council links in isolation.  I changed the color to blue and orange instead of green and orange, because of the colorblind people. I also kept ‘data-ink ratio’ by Edward Tufte in mind: remove as much (visual) clutter as possible. Martin Wattenberg once said: “if you start playing with your visualization, you know you’re on the right direction”. And that’s exactly what happened when I added the ability to view Council links in isolation. The final result looks like this:

Challenges

One of my biggest challenges was that I didn’t understand the matrix in the data set; I couldn’t understand the logic behind it. And then when I finally thought I realized that not all data was shown, but just the strongest links so that apparently uninteresting data was omitted, I came to realize that the data in the matrix was not normalized. So, a link between Council A and B of 0.5 in the matrix was not the same as a link between Council C and D of 0.5. And because I didn’t understand the logic behind the values in the matrix, I decided not to use the values in the matrix, but do my own calculations on link strength.

The result

I’m very satisfied with the result, and at the same time I see room for improvements. I like the fact that the visualization communicates mainly visually: line thickness, line color, bar width and bar height are the main visual elements that make it easy to spot interesting links or Councils. Also, I haven’t seen circular layouts like this in Protovis yet, so it was also fun to try something new like this.

A suggestion I received from Mike Bostock was to make the selection of the Councils fuzzier. Right now the bars of the Councils can become very thin or small, and selecting them may be somewhat difficult. By using a more fuzzy selection the user experience may improve.

I have considered adding a filter for link strength as way to reduce the number of links shown. But so far I’m not yet convinced that this will reveal more clusters.

The visualization omits some data that may be of interest: for instance, organization type is currently ignored, as well as country. It may be interesting to see if finding clusters would be easier if organization type or country (or a combination) would be used instead of a respondent Council.  Also, there is currently no back-link from the linked Council to the respondent Councils. So, you cannot see if the linked Council wants to interact with the Councils that are linking to this Council.

Finally, I think that in order to find clusters the survey should not have this many options for it respondents. I would suggest just 2 ranks for Global Agenda, and 1 rank for Industry / Regional Agenda. It appears that giving Councils (or better yet, the organizations of each Council) this many options to link to other Councils, results in a situation that at some level, almost every Council links to every other Council. Using fewer ranks to choose from will probably reveal a more polarized choice, and will make it easier to find clusters.

Technology

I have used custom Scala code for pre-processing the data, and Protovis for visualizing the data. Both technologies are very interesting, and highly recommended!

Page 3 of 3123