Using JS to build bigger, better datavis to enlighten and elate by Alex Graul
I apologize a little for my voice. One of the great things about working for a newspaper is you get to work with a lot of people that go to exciting countries on a regular basis and then they come back with these exciting new variants of flu and give them to you. I might end up doing part of this as mime. We'll see how we go.
We basically work as the interactive team at the Guardian. We spend most of our time producing new ways of doing data vis, but this is not new stuff and I'm guessing. How many people have produced a chart as part of their job at some point? OK, so most of you.
Most of us now, if you're not working for an agency, or if you are working with clients that have a lot of data, if you're working for a mid to large organization, you're basically awash with tons of data. What we're seeing more and more these organizations have data science teams. They'll have staff specialists who will put together this stuff, and they all find some insights and some things hopefully empirical allow that company to be more efficient. More often we see, because we also do consultancy
work as well, is that when it comes to communicating that stuff to the larger organization, it falls flat on its face because there isn't the expertise in how to actually effectively communicate things about data. The role of visualization is to effectively communicate stories around data and to turn that into actionable information. I work for Guardian. We're quite a forward looking media organization. We've got
[inaudible 01:19] content of our API, there's also some stuff in the data store, we do a lot of work with crowdsourcing, and we try basically take advantage of technology to find new ways of telling the news.
You might know us for headlines such as this one. I work as part of the interactive team at the Guardian. We're quite a small multi-disciplinary team. We're an experiment that's been set up, so we consist of designers, journalists and developers who work very, very closely together on projects.
We look like that when we're not at work. Our job is to basically take data heavy stories and try and turn them into something that normal people can understand. This is a pretty good example in some ways. When the Murdoch story broke, this is something we put together. Basically, this was pulling all the tweets then taking out the joining words and looking at the common words that were coming out to try and get a sense of sentiment without relying on Twitter sent analysis, which tends to be a bit weird.
Basically, you've got frequency over at the top and then you've got most common words there. You get this sense that it's quite dull for quite a while and then it skips along a bit, and people get very, very bored and somewhere in there start making up modified Jay-Z quotes.
Then, in a second, it all kicks off. It isn't telling people something they didn't already know, but what it does do is actually surface something that we all know is there but we can't actually see. Yes, that's the Jay-Z stuff. [laughs] People got really, really bored at that point. It allows people to see something that they know is there but they've never actually been able to visualize before.
That's the really exciting stuff about this work, is that ability to surface things. We don't just do fancy Twitter stuff -- that's when he got hit with a pie and it all went completely nuts, just good fun. We also do quite a lot of heavy data analysis. This was a piece we did recently looking at U.S.A. contracts, doing analysis around a corporate ownership of aid.
We also do random 8-bit games and sorts of fun stuff like that. We exist at the absolute pain points, in terms of technology because about 20 percent of our users are IE 7 and 8. We're trying to do fun animated graphics, and we're trying to do it work standards. We hit pretty much every problem that you can in this area.
We work with a really great graphics team and they produce fun things like this. As much as I can, I sit with them because they've got like 100 years experience putting together really, really cool charts. You bring that experience to bear and you realize how much bad stuff is out there. Like it wouldn't be a Vis presentation if I didn't have a god awful example of something from Fox News, not so good on the stats there.
There's a lot of subtler stuff that can kick in. My slides appear to be in slightly the wrong order, which is mildly concerning. What's happened here? Apologies for a second. [pause]
Something's gone horribly wrong there, so I'm going to skip very slightly forward. This is a really basic chart. Really, really simple. It's about as simple as they come. So you're a web developer. You think, "OK. We can implement that fairly quickly." But if you're a good developer, you're going to start wanting to generalize this content.
You're only going to write this chart once. This is where it starts going horribly wrong. Because then you start realizing how many design decisions actually come into an incredibly simple chart for it to be effective. You look at the Y scale. You've got what units they are. You've got where those numbers sit. Do they sit to the left of the tic, to the right of the tic, under the tic, above the tic?
Do the lines continue all the way out? How you do these things affects the rest of it. Basically, if you want those numbers to be out beyond the tic, then that changes the scale at the bottom. Do you want those numbers to be under or do you want them to be in the middle? Are they intimating that's the year, or the start of the year? You've got the colorization, the thickness of those lines.
Do you have the individual points on the line, or do you just show the line itself? Do you put the labels over there? Do you want a separate box? Do you want to put the titles in there? Do you want...? Actually, you look at the far right of this chart and you'll see the data actually goes off the top of the chart, as well. That's a very specific design decision.
What that graphics team do is basically make these hundreds of decisions every day. But if you're a developer and you try and generalize something that can handle all of this, it starts getting really weird really quickly. You look around the web and there are hundreds of charting libraries. Literally, hundreds of the damn things.
I think there's a couple of reasons for that. The biggest one is that it's easier to write code than it is to read it a lot of the time. So people think, "This is easy. I'll write another one." They release it. [laughs] But there is another thing going on.
You take another example. Now, you're coloring your labels. You've got these at an angle. It gets more and more complex. You take another example. Now, you've got standard deviation kicking in here. Basically, now you're not just visualizing the number. You're visualizing the area around the number itself. If you're a developer, you start thinking about the number of configuration parameters you're going to need to be able to do this stuff.
It stars getting pretty scary pretty quickly. Go even further and further again. This is a really good example. What you're actually doing here is you're basically showing an area for each line. You've got time series starting at different points. They're finishing at different points. You need to support positive and negative scale, different units, a different sense of time.
Now, you've got a hell of a lot of configuration options. It's not just us that hit these problems. Here's a really simple example from FiveThirtyEight, The New York Times. Now you see those lines are continuing off the bottom, but they're not actually showing anything on that axis. You look at the units on the scale on the right, they're not showing the number at the top.
Gaps in between are now four percent. You compare it to that one. Suddenly, you've got no lines on this axis. You change the axis. You've now got two different levels of lines. It's all these configuration details make it incredibly hard to actually create a single reusable charting library. You do this for any period of time and you think, "Sod this. We're going to start using low-level libraries to do this stuff," because trying to create one reusable component to do it just doesn't work.
But this library used to be Flash. Before this library was Flash, it was Java. Because of that, there's this incredible intellectual lineage in terms of the way it's constructed, because they had to solve these problems many, many times before. That means that it's a very small library. It's a very simple library, in terms of the actual total gamut of functionality. But the stuff it does, it does incredibly well.
Where are we? Let's see if we can get this up for a second. Can I get mouse?
Because it's so well written and so small, it means you can do really, really nice stuff like this. It just works beautifully quickly, except for the fact that I don't have an Internet connection at the moment. But the smoothness of that animation comes from the fact that the internals of the library are so well written.
Going back to presentation for a second. And then, for all the other charting, D3 is the library we use. D3 is wonderful because it's so simple, in terms of the fact that basically what you do is you just map data to DOM nodes and you use that data to effectively set the properties of that DOM node.
Imagine you've got 10 points of data. You've just got individual numbers. Then you've got 10 divs. You set the width of those divs against that number, and bingo, you've got a bar chart. But it actually provides this additional functionality, which means that you can pretty much create any kind of chart. And the far extremes of this stuff are just incredible now.
This is basically adaptive cutting of geo-projection. The ability to do that in SVG and basically reuse this code instantly is what makes D3 wonderful, because you could then turn this into a choropleth, so colorize each country versus a piece of data really, really quickly, and it works. It's just the things that are available there. Because incredibly smart people are working on this thing all the time, it is wonderful from our perspective.
But the problem with D3 -- Well, it's not really a problem, but it's more of an attribute of the way it's created -- is that you look at D3 code. This is like putting together a basic bar chart in D3. You look at it for second, it starts looking like jQuery code. You've got lots of chaining involved. You're basically working elements. You apply things and you then get the element back so you can then continue down.
You're setting a bunch of styles at the top. You've got some event handlers here. That puts together your code. But much like jQuery, you start building larger applications. You run into a problem, because you need a structure. If you've been doing this for any period of time, you reach for your standard MVC frameworks.
You look at histories of things like Java Swing, WPF, any of the frameworks that were for building desktop applications. They use very different constructs to actually build applications. We've been working on something called Miso, which is a set of libraries designed to address some of this stuff. This is in conjunction with the Bill and Melinda Gates Foundation, confusingly enough. I'm being paid by Bill Gates to write open source. Not an expected development. We're working with Bocoup, who you might know because they underwrite grants and a bunch of other open source libraries. Irene Ros there, who actually worked in IBM Many Eyes
To give you an idea of the kind of problems we're trying to solve...Imagine you're trying to create something like this. This is Github's internal charting of response times.
You've basically got an aggregate here and you've got the individual breakdowns. You've got the average of something. You might want to know the 90th percentile, the 10th percentile of that. You want to show all the individual data as well. That means a traditional model and collection approach doesn't work particularly well because you want to work on your data in totality.
You want to be able to run operations against an entire set of data, not just individual points. We wrote a library called Miso dataset. What this is designed to do is to work with any kind of tabular data. So you can pull in stuff from CSV files, TSV files, JSON files, REST APIs, anywhere you can get this stuff from. And then, operate on it as collections.
To give you a practical example...This is basically using dataset to pull down all the Github issues for a particular repo, really simple three lines of code. Then what you do is you can then take that and expand that out. This is a bit more of a complex example. We pull down those issues. What we can do is extract a view which will give you just the open issues.
You can then take that, pull out the assignee name, do a group-by operation on that. Before you know it, you can count how many issues are against each person. It ends up with a very small amount of code. The nice thing about this as well is you can keep pulling that data and it'll real-time-update all the way through.
You could basically build a chart of just the open issues, and then every time those open issues change, because you're pulling new data, that chart can be dynamically updated because there's a venting all the way through this system. It gives you that ability to do this.
I'm going to very quickly do something. Sorry about this. I've realized what's actually happened here. Can I possibly go over to that window? I can't. Sorry about this. Shit. Trying to operate on a monitor that far away is not that easy.
The next big part is basically reusable charting. What we're trying to do is create a way that we can actually create pieces of charting we can reuse, primarily based on D3, without actually losing the ability to customize them on the fly, because that's the thing we need the most. Basically, it's creating a situation where we waste as little code as possible when we create these things but we still have the ability to customize them as we go along.
You end up either abusing those properties or writing something a little bit different. Basically, you're passing a style object, passing some different colors, setting the inner radius to zero, and, suddenly, you have the chart you want. But imagine you want to go a little bit further. You want some additional functionality.
This is a quick example, looking at -- I'm really sorry about this -- the ability to basically highlight sections of the chart and have them grow when you mouse over them. You've got two additional pieces of functionality you want there. One of those is the ability to have the segments grow when you mouse over them. The other is to have the segment highlight in color.
Each one of those is a separate thing. You might want one, you might want the other, or you might want both of them. What we allow you to do is basically define mix-ins, which, if you come from a Ruby or a Python background, you're familiar with the idea. It's basically a piece of optional functionality that you can then mix into your chart, and they become really easy to pass in like that.
Then the third ability you may well want is the ability to completely customize a chart the same way you would extend a background view or anything like that, and that looks a little bit like that. There's a few other issues specific to this problem area that we're dealing with as well.
Imagine you're looking at something like this, small multiples. You've got a whole ton of charts that are related to each other. Basically, if you advance a month in one of these charts, you're going to want all of them to advance. But then, if you want to then turn off the event bindings for one of those charts, if you're doing it with a traditional binding framework, you're going to have a problem, because they work on identification of a callback.
If you've got lots of identical objects, they've all got the same callback, so you can't identify the individual callbacks. It's implementing things like a token-based system. You do a subscription. You get a token back that's unique to that subscription. You could then unsubscribe based on that token.
But the third level of this, that's probably the most interesting in some ways, is a scene structure. If you're a software developer, you'll think of it as a finite-state machine, but it's one with high support for asynchronicity. If you come from more of an editorial background, you'll think of it as a way of mapping a storyboard quite neatly into code.
You imagine you start with a piece like this. And then, if something happens and you end up looking at that, it takes you a while to understand what on the earth has actually happened there. What you actually want is a nice transition between them, because that's what communicates most effectively what's actually happened to the user.
Animation is a really, really important part of that. So we're trying to support that kind of asynchronicity and basically make it very easy to define your scenes in such a way that you can have asynchronous translations between them but not have to worry about each level of asynchronicity beyond the individual level.
That's actually the end of my slides. What's happened here is I've loaded up the wrong version of a presentation, because I redid it when I got on stage, which is why it's a little bit rough. Apologies for that. Any questions? [applause]
It's a good thing that all of that is being on video, because we can play it back in a third of the time or something and we'll see what we've been talking about. We have 40 minutes of Q&A. [laughter]
Just saying. But I challenge anybody of you here to actually say the word "asynchronicity" as fast as he did. It's a bit like reading CSS back and trying to pronounce "specificity", which is not fun. But, talking about fun. We have a few good questions, which you can actually now take your time to answer, like five minutes for each or so.
The main question was, that people had -- well, actually, that Vitaly has, because he's a big fan, obviously, of it -- that there are still a lot of Flash visualizations online. Do you see a need where Flash would still be necessary?
Very little. Very, very little now. I think that's changed a lot in the last couple of years. The biggest problem now, I think the reason you're still seeing so much, particularly with news websites, is you've got graphics teams that have an authoring environment in Flash that just doesn't exist in HTML. You've got people that are very OK with that. They have no idea how to create it with HTML, and there's no GUI tools that really replicate the same environment. I think that's the biggest problem at this point.
I think Adobe are working on some of them.
I believe there's exporters from Flash to other standards which brings us to another question that people had. My God, people, stop tweeting. You inspired them. That's interesting.
What do you do about IE less than nine? Is a VML fallback a good idea?
Yeah. We've used Raphael a lot to do that. Basically, we'll do all the actual execution logic in D3, and then the actual rendering will fall back to Raphael, for example. There's actually a really neat library called R2D3, which is an attempt to actually do that abstraction purely inside D3. That works fairly well.
Also, if it's anything geo-based, there's a library called Leaflet, which is really good for mapping, which has VML support for any vector internally. There's a lot of different things you can do to address that. We, at this point, still have to support IE7, which is incredibly painful. It does mean that we lose a third of our development time on that.
Increasingly, it's becoming a business issue, because it's such a percentage of time. You think, "We could get an extra project done out of every three." Yeah but it's definitely an issue. Basically, Raphael for vector is the short answer.
Well, that's actually a much bigger problem, in a way, because I think the biggest thing with a lot of the interactive content is it just doesn't make sense on mobiles. Most of the formats, they're a silly thing to load on a mobile device, because they're just not going to be readable or people can't interact with them because it's just too small.
This is actually another problem with the responsive-design stuff, because it deals with refactoring an existing layout or design. But what you actually need with interactive content is a fundamentally different format. And so, with us, increasingly, you'll author things two or three times from scratch. It's conceptually the same piece of content, but, in practice, you redo the entire execution of the front end.
Then, we'll just do a detect and we'll actually pass you to the right thing. It's a much deeper problem, really, than it is with normal content.
Did you find that you reach your limits with Miso, at times? Do you find that your visualizations can become laggy if you do a lot of sorting and testing?
Oh, definitely. Definitely.
What can you do about it?
Very little beyond a certain point. I think it's about being realistic about the capability of the devices. We end up, particularly with anything from mobile, actually doing very specific detects for even different versions of iOS to try and get a sense of performance, because there will be things that'll run very nicely on iPad two but not on an iPad 1.
There is that level of detail, particularly around performance. SVG performance will be fantastic in some browsers but terrible in others. It also changes version to version. You can get issues where particular versions...I think the current version of Chrome 22, actually, was really degraded performance of Modest Maps transitions, for some reason. But if you run it against beta, it's fine.
That can be the difference between a project working or not. Particularly when you've got editorial stakeholders to please and they're running certain versions on their machines. If it doesn't run nicely in that, then you've got really big problems.
It really does depend on the content. I mean, in some situations, I'll use a fallback to a table of data. It really does come down to the specific use case. But, I think, a lot of the time, it's actually asking whether it makes sense to do a mobile version or whether, actually, this is an experience that only makes sense to do on a tablet or a desktop service.
Do you see any reuse in print of what you're doing?
Increasingly, actually. The biggest thing tends to be projects with a lot of geographic data, because it's just so hard to do by hand. But the nice thing about SVG is, because it's all vector, you can do that very easily. So, yeah. It's something that I think, with the pressure on users, it's increasingly something they're keen to do.
What were the slides that you didn't show?
About half of them. [laughs]
Anything you want to cover that you haven't?
I had a terrifyingly confusing time because my slides were half-right but half-wrong and I couldn't work out what was going on for a while. That was interesting.
I was impressed that you tried to debug on a screen far away from you like that. [laughs] It's amazing. I train speakers at the moment. I do a lot of speaking myself. It's incredible how many times you're on two screens and you try to move the mouse to a screen on the left of you while your mouse is going to the right. You will fail, immensely. Actually, this was impressive.
Might ask you about that training at some point?
It's OK. Everybody's scared, and it's good.
I think the biggest things we didn't cover is probably going into a bit more around the reusability of the blog stuff. Hopefully, what we're planning on doing with that is, if you look at what's happening with Web Components, that's very much going in that direction. What we're trying to do is almost set up a bit of a straw man for the meantime, can we create a way of doing this before that standard exists.
There is a system called X-Tag, actually supported by and written by Mozilla that actually allows web components in IE9 and upwards.
All right, OK.
We're using a few bar charting things in there just to plug that a bit. Web components, and I'm sure Alex and other people will talk about is a very interesting part, because a lot of this interactivity that we have right now could be semantically defined in the document as well, and not written with Java Script.
Yeah, ideally, but then it comes down to the issue of the data configuration required to get the desired effect.
That does tend to be the challenge.
There was a good question, how is it to work with journalists and designers? What are the tips that you'd found working because a lot of them just give you a PowerPoint or something. Say like, "OK, this is how it should work and now, make it animated, and make it pretty." As soon as you animate it, it's not pretty.
Yeah. [laughs] The biggest thing is faster itteration. Usually, we'll talk to journalists and they have a sense in their head of what they want. They're very poor about communicating that. Until you actually put something in front of them, almost as a like a straw man, just kind of say, "Is this not what you want," you don't really get good feedback. As soon as you start doing that, you get a nice tighter itteration loop, and that's by far the best approach. In terms of working with designers, I think the closer the better, if you can sit next to each other, that's ideal. I think the reason our team works so well is you've got all three elements literally sitting next to each other. That means we can turn around things like that right into a piece in a week.
How does the data get in right now, it seems to be just data sets, but I've seen other talks and I'll try to remember who it was--was Dan Mills (?) actually at some conference. He talked about that he actually wrote a content management system for the data vis for designers and journalists so they can actually reuse them. Do you do something like that?
We use a variety of things to be honest with you. We built some tools that allow journalists to effectively create json. We use Google Spreadsheets quite a lot simply because it provides a really nice UI. We haven't actually used in production, but we've experimented quite a lot with Crossfilter as well, which is by the same guy as D3, Mike Bostock. That's really good when you've got huge amounts of data, that's built for very specific use case, but it's great for that.In terms of UIs, beyond that not really. At the top end, we've got people that now had to write SQL and use ours
They can give us a CSV and that works fine, but below that it tends to be basically Google Spreadsheets, just because it's a familiar UI for people.
It comes with a form automatically.
Yeah. As well, the interesting bit.
Actually, the other thing to mention would be the PANDA Project, which is being run by the Chicago Tribune. That was a Knight-backed project to basically create an interface for journalists on top of a big sets of tabular data. Make it really easy to search, so if like a name comes up you can quickly look up police records and things like that. We actually started working with that as a way of creating a back end for data vis as well. There's a project called Datawrapper, which is being run part by Gregor Aisch, he's
driven_by_data on Twitter who does really good visualization work, is heavily involved in that. That's basically, you upload a set of data, actually uses Miso Dataset to do the upload, and then you can configure and you get a basic chart out of the far side, and that's a really good project.
Do you have a repository on Github where you actually put the data vis afterwards and what you've done or the tools that you're using? Because we heard about it, but it was quite fast.
You're doing a great job at The Guardian. Not many people know this. When you see a data vis like that on The Guardian, normally the data set is available on Google Docs as well for you to play with. Again, I hadn't seen any other newspaper that does that, so it would be interesting to see your tooling a bit more as well.
There's one project, which we're actually planning on doing that, which isn't public yet, which will be the first one where we release the raw data plus all the intermediary steps and scripts involved. The biggest issue, actually, with releasing all the tool set is we use Google Refine quite a lot, which is now becoming Open Refine, but that doesn't, by default, give you an audit trail.
What is Google Refine?
Google Refine is a platform for taking quite big chunks of data and then doing operations to do things like entity resolution. Imagine you get 10.000 contracts through, and you might find seven different spellings of PricewaterhouseCoopers. It has a bunch of algorithms built in that make it very easy to then identify all those and resolve that down. So that's a really common data problem that we face.
But that doesn't give you an audit trail. You can say, "Here's the fall-away and here's the fall-away now," but we can't actually explain what happened in between. We're trying to find a way of actually being able to give people a full workflow of how we got from A to B with data, and that's a bit of a challenge.
Talking of data, I used to work with the UK government, when they released their open data platform as well, which then got canned when the other party got voted in. A lot of this data is incredibly dirty. Is there actually an automated process to clean it, or is there some artificial-intelligence stuff going on, or do you just churn numbers by numbers?
The closest is Google Refine. But a lot of it is by hand. It can be incredibly time-consuming. We've had situations where we get a team of five or six interns to work on something for a couple of weeks to go through, because there's just no other way of doing it.
A lot of it is domain-specific as well. It's not just like anyone can pick up this data and understand what's going on. Unless you understand the context around the data -- this is often the thing that isn't released -- you can't really do anything with it anyway. Often, before we work with something, we have to sit down with the right government department and actually quiz them about what the hell the data is, and then find out that half of it isn't there or they've forgotten to record something or release something. It's a complete mess.
What was the most interesting thing that you found just by making a visualization before actually looking at the data?
Actually, the US contract stuff, looking at USA contracts. It was the level of concentration of major agricultural firms was much higher than anticipated. There is two stages in terms of our workflow that we use visualization. The first is actually that exploratory phase, and you just throw it up and you see what pops out.
We're actually doing some work with European patent data at the moment as well, and that's another thing where you occasionally just find a category that you just never would've expected that many patents to be in. Or they'll all happen at a certain time, and you can kind of chart the rise and fall of areas of industry, essentially, through that data. So that's been really interesting to pull apart.
On the other side of things, did you have a chart where you actually found some information, showed it, and it turned out to be completely wrong and it had to be changed quickly?
I don't think that's ever happened. We basically have quite a careful process before we put things out. We have had situations where we spent weeks working with a data set to find there's just nothing very interesting. That's more common. That's a really hard problem, because you can't tell before you've done the work.
Well, I don't think it was so much push-back, but I think it's just a sense in the community that it isn't traditionally going to be much of value. It's seen as flashy, shiny bits rather than a key part of the experience. I think it's more of a sense of where the value is seen in terms of animation.
Did you find accessibility issues with that?
Yeah. It's a nightmare. Creating an accessible version of a lot of this content is really, really hard. A lot of the times, I've got a couple of friends that are blind, and I sit down with them, and screen readers make a mess of this stuff. Often the best thing you can do is basically provide a link to the raw data, because there isn't really an easy way of providing even a summary sometimes.
Text is often a far better format than any kind of interactive or even raw data.
How about progressive enhancement? What I've done in the past is leave the data table in the page for a screen reader to actually index and just generate on top of it. Is that feasible, or is it just for very, very simple charts?
It's for simple charts. I think you get beyond a certain point of complexity, it doesn't really work. You think about something like the riot-rumors piece or the Twitter piece; the sheer amount of data around it there is going to be completely meaningless to display on its own. There's a point at which that technique kind of caps out.
When you work on data visualization and things like that, do you just go through the web and look at other newspapers and go, "Oh my God, I know exactly what they're trying to do here"? Do you also dive into the source code of other people's stuff to see what they've done?
Yeah. It's quite a close-knit community. We know the guys at "The New York Times" and the "Texas Tribune" and the "LA Tribune," and we'll catch up with them and we'll talk through techniques and how they're approaching certain problems. That's a very common thing. We'll pull apart each other's work and try and understand what they're doing. People riff of each other's work as well. That's really common.
We do a lot of stuff with animated balls, basically. Now, you see "The New York Times" doing all this stuff with animated balls. They did a really nice technique of judging public sentiment on two axes, and NPR did a really nice piece that's kind of taken that technique a bit further. So I think we're all trying to work out the best way of creating this stuff simultaneously, so we all work quite closely.
Question, and obviously, I think you can't answer it. Can we expect more open-source stuff from The Guardian?
Yeah. There's been certainly a support for it from management. We're keen to push more and more of our stuff in public. I know the rest of the development team is as well. I think the biggest issue is just we're trying to do so much simultaneously at the moment. It's getting it to that point. It's that thing where things are good enough for production but not good enough for Github. [laughs] It's finding the time to actually get things ready, that it's not just like [makes gagging sound] throwing them out and just leaving it there.
So you're finding a lot that you work just getting it out of the door, rather than being able to plan through it and...
No, we work on incredibly short time lines. These projects will be quite big things, but we'll turn them around in two or three weeks. It tends to be very, very fast. Once they're done, they're done. If you work on one tool for a few years, you can iterate that code base and tidy it up and all the rest of it.
But with this, you knock it out. If it's a day late, it's probably worthless. You've really got to hit that deadline, and that often means you just hack things until they work. So you see some horrible code behind these [laughs] sometimes, partly because you just have this pressure in terms of deadlines.
How about code reuse? Do you actually reuse a lot of stuff you've done before? What is your process? You say, "OK, here's the infographic that we have. We have a live Twitter sentiment analysis. We ran that through a database. We want a bubble interface." You've done it once before. Are you going to look at the old one, or what is your process to go from the spec to what you need to get on the screen? In very slow words, because we have 15 minutes to cover.
We do reuse some things. What we're trying to do with the Miso blog stuff has actually dramatically increased that reuse, because it is a problem, particularly around more basic components.
Actually, there's one more thing I might try and throw up, which might be...
Yeah, go for it. Show a movie. Do whatever.
See if this works better than last time.
Can't be worse.
Get a Mac. Oh, wait ...
Right, so I do that. Let's see if I can -- hey, there we go.
Language young friend, language.
Right now, look it's another deck, it's slightly better.[laughter]
It's got a lot more slides in it, it's like it was designed for a longer talk or something. The point I'm trying to get to...That's an animated pie chart so this is that what it should have looked like. That's what I was talking about in terms of reusable but optional pieces of functionality. It's got highlight there and you've got mouse overs in there, both things you might want in one situation but not in the other.
That came out of D3, isn't it? It's one of the components that's already in the box there?
Yeah, partly. This is a piece, this isn't actually public yet, I think it's going out today, but this is the first thing we built entirely with reusable blocks. The map itself here is a reusable block, that slideshow is, and it's using Scene to actually do the overall structure. Basically, you've got the walkthroughs, it is like a series of defined scenes that then have entry and exit points.
They then handle the transitions. You've got a data set behind that updates a global state, but you've got all these extra pieces that we can now reuse very, very quickly in other pieces. You'll see in a minute we've got a...There's a nice little piece here which basically visualizes shape files so you can see the change in the layout of the city over time.
That's we're using the same timeline component which is being used by the main piece, you'll see that in a minute. It's getting to that point where you get that higher degree of reuse. There are lots of frameworks reusable components, but there's some of the problems that are specific to visualization. This is another one, this nice little slider, back-and-forth effect that we use lots of times.
It's trying to get these things so we can reuse them a lot faster, so we can do these things closer to deadlines. But you've got this underlying data set that pulls this whole thing together. It means the whole thing is a much more integrated package, basically.
This is beautiful. [applause]
This was actually put together in a week and a half. [laughs]
That's the deadline system we run to, and that's why we need the code reuse, because there isn't time to write things from scratch a lot of the time. That's what we're trying to achieve.
So you code like you speak, really fast.
Yeah. It's not bad. This, a half a year ago, probably would've been a 6.000, 7.000-pound project for a graphical design studio to make a video like that.
Can you think that marketing materials could be created out of interactive graphing systems like that?
I think increasingly they are. The big case is...Should I sit down again? [laughs]
If you want, unless you want to show more?
I think the big use case here is annual reports. That used to be a big cash cow for graphic design agencies. Every year, they get this big budget to put together a big, shiny, nicely printed report. But I think the big thing now is, companies want custom analytics dashboards. They want nice little iPad apps with graphs on them for their C-level executives to understand what's going on.
I think that's a massive growth area. From talking to friends in agencies, they're doing more and more of that kind of work. I think it's those situations where you need things to look beautiful, be entirely customizable. But they don't want to be writing those charts from scratch every time. Hopefully, this stuff will be really useful to them.
Talking about interfaces and talking about iPads and things, can you think that in the future, or near the future, actually, there will be charts that actually make much more sense touch than with anything else? Do you think that interactivity of systems could be an interesting point for data visualization in the future? I'm thinking Kinect, these kind of things?
Yeah. Kinect's a harder one. [laughs] Just because it gets into that Minority Report territory of lots of swooshy things that look cool in movies, but are very hard to actually make useful in real life. But certainly, interactivity between multiple charts and our ability to play with things, I think, is huge. Play is a really big thing.
One problem, obviously, in analytic systems where they'll be pulling data across a big corporation. They'll have a couple of hundred properties across all this data. They want their analysts to try and understand what's going on with this. They want to create these interfaces that encourage their analysts to play with that data, to throw things together and see what happens. It's quite an interesting challenge to try and encourage people to do that.
It is almost like using Microsoft Surface or something so they can just throw things in and see how they go and see if there's relationship. Yeah, definitely. The touch stuff is really good fun.
Which brings me to UX or, again, usability/accessibility issue. How do you sometimes make people understand what is interactive and how to use it? There's a lot of different interaction patterns in there. Do you find that some of your visualizations need hinting, and what would you do with that?
Well, there's two parts. One thing you might've noticed with the video of that piece was that when you exited the guided mode, there was basically a box just saying, "You are now exiting the guided mode. Have a play. Try things out." I think, particularly when you're dealing with a general audience, you need those kinds of hints.
I think a lot of the Flash pieces we produce still have almost like a page that will come up and say, "You can move this piece around and you can do that." You still need that, to a degree.
The other part is walking people through data. This piece is almost a two-part presentation, because you have that area at the start where you've got lots of text and it tells you a story and it steps you through it, and then you have that free explore. That's an idea called the "martini glass," where you have that very narrow path for a while and then you let people explore.
But the other purpose of having that narrow part is that it trains people as to what they're looking at. So by the time they start playing with it, they have a concept of what they're actually looking at, and that's really important.
It's about finding ways of subtly hinting to people what's going on, because it is really hard. Visual.. (???), as you see, is low, and there's a lot of really crap charting out there, because people still aren't very good at putting these things together and identifying what's gone wrong.
There was another one I was going to show that JP Morgan put together. Their analysts should know what they're doing, but they had visualized the change in bank market capitalization from just before the crash till after the crash, and they'd done it as circles. One circle for what it was before and one circle what it was after.
Rather than doing it by changing the area of the circle, they changed the diameter and so catastrophically overestimated the impact of this, and so it made it look like Citibank was now a 1.000th of its previous size. Things like that are still really, really common.
Where do you learn this? Are there any good books that people can look at? Are there any online tutorials that you would think people should go to?
Yeah. There were a few more slides I didn't get to show. I talked about Edward Tufte. His books are probably still the seminal reference, but they're very print focused. The interactivity is something he doesn't address particularly well. There's a new book called "The Functional Art," by Alberto Cairo. That's probably the best, most intellectually coherent book I've seen, in terms of analyzing how to produce interactive graphics well.
That's absolutely fantastic. Beyond that, I think, it's playing and just looking at what's out there. I think, because of our market, and because of the problems we face, newspapers are at the cutting edge for a lot of this. Between us, The New York Times, La Nacion, in Argentina. These papers are probably doing more research in this areas then almost anybody at the moment.
Simply because of the problems we face.
How about interactivity with video? We've got this system called "Popcorn." We were talking with John...
Popcorn's fantastic. Yeah.
Do you do any of that right now or is it just more of a Flash-based thing that people still do?
That's painful enough. Trying to do it with video is even worse. Actually, the video's really interesting, particularly in mobile. Because there's a lot of interactive formats that don't work well in mobile. But you can produce a simple video that works quite well. The next step on from that will be starting to add simple interactivity to those videos.
Obviously, with mobile, we can trust a lot more what browser support there's going to be. So I think things like Popcorn are really interesting in that area.
Cool. Are there any questions from the audience? Do we have a microphone to go around? Or just shout it out. [laughs] That worked well. This is English. Did you do any international work as well? The Guardian has a world section as well. Do you see those differences in visualization, in different countries, how they work?
Are American visualizations flashier than English ones?
I think you see...There's difference, particularly in South America. If you look at some of the investigative work that's being done in South America, in terms of data, they take it very seriously. They're dealing with very serious issues, corruption and the like. But they do really cool work and they tend to do it in a much drier style.
I think because of the meaningfulness of the work. So it's much less playful. Other than that, not particularly.
I remember Google had this wonderful visualization API that now has been discontinued. What were the things that you thought that in there were the most useful ones? When somebody thinks about you get a better piece of data, what's the most simple way to show a relationship between the data?
It depends on the shape of the data. If it's a time series you refer it into a line shot, if it's a a set of variables then like a Scalar plot essentially. You pick two variables, you pick each set of two variables and you throw it into a Scalar plot and you start seeing the clusters.
If you working with data like that R is an amazing tool because you can just very quickly knock out histograms, line shots, box plots and get a real sense of what's going on with the data. Having that as an open source tool I think is wonderful and you can do almost anything out of the box with it now.
How much do you give back to the projects? You're using them commercially, in this case, is there anything that you've found in D3 that needed fixing, that you send back to them?
Yeah, we've pushed out various little bug fixes. Nothing big yet, we've actually got a couple of map projections we want to push back to D3 but I think increasingly, oh yeah as we hit little issues, definitely. Having D3, Mike Bostock is now at the New York Times, you've got that quite close relationship now between media and the people writing these things as well, which means that feedback loop is a bit faster.
Now, in terms of fixed sizes and responsiveness, is there any way to get these things responsive or I would say zoom and getting the...
We can to a degree. This one actually does full screen in Firefox and Chrome, which is nice. That works, it's just going small beyond desktop is the really hard thing, but you can definitely scale them up. Things like, we've done mapping things where you just reorganize the interface bit and that just about works.
One question was how do they print and is it really sensible to print any of them?
Well, you can reuse some things. You look at like the circles in that map, that's all rendered in SVG so you basically just pull that out on top of a normal map and that can go straight into Illustrator and out to the printers. We do that every now and then but not that regularly because it tends to be a different version of the same content.
They'll just be nuances about it that means it's better to do it in a different way for print than it is for desktop. They're fundamentally different mediums. I guess.
As I said before, infographics and visualizations are a big thing. There's actually spammers out there that send you a daily visualization that you could put on your blog for them. I'm like why would I ever do that? What is the biggest cliché in visualizations that really grinds your gears?
Bar charts that use things other than bars to do size, so you end up with these weird shapes that then don't do a very good job of explaining the different height of things -- that's particularly horrific. The other one is network diagrams with 10.000 points. You call them hairballs because you end up with these giant plots of just mess.[laughter]
They're still quite common in science, particularly when it has to do with lots of data. There is better visualization techniques. If you're dealing with a hairball like that, what you can actually do is use a matrix visualization. Rather than showing all the nodes and relationships between the nodes, you basically just do it as a big table.
You put all the nodes on both sides and if there's a relationship between two nodes, you basically color in that block. That scales out to any amount of data. It's nowhere near as sexy. It doesn't require 3D to render, but it's very, very functional.
Talk about 3D. Do you think it's more confusing or actually telling a story better than a 2D chart would?
It's very rare it's more effective. 3D pie charts and things like that are just horrible, because they just make it much harder to understand what's there. If you look at mathematical charting, they tend to use different techniques to avoid having a third dimension.
You get contour charts, for example, purely to avoid having that 3D effect, because it is just such a waste in terms of actually trying to communicate something. Every now and then, there'll be something but it's very, very rare.
Are there any tricks that, from time to time, you have to use in browsers and you think they shouldn't be necessary? I'm thinking for example, when plotting on Canvas, you have to do the half-pixel up rendering to the next half-pixel rather than just a pixel. Are there any things like that as well that are just annoying, that could be better?
I don't think it'd be an easy thing to fix. We use Canvas in situations where there's just too much to render with SVG, but the biggest problem you hit then is you go to hitboxing by hand.
Then you've got the overhead of the hitboxing versus the overhead of the SVG nodes which is least painful. That's a big problem. It's probably the biggest with Canvas. If there was some way of having hitboxing automatically rather than having to do that by hand, that'd be wonderful.
We could not ask the audience what hitboxing is, but I'll ask you myself.
If you click on a canvas, if you've got a circle there, to work out whether someone's clicked on that circle you've got to then know exactly where the circle is on that canvas and the pixel reference, and then work out what you've actually worked on.
Oh, yeah, because that's not event model, it's just a bitmap.
Yeah, exactly. You end up doing things like having a second canvas, which you then colorize based on what the object is and then you just look up the color reference.
Against a look-up table. Techniques like that, but it's quite painful and time consuming to do. [laughs]
These will be interesting blog posts to do, to explain.
Yeah, I'd like to find time to write up more.
Well, you speak in 20 minutes what other people took an hour, so it should be...
Yeah. I think we've grilled you long enough. Thank you very much, Alex. [applause]