An Introduction to Open Data by Sally Jenkinson

Watch high quality video on Vimeo
Download video (MP4, 197MB)

Transcript

[APPLAUSE] Hello, so as Bruce has said, my name is Sally Jenkinson and I'm definitely not a synergy leverager whatever the hell that is.

Um, if you could all avoid that, that'd be great.

I'm also sjenkinson on Twitter if you want to get in touch.

And I do, basically, work as a freelance consultant and solutions architect through my company.

And we help people like this from big to small with project discovery and research, and solutions architecture, and doing transformation around all the systems and processes.

And we basically try to help people understand what they need, the choices they have, and how to, technically, make that happen.

And that's the kind of stuff that I typically talk about.

But one of the great things about my job is that it is really broad and it constantly challenges my perspectives on things that I take for granted.

And last year, I worked on a project that was really different for me and that kind of influenced this talk and its this that I want to talk to you about today.

So, I want to talk to you--ooo sorry-- yeah, I want to talk to you about data and specifically I want to talk to you, as Bruce has said, about this idea of open data.

And everybody has gone absolutely nuts for data over the last 10 years and I'm sure the big kind of buzz words like this are no surprise to you.

And last year, IBM actually published a stat of 90% of the world's total data has been created within the last two years, and it's something that's really become so ingrained as part of our projects.

And if you did on places like dribble, for instance, for terms like dashboard, everybody's at it, right? So just a quick show of hands to kind of validate this a little bit.

Who here collects data, whether that's in your projects or perhaps you've got, you know, wearable and you collect personal data? Yeah so most of you, I'd say.

Who here kind of ingests it as part of the project? So you might use something like an addressed base service to do things like address completion or verification, or talk with standardization, or you might do some geocoding or something like that? Kay, a few-- fewer.

Um, and who actually exposes it? So does anybody create an API, or to put data out there at all in any way? Getting a bit less again.

And does anybody create content around data? So does it inform the content that you put on your sites, perhaps? Even less again.

So there's interesting different ways that we can actually engage with data.

And one of things I want to do today is to hopefully kind of expose you to a few more ways that you might be able to use it in your work.

So personally, I really love data.

I'm a bit of a nerd, and it's not always directly part of my job, as I've said.

But I really do find it fascinating, and it touches so many different aspects of my life.

Now I obviously use Google Analytics-- as I'm sure many of you do-- on my own site.

Professionally, I kind of use it to identify any issues people might be having and stuff like that.

But personally, I like to just look for really weird things in it.

So up here that massive spike was when people found my cookies page.

You can have a look at that yourself if you want to.

If I wanted to I could see which of my tweets were most popular.

So lots of services are building in this kind of ability for us to analyze how people are reacting to us-- how we're engaging with them.

Now my last FM profile is quite an interesting one because what this is meant to do is it's meant to show all the music that I've been listening to over the last 10 years-- except what it actually does is to show me how I've been consuming music, whether it's been tracking or scribbling it or not.

And as you can see there's some massive dips there which is when I've, sort-of, been doing different things in my life or using different services.

And I also make stupid stuff with all the people's data.

So this is using the Marvel API and D3 to help me to find out which the most popular characters by the number of times they've appeared in individual comics, or series, or stories, or events, and things like that.

And I make stupid stuff with my data too.

So this is my-- utterly pointless, soon to be retired because I can't handle the pressure-- tea tracker page which details all of my tea drinking habits since November 2013 and in case you're wondering, today I have had cups of jasmine tea with the last one (being at 9)40, which you could checkout live.

If you so desired.

But I'm sure, like me, when you make stuff and you're not working with your own data, you probably found yourself being deliberately limited by some of the data providers out there or maybe being bound by terms that don't quite meet your needs of what you're trying to do with it.

Many of the database services that we actually use have these very strict terms and conditions around usage, or they can be prohibitive in other ways.

So they can end up locking off data from us, deliberately, or restricting access in certain aspects.

The Twitter API is, obviously, really notorious for this especially with regards to kind of taking away the access that it may have initially given out.

But if we look back at the Marvel API this, for instance, says, you can only use the data for non-commercial applications that don't portray Marvel in a negative manner.

You can't change, edit or augment the content, and any time Marvel's data appears on screen it must be accompanied by Marvel's copyright notice and a link back to their own site.

And they all say limit the delivery mechanism.

So, it's free at the moment but they reserve the right to charge in the future.

And it's rate limited in terms of the terms and conditions to 1,000 API calls per day, but I actually logged in the last night and the documentation there says 3,000, so you get this kind of conflicting message.

There are also rules around caching and content retention in that basically you can't.

Well, you can to an extent but not long term.

And also if they terminate the API then they could come up to ask you to delete all of your content.

So it may completely ruin whatever product you built on it.

And this is my favorite-- so you agree to maintain your apps and your systems in accordance with industry standard quality levels because, as far as I'm aware, we're still all yet to agree on actually what a web app is and whether they exist at all.

So I'm not sure weather we've got kind of industry standards quality levels to a definition just yet but good for them to put that in there.

So anyway, we can't always get our hands on the data that we want but, of course, getting your hands on it is only one part of the puzzle.

So sometimes we're restricted by whatever, whoever has put it out there-- in terms of what we can then do with it as well.

So actually the Marvel API is an example of data sharing which is the release of data for restricted purposes to restricted people or organizations.

And it's doing something with data that really matters.

So through using open data, the goal is that your consumption, and your analysis, and your presentation of data shouldn't be controlled by whoever's published it.

And this leaves you free to use it in ways that they may not have originally anticipated and it also allows people to subsequently build on your work as well.

But OK, what is it that actually makes data open? We've heard this term a few times now.

So that we've got a shared understanding of it, we're going to use a definition.

So this is on opendefinition.org and it's

one of the most popular definitions.

It's not the only one.

And as you can see there's a really detailed explanation, should you wish to read it, but it can basically be summed up as, open data and content can be freely used, modified, and shared by anyone for any purpose.

So this isn't a new concept.

It's just one that kind of goes against some of the more traditional attitudes of how we approach data on the web.

And it's also something which is being standardized a lot more as time's going on.

And in fact, there was recently a great diagram from the people at The Open Data Institute in the UK of what they call the data spectrum.

So this is basically everything from open data on that side, which is just what we've talked about-- which has an open license and allows people to use it freely-- through that spectrum of it being shared and it being out there but maybe has a license that limits use, through to things that have named access, so only certain people can access it at all, or that it's internal so things like employment contracts and policies.

And if you use open data, or if you put data out there openly then you start to give people freedom.

So they can work with this data without restrictions from copyrights, or patents, or other mechanisms of control-- so by opening up this data you could potentially allow people to maybe republish the content or data on their website, or to derive new content or data from yours.

Maybe to make money from selling products that use your content or data or to republish the content or data whilst charging a fee for access.

And these last two are quite important to understand because the data itself should meet all of these criteria about being open so that the data itself can be freely used, and modified, and shared by anybody for any purpose.

But you might potentially choose a delivery mechanism with a commercial aspect like an API that requires registration or has different, sort of, pay for levels of access.

So open data does not necessarily mean unlimited free access of viral APIs for everybody.

So this is something that's both incredibly simple but actually quite confusing at the same time.

And one of the reasons that I wanted to give this talk was because there's quite often a lot of different interpretations or misinterpretations about openness, especially in different contexts.

And I think that this is an area that, it's kind of like, frequently misquoted and misunderstood by certain people.

And there was a great article recently which discussed naming conventions-- again from The ADI who you'll see crop up a lot in this talk.

And as here, in this quote, open data isn't something entirely different, in terms of its formats, it's-- it's not magical in its raw form, it's just more about the mentalities that lie around it.

So you're probably starting to get the picture, that actually a lot of this comes down to licensing and you'd be right.

So these are some of the licenses for open content which you might be really familiar with, and in terms of databases it's again, really, really similar.

There are other licenses that enable reuse, in which you might encounter, particularly, around public sector information, s so it's not just creative commons.

The open government license is an attribution license, for instance, that covers both content and database rights and it's mainly used for information made available by central government-- so the UK has one, Canada has one, I don't know if the Netherlands has one but you'd probably have something similar.

And the OS open license is extremely similar except it just instills that attribution is back to ordinance survey instead.

Licensing is really boring so I'm not going to spend a whole talk on that.

And--but let's look at actually why this is all exciting because there's some amazing data out there.

And when you talk about open data, you tend to find that a lot of the same sets tend to crop up again and again.

So you get things sited like DBPedia which is great because it essentially makes the content of Wikipedia available in rtf format and also incorporates these links to other data sets on the web.

And then we've got something which isn't super exciting, but if you want to overthrow the conservative government in the UK like Bruce you might want to go in here, which is data.gov.uk.

This is a collection of data sets available from all the UK Central Government Departments and number of other public sector bodies and local authorities.

But then there's also data out there about pretty much anything you could possibly be interested in.

So if you like music-- a lot of music yesterday-- there's Music Brains which is the open music encyclopedia and you've got the Million Song Dataset.

If you like geography, you've got some amazing earthquake data and this is some of the data around earthquakes this September just gone.

And if you're a real nerd-- I don't even understand what half of this means but-- you've got current and voltage measurement sampled at 30 kilohertz from 11 different appliance types, present in 55 households in Pittsburgh-- if you want to get really specific.

But personally, I lived in Indonesia for a long time so I have a, like-- I have an interest in the country.

I like to keep up with what's going on there.

But even then, I'm not entirely sure that I'd want data on all the billboards in Jakarta including that orientation, but that leads to people making info-graphics like this.

So there's something out there for everybody but for me it's probably more likely to be a data set of 10,000 cat images and their attributes-- or there's 20,000 dogs, if you prefer.

Or I'm sure you can do some really interesting social analysis using the 1.7 billion JSONobjects

representing public Reddit comments, which I'm sure you could get some really strange and interesting stuff out of.

But you want it the chance is that it is out there in some form.

So when I started putting this talk together originally, I started collecting together a few of these data sets that I thought might be quite interesting for people.

But then, as is always the case, somebody sent me this link, which is just a repro full of some really, really interesting things from the more bazaar ones through to some quite big portals and standard sources of data that you typically find.

So, not everything in here is open in the true definition but have a look at the licenses and have a look at the stuff there because there's some really great things.

So, in terms, of kind of using all of this-- consumption is pretty straightforward and this is the bit where I kind of stand up here feeling a little bit like a fraud because as we've seen from earlier quote, the data itself isn't special.

In terms of the technicalities, it's kind of like most data that you're going to be using-- or it is.

The key points, if you are going to be using open data in your projects, are that it has an appropriate license, that it's of a good quality and level of detail for your needs-- and I'll come back on that latter stuff in a little bit.

But once you've found what you need, you're commonly going to be working with data that we all know and love, so formats like CSV, JSON, IDF, XML-- different levels of love for each of those, but pretty much what we're already doing when we're working with it.

So there's nothing hugely special here and you might be downloading it periodically or you might have access to an API or something else.

Now one thing to watch out for is that some people actually release data openly as they claim, but it is actually stored as PDFs and this is another example of this kind of misunderstanding of the different kind of ways that you can go about things.

So as we all know, PDFs aren't really structured very well or machine readable in, you know, for passing in the same way that, say, JSON is-- this wouldn't be considered desirable to people so watch out for that when people do claim to have it out there openly.

And then, in terms of, once you've got it, once you've grabbed it, once you're doing something with it, you obviously want to present it back and there are loads and loads of different ways to do this.

So I'm not going to go into too much but I'm sure we've all heard about libraries like D3, there's Processing, uh, the Jam Sessions, a load of ones that were name checked.

But of course, this is going to tie in hugely to what you're doing, so what I'd encourage you to do is to very much think about whether you need to just slap a dashboard on there for the sake of it and what end you're, sort of, you know, you're trying to get to.

So think about the purpose again.

It may be that, actually, your interaction with open data is much more functional and it's much more happening in the background.

So think about whether we do need to, sort of, have this visual aspects or whether the data is hidden away.

But of course what we want to try to do here, is to try and make it simple for people to understand the point that we're making.

And of course, if we're making a point on the web, we want to follow all the good stuff that we've heard about yesterday, especially.

So we want to be using standard space, web technologies, we want to make your performance which can be tricky sometimes when we are working with large data sets, and we want to follow progressive enhancements.

So all the good stuff that we've, helpfully already had covered.

But these are still just websites and data sets for the most part.

So they're kind of contained.

But I'd like to challenge you a little bit today.

So in order to do that, let's look at some slightly different examples.

So I want to start doing that by telling a story about this guy and this fire hydrant.

So this is Ben Wellington-- afraid I don't know the fire hydrant's name but-- Ben lives in New York and this is a place where since 2012 they have had a comprehensive, city wide open data policy.

And as you can imagine, all of that data being put out there is great but it takes analysis to really maximize it.

So New York is also a place where if your car is parked within 15 feet of a fire hydrant it can be ticketed and towed away.

And now thanks to the availability of the open data around fire hydrants in New York, Ben ran a little experiment to see how much people were actually being fined for parking near hydrants like you do-- you know, a standard Friday night.

But he actually found something pretty interesting.

So he mapped the top 250 grossing hydrants in New York, and our friend the hydrant on the previous slide was found to be grossing $33,000 US a year in fines.

Now here's one of these ones down there-- so the big two blobs.

And actually on the next block as well, the second most profitable hydrant was generating $24,000 US a year.

So that's a huge amount for two hydrants in close proximity.

So this was a bit of a puzzle as to why and Ben did a bit more investigation and found that these two parking spaces were actually extremely confusing.

Basically, it felt like a trap.

So there is this wide curve extension between the street and fire hydrant, as you can see-- which I think may actually have been a cycle lane.

So there's a bit of confusion whether it was actually part of the road, part the pathway, or a cycle lane and it's-- if that's different.

And additionally, the most confusing thing was, as you can see where these cars are, there are actually painted parking spots right where you would be fined if you were parked.

So Ben kind of put this data out there.

The story was picked up by a lot of different people and the Department of Transport eventually came back and commented and they said, well DOT has not received any complaints about this location, we will review the road way markings and make appropriate alterations.

And if you look at the fire hydrant now, on street view, you get this slightly weird patching, but you can see that it looks like this.

So they actually painted over a visual kind of restriction to people so that they know they can't park there.

And now of course, this is one of the big challenges against open data, that people kind of, you know, on the-- maybe-- business end of the spectrum-- on the less good side-- they always raise this question of, how happy people are to be losing out on all their lovely revenue, in terms of people like this coming along and going well actually.

But, of course, in terms of the public good, open data, and Ben did a great thing here.

But it's not actually just about companies losing money because the economic impact of opening up data has been the subject of a lot of recent studies, showing benefits in terms of both generating value and cost savings.

So you've got things like the 2013 McKinsey and Company report which estimated that open data could generate $3 trillion in additional value per year across seven different sectors.

Now, obviously, the story in New York came out of the fact that data was mandatory to release, but people also want to be seen as being transparent.

And there's a great example of this in a book called "Thinking In Systems" which has a little bit about how in 1986 there was new federal legislation which required US companies to report all chemical emissions from each of their plants.

And this became publicly accessible through people applying for it through freedom of information requests.

And actually, what happened was that many of the emissions being generated weren't at illegal levels or breaking any rules but you started to get wonderful-- I suppose, the Buzz Feeds of their day-- reporters were getting the information and starting to put out lists like, the top 10 polluters.

And so, actually, what happened was that due to things like that, within two years nationwide emissions had decreased by 40% because transparency isn't just about giving people access it's actually about giving them the freedom to analyze your data and to share it and to reuse it.

And as we see with Ben's hydrants, often to really understand problems, data needs to be visualized, perhaps, or combined with other data.

And this requires that the material is able to be freely used however you want to.

Now also with the polluting data, freedom of information requests are fantastic but it would save everybody an awful lot of time if it was actually open by default.

So I want to share one more example of how we can do great things when data is more open, when we've got more freedom.

So, I'm a big advocate of thinking about the experience that users have when you consider any kind of technical or bill decision and one of the benefits that open data can bring is to be able to combine these data sets in ways that answer human questions rather than being purely data led and just putting it out there.

This is what Mapumental have done.

So they've created a tour that helps people to find a viable part of the country to be able to live in based on the amount of time that it takes them to get to work or to do life based things, rather than thinking in terms of miles which is what we typically do.

So, they created a frame work to do this.

But then once you've got that, you can apply that to much bigger problems than, you know, say where people should live in order to be able to get 15 minutes more sleep in the morning.

So you can maybe create a tool for planners to think about public services as this quote shows really well.

Because in Wales, as it really uses to illustrate, the amount of distance, again, isn't representative of the time that it takes to get some of these key places.

And another one, another sort of question that you might have that isn't easily answerable from raw data is around emergency services.

So you want to be able to ask human led questions like, how quickly could four fire engines get to a specific post (code at 2)00 AM or, could we get a helicopter somewhere, you know, within the certain amount of time.

And so tools like this can only come about when there is openness and lack of restrictions on the data's usage.

The more that we place restrictions, the fewer these benefits we're actually likely to see.

So we can help to improve efficiency and effectiveness and we can really, really measure the impact as well.

And data's this really key resource in the modern age and the more that we can release it or have the ability to consume it without limitations then the more likely it will be that we can to go out there and build, really, some innovative products that answer these big questions.

But, well, it's most of my focus today is obviously on the web.

You don't just have to work with open data in digital ways.

So people are also using these open principles in conjunction with the physical world and to create, actually, pieces of art.

So by opening up your data to any usage whatsoever, you can get these amazing interpretations through completely different formats that were never originally intended.

So most limiting digital terms don't actually think about this.

So if we think back to the Marvel API, how can I meet these criteria if I want to use their data to do something physical.

So I'm sure that I'm probably going to be breaking their caching limitations instantly and I'm not sure really how I'm going to do a hyperlink back to their homepage, if I'm making something physical but.

As a half way house, because I know that we all like to cling onto the internet, you've got schemes like opensensors.io which

aim to take this enthusiasm around the internet of things and to in-- encourage the publication of open data within your physical projects through the provision of a platform that facilities this.

And they give examples of things like environmental censoring projects from communities.

You might want to monitor their air quality or water quality.

But then you can put that data out there and also, maybe, combine it with data from other communities on a real time basis or on a historical basis as well.

But if you want to take it further, and completely step away from the internet, there's a chap called Doug McCune.

He does a load of artistic projects based on data and I'm going to show you some in a sec.

But his particular slant, is that he loves to take the most horrible, nasty data he can find and to really make something beautiful out of it.

And I think that this is particularly interesting to me because having come from a development background, Doug really nicely illustrates this shift in perspective because he started off really similarly.

So he was just using data in his projects but then he found his relationship changing and there's a great talk that he gives called, Desperately Trying to Remove the Air Quotes Around the Word Artist, which documents his shift in identity from a developer through to a fully fledged artist.

So here's one of his pieces.

It's called Drunk Traffic Map of Portland DUIs.

And it's a print-- the number of driving under the influence arrests on each street in Portland over the course of 10 years.

And this is a Haiti earthquake tree trunk map which is a representation of the 2010 earthquake that destroyed much of Haiti.

And this is the laser cut out of different kind of bamboo and it shows the shake intensity data that radiates out of the epicenter near Port-Au-Prince.

And this one is a 2014 South Napa Earthquake 3D print.

The source data here comes from the earthquake data source I showed you earlier.

This, sort of, the contours of the 3D print represent the peak ground velocity which is a measure of how intense the data shaking was which is really closely linked to damage, as you'd expect.

And this one's my favorite, I think.

It's the Bay Area Homicide Constellation Map and it's a map of murders throughout the Bay Area, what he's done here is he's taken the data for murders in 2013 which are, sort of, connected in terms of geography and he's drawn the lines between them and then just to give it a bit more context he's put the data for 2014 on just as individual points.

So lovely, lovely work-- horrible, horrible data.

And this is some work from Stefanie.

So, Stefanie works with a range of open data.

She's done some great talks recently, which you should checkout, but through her project Air Transforms she focuses on the question, what if we could really see and feel the burden that air pollution places on our bodies.

And so she's created, in her words, a series of wearable data objects which are pretty much kind of like to these necklaces.

And she's used air quality data from Sheffield in the UK which as it says in the quote is a-- it's a really sort of notorious city in terms of the history there, around air pollution.

And she's taken that data and she's applied a couple of different categorisations.

So you've got the physical kind of shape and you've got a color scheme as well.

So that ranges from green being quite good through to, sort of, reds and orange.

And so these necklaces end up looking like this where you've basically got a good week here, which is very nice, friendly colors, friendly, sort of, shapes, nice and rounded and soft.

But it's really apparent, the message that she's trying to convey, in terms of if you look at Bonfire night you get something that is very physically jarring and I'm sure that I'd be stabbing myself in the neck every time I tried to wear this.

Bu But it's really obvious, both in a kind of physical sense and a visual sense the impact that bad air quality data is having here.

And this is what things like open data can let you do, so Stephanie, Doug, and Ben, in their own way, they're using open data as a way to get their personal message across.

So you might be somebody who wants to actually make a statement, to get your message across like that, but actually it could be that you just want to solve your own problems.

So there's a Danish lady called Tina Muller, and I really like that she saw the problem that people with bladder weakness don't always feel confident in actually kind of going out and exploring the city, which for somebody who's come to Amsterdam and likes to wander around, I really take that for granted.

But what she's done is to use open data to solve that very niche problem of allowing people to have confidence and be able to get out about.

So if there is data out there without restrictions, you can actually use it to help make your lives better or to make a big impact.

Up to this point, the examples I've showed you have pretty much been quite self-contained and had a single focus, but there's a great Ted Talk out there by Tim Berners-Lee, which you can find on the data.gov.uk

site amongst other places, but he basically talks about the growth of the internet and how originally he came along, he asked us to put all of our documents on the web, and now we have, and now we can't stop.

We're addicted to it, and it's just more and more cats every day, and it's great, it's fantastic, but what he also wants us to do is to share the data that underpins those documents but to maintain this original powerful concept of the humble hyperlink, so the infinite navigable journeys that we can go on that make the web so special.

Because it's not just about these individual silos of data, it's about actually what you can do with them and, more crucially, the relationships you can force between them.

So by having this freedom to reuse and combine data, we gain some great insight into solving problems.

And this is something that we've actually done a lot through the ages, and it's not exclusive to the web at all, but the web can really help facilitate it and to make it more accessible to people.

So an example that I love, not from the time of the internet, is the relationship between drinking water pollution and cholera in London in the 19th century, and this was discovered by Dr.

John Snow, who basically took the data around deaths from cholera-- we're back to the really horrible data-- but he combined with the location if water wells, specifically ones around Broad Street in London, which are up there.

He did that, he worked out this relationship, because previously they had thought that this was an airborne thing maybe, and the thing that I like about this story is that, interestingly, there was one significant anomaly, so none of the monks in the adjacent monastery contracted cholera at all.

Actually, again, when you dig further into this and you find out the human stories and the relationship with the data, this wasn't actually an anomaly, it was further evident, because the monks only drank beer, which is how they protected themselves, and they brewed this themselves.

And also residents in, or sort of near the brewery, they also weren't affected because of the result of fermentation of the contaminated water.

So the beer was safer to drink than the dirty water from the street pump.

So this eventually led to the building of London's sewage system, and it hugely improved the health of the general population.

So we see the same benefits now.

When we can standardize, we're able to use data in combination, the more easily patterns will be identified and new knowledge can be gained.

So this principle of linking up data is a really powerful one, and it's gained a lot of prominence since it kind of started to get talked about in 2009.

And when you start to view data as being linkable, you build up this picture of the amazing resource that we have out there.

You may see some names in here from the previous sites that I showed you, so things like DBpedia.

But, of course, as we can see from this diagram, where there's great potential to combine data, there's also huge potential for things to get very, very messy, and I really don't envy the people that put this together.

But, as Lisa was telling us yesterday, this is where standards come in.

So the concept of linked data itself is a standard, so it's defined as a set of best practices for publishing and connecting structured data on the web.

And it comes with guidelines to how best approach it so key technologies-- obviously HTTP for requesting or receiving data, URIs for being generic identifiers for entities or concepts, and probably most relevant, RDF which, if you don't know, is a generic graph-based data model with which to structure and link data that describes different concepts.

So the basic principles are reasonably simple.

So it's to use the RDF data model to publish your structured data on the web and then to use RDF links to interlink data from different data sources.

So very simply, you could have one data set, which is described in a really structured way, which has a semantic relationship to another and that, in turn, may have relationships to other data sets and so on.

Now, one of things we need to be careful about when we start linking up data is this potential for misinterpretation.

When we think about the concept of linking, some scientists have actually argued against this, because their data may be collected under very specific conditions, and you don't want start combining that with data that wasn't meant to be combined with it in the first place.

We have all, no doubt I'm sure, seem some wonderful examples of just how wrong data can go when it's taken out of its proper context.

So you start seeing things like this, and one of my favorite sites-- cheap joke, sorry-- yeah, one of my favorite sites is something called Spurious Correlations, which teaches us that correlation and causation may not be exactly the same thing after all.

Who knew? So Nicolas Cage is a very, very dangerous man for a few reasons.

Just one more.

Ate a lot of cheese last night, could have died, didn't know.

Anyway, interoperability can be really important, but this is also where licensing gets really tricky as well.

Like I said, licensing, boring.

Not going to cover it too much, but one of things I really do want to stress is you just need to think about this a little bit when you do start using data with different licenses.

So you may have used some data in your projects which requires attribution.

You might have done that because you're a good person, so you've attributed it correctly, but you might then want to build on top of it.

Then you might want to release your new project as being something in public domain, completely open and free.

But that, obviously, is going to contradict the terms of the attribution that the other license might have placed on you.

So if you think about how this can start to work when you've got lots of different data sets, all with different requirements, it can get really messy sometimes.

So there's a really useful tool available down here, which can help to guide you in these matters, and I'll put these online later if you want grab any links.

But we've looked at people using open data, we've looked at how we can link it up, but this publishing aspect is really crucial too.

So additionally, I'd like to ask all of you here today if there's data that you could be publishing or whether it's something that you could perhaps think about with the projects that you work on, whether you could ask the question of other people as well.

So for me, one of the best things about the idea of open data is really the fact that people can take it and use it in ways that you've never dreamed of.

And this quote is one that's mentioned a lot, but I always think of it in conjunction with something called the Relevance Paradox.

This is basically where people are unaware of certain information that could help them, because they don't necessarily know that it exists.

It's one of the great arguments for people putting data out there openly.

So as we speak about the [INAUDIBLE], data itself isn't necessarily that useful.

So unlocking that value and the relevance is the really tricky part, and you might not be aware of the relevance that your data even has, but others may be.

The other good thing is, you might not have the resources or the skills able to unlock that yourself, but by putting it out there, you look really great and altruistic and actually everybody else does the work for you, so something to consider.

But to finish it off, I want to talk to you about what sharing your data openly might entail.

So we all remember [INAUDIBLE] data, really useful stuff.

Remember also the pointless data sets other people put out there, so the billboards in Jakarta.

Maybe once I stopped collecting this, maybe I should release it openly, because it might have some use to somebody else.

Who knows? So if I was going to do that, the first step is obviously to identify what should and shouldn't be published.

And this typically involves an assessment of the data you have, the quality of it, and also to plan the processes that are going to be involved in the date of publication.

So many organizations, when they come to this point, when they're looking at what they have, they start with the data that they might be releasing already, but it may not necessarily be open.

So that might either be because of the restrictions that they've previously placed on it, or maybe because of the format that it's been in.

So things like PDFs are a great place to start, because it's reasonably simple to kind of take that, dig the data out, apply a license and put it out there.

But you do need to be careful if you do this because, as we've seen with the model API, sometimes you get conflicting terms-conditions kind of limitations, because typically-- I'm sure everybody has seen the terms and conditions that basically say, you can't do this, you can't steal this, you can't do anything with it, and if people want to put open data out there, they always forget about the terms and conditions on the bottom, so watch out for that.

The things that I would need to think about would be putting some clear licensing and usage information so that people know how they can use my data.

And I want to make sure the structure was good and the quality was good, so data should be cleaned and formatted in a logical way and presented ideally as machine readable.

You'll need to think about the support.

So data that's simply put out there which isn't maintained will likely not be very desirable.

And it's the gift that keeps giving, because last night I actually went to fix a bug that I've noticed in my MarvaLytics page, because I thought, oh, two people are bound to look at that and tell me that's it's broken.

This is just something that I made about a year and a half ago when I was bored at my lunch breaks, but actually it was quite useful because I went there and realized that the API had an issue, and so I did the thing where you go into the community page, and you start to see reams and reams of people complaining about stuff, and now it's really sad.

It looks like that they're kind of just letting it slip, and that's a really, really important point.

So if you are putting information out there, people are going to be building products on that, so you need to think about how you're going to support it and what happens if it does fall over, and is it going to be out there forever or are you going to communicate that to people? But anyway, on the structure and quality front, data may also not be accurate, it might not be complete.

So my data, I know, has some issues, namely because I either forget to track things, because I always meant to build in offline capabilities and just never got around to it, so I might have some timings being completely out, and I didn't factor in things like time zones.

So this wasn't something that I was thinking about when I first started collecting data, which is another important point.

NOTE Paragraph But there's some [INAUDIBLE] of you as well that if data is kind of out there in the open and especially if it's able to be fed into by people, then reliability is going to be at stake.

But you need to balance that against the fact that if it's a closed silo entirely, then that opportunity isn't there.

And actually by opening it up, you might be able to grab data that you would never have possibly gotten before.

So once the data has been identified, this extract compilation might be needed, and this particularly happens if personal details are held.

And the data might actually need to be cleaned up as well.

And so this point about personal data is a really, really important one, because many of the data gathered online or held by companies is actually around individuals, and this especially has become the case in the last 5, 10 years.

When you start dig into data about individuals, you get into anonymization issues, and anonymization or true anonymization is notoriously really difficult to do safely.

And as such, this usually means that open data that does relate to individuals is typically kind of aggregated statistics, so you might have like the results of the census rather than Bruce's favorite animal and how many cups up tea he's drunk today.

But the exception is here when data is made available by the relevant individuals themselves or where there's explicit consent of the affected individuals, or where there's actually a mandatory reason.

An interesting example here is in Scandinavia.

Everybody's tax return data is openly available, and this includes everybody from the government to celebrities to your neighbor next door, and if you think about it, this kind of openness can actually do some real good.

So it can help to alleviate pay gaps between the genders, or it lets you know that your colleague is on a similar wage to you and that nobody's being screwed over, or it just gives you enough data to kind of know your worth if you're doing research generally into what might be available in different careers.

But I know that in the UK, this is something that we would not be comfortable with at all.

It's also something that Scandinavia dialed back a bit, so they've done more to protect the individual's rights.

So you need to be very careful with this balance.

When you've actually sort of done your extraction, maybe done some kind of compilation aggregation of it, you might need to do some cleaning as well.

There are lots of tools that are available to help you clean up data.

So OpenRefine is one that's really widely used.

It can clean, it can transform your data, and you can also use it to extend it with web services should you want to do that.

But as a tool, it's really, really useful and I'll actually recommend it if you're working with data in any way, not just thinking about releasing it openly.

And you've got other guidance as well, so there's a resource called Clean Sheet, which is useful for giving you guidance around how you might want to put your data out there in terms of spreadsheets.

Now finally, you want to share it, so you want to put it out there, and this can take many forms.

So you've got putting the data out there, and then you've actually got helping people to find it, because that's a really important thing, right? So in terms of actually putting it out there, there are a couple of different schemes to give you some guidance again.

The first one I want to talk about is 5-Star Data, again, from Sir Tim Berners-Lee.

Honestly, the guy just doesn't stop doing stuff.

And these are some levels, so this is the 5-Star guidance.

So this goes from basically having your data available on the web in whatever format but putting it out there under an open license, through to starting to make it better.

So make it structured, making sure it's not proprietary, and then starting to use URIs in order to actually give it a place to live and to allow people to link through to your data sets as well.

So you might actually want to host this directly, as we've talked about, you might want to create an API, you might want to put it on a torrent site, you might want to house it on one of the portal sites that were in the repo that I showed you.

But one of the things that's really important is that if you are taking data, if you're building on it and you want to re-release it, think about the structure early on.

So don't make people have to scrape your data out of your beautiful visualization that you've built.

Give them a way to access it directly.

So aiming to give a less technical and more kind of well-rounded view of what a good data set is, you've got the open data certificates which are pioneered by, again, the Open Data Institute, and these also provide information around rights and licensing, around documentation, and guarantees about availability.

So this is what we talked about in terms of the support and the ongoing kind of information.

What this is, they basically act like a reference sheet, and it contains information that may be of use to re-users of your data.

So what we want to do here is we want to show that it's easy to find and use and share data.

So in summary, we now live in a society where data is so entrenched in everything tat we do, and the information we can get from that is absolutely key.

So data now underpins much of how we actually live our lives as well as our projects.

So we collect it, we use it, and sometimes we might share it.

But data, as with documents, is a really fundamental element of the web itself.

So rather than data being something which is purely hidden below applications, we can really bring it to the forefront and make it more powerful and use it to gain benefits either indirectly or directly.

We can do this by following some stats like these.

So using data, publishing data, thinking about how our data fits into the bigger picture, whether it can be linked with sets to enhance its context, by removing these unnecessary restrictions and promoting freedom and creativity of usage and by embracing standards and open web technologies with better, more freely available data, society can hopefully start making these better decisions, improving costs, being more creative, and sparking new innovations that would not otherwise have been possible, so some of things that we've seen today.

By consuming or publishing open data, you too can help promote this.

It'll really help to make the data grow as well.

So thank you very much.

[APPLAUSE] That was epic, Sally.

Thank you very much.

We're not going to do Q&A now, because many of the questions are going to be equally applicable to the following talk, and maybe we'll combine it then.

There were so many different conversations going on during that talk, Sally.

Something I learned is that you should drink beer, because that prevents cholera, and that because cheese eating seems directly correlated with getting tangled in bed sheets, now I know why in the Netherlands I only ever get a quilt rather than sheets, so that's a useful thing to be aware of.

0 comments
Comment

An Introduction to Open Data by Sally Jenkinson

Transcript

Fronteers 2015

Elsewhere

Previous years

Fronteers

Stay updated