honestlyreal

Icon

Fare dealing

Remind me again: what’s the purpose of opening up all this public data?

Ah yes, that’s it. To create value. And you can’t get a much stronger example of real value in the real world than showing people how to save money when buying train tickets.

Fare pricing is a fairly hit-and-miss business, as you’ve probably noticed. We don’t have a straight relationship between distance and price. Far from it.

The many permutations of route, operator and ticket type throw up some strange results. We hear of first class tickets being cheaper than standard, returns cheaper than singles, and you can definitely get a lower overall price by buying your journey in parts, provided that the train stops at the place where the tickets join.

The rules here are a bit weird: although station staff have an obligation to quote the cheapest overall price for a particular route, they aren’t allowed to advertise “split-fare” deals, even where they know they exist. Huh?

Why this distinctly paternalistic approach? Well, say the operators: if a connection runs late, your second ticket might not be eligible, and there might be little details of the terms and conditions of component tickets that trip you up, and, and, and…well, it’s all just too complicated for you. Better you get a coherent through-price (and we pocket the higher fare, hem hem).

There’s no denying it is complicated. Precisely how to find the “split-fare” deal you need is a tiresome, labour-intensive process of examining every route, terms and price combination, and stitching together some sense out of it all. And, indeed, in taking on a bit of risk if some of those connections don’t run to time.

You might be lucky, and have an assistant who will hack through fares tables and separate websites to do you that for you. But you’d be really be wasting their time (and your money).

Because that sort of task is exactly what technology is good at.

Taking vast arrays of semi-structured data and finding coherent answers. Quickly. And if there’s some risk involved, making that clear. We’re grown-ups. We can cope.

There’s no doubt at all that the raw materials–the fares for individual journey segments–are public information. Nobody would ever want, or try, to hide a fare for a specific route.

So when my esteemed colleague Jonathan Raper–doyen of opening up travel-related information and making it useful–in his work at Placr and elsewhere, put his mind to the question of how new services could crunch up the underlying data to drive out better deals for passengers, I don’t doubt that some operators started to get very nervous indeed.

Jonathan got wind–after the November 2011 meeting of the Transport Sector Transparency Board–that a most intriguing piece of advice had been given by the Association of Train Operating Companies (ATOC) to the Department for Transport on the “impact of fare-splitting on rail ticket revenues”.

Well, you’d sort of expect an association which represents the interests of train operators to have a view on something that might be highly disruptive to their business models, wouldn’t you?

So what was that advice? He put in a Freedom of Information request to find out.

And has just had it refused, on grounds of commercial confidentiality.

This is pretty shocking–and will certainly be challenged, with good reason.

Perhaps more than most, I have some sympathy with issues of commercial reality in relation to operational data. We set up forms of “competition” between providers for contracts, and in order to make that real, it’s inevitable that some details–perhaps relating to detailed breakdowns of internal costs, or technical logistics data–might make a difference to subsequent market interest (and pricing strategy) were they all to be laid out on the table. I really do understand that.

But a fare is a fare. It’s a very public fact. It’s not hidden in any way. So what could ATOC have said to DfT that is so sensitive?

The excuse given by DfT that this advice itself is the sort of commercial detail that would prejudice future openness is, frankly, nonsense.

I look forward to the unmasking of this advice. And in due course to the freeing-up of detailed fares data.

And then to people like Jonathan and Money Saving Expert creating smart new business models that allow us to use information like it’s supposed to be used: to empower service users, to increase choice, and to deliver real, pound-notes value into the hands of real people.

That’s why we’re doing all this open data stuff, remember?

Just because you can…

An interesting piece appeared on the Guardian data blog on Friday. It describes a wealth of new data being released relating to court and conviction information.

The database shows sentencing in 322 magistrates and crown courts in England and Wales. Defendants’ names are excluded but details such as age, ethnicity, type of offence and sentence are not. Any computer user can analyse aspects such as how many white people were sent to jail for driving offences.

All good stuff. There’s definitely value to be gained from this type of analysis. It’s being released as a database (hopefully with a commitment to regular ongoing publication), and it brings consistency to often haphazard arrangements for making data available. These are positive moves, and should be welcomed.

But…

Transparency campaigner William Perrin, who advises the Ministry of Justice on opening up its data, says the release is a big step: “Publishing the details of each sentence handed down in each court is a great leap forward for transparency in the UK, for which MoJ should be warmly praised. Courts have to be accountable to the local populations they serve.” But he, like some campaigners, believes the MoJ should go further, releasing the names of defendants. “The data published is anonymised, flying in the face of hundreds of years of tradition of open courts and public justice.

“The MoJ need to have an open and public debate about the conflict between the central role in our society of open public courts where you can hear the name and details of offenders read out in public and crude misapplication of data protection.”

My concern lies with the consequences of releasing the names of individuals, as proposed here, in a completely accessible and reusable way.

William draws a parallel between the act of reading out names in public court and publishing them on the Internet. (Disclosure: William and I both sit on the Transparency Sector Panel in MoJ.)

Were it a simple parallel, with the same consequences, I’d be pretty comfortable with the principle of release, too. But I see one very big difference: raw content on the Internet is (almost always) indexed by search engines. And search engines have very, very long memories. The (only) two things that the Internet has fundamentally changed are the ease with which information can be found, and the duration and extent over which it persists–as I’ve banged on about on this blog before.

So, this proposal (if taken at face value) would lead to a couple of consequences which might not be wholly desirable: firstly, a name would quite feasibly, if entered into a search engine, throw up information about an offence and the consequent sentencing for an indefinite time. What implications does that have for rehabilitation of offenders? If your conviction has been spent, and your potential employer does a quick check and finds that the only thing you’ve ever been noted for on the Internet is… Well, would that feel just to you?

Ah, I hear you say–but look at court reporting now: those journalists that do manage to get intelligible information out of a clerk so they can write their pieces accurately end up with their content being indexed (paywalls permitting), and the Google ghosts will be there to do their haunting anyway. Yes. They will. But this is an issue of scale and ease, not principle. Journalists today, even those with perfect information, exercise some choice over what they choose to print. Maybe this is just because of space constraints, maybe there are other factors at play. But the “release everything for reuse” stance would dramatically increase this scale of publication.

You may say that this is a good thing: along similar lines as “nothing to hide, nothing to fear”, this extra hangover from a criminal’s downfall may be a very positive thing for society. Another deterrent to criminality, maybe? I don’t know about that, but I do know that we then face a reappraisal about what we mean by rehabilitation as a direct consequence of data release.

And, as William says, that needs proper public debate.

But it’s not just a matter of scale. We find, when public data is released en masse, that new business opportunities spring up. Imagine the entrepreneur who gathers all data on convictions and charges for their own employee check service. They might adhere to principles of time limitation on their data. They might not. They might mash-up this data set with other information. They might not. They might put profit before principle.

We attempt to control such reuse of information with regulation, but on the Internet, it gets very much harder to make this stick in practice. Again, we risk changing the landscape of what it means to be convicted, by releasing data like this.

I’m fascinated by how even something like the current Data Protection Act relates to the indexing of personal information within search engines. Surely, almost by definition, the end purpose of such indexing cannot be known, and therefore Principle 2 (Personal data shall be obtained only for one or more specified and lawful purposes, and shall not be further processed in any manner incompatible with that purpose or those purposes–source: ICO) must surely be creaking already?

So, I’m not so keen on making it indexable. Can this be avoided? Is there a middle ground which acknowledges the shambles that is the current practice in courts–with some prepared to supply information in machine-readable format, others insisting on hand-written notes being passed, and some seemingly actively obstructive in providing information?

I think there might be. There are some “government” datasets which although they could be released for reuse, aren’t. For fairly good reasons. The database of car registrations, for example. I suspect we’d consider if a bad thing if a road rage incident could be easily followed up with some bricks through windows on the basis of typing in the offending registration plate when you got home.

Similarly, we have a curious set of “frictions” in place to allow us to have an electoral roll which is at the same time both “publicly viewable” (provided you go to a library) and searchable online only if you pay up a good chunk of cash. A big hmmm from me to that latter part, by the way, but you can read much more on electoral roll issues here.

And the way that this data is structured is also important: so that we can’t, for example, easily go online, type in an address down the road, get a full list of occupants’ names and pop round there with all sorts of social engineering stories designed to make trouble/extract money/dig for further info/groom/be very creepy. Again, I’d suggest we do this for good reasons, and we know how to build machinery to keep this equilibrium in our society.

We may solve the problem through choosing carefully the format for release, the means by which it’s referenced, and even to whom it’s released. Yes, I know, those wretched privileged accessors again (just like the Police, DVLA, local authorities, credit agencies etc etc etc.) Always a subject to warm the temperature in open data discussions!

But I’m not arguing for wilful obfuscation of this data, merely putting forward some of the alternative perspectives to “everything, raw, now”. We do need this public debate, and we need to be reasonably confident that we’re getting a net societal benefit from whatever action we take.

Let’s tread carefully here–just because you can, doesn’t always mean you should.

[I’d be commenting on the Guardian article if I could, but it doesn’t seem to have comments open, so I’ve written this in response.]

Neither one thing nor the other

In which I look more closely at one particular, well-known data set: what makes it what it is, and what we might draw from the way it’s managed to help us with some other challenging questions about privacy and transparency.

Surely data is open, or it isn’t?

(I’m using “open” here as shorthand for the ability to be reached and reused, not with any particular commercial or licensing gloss. It’s a loaded term. But let’s not snag on it at the beginning, hey?)

Data is either out there, on the internet, without encryption or paywall, or it isn’t. And if it is, then that’s that. Anyone can reach it, rearrange it or republish it, restrained or hampered only by such man-made contrivances as copyright and data protection laws.

Maybe. Maybe not.

I’ve been involved in some interesting discussions recently about the tricky issues surrounding the publication of personal data. By that, I mean data which identifies individuals. To be specific: some of the information in the criminal justice sector about court hearings, convictions and the like.

You’ll have seen much in the press, especially following the riots, about a renewed political and societal interest in this type of publication.

Without making this post all about the detailed nuances of those questions, this broader issue about the implications of “open” publication seems to me to need a bit more exploration before we can sensibly make judgements about such cases.

And to do that I took a close look at one very well-known data set: the electoral register.

What is it? Well, it’s a register of those who’ve expressed their entitlement, being over 18 (or about to be) and otherwise eligible, to vote in local and national elections, through returning a form sent to them by their council each year. If you’re reading this, you’re probably on it. I am.

It’s therefore not: a complete list of people in the UK (or even of those entitled to vote); a citizenship register; a census; a single, master database of everyone; accurate; or a distillation of lots of big government systems holding personal information.

What’s it for? An interesting question. I suppose its primary existence is to support the validation of those entitled to vote, at and around election time. But you’ll know, if you have voted, that it’s more of an afterthought to the actual process; most people show up with polling cards in hand, and anyway, there’d be no possibility of any real form of authentication, as the register doesn’t contain signatures, photos, privileged information or any other usable method of assurance. It’s not even concealed from view. (More on that here.)

But it does some other things, doesn’t it? It provides a means for political candidates to be able to make contact for canvassing purposes with their electorate. And I suppose, for that reason, it has this interesting status as a “public document”. Which we’ll come back to in a moment.

And to complete the picture, a subset of it (the “edited register”) is also sold to commercial organisations for marketing purposes, enabling them, amongst other things, to compile pretty comprehensive databases of people.

…and as a byproduct of that it also forms an important part of credit-checking processes–with said commercial organisations able to offer services, at a price, to anyone who wants to run a check that at least someone claiming to have X name has at some point claimed to live at Y address. (Remember, it’s all pretty weak information really, self-asserted with no comprehensive checking process.) You can opt out of the edited register if you choose, but you’re included by default.

[Update 2 Oct: Matthew, below, comments that I’m not quite right here–the full register is also available to be used for credit checking]

There’s probably more, but let’s get stuck into some of this.

First off, I will happily add that the whole business of why it needs to be public at all seems highly questionable. And I don’t remember the public debate where we all thought that it was a great idea to try and make a few quid off the back of this potentially highly-sensitive data? Do you? How do you feel about that?

And the idea that the process of democracy would be terminally hampered were candidates, agents and parties not able to make checklists of who’d been canvassed? Really? Couldn’t they perhaps just knock on doors anyway? As a potential representative would I only be willing to learn from encountering those who had a vote? I suggest not.

So, moving on past those knotty questions about “why do we have it, and why do we sell it?”, we have in practice established some conventions about managing it as “a public document”.

Can I, as a member of the public, request a copy be sent to me? Certainly not. Ok, perhaps I can download it then? Nope. Search it online? Hell no.

I can go and see it in my local library.

So I did.

I heartily recommend you do the same. It is a real eye-opener in terms of the idea of data being “semi-public”.

I trotted up to the (soon-to-be-closed [boo hiss]) information desk at the library under Westminster City Hall.

–Can I see the electoral register please?

–Sure. We only have the edited version here: if you want the whole thing, you have to go through there and ask for Electoral Services.

(He pointed at a forbidding and not-at-all-public-looking door).

–You’re ok, I’ll just have a look at this one

And out from the back window-ledge comes a battered green lever-arch file, containing bundles of papers.

–You know how to use this? he says

I shake my head. It seems the top bundle of papers is a street index. The personal information (names grouped by cohabitation, basically) is listed by street, then house name/number within street. Not by names.

So, you can’t, easily, find someone you’re stalking. (Did I say that? I mean, “whose democratic participative standing you have a legitimate interest in establishing.”)

But you can if you’re patient. Or if their name, like that of one Mr Portillo, leaps off the page at you. I intentionally chose the register of the area immediately around the Houses of Parliament, for just this reason. Curiously, I couldn’t actually find the HoP itself listed, but Buckingham Palace does have over 50 registered voters (none of whom are called Windsor.)

But back to the process: as I picked up the box to head towards an empty desk a finger came down on the lid: –you have to read it here, he says.

I look at the lid. Wow.

I ask the question about photocopying anyway, just to judge the reaction. Kitten-killer, his eyes say.

But I take it a few paces away anyway and have a closer look.

Fascinating. I see a bunch of well-known people from industry and politics, their home addresses, and who else lives with them.

I’m sure I’ll go grey in chokey if I actually published unredacted screen shots in this post, but I’m pretty sure this one will be ok; if nothing else I think its historical interest justifies it… (RIP, Brian.)

Now, in all the fuss we make about child benefit claimant data being mislaid via CD, and in all the howling we make about anonymisation of health records and other sensitive data, and through all the fog that surrounds the commercialisation of public information and the Public Data Corporation etc. isn’t this sort of information that we would normally expect to be the subject of an enormous public debate about even its very existence? And I’m walking off the street and making notes of it, and, and…

And I can see what’s happening here.

Yes, it’s “public”. Sort of. But so much friction has been thrown in the way of the process–from the shirty look as I have the temerity to request it, to the deliberate choice over structure that minimises me being able to quickly find my target–that I would strongly argue it to be “semi-public” rather than public.

There are some important lessons here perhaps when considering the mode, and the consequence, of publishing data online. Clearly, structure is highly relevant. If I am able to sort, and index it, that instantly creates a whole universe of permanent, additional consequences. Not all of which may be that desirable. “A perpetual, searchable, SEO-friendly database of all those ever summoned to court, convicted or not, you say? Certainly sir…coming right up.”

If I’m able to relate information–by association with others–I can also help the cause of those wishing to track someone or something down. Look at Facebook. It does a great job of finding people you search for, even those with very common names amongst its hundreds of millions of accounts, by this type of associative referencing. Powerful stuff.

And let’s not forget that ALL this information is pretty easily available online anyway. You just have to pay for it. The best-known provider that I’ve looked at, 192.com, has an interesting model. You’ll be giving them at least a tenner, and more like £30 to buy some credits to search their databases. And they have the ominous rider that their really sexy information–the historic registers, is only available at an entry-level price of £150 a year. For that reason, I haven’t actually given them a penny as yet. But it’s no obstacle to the serious stalker. I mean, researcher.

I’m sure there are all sorts of impediments, from download limits to penalties for misuse, that attempt to put further spokes in the wheel of it becoming a common commodity. But how long, really, before the whole register is available as a torrent on the Pirate Bay? Maybe it is already?

And we’re not bothered about this? It’s amazing, isn’t it? Yes, this whole industry is built on data that we’re required to submit to public authorities–and if we don’t, we’re disenfranchised.

This is a scandal, and one that urgently needs review.

But do take away the point that there is such a concept as “semi-public” – at least for now. It’s the ability to process, to restructure, to index, that makes online data different from those box files in the library.

The friction we throw into the system, whether it’s (intentionally?) releasing information via pdf, or slipping a local journalist a hand-written note of the names of those in court, is perhaps more than just dumb intransigence in the face of “information that wants to be free”. And it can serve some potentially legitimate social purposes.

Think how you’d feel if those frictions weren’t there around the electoral roll? Even the money that 192.com require for you to buy back the data you gave up in the first place?

Happy that every comment you made online under your own name, every mention in the press, could be traced back to your real address along with the names of your (18+) family? I think perhaps not.

So, a very big public debate is required on the consequences of any personal data being put online. But remember, stealthily or not, we’ve had experience of these issues for years. We just need to look on the library window-ledges to find it.

Inconvenience

I’ve written before about something that would really set a rocket under the opening up of data: the vigorous pursuit of the useful stuff.

When we’ve been given access to transport data, wonderful things have happened. When we get real-time feeds, useful services follow hot on their heels. Let’s make those infrastructural building blocks of services available for free, unfettered use: the maps, the postcodes, the electoral roll, your personal health records.

(Ok, I didn’t mean the latter two. Or did I? It gets complicated. Still writing that post…)

Here’s a vision:

Roll forward to a time when the first priority of any service owner within the public sector is not “how shall I display the accounting information about the costs of this service” (or indeed “how shall I obfuscate the accounting information..?”).

No. Instead, it is: WHERE is the service? WHEN is the service? WHAT is the service? HOW DO I USE the service? (And maybe even: WHAT DO PEOPLE THINK about the service?)

Those basic, factual jigsaw pieces that allow any service to be found, understood, described and interacted with. From a map of where things can be found, to always-up-to-date information about their condition, and a nice set of APIs with which others can build ways in.

The genius of this type of thinking being that many of the operational headaches of current service delivery simply fall away. They are no longer a concern for the service owner. “Our content management system can’t show the information quite like that.” “We haven’t got the staff to go building a mapping interface.” “We’re not quite sure how we’d slot all that into our website’s information architecture.”

Pouf. No more. Gone. The primary concern becomes: is the data that describes this service accurate (or accurate enough–with some canny thinking about how it might then be written to and corrected), and available (using a broad definition of availability which considers things like interoperability standards).

Well, Paul. Nice. But what a load of flowery language, you theoretical arm-waver. Can’t you give a more practical example?

Well, reader. Yes I can.

Loos.

That’s right. Public conveniences. A universal need. A universal presence. But where are they? When are they open? And what about their special features? Disabled access? Disabled parking? Baby-changing?

There’s actually a bit more to think about (once you start to think hard) than just location and description. But not a whole lot more. The wonderful Gail Knight has been banging this drum for a while, and has made some good progress, especially on things like the specification for data you’d need to have to make a useful loo finder service.

Why’s this really interesting? Really, really interesting? Because having got a good idea of the usefulness of the data [tick] and a description of what good data looks like [tick] we then find all the other little gems that stand between A Great Idea, and a Service That Ordinary People Can Easily Use.

Who collects the data? Where does it get put? Who updates it? Who’s responsible if its wrong? How do people know they can trust it? Can people make money from it? (I could go on…)

Bear in mind that any additional burden of work on a local authority (who have some duties around the provision of public loos) probably isn’t going to fly too high in the current climate of cuts. Bear in mind also that anyone else who does a whole load of work like this is probably going to want something in return. Bear in mind also that “having a sensible standard” and “having a standard that everyone agrees is sensible” are two different things. Oh, and I need hardly add that much of this data will not currently be held in nice, accessible, extractable formats. If, indeed, it exists at all.

Two characters usually step forward at this point.

The first is the Big Stick Wielder (“well, they should just make councils publish this stuff. Send them a strong letter from the PM saying that this is now mandatory. That’s the standard. Get on with it. It’s only dumping a file from a database to somewhere on the Internet, innit?”) BSW may get a bit vague after this about precisely where on the Internet, and may, after a bit of mumbling start talking about a national database, or “a portal”, or how Atos could probably knock one up for under a million… (and it’s usually at this point that some clever flipchart jockey will say “Why just loos? Let’s make a generic, EVERYTHING-finder! Let’s stretch out that scope until we’ve got something really unwieldy massive on our hands”.) We know how this song goes, don’t we?

The second is the Cuddly Crowd-Sourcer (“forget all that heavy top-down stuff, man. We have the tools. We have some data to start from. Let’s crack on and start building! Use a wiki. Get people involved. Make it all open and free.”) CCS’s turn to go a bit vague happens when pushed on things like: will this project ever move beyond a proof-of-concept? how do we get critical mass? does it need any marketing? can people charge for apps that reuse the data and add value to it? how do we choose the right tools?

Both have some good points, of course. And some shakier ones. That’s why this is a debate. If it were clear-cut, we’d have sorted it by now, and all be looking at apps that find useful stuff for us. And isn’t just a matter of WDTJ (Why don’t they just..?).

My suggestion? CCS is nearer the mark. Create a data collection tool which can take in and build on what already exists. Use Open Street Map as the destination for gathered data. Do get on with it.

Matthew Somerville’s excellent work to get an accurate data set of postbox locations and the Blue Plaque finder are obvious examples to draw inspiration from. Once in OSM, data can be got out again should the need arise. There will be a few wrinkles around the edges as app developers seek to make a return on what they build using the data. There may well be a case for publicly-funded development on top of the open data. But get the data there first. Make it a priority.

Because if, after years of trying to make real-world, practical, open, useful services based on data we continue as we are, with a pitiful selection of half-baked novelties and demonstrators of “what useful might look like, at some point in the indeterminate future” we’re badly letting ourselves down.

Basically, what I’m saying is: if we can’t get this right for something as well-defined and basic as loos, a lot of what we dream of in our hack-days and on our blogs about the potential of data will just go down the pan.

————

UPDATE:

OK, so it seems it already exists. Or at least a London version of it anyway. Don’t you love it when that happens? Would be good to see how it progresses, and what its business model looks like. I like the way that data descriptions have been used e.g. “Pseudo-public” for that class of loos which aren’t formally public conveniences, but can easily be accessed and used – e.g. those in libraries, and cooperative shops. The crowd-update function looks good too.

In a way, this also shows up another headache that arises when spontaneous services start to appear: there is only one set of loos in the real-world. But each representation of them in an app or online service must go through the same process of ensuring accuracy and extent of coverage. Distributed information is always tricky to manage. Should we hope that several competing services make it into production, with the market determining which succeeds? Will that be the one with the best data? Or is there scope for an underpinning data service that feeds them all? (But then we court the central, mega-project problems again…)

Answers on a postcard, please.

Data.gov.uk one year on

A year, almost to the day, from the launch of data.gov.uk it seems clearer that it was really trying to fire at three targets simultaneously: transparency, usefulness and good old commercial value. Three targets that have some overlap, but also some inherent tensions. How well has it done?

On transparency, we heard much along the lines of “sunlight being the best disinfectant” and that the very act of openly publishing information, particularly on accounting and spending, would do much to reduce wrong-doing and rebuild trust. It might not matter so much if the information wasn’t actually read that regularly or in detail; what mattered most was that it was published. We were told that tools would emerge to make general understanding easier, that amateur auditors would audit from their armchairs and indeed there has been some progress in this area. But there hasn’t been a dramatic unveiling of hitherto concealed horrors, just some visualisations and a tendency to focus on quirky details that make interesting stories—with no substantive follow-up.

On the subject of usefulness, things have gone less well. We haven’t seen much in the way of new apps and services driven by data.gov.uk data which actually deliver value to people in their day-to-day lives. Political pressure has been focused on driving out more of the spending data, perhaps at the expense of data that may be practically useful. We can speculate about the political factors at work here: gleeful exposure of the excesses of the last government and the current tensions between central and local government on spending priorities both spring to mind. But it does mean that the genuinely “useful”—the data that describes things in real people’s lives: maps, postcodes, contact information, opening hours, forthcoming events—and the real-time stuff, such as live running transport information, are falling behind. And that’s where the really useful apps and services are going to come from. Certainly, recent moves such as the release of Ordnance Survey maps under reusable licence are steps in the right direction, but much more political will is needed here to level things up.

And on the last target—the billions of commercial value that were touted as being locked up in government data—things don’t seem to be going too well at all. Some of this value was no doubt to be derived from the opening up of key enabling datasets—such as maps and postcodes—allowing new business opportunities to really take off. But some of it would have to come from inherent value in the data itself, or released from the combining of datasets to produce new products: taking data and finding new markets for it. Quite where this is currently headed remains shrouded in vagueness, but a new Public Data Corporation is now proposed, which lists among its objectives the management of the conflict between revenues from the sale of data and the benefits of making it freely available. This doesn’t actually seem that unreasonable. If one considers data as a national asset, why would it not be sensible to secure appropriate commercial value from it as with any other asset? But the proposal has triggered questions and some criticism from open data campaigners that this wasn’t how it was supposed to be. The extent to which commitments to release data free of charge were actually made or implied is now coming under scrutiny.

So where do we go from here? In the light of what we’ve learned over the last year, I’d prescribe the following: a rebalancing of the data held within data.gov.uk in favour of the genuinely useful; swift clarification of what is to be made available free of charge and what is not; a more mature approach to engaging developers and entrepreneurs if we’re really to see apps and services flourish (it’s going to take more than just a few “hack days”); and some exploration of how to demonstrate the value returned from what government spends. This last point should be of concern: at the launch last November of central government spending data, I reminded Francis Maude and the Transparency Board of Wilde’s description of those who knew the price of everything and the value of nothing…

The trust paradox

Although we think that “being open” will increase trust and transparency, the reverse is more likely.

I came to this paradoxical conclusion after reading an interesting piece on perverse economics [link; but summarised here to save you jumping around]: why the decreasing cost of something over time doesn’t mean that overall expenditure on it is reduced; instead usage goes up by a relatively larger rate—therefore so does overall expenditure.

It was first formally proposed by William Stanley Jevons in relation to coal production in the c19th and has been applied to lots of other resources including, in that linked piece, the cost of computers. Now I’m thinking about it in relation to the issues of trust in our public services and government.

We express a wish for our politicians to be more open—to share more about the detail of their lives, and not just at the lobbyist-lunching, shady-room-negotiating level. About them as people. We have social media and other channels now that make it faster and easier to do so. The boundary between their (and our) public and private lives gets fuzzy. We love this, when we see it serving our interests.

We have more direct access to our representatives. We can exchange a few words with a government minister via Facebook updates, or hear an opinion from the front bench even before the House does. We love that we can do this with our celebrities too, and we perhaps blur the categories at times. It’s all “public interest”, and the more open the better, hey?

And then things go wrong. With wholly predictable regularity. A public figure says something they shouldn’t. Perhaps something careless, a bit dumb, or misinformed, or—indeed—showing up actual malpractice in either a professional or personal capacity. The resources of a 100-hour working week, 200-mile commuting MP with a family and private life to manage are suddenly matched against sharp-eyed and keen-witted bloggers sitting at home with hours to spend forensically dissecting every statement, every inconsistency. And with no incentive to preserve any of those category boundaries, especially between professional and personal capacity. MPs are there to be kicked, particularly if they’re not of your favourite political colour.

You probably know the sort of thing I mean. The MP may not be whiter than white. But this was always our delusion that they would ever be. They are human. And they’ll get filleted in what amounts to asymmetric warfare. Openness goes up. Honesty and dishonesty are revealed. We amplify the dishonesty and ignore the rest. And trust goes down.

There are similar arguments at play with openness in relation to published data. Throwing everything over the wall creates the appearance of transparency. Surely it must increase our trust? But like a good astrologer we’ll expertly search for the material that confirms our thesis, and glide swiftly past the rest. And I’m not necessarily talking here about material that is genuinely in the public interest: the big fraud, the unambiguous cover-up—I’m talking about the trivial, the amusing, the petty contradictions that arise when serving many complex interests at the same time. The sieve that’s required to separate the two is a rare thing indeed.

Openness goes up. Trust goes down.

There are two ways this effect could be countered: by withdrawing openness (either outright or by stealth) or by drawing on the trusty old “sunlight=disinfectant” argument—that nobody will do anything stupid or wrong any more as they know they’ll be spotted. Good luck if you think the latter is more likely.

The speed camera and the Public Data Corporation

Think of a speed camera.

Think of the proposal for the Public Data Corporation.

One of them has attracted controversy. This seems to be based on instinct or ideology, without much groundwork being put in on the complex models and circumstances that surround it, and what it might mean as part of a bigger picture.

Its supporters see it as a way of bringing some order to a complex system; of ensuring that things actually do move more quickly by introducing an element of regulation. That it will actually bring some accountability and ensure things don’t run recklessly out of control.

Its detractors see it as a cynical front for raising cash for the government.

Oh, and the other one is a speed camera… :-)

What I’m saying, of course, is: we don’t really have much evidence as yet – perhaps it would be good to tease some out before taking a strong position either way?

A bit more about train information

If you were reading my outpourings a year ago you may remember a distinct preoccupation with train operating information. In the great range of public-facing datasets out there, the ones that offer the very highest utility, in my opinion, are those about real-time and real-world things: a picture of what’s happening right now and in the near future.

Transport information, weather, location, revised opening hours, where things are etc. etc. Sure, there may be treasures to dig out from the big dumps of auditable history in other datasets, but when it comes to actually building things people will find useful, there are some targets which are clearly more promising than others. (It’s probably no coincidence that data about timetables, postcodes, maps, operating information and the like are those which are also the most commercially tangled. Value breeds impediments, it would seem.)

I wrote about the problems of there being different versions of the truth about train operation. I wrote it at a time when ice and snow were crippling normal running. So, unsurprisingly, I’m back to revisit what’s happened since.

My idea a year ago – born of the frustration in inaccurate data systems (one could get a different answer from the web, the train station office, the train platform sign, apps and feedback from other travellers via Twitter, for example) was to rethink the way that trains are tracked and described in times of extreme disruption. I’m talking here about normal running disrupted to the point that existing timetables have become meaningless (and have been abandoned), where all trains are out of their normal positions, and the only meaningful data points that might relate to a particular physical train are its current physical location and its proposed calling points.

The notion of “Where’s my train?” was that if these basic data points were captured at the level of the train and made available as a feed, then in the event of utter chaos you would still be able to see the whereabouts of the next train going where you needed to go (even if you couldn’t do much about making it move with any predictability, or at all). Very much about the “where”, rather than the “when”.

This was a departure from information systems which relied on a forecast of train running (that abandoned timetable) or on a train having passed a particular point (for the monitoring of live running information). If trains had GPS tracking (which I heard they did) and the driver knew where the train would call (I was told this was generally the case) then a quantum of data existed which could drive such a feed.

It didn’t get very far. I talked through the principles with operational staff in two train operating companies: one blockage would be that such extreme disruptions were so rare that the usual situation would arise with regard to contingency planning – just not a common enough occurrence to warrant the development of a specific response. In addition, that certainty of where the train was going didn’t seem that certain, after all. Drivers would punch in their intended stops but right up to and even beyond the point of departure this could change. Better information, intended to give comfort to the stranded, might be replaced with false hope and ultimately do more harm than good. And the nagging thought I’d had originally remained: how much use was it really to know there was a train four miles up the track which was going to your stop, if it was quite possible that the points in between you had no chance of being unfrozen?

So, no more of that for the moment. There’ve been other developments. The live running information now seems to be much more accurate. The release of that and other information, such as timetables, is now a political football – should it remain a commercial asset of the train operators to be resold, and controlled, under licence, or are there greater benefits in releasing it to all who may make use of it?

I’m fairly sure that the data will be freely available eventually. There is some sterling work going on from innovators who are banging on the door of the information holders to make better use of it (all Malcolm’s recent posts make fascinating reading). To my mind there is a big difference between the commercial position of providing a physical service, and the commercialisation of the information that describes and records the performance of that service. But we won’t have reached the end of this story until that information is dependable. And at the moment, it’s not. You can still see differences between web information, feed information via an app, and trackside information. All were different on my line a day or so ago.

Yet the teasing paradox is that there is only one ‘truth’ at a particular point in time about a train’s running, even if it may then vary over time. And sometimes, as I experienced last Tuesday evening in South London, that truth is “we don’t know where this train is going”. In extreme circumstances, when lines are blocked and points jammed, I might have known where my train was (I was sitting in it) but I had as much idea as any of the crew (i.e. none) where it was going.

Distributing and presenting this information is far from being a trivial task. I don’t know the details of the architecture behind train information systems. But I can postulate that there are many different models by which highly volatile information, in thousands of different places, could be brought together, indexed, shared, distributed and so on. And they’re all pretty complicated. Before you blithely say “well, they just need to update a single database”, think what that might mean.

The track record (sorry) of mega-aggregation isn’t great. Without doubt it’s been attempted before in this area. Perhaps linked data mark-ups of distributed information sources hold the answer? I’d be interested on any thoughts on this.

But I’m clear about this:

the sorry position in late 2010 that it is still a matter of detective work and guesswork to find out where a large, highly-tracked piece of public-service equipment is, when it’s coming, and where it’s going, cannot be allowed to continue.

I’m fairly forgiving of physical failures in cases of extreme disruption – there is much real-world complexity that lies behind simple-sounding services. But on information failures? We can, and must, do better.

There’s data, and there’s data

I’m enjoying the latest flowerings of open data, and the recent quality posts from Ingrid Koehler and Steph Gray on what it all might mean. As well as quality action from Rewired State and others to actually demonstrate it in practice. (ooh, I just spotted that a reel of my photos is running on the Rewired State home page – thanks guys)

We’re getting a better understanding of what data actually is now that we’re seeing more of the things that were previously tucked away.

I’ll add my own observations: it helps me, at least, when thinking about complicated things to break them down a bit. My suggestion is to think in terms of four broad types:

1. Historical data

What’s happened in the past: how organisations and people have performed – what’s been said in meetings – what’s been spent – where the pollution has been – how children performed in tests…

2. Planning data

What’s projected to happen, or will shape what will happen: this and next year’s budget – legislation in progress – consultations – proposed housing developments – manifestos…

3. Infrastructural data

The building blocks of useful services. Boring stuff, doesn’t change that often, but when it does, it needs to be swiftly and accurately updated: postcodes – boundaries – base maps – contact directories – opening hours – organisation structures – “find my nearest…”

4. Operational data

The real-time stuff; what’s happening NOW: where’s my train/bus? – crime in progress – emergency information – school closures – traffic reports – happening in your area today…

These are not unrelated: what’s happened in the past will often guide what’s planned for the future. Today’s operational information becomes tomorrow’s history. And so on. There’s plenty of overlap. They’re intended as concepts, not hard definitions. The types can also be combined in every way conceivable: that’s part of the point of releasing the data in the first place.

I’m deliberately drawing no great distinction here between ‘information’ and ‘data’: the latter is a structured, interpretable incarnation of the former. That’s another set of issues in itself. I’ve also skipped over questions of interpretation and spin – this is a blog post, not a chapter of my book ;) And I’ve omitted “personal data” as a type – this is woven through all areas and carries with it its own baggage. I’m thinking more about the basics of function and purpose. Which lead on to usefulness. Which, as I’ve said before, is the test that all this is taking us in the right direction.

“Useful to whom” does of course vary by type: 1 and 2 are great for those holding public service to account (press, public, whoever). 2 is for those who will make change happen. 3 will benefit of ordinary people in day-to-day life (and I’m careful here not to imply that these ordinary people ever have to see ‘data’ or an ‘e-service’ themselves: their local paper, toddler group, or community centre noticeboard are all valid intermediaries here). 4 will do things for the e-enabled – the mobile generation, the data natives, as well as for places that can serve an offline public (screens in train stations, visuals at bus-stops).

As a practical suggestion, I would love to see some of the current initiatives to build repositories and access to data recognising these distinctions exist. A little more signposting about the type of data that’s being released may help to highlight which types are being overlooked. For as we know, opening up the narrative helps to drive the change itself.

And how are we doing against these four types?

Pretty good on historical (it’s quite easy to dump old files online); weak on the future planning stuff (trickier, because if there’s no means of action accompanying the data, will publishing do anything other than frustrate?); getting there on infrastructural (though licensing, linking and standards offer the greatest challenges); struggling on operational (contractual, accuracy, standards).

That’s a one line summary. What do you think? Where should we putting more effort?