Greenland Street

I’ve just done a bad thing.

As a photographer, I’ve got a great deal of respect for the work of others. When copyright is held by someone else, and I don’t have a licence to take it or tinker with it, I don’t.

Well, I just did. (And remorse is fairly low on my list of feelings, to be honest.)

I picked up a nice brief today for a shoot in a week or two. As I often do these days, I put the address into Google Street View, just to get a general sense of where the building lies, and what sort of street vistas might be possible.

And there’s this guy. Right in front of the camera.

Shielding his face from the camera, with not one, but both hands.

I kept coming back to the image. Finding it a powerful visual metaphor for the evasion of surveillance; of a small, bowed figure at the front of the frame, seeking not to be identified.

Did he know about the face-blurring they use? Did he trust it? Did he care?

(Yes, I think he cared.)

So I did the bad thing, and scraped the image, un-watermarked it (in a symbolic echo of de/anonymisation?), gave it a little help with colour and tone and composed it as an image that told a story. As I like to do.

You can see it in its full glory by clicking on the preview below. You can download it and use it for stuff if you so choose.

(If I get into trouble, I’ll let you know. If you do, let me know.)

P.S. Thanks to Michael Smethurst for setting the image in the context of this fabulous story from Cory Doctorow, which then made me think more about its symbolism.

P.P.S. You can see the original image here (until it’s replaced by a fresh camera shot, of course).

P.P.P.S. Yes, I am fully aware that I’m quite happy to use Google Street View to help me in my work but also have little frissons about some of its other “features”. But thank you for thinking it.

Just because you can…

An interesting piece appeared on the Guardian data blog on Friday. It describes a wealth of new data being released relating to court and conviction information.

The database shows sentencing in 322 magistrates and crown courts in England and Wales. Defendants’ names are excluded but details such as age, ethnicity, type of offence and sentence are not. Any computer user can analyse aspects such as how many white people were sent to jail for driving offences.

All good stuff. There’s definitely value to be gained from this type of analysis. It’s being released as a database (hopefully with a commitment to regular ongoing publication), and it brings consistency to often haphazard arrangements for making data available. These are positive moves, and should be welcomed.


Transparency campaigner William Perrin, who advises the Ministry of Justice on opening up its data, says the release is a big step: “Publishing the details of each sentence handed down in each court is a great leap forward for transparency in the UK, for which MoJ should be warmly praised. Courts have to be accountable to the local populations they serve.” But he, like some campaigners, believes the MoJ should go further, releasing the names of defendants. “The data published is anonymised, flying in the face of hundreds of years of tradition of open courts and public justice.

“The MoJ need to have an open and public debate about the conflict between the central role in our society of open public courts where you can hear the name and details of offenders read out in public and crude misapplication of data protection.”

My concern lies with the consequences of releasing the names of individuals, as proposed here, in a completely accessible and reusable way.

William draws a parallel between the act of reading out names in public court and publishing them on the Internet. (Disclosure: William and I both sit on the Transparency Sector Panel in MoJ.)

Were it a simple parallel, with the same consequences, I’d be pretty comfortable with the principle of release, too. But I see one very big difference: raw content on the Internet is (almost always) indexed by search engines. And search engines have very, very long memories. The (only) two things that the Internet has fundamentally changed are the ease with which information can be found, and the duration and extent over which it persists–as I’ve banged on about on this blog before.

So, this proposal (if taken at face value) would lead to a couple of consequences which might not be wholly desirable: firstly, a name would quite feasibly, if entered into a search engine, throw up information about an offence and the consequent sentencing for an indefinite time. What implications does that have for rehabilitation of offenders? If your conviction has been spent, and your potential employer does a quick check and finds that the only thing you’ve ever been noted for on the Internet is… Well, would that feel just to you?

Ah, I hear you say–but look at court reporting now: those journalists that do manage to get intelligible information out of a clerk so they can write their pieces accurately end up with their content being indexed (paywalls permitting), and the Google ghosts will be there to do their haunting anyway. Yes. They will. But this is an issue of scale and ease, not principle. Journalists today, even those with perfect information, exercise some choice over what they choose to print. Maybe this is just because of space constraints, maybe there are other factors at play. But the “release everything for reuse” stance would dramatically increase this scale of publication.

You may say that this is a good thing: along similar lines as “nothing to hide, nothing to fear”, this extra hangover from a criminal’s downfall may be a very positive thing for society. Another deterrent to criminality, maybe? I don’t know about that, but I do know that we then face a reappraisal about what we mean by rehabilitation as a direct consequence of data release.

And, as William says, that needs proper public debate.

But it’s not just a matter of scale. We find, when public data is released en masse, that new business opportunities spring up. Imagine the entrepreneur who gathers all data on convictions and charges for their own employee check service. They might adhere to principles of time limitation on their data. They might not. They might mash-up this data set with other information. They might not. They might put profit before principle.

We attempt to control such reuse of information with regulation, but on the Internet, it gets very much harder to make this stick in practice. Again, we risk changing the landscape of what it means to be convicted, by releasing data like this.

I’m fascinated by how even something like the current Data Protection Act relates to the indexing of personal information within search engines. Surely, almost by definition, the end purpose of such indexing cannot be known, and therefore Principle 2 (Personal data shall be obtained only for one or more specified and lawful purposes, and shall not be further processed in any manner incompatible with that purpose or those purposes–source: ICO) must surely be creaking already?

So, I’m not so keen on making it indexable. Can this be avoided? Is there a middle ground which acknowledges the shambles that is the current practice in courts–with some prepared to supply information in machine-readable format, others insisting on hand-written notes being passed, and some seemingly actively obstructive in providing information?

I think there might be. There are some “government” datasets which although they could be released for reuse, aren’t. For fairly good reasons. The database of car registrations, for example. I suspect we’d consider if a bad thing if a road rage incident could be easily followed up with some bricks through windows on the basis of typing in the offending registration plate when you got home.

Similarly, we have a curious set of “frictions” in place to allow us to have an electoral roll which is at the same time both “publicly viewable” (provided you go to a library) and searchable online only if you pay up a good chunk of cash. A big hmmm from me to that latter part, by the way, but you can read much more on electoral roll issues here.

And the way that this data is structured is also important: so that we can’t, for example, easily go online, type in an address down the road, get a full list of occupants’ names and pop round there with all sorts of social engineering stories designed to make trouble/extract money/dig for further info/groom/be very creepy. Again, I’d suggest we do this for good reasons, and we know how to build machinery to keep this equilibrium in our society.

We may solve the problem through choosing carefully the format for release, the means by which it’s referenced, and even to whom it’s released. Yes, I know, those wretched privileged accessors again (just like the Police, DVLA, local authorities, credit agencies etc etc etc.) Always a subject to warm the temperature in open data discussions!

But I’m not arguing for wilful obfuscation of this data, merely putting forward some of the alternative perspectives to “everything, raw, now”. We do need this public debate, and we need to be reasonably confident that we’re getting a net societal benefit from whatever action we take.

Let’s tread carefully here–just because you can, doesn’t always mean you should.

[I’d be commenting on the Guardian article if I could, but it doesn’t seem to have comments open, so I’ve written this in response.]

Neither one thing nor the other

In which I look more closely at one particular, well-known data set: what makes it what it is, and what we might draw from the way it’s managed to help us with some other challenging questions about privacy and transparency.

Surely data is open, or it isn’t?

(I’m using “open” here as shorthand for the ability to be reached and reused, not with any particular commercial or licensing gloss. It’s a loaded term. But let’s not snag on it at the beginning, hey?)

Data is either out there, on the internet, without encryption or paywall, or it isn’t. And if it is, then that’s that. Anyone can reach it, rearrange it or republish it, restrained or hampered only by such man-made contrivances as copyright and data protection laws.

Maybe. Maybe not.

I’ve been involved in some interesting discussions recently about the tricky issues surrounding the publication of personal data. By that, I mean data which identifies individuals. To be specific: some of the information in the criminal justice sector about court hearings, convictions and the like.

You’ll have seen much in the press, especially following the riots, about a renewed political and societal interest in this type of publication.

Without making this post all about the detailed nuances of those questions, this broader issue about the implications of “open” publication seems to me to need a bit more exploration before we can sensibly make judgements about such cases.

And to do that I took a close look at one very well-known data set: the electoral register.

What is it? Well, it’s a register of those who’ve expressed their entitlement, being over 18 (or about to be) and otherwise eligible, to vote in local and national elections, through returning a form sent to them by their council each year. If you’re reading this, you’re probably on it. I am.

It’s therefore not: a complete list of people in the UK (or even of those entitled to vote); a citizenship register; a census; a single, master database of everyone; accurate; or a distillation of lots of big government systems holding personal information.

What’s it for? An interesting question. I suppose its primary existence is to support the validation of those entitled to vote, at and around election time. But you’ll know, if you have voted, that it’s more of an afterthought to the actual process; most people show up with polling cards in hand, and anyway, there’d be no possibility of any real form of authentication, as the register doesn’t contain signatures, photos, privileged information or any other usable method of assurance. It’s not even concealed from view. (More on that here.)

But it does some other things, doesn’t it? It provides a means for political candidates to be able to make contact for canvassing purposes with their electorate. And I suppose, for that reason, it has this interesting status as a “public document”. Which we’ll come back to in a moment.

And to complete the picture, a subset of it (the “edited register”) is also sold to commercial organisations for marketing purposes, enabling them, amongst other things, to compile pretty comprehensive databases of people.

…and as a byproduct of that it also forms an important part of credit-checking processes–with said commercial organisations able to offer services, at a price, to anyone who wants to run a check that at least someone claiming to have X name has at some point claimed to live at Y address. (Remember, it’s all pretty weak information really, self-asserted with no comprehensive checking process.) You can opt out of the edited register if you choose, but you’re included by default.

[Update 2 Oct: Matthew, below, comments that I’m not quite right here–the full register is also available to be used for credit checking]

There’s probably more, but let’s get stuck into some of this.

First off, I will happily add that the whole business of why it needs to be public at all seems highly questionable. And I don’t remember the public debate where we all thought that it was a great idea to try and make a few quid off the back of this potentially highly-sensitive data? Do you? How do you feel about that?

And the idea that the process of democracy would be terminally hampered were candidates, agents and parties not able to make checklists of who’d been canvassed? Really? Couldn’t they perhaps just knock on doors anyway? As a potential representative would I only be willing to learn from encountering those who had a vote? I suggest not.

So, moving on past those knotty questions about “why do we have it, and why do we sell it?”, we have in practice established some conventions about managing it as “a public document”.

Can I, as a member of the public, request a copy be sent to me? Certainly not. Ok, perhaps I can download it then? Nope. Search it online? Hell no.

I can go and see it in my local library.

So I did.

I heartily recommend you do the same. It is a real eye-opener in terms of the idea of data being “semi-public”.

I trotted up to the (soon-to-be-closed [boo hiss]) information desk at the library under Westminster City Hall.

–Can I see the electoral register please?

–Sure. We only have the edited version here: if you want the whole thing, you have to go through there and ask for Electoral Services.

(He pointed at a forbidding and not-at-all-public-looking door).

–You’re ok, I’ll just have a look at this one

And out from the back window-ledge comes a battered green lever-arch file, containing bundles of papers.

–You know how to use this? he says

I shake my head. It seems the top bundle of papers is a street index. The personal information (names grouped by cohabitation, basically) is listed by street, then house name/number within street. Not by names.

So, you can’t, easily, find someone you’re stalking. (Did I say that? I mean, “whose democratic participative standing you have a legitimate interest in establishing.”)

But you can if you’re patient. Or if their name, like that of one Mr Portillo, leaps off the page at you. I intentionally chose the register of the area immediately around the Houses of Parliament, for just this reason. Curiously, I couldn’t actually find the HoP itself listed, but Buckingham Palace does have over 50 registered voters (none of whom are called Windsor.)

But back to the process: as I picked up the box to head towards an empty desk a finger came down on the lid: –you have to read it here, he says.

I look at the lid. Wow.

I ask the question about photocopying anyway, just to judge the reaction. Kitten-killer, his eyes say.

But I take it a few paces away anyway and have a closer look.

Fascinating. I see a bunch of well-known people from industry and politics, their home addresses, and who else lives with them.

I’m sure I’ll go grey in chokey if I actually published unredacted screen shots in this post, but I’m pretty sure this one will be ok; if nothing else I think its historical interest justifies it… (RIP, Brian.)

Now, in all the fuss we make about child benefit claimant data being mislaid via CD, and in all the howling we make about anonymisation of health records and other sensitive data, and through all the fog that surrounds the commercialisation of public information and the Public Data Corporation etc. isn’t this sort of information that we would normally expect to be the subject of an enormous public debate about even its very existence? And I’m walking off the street and making notes of it, and, and…

And I can see what’s happening here.

Yes, it’s “public”. Sort of. But so much friction has been thrown in the way of the process–from the shirty look as I have the temerity to request it, to the deliberate choice over structure that minimises me being able to quickly find my target–that I would strongly argue it to be “semi-public” rather than public.

There are some important lessons here perhaps when considering the mode, and the consequence, of publishing data online. Clearly, structure is highly relevant. If I am able to sort, and index it, that instantly creates a whole universe of permanent, additional consequences. Not all of which may be that desirable. “A perpetual, searchable, SEO-friendly database of all those ever summoned to court, convicted or not, you say? Certainly sir…coming right up.”

If I’m able to relate information–by association with others–I can also help the cause of those wishing to track someone or something down. Look at Facebook. It does a great job of finding people you search for, even those with very common names amongst its hundreds of millions of accounts, by this type of associative referencing. Powerful stuff.

And let’s not forget that ALL this information is pretty easily available online anyway. You just have to pay for it. The best-known provider that I’ve looked at, 192.com, has an interesting model. You’ll be giving them at least a tenner, and more like £30 to buy some credits to search their databases. And they have the ominous rider that their really sexy information–the historic registers, is only available at an entry-level price of £150 a year. For that reason, I haven’t actually given them a penny as yet. But it’s no obstacle to the serious stalker. I mean, researcher.

I’m sure there are all sorts of impediments, from download limits to penalties for misuse, that attempt to put further spokes in the wheel of it becoming a common commodity. But how long, really, before the whole register is available as a torrent on the Pirate Bay? Maybe it is already?

And we’re not bothered about this? It’s amazing, isn’t it? Yes, this whole industry is built on data that we’re required to submit to public authorities–and if we don’t, we’re disenfranchised.

This is a scandal, and one that urgently needs review.

But do take away the point that there is such a concept as “semi-public” – at least for now. It’s the ability to process, to restructure, to index, that makes online data different from those box files in the library.

The friction we throw into the system, whether it’s (intentionally?) releasing information via pdf, or slipping a local journalist a hand-written note of the names of those in court, is perhaps more than just dumb intransigence in the face of “information that wants to be free”. And it can serve some potentially legitimate social purposes.

Think how you’d feel if those frictions weren’t there around the electoral roll? Even the money that 192.com require for you to buy back the data you gave up in the first place?

Happy that every comment you made online under your own name, every mention in the press, could be traced back to your real address along with the names of your (18+) family? I think perhaps not.

So, a very big public debate is required on the consequences of any personal data being put online. But remember, stealthily or not, we’ve had experience of these issues for years. We just need to look on the library window-ledges to find it.

Getting personal

For a long time, I’ve shied away from writing here about personal data. Or even thinking that deeply about it. The nature of identity, yes. The usefulnesss of data, yes. Personal data, no. Why?

Not because it isn’t fascinating, or important. Mainly because it’s so…damn…nebulous. And difficult. Time to get over that, I think. Very significant things are happening in this area, and we all need to raise our game in how we understand and engage with the concepts involved.

As I’ve surmised before, the only things that are really different in the Internet age are the ease with which information can be found, and the ease with which it can be stored.

Two things, really. That’s all.

The first embraces everything around indexing, cross-referencing, labelling, structure and searching. The latter takes us into the territory of copying (and of course copyright), archiving, and the general issue of persistence.

And when we look at personal data in that context, there is an immediacy–and potential toxicity–in what emerges.

We saw early rumblings of this long before the Internet, of course, when computers were first used for the mass processing of information about people. Things could be done with databases that simply weren’t possible with big paper ledgers.

We created Data Protection legislation which attempted to put reins on the ability to make free use of some types of information. Gathering stuff about people, from the basic facts of who and where, to how to contact them, who they were connected to, and what their tastes and preferences were. Pure gold, used in the right (or wrong) ways.

Data Protection set out some pretty sound, but general, principles. The overarching one being that the purpose to which data could be put should always be made clear to whoever provides it, at the time of providing. Lots of other stuff about processing, storage, where and how long, and so forth–but that issue of consent always seemed the most important, to me.

And we scratched about a bit to actually try and define what we meant by “personal data”. Some things were easy. Names. Addresses and phone numbers. They’re just obvious.

But what about our tastes? Our buying history? The movements of our mobile phone from cell to cell? A journey we took? As one takes informational side-steps away from the individual, the obviousness diminishes, but if you can make meaningful connections back to the person…

…and remember the first thing that the Internet really changes?

Being able to make those tenuous links between blocks of information into something really substantive.

And the second thing? That information and those links are now permanent. You can’t delete them, once they’re there.

All those things that databases couldn’t previously do, because they all conformed to different standards, and weren’t connected together? They can now. Things can be done via the Internet that simply weren’t possible with just the databases.

Bit by bit, it’s been possible to build up the most humongous repositories about people. Maybe entirely within the law, maybe in other ways as well. Maybe with explicit and informed consent all the way down the line. And maybe not.

Who’s to know? We find strange things going on with data that we provide in order to use one or other service–or even to exercise our democratic rights. Didn’t it ever strike you as slightly weird that the electoral roll could be sold on for commercial purposes? (Much more on the electoral roll in another post coming soon.Update: now here)

We have big companies that have built successful businesses just like this: perhaps using aggregated personal information for credit referencing, perhaps to sell to marketeers to give them a better understanding of demographics.

The genie is very much out of the bottle. Your rights to see the information that a particular company holds on you may exist, but you have to have a fair idea of which company to ask in the first place. Can you ever see the full picture of what others know about you?

Of course not.

And it’s unreasonable to suggest that we’ll ever be able to do that. Instances of data multiply more rapidly than does our capability to track them. (There must be a Law of Internet Entropy out there that says something like that. If not, I just invented one.)

(As an aside, a dear friend once uttered the memorable line “somewhere out there, there’s a database with your dick size on it”. That was in 1989.)

So what can we do?

Realistically, all that’s available to us are firebreaks and friction.

We can’t get that genie back in the bottle, but we can slow it down a bit, and find ways to mitigate the impacts.

Do we need an updated definition of personal data? It’s MUCH harder than it seems at first glance to create one. The best I can find at the moment in terms of an “official” position is here.

And it’s clumsier than you think. Essentially, it’s a list of ever-widening filters that assess whether a particular piece of information can be connected to a specific individual. Culminating in the rather wonderful catch-all of the final category:

8. Does the data impact or have the potential to impact on an individual, whether in a personal, family, business or professional capacity?

Yes The data is ‘personal data’ for the purposes of the DPA.
No The data is unlikely to be ‘personal data’.

Even though the data is not usually processed by the data controller to provide information about an individual, if there is a reasonable chance that the data will be processed for that purpose, the data will be personal data.

That’s pretty general, no? In fact, going by that, an awful lot of things are now personal data. I really like the emphasis it puts on the outcome of the data use, not attempting to over-define things like form and structure.

I’d go as far to say we should probably throw away that big long document, and just run with this definition:

Personal data is information that affects you when it’s used. Either directly, or through being linked to other information using technologies that exist now, or may exist in the future.

Broad enough? ;)

(So my beloved photos: they’re personal data. I take them with a camera that has a unique number, held in metadata in the picture file. That provides a way to link all the pictures it takes together, and then, through the various accounts I put them in online, back to me. Think how many other trails you leave…)

But again, all we really have are firebreaks, and friction. There’s a sort of reverse entropy at work. Unlike almost every other instance of entropy–where things get more chaotic over time (china plates get broken, they never put themselves back together again)–personal information is, relentlessly, only going to get more linked. More aggregated. More pervasive. More permanent.

(So, maybe I just invented The Law of Reverse Internet Entropy as well? Not bad going for one post…)

And if someone tells you that big blocks of personal data can be “de-anonymised”, be very sceptical indeed. (You can read some wise thoughts on the issues involved here and elsewhere on that blog.)

We can undertake some pretty noble fire-breaking: like ensuring the state doesn’t become the source of a global universal identifier for you. And we will certainly see more developments around multiple personas: compartments of your life associated with particular tasks, contexts, or connections. I think we’ll have to. (The concept of federated identity helps here, but that’s too much to go into for this post. Read more thoughts from the team working up these concepts for government.)

And we’ll adjust. Society has seen some pretty dramatic upheavals. Often associated with a new technology, or philosophy. If we adjust our societal norms faster than the upheaval, we don’t notice. If we’re slower to change, it’s painful. For a bit.

But we get through. We adapt. And we change. Always.

Indeterminately public

I did a thing that might have been very wrong yesterday. But I’m not sure.

So this is part confessional, part taking advantage of it as a vehicle for discussion. (And a fair bit of hand-wringing into the bargain.)

I recorded someone’s conversation without their knowledge or consent and I put it on the Internet, amplified via Twitter and Facebook.

I’ve had some great discussions in the past about where the boundary of public and private really lies: debating the nuances of shouting “FIRE!” in a crowded theatre versus an empty street versus a whisper in a pub. Thoughts around the “Twitter joke trial” about the point at which something could meaningfully be said to be “broadcast”, or to have a particular intent, or audience. The Press Complaints Commission ruling this week on the (non)privacy of tweets.

It’s a bloody minefield, that’s for sure. And much of the mine detection equipment hasn’t been built yet, and what has, hasn’t been well tested.

I am pretty scrupulous about respecting privacy, where I understand privacy to exist. As I understand it. Which isn’t all that straightforward either.

So a person going about their job, talking to a colleague about what they think of their company’s internal training (and more) shouldn’t expect to have their conversation recorded and “broadcast” on the Internet, should they? Under any circumstances?

Questions, questions, questions.

What does broadcast mean here, really? Does identifiability of the speaker make a difference? What about the nature of his job? Where he is as he’s speaking? How long he’s been going on like this? Whether his outpourings have clearly (to my ear) left the realm of colleague-to-colleague and taken on the guise of a rather bizarre form of public performance art? What about him swearing, loudly, wearing a company uniform and in the earshot of children and those visibly (to my eye) not appreciating it? Rubbishing his employer. Crude sexism. Do any of these make a difference?

And on my side of the fence: what’s my intent? To amuse people? To hold him to account or even get him fired? And what if that’s the outcome and it wasn’t my intention? Do I become liable, legally or morally?

The Internet has changed everything. The two dominant characteristics, as I see them: ease of access to information and permanence of record are visibly in play here. I don’t know where the recording might end up. I do know it will be somewhere, for ever. With a very low threshold of effort required to find it.

Of such vapours are the clouds that we call public and private space formed. And I thought long and hard about them. And I weighed the answers I came up with as levelly as I could. And I published.

Let me know what you think. Right, or wrong? Or “complicated”?

Would make a good role play, mind you.

(I don’t do role play.)