Copyleft and data: databases as poor subject

tl;dr: Open licensing works when you strike a healthy balance between obligations and reuse. Data, and how it is used, is different from software in ways that change that balance, making reasonable compromises in software (like attribution) suddenly become insanely difficult barriers.

In my last post, I wrote about how database law is a poor platform to build a global public copyleft license on top of. Of course, whether you can have copyleft in data only matters if copyleft in data is a good idea. When we compare software (where copyleft has worked reasonably well) to databases, we’ll see that databases are different in ways that make even “minor” obligations like attribution much more onerous.

Card Puncher from the 1920 US Census.
Card Puncher from the 1920 US Census.

How works are combined

In software copyleft, the most common scenarios to evaluate are merging two large programs, or copying one small file into a much larger program. In this scenario, understanding how licenses work together is fairly straightforward: you have two licenses. If they can work together, great; if they can’t, then you don’t go forward, or, if it matters enough, you change the license on your own work to make it work.

In contrast, data is often combined in three ways that are significantly different than software:

  • Scale: Instead of a handful of projects, data is often combined from hundreds of sources, so doing a license conflicts analysis if any of those sources have conflicting obligations (like copyleft) is impractical. Peter Desmet did a great job of analyzing this in the context of an international bio-science dataset, which has 11,000+ data sources.
  • Boundaries: There are some cases where hundreds of pieces of software are combined (like operating systems and modern web services) but they have “natural” places to draw a boundary around the scope of the copyleft. Examples of this include the kernel-userspace boundary (useful when dealing with the GPL and Linux kernel), APIs (useful when dealing with the LGPL), or software-as-a-service (where no software is “distributed” in the classic sense at all). As a result, no one has to do much analysis of how those pieces fit together. In contrast, no natural “lines” have emerged around databases, so either you have copyleft that eats the entire combined dataset, or you have no copyleft. ODbL attempts to manage this with the concept of “independent” databases and produced works, but after this recent case I’m not sure even those tenuous attempts hold as a legal matter anymore.
  • Authorship: When you combine a handful of pieces of software, most of the time you also control the licensing of at least one of those pieces of software, and you can adjust the licensing of that piece as needed. (Widely-used exceptions to this rule, like OpenSSL, tend to be rare.) In other words, if you’re writing a Linux kernel driver, or a WordPress theme, you can choose the license to make sure it complies. Not necessarily the case in data combinations: if you’re making use of large public data sets, you’re often combining many other data sources where you aren’t the author. So if some of them have conflicting license obligations, you’re stuck.

How attribution is managed

Attribution in large software projects is painful enough that lawyers have written a lot on it, and open-source operating systems vendors have built somewhat elaborate systems to manage it. This isn’t just a problem for copyleft: it is also a problem for the supposedly easy case of attribution-only licenses.

Now, again, instead of dozens of authors, often employed by the same copyright-owner, imagine hundreds or thousands. And imagine that instead of combining these pieces in basically the same way each time you build the software, imagine that every time you have a different query, you have to provide different attribution data (because the relevant slices of data may have different sources or authors). That’s data!

The least-bad “solution” here is to (1) tag every field (not just data source) with licensing information, and (2) have data-reading software create new, accurate attribution information every time a new view into the data is created. (I actually know of at least one company that does this internally!) This is not impossible, but it is a big burden on data software developers, who must now include a lawyer in their product design team. Most of them will just go ahead and violate the licenses instead, pass the burden on to their users to figure out what the heck is going on, or both.

Who creates data

Most software is either under a very standard and well-understood open source license, or is produced by a single entity (or often even a single person!) that retains copyright and can adjust that license based on their needs. So if you find a piece of software that you’d like to use, you can either (1) just read their standard FOSS license, or (2) call them up and ask them to change it. (They might not change it, but at least they can if they want to.) This helps make copyleft problems manageable: if you find a true incompatibility, you can often ask the source of the problem to fix it, or fix it yourself (by changing the license on your software).

Data sources typically can’t solve problems by relicensing, because many of the most important data sources are not authored by a single company or single author. In particular:

  • Governments: Lots of data is produced by governments, where licensing changes can literally require an act of the legislature. So if you do anything that goes against their license, or two different governments release data under conflicting licenses, you can’t just call up their lawyers and ask for a change.
  • Community collaborations: The biggest open software relicensing that’s ever been done (Mozilla) required getting permission from a few thousand people. Successful online collaboration projects can have 1-2 orders of magnitude more contributors than that, making relicensing is hard. Wikidata solved this the right way: by going with CC0.

What is the bottom line?

Copyleft (and, to a lesser extent, attribution licenses) works when the obligations placed on a user are in balance with the benefits those users receive. If they aren’t in balance, the materials don’t get used. Ultimately, if the data does not get used, our egos feel good (we released this!) but no one benefits, and regardless of the license, no one gets attributed and no new material is released. Unfortunately, even minor requirements like attribution can throw the balance out of whack. So if we genuinely want to benefit the world with our data, we probably need to let it go.

So what to do?

So if data is legally hard to build a license for, and the nature of data makes copyleft (or even attribution!) hard, what to do? I’ll go into that in my next post.