DH 2016 Abstracts

When DH Meets Law: Problems, Solutions, Perspectives

Digital data is fuel for data-intensive science. Access, re-use and sharing of this data, however, while required by academic ethos and good practice, are often highly restricted by legal frameworks. In particular, the areas of law that can affect research data are: intellectual property (copyright and database rights) and personal data protection. These issues are particularly relevant in the field of Digital Humanities, which study various aspects of human activities in general, and their creative and social aspects in particular. In fact, most of the research data in digital humanities are within the scope of either intellectual property or data protection law, which means that they cannot be freely accessed, re-used and shared without a permission of the right holder or the data subject’s consent.

Moreover, research funding agencies require more and more often that the results (and underlying data) of research projects that they fund be made available in Open Access. Open sharing of research data and outcomes is often perceived nowadays as an ethical obligation in contemporary science, but it cannot be done in a satisfactory way without addressing legal concerns (such as appropriate licensing and rights clearance).

Legal issues are increasingly being taken into account in the preparation phases of many research projects. Scientists who do not consider legal issues in their research activities may be exposed to certain legal risks. Existing statutory exceptions for research rarely provide for enough relief (even though lobbying efforts are being made to extend their scope). In short, modern science in general, and Digital Humanities in particular, are more concerned with legal issues than ever before.

The purpose of the multiple paper session we propose is to emphasize the problem, discuss various technological and organizational solutions, as well as future legal challenges that DH researches will have to face.

Three papers by authors with both legal training and hands-on experience with D-SSH research data management will be presented. The first one compares organizational and technical solutions adopted in the field of DH and Social Sciences. The second discusses research data licensing, and presents existing tools that help researchers in the process. The third paper examines legal and ethical aspects of stylometry and authorship attribution research.

1. “One Does Not Simply Share Data”. Organisational and Technical Remedies to Legal Constraints in Research Data Sharing -- building bridges between Digital Humanities and the Social Sciences.

by Pawel Kamocki (IDS Mannheim/Paris Descartes/WWU Münster), Katharina Kinder-Kurlanda (GESIS) and Marc Kupietz (IDS Mannheim)

Within the Social Sciences there exists a long tradition of data sharing, which is facilitated by infrastructure institutions such as the GESIS Data Archive that has been providing survey data to researchers since the 1960s. More recently various technical solutions aiming to grant secure and user-friendly access to data requiring special protection have been emerging in the field.

The DH also have a well-established tradition of research data sharing. In linguistics, for example, digital text collections have also been published since the 60s (e.g. the Brown Corpus or the Mannheimer Korpus). First software solutions to share the data and to make it accessible to other researchers emerged in the late 1980s and started to boom with the appearance of the WWW in the early 1990s.

1.1. Legal barriers to data sharing

Legal issues have long been identified as barriers to research data sharing. They can be divided into two categories: those related to intellectual property rights and those related to privacy laws.

Intellectual property rights — such as copyright and the database right — grant the rights holders certain exclusive rights (monopolies), i.e., rights to exclude others from the use of their property. For example, in order to copy and distribute a copyright-protected work or a database, one normally has to obtain permission from the rights holder, usually in an agreement known as a license. This highly affects DH — disciplines fueled by digital data issued from human creative activities, which normally qualify for copyright protection.

It is essential to understand that intellectual property is similar to “traditional” (i.e. corporeal) property. Therefore, it can be said that most research data in DH in fact belong to a third party (author or publisher). The right to property is a fundamental freedom, which overrides freedom of research. As a consequence, statutory research exceptions are rarely enough to allow use of copyright-protected data in research projects.

Researchers in DH therefore need to obtain licenses for the use of data, which is not an easy task. The negotiations may be time-consuming, and the result is not always satisfactory. In practice, licenses signed with e.g. publishers are often very restrictive and non-transferable.

Another legal framework that affects researchers in areas such as medicine or the social sciences, but also in the DH, is personal data protection. In principle, personal data (i.e. any information related to an identifiable person) can only be processed if the data subject has validly consented to the processing. While it is true that anonymised data can be freely processed, anonymisation may strip a dataset from most (if not all) of its informational and scientific value. The obligation to obtain consent is particularly burdensome when it comes to older data that has been collected without consent, or that has been collected for a different purpose (re-purposing normally necessitates a new consent). Also, in practice, consent rarely covers transfer and sharing of data with other researchers.

1.2. Social Science approach

In the Social Sciences quantitative survey data is particularly interesting for sharing. The highly controlled and well-documented ways of gathering data in large-scale survey programmes make the data highly reusable in a methodologically sound way. The GESIS Data Archive for the Social Sciences in Germany provides survey data for secondary use and thus allows researchers to share collected data in a user-friendly, searchable and standardised manner. Due to data protection legislation participants of survey data provided by the archive must not be re-identifiable, or only with a disproportionate amount of time, expense and labour. Anonymization challenges usually occur once detailed geographical as well as demographical information has been collected.

To improve data sharing several solutions have been found for the Social Sciences. For example, most data at the GESIS archive is anonymised and thus can be provided for download via the online data catalogue. Some datasets containing more detailed information are provided employing secure data access solutions. A combination of contractual, organizational and technical safeguards is employed to ensure that individuals’ rights to anonymity are protected. For example, for particularly disclosive data, researchers must visit a safe room where a completely encapsulated virtual research environment is provided via a thin client. They cannot download any data or access the internet. They are also not allowed to bring mobile phones or other electronic devices. Any analysis output they produce is intellectually assessed for its level of disclosiveness and only handed to researchers once the output criteria are fulfilled. Only users whose signed usage agreements (detailing the research topic and the methods applied) have been approved can use the safe room.

Secure remote solutions either provide researchers with a secure connection to an encapsulated work environment as described above or allow the submission of code and syntax to be run on the data by the data provider. All remote solutions need to be secured from threats posed by using the internet.

1.3. DH approach

To cope with legal challenges, to make research data as openly accessible as possible, and to enable traceability and replicability without interfering with legitimate interests of rights holders, the disciplines that deal with language as their primary research data, particularly linguistics, have developed several strategies.

As already discussed above, usually the only way to acquire text for research purposes is to obtain licenses from copyright holders. The copyright holders can sometimes be convinced to provide scientific licenses for free or for comparatively low fees as long as they do not interfere with the company’s business model. Given that some institute is willing to conclude partially transferable license agreements with rights holders and license agreements with end-users, thus acting as an intermediary between both parties, and to provide software that enforces the license restrictions and provides all retrieval and analysis functions that researchers need, every group's interests can be satisfied. Indeed, this model has worked successfully at the Institute of German Language (IDS) since the beginning of the 90s and also for most other providers of national and reference corpora.

The problem with the intermediary model alone is that it requires the intermediary to provide all functions that are required for any researcher. While, for example, the requirements for more traditional linguists could be satisfied, this was not possible for researchers in the field of text mining and computational linguistics where the methods of analysis themselves are in the center of research and therefore subject to rapid change.

To meet the needs of such user groups, another idea, recently described for linguistics but traditionally used in data-intensive disciplines such as climate research, has to be put into action. Extending Gray’s (2003) famous claim “put the computation near the data” to situations where the data cannot be moved due to license restrictions, data providers also provide mechanisms that allow end-users to run their analysis software on the data located at the provider, making sure that the software does not violate any license restrictions.

Another remedy that is often applied in the context of corpora that are based solely on texts from the WWW, is not to share the research data itself, as this could be a copyright violation, but rather to share the software that retrieves the texts from the Web. Depending on the the application scenario and the local legislation, this technique can enable end-users to benefit from statutory exceptions. An unwanted side-effect of this approach is that the identity of the corpus data retrieved by different runs of the retrieval software cannot be guaranteed. However, there is some consensus in the research community that the sights with respect to demands on replicability and persistency of research data necessarily have to be lowered to a realistic standard that takes into account legal restrictions. This aspect has also recently entered the best-practice guidelines of the German Research Foundation.

1.4. Conclusions

D-SSH are dealing with data surrounded by legal issues. Traditionally, Social Science researchers process more privacy-sensitive information, whereas DH researchers work with data protected by intellectual property rights, usually belonging to third parties; the division, however, is not clear-cut. Some disciplines in DH (such as linguistics) also have to deal with privacy-sensitive material, and the Social Sciences are concerned if not by copyright, then by other branches of intellectual property (such as database right). Increasingly commercial data owners such as social media companies are becoming important.

Institutes in both disciplines have developed idiosyncratic ways of coping with legal restrictions which provide satisfactory results. This shows us that some legal issues may be resolved by appropriate organisational, technical and infrastructural solutions; the comparison between DH and Social Sciences, however, demonstrates that both disciplines still have room for improvement and that there is a lot that they can learn from each other.

2. « Trust me. I’m a License Selector ». Licensing for Digital Humanities

by Paweł Kamocki (IDS Mannheim / Université Paris Descartes / WWU Münster) and Pavel Stranak (UFAL Prague)

Lack of legal interoperability (i.e. a situation in which a dataset cannot be used due to incompatible licensing restrictions on its various parts) has been identified as one of the major obstacles for data access, sharing and re-use. This is particularly relevant in the field of Digital Humanities, where data are often protected by copyright (i.e., they are created by human authors). While it is true that some data (e.g. those obtained from press editors) are only available to researchers under very restrictive license agreements, in fact quite often legal interoperability problems can be solved by proper licensing of research outcomes. Indeed, despite the fact that openness and reproducibility of results have long been identified as cornerstones of the scientific community, in practice many digital datasets and tools are being shared under licenses that are unnecessarily restrictive or not fit for the purpose, or even without any licenses at all. This is probably due to the fact that the task of choosing an appropriate license may seem difficult for an average researcher with a limited access to legal advice. As a response to that problem, attempts have been made to build tools (referred to as License Choosers, License Selectors or even License Wizards) that would guide the users through the jungle of available public licenses and allow him to choose one that is the most suitable for his needs.

Before these License Selectors can be presented and assessed, it is essential to define the notion of a public license. A public license is a license that grants certain rights not to an individual user, but to the general public (every potential user). Public licenses for software has been known since 1980s (when software licenses such as BSDL, MIT or GNU GPL emerged). However, public licenses for other categories of works (including datasets) only appeared in the 21st century, mostly due to the creation of the Creative Commons foundation. The latest version of the CC license suit (including six licenses, a waiver and a public domain mark), CC 4.0, is well adapted for datasets, as it covers not only copyright, but also the sui generis database right, but older versions are still in use. While choosing a license, one has to keep in mind that the licenses which are appropriate for software are not appropriate for data and vice versa. Moreover, not all public licenses are ‘open’, i.e. not all of them meet the requirements for Open Access/Open Data/Open Source label. In our paper, we would like to briefly demonstrate three online tools made specifically for licensing of research material.

The Licentia tool (http://licentia.inria.fr/visualize) has been developed in 2014 by Cardellino for INRIA (French Institute for Research in Computer Science and Automation) is in fact a conglomerate of three tools: a License Search Engine (which allows to identify licenses that meet a set of requirements defined by the user), a License Compatibility Checker (which assesses whether two licenses are compatible, i.e. whether material licensed under those two licenses can be ‘mixed’) and a License Visualiser (an interesting extra feature which produces graph-based visualisations of licenses expressed in ODRL - Open Digital Rights Language Deontology).

The ELRA (European Language Resources Association) License Wizard (http:// wizard.elda.org), released in April 2015, allows users to define a set of features and browse corresponding licenses. For now, the tool only includes CC, META-SHARE and ELRA licenses, so it is particularly useful for language resources.

Finaly, the Public License Selector (http://ufal.github.io/public-license-selector/) developed by Kamocki, Stranak and Sedlak in 2014 as a cooperation between two CLARIN centres (IDS Mannheim and Charles University in Prague) uses an algorithm (a series of yes/ no questions) to assist the user in the licensing process. It allows to choose licenses for both data and software, and features a built-in License Interoperability Tool. Licenses that meet the ‘open’ requirement are clearly marked. Finally, unlike the two other tools, it is made available under Open Software/Open Data conditions.

All of these tools have both advantages and disadvantages; their biggest disadvantage is that they use (to a different degree) a very specific language, which in fact requires basic knowledge of Intellectual Property Law from the user. They also necessarily involve a certain degree of over- or undergeneralization, especially when it comes to assessing license interoperability. Nevertheless, they remain very useful for the research community and may indeed help facilitate re-use and sharing of tools and data in Digital Humanities.

3. Legal and Ethical Aspects of Authorship Attribution Using Stylometry - EU and US Perspectives

by Erik Ketzan (IDS Mannheim) and Paweł Kamocki (IDS Mannheim / Université Paris Descartes / WWU Münster)

3.1. Introduction

Authors have written anonymously since the invention of writing, and the growing digital humanities field known variously as stylometry / computational stylistics / authorship attribution often aims to discover the identify (or rule out the identity) of anonymous authors.

Depending on whether such authors are living, whether the works in question are protected by copyright, and what the aims of the digital humanities research is, vastly different legal frameworks govern such research in the European Union and United States. The strong data protection laws of the EU seem to prohibit certain types of authorship attribution research, while researchers in the US have vastly fewer restrictions regarding data protection regulations. In addition to data protection, the acts of copying and analyzing texts for the purposes of stylometry raise copyright concerns. In the US, these acts seem to be largely allowed by the fair use doctrine. In the EU, these fall into more questionable legal territory, although new laws regarding text and data mining offer improved guidance to researchers.

As laws concerning research in the digital age are being revisited in both the US and EU, it is important to see where stylometry falls under current legal frameworks, and how, and whether, researchers should advocate for changes to law. Finally, we argue that a parallel debate regarding the ethics of stylometric research should be begun. As stylometric research and technology continues to improve, with promises of improved reliability of authorship attribution, researchers should begin to debate which questions researchers should ethically tackle, not only which questions they can.

3.2. Stylometry of anonymous authors under EU law

In the EU, where the memory of totalitarian governments is still present, Member States value privacy very highly. This is translated in the legislation, where the right to be and remain anonymous is not only protected by rules on the processing of personal data, but sometimes also to an extent guaranteed by copyright laws.

The Data Protection Directive is the primary source of laws governing the processing of personal data, and guides Member States in protecting "the fundamental rights and freedoms of natural persons, and in particular their right to privacy with respect to the processing of personal data." The Directive defines personal data as, "any information relating to an identified or identifiable natural person ('data subject'); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identification number or to one or more factors specific to his physical, physiological, mental, economic, cultural or social identity." Processing of personal data is defined as, "any operation or set of operations which is performed upon personal data, whether or not by automatic means, such as collection, recording, organization, storage, adaptation or alteration, retrieval, consultation, use," etc.

In general, processing of personal data can only be done if the data subject (i.e. the person that the data refer to) has unambiguously given his consent. Moreover, the Data Protection Working Party (a body composed of representatives of National Data Protection Authorities from each Member State and whose purpose is to give expert advice on the interpretation of the Data Protection Directive) clearly stated that information that does not relate to an identified person, but is collected for the purpose of identification, shall also be regarded as personal data.

European researchers engaging in stylometric research for the purpose of identifying a living author therefore engage in the processing of personal data, and are subject to the rules and restrictions of the Data Protection Directive and related Member State data protection laws. Alternatively, consent could be obtained to process the personal data, but this leads to the absurd suggestion that researchers obtain permission from an anonymous author so that they can guess at his/her identity.

While the current framework allows for alternative grounds for lawfulness of processing (other than consent), such as e.g. pursuit of legitimate interests, these provisions remain vague and do not guarantee the necessary legal security for researchers. Exceptions from the rules set up by the Directive exist, but only cover very special cases such as freedom of journalistic and artistic expression, public security or (to a limited extent) historical, statistical and scientific research.

An inevitable conclusion, however, is that Personal Data Protection law in the EU protects anonymous authors from being identified against their will, at least when they are still alive.

Anonymity of authors is also addressed by many national laws on copyright. Although anonymous works benefit from a significantly shorter term of protection (70 years after publication, and not 70 years after the death of the author), the anonymity of the author is nevertheless protected; his rights can be exercised by a proxy (usually an agent or publisher). Moreover, in some jurisdictions (i.e., in France), inaccurate attribution of authorship can be regarded as violation of moral rights (i.e., a form of copyright infringement) of both the real author and the falsely attributed one.

3.3. Stylometry of anonymous authors under US law

The legal framework of the United States governing stylometry of anonymous authors is vastly different from the EU. The United States has no single general data protection law. The First Amendment of the United States Constitution guarantees the right to free speech, and a broad right to privacy has been inferred from the Constitution by the United States Supreme Court. A number of state constitutions, such as California, explicitly mention privacy as well.

Courts in the US have recognized certain rights to anonymity, most notably in McIntyre v. Ohio Elections Commission, 514 U.S. 334 (1995), where the Supreme Court held that the freedom to publish anonymously is protected by the First Amendment, and extends beyond the literary realm to the advocacy of political causes. Whether such a right extends to a researcher attempting to remove that anonymity is an open question.

Regarding copyright, there are strong arguments that the acts of copying and data mining text for research purposes are covered by the fair use doctrine, especially after the landmark Google Books case, which held that the scanning of books and making snippets available in search engines is a fair use. As a typical stylometric analysis involves the copying of texts and analysis on a single computer, without distribution of snippets (in other words, infringing less upon exclusive rights than the facts in Google Books), the acts seem to be covered by the fair use framework.

3.4. De Facto

Regardless of the letter of the law, the fact remains that many writers write anonymously and academics are increasingly asked to identify them.

In courts of law, researchers with expertise in linguistics, computer science, and stylometry have acted as expert witnesses for decades now in criminal and civil disputes.

Outside of courts, academics have published or given pronouncements to journalists in most news-worthy instances involving high-profile anonymously written works, including Primary Colors (a 1996 novel satirizing the Clinton Presidential campaign), The Cuckoo's Calling (a 2013 novel revealed to be the work of J.K Rowling), the Wanda Tinasky letters (dozens of eccentric and creative letters mailed to local California newspapers from 1984-88, which academics proved were not the work of Thomas Pynchon), and many more. In all of these instances, journalists and academics have not discussed the moral or legal right to make such analysis; they have simply done it.

3.5. Ethics and stylometry

The purpose of the proposed paper is, through an analysis of different legal frameworks, to highlight the different norms and assumptions that surround the "un-masking" of anonymous authors. The radical difference in EU and US legal approaches proves that opinions can differ, and that serious debate needs to be begun among researchers. As the technology and approaches to stylometry yield increasingly accurate results, it is time for the digital humanities community to begin to discuss ethical standards.