Meet our new Text Enrichment Research Fellow, Wim Peters

By Wim Peters

I joined LACR in December 2017 as Text Enrichment Research Fellow and this role means I work to provide computational support for the transcription activities.

What is my background? Coming from a linguistic background (Classics, psycholinguistics) I entered the world of computational lexicography after my linguistics study in the Netherlands. From then on my main activity was the building of multilingual lexical knowledge bases, such as computational lexicons, machine translation dictionaries, term banks and thesauri.

From 1996 I worked as a Senior Research Associate/Research Fellow at the Natural Language Processing (NLP) Group in the Department of Computer Science at the University of Sheffield, where I received my PhD in the areas of computational linguistics and AI. The group works with GATE (General Architecture for Text Engineering), which is a framework for language engineering applications and supports efficient and robust text processing ( In Sheffield I specialised further in knowledge acquisition from text and the modelling of that knowledge into formal representations such as RDF and OWL. I participated in many projects and application fields such as fisheries, digital archiving and law.

NLP for legal applications is a growing area in which I have been engaged. Given the fact that legal texts mostly consist of unstructured text, NLP allows the automatic filtering of legal text fragments, the extraction of conceptual information and automated support for text interpretation through close reading. By using a combination of NLP and Semantic Web technologies such as XML and ontologies, novel methods can be developed to analyse the law, attempt conceptual modelling of legal domains and support automated reasoning. For instance, concerning case based reasoning, Adam Wyner and I applied natural language information extraction techniques to a sample body of cases, in order to automatically identify and annotate the relevant facts (or ‘case factors’) that shape legal judgments. Annotated case factors can then be extracted for further processing and interpretation.1

Another example of my activity in conceptual extraction and modelling is the creation of a semi-automatic methodology and application for identifying the Hohfeldian relation ‘Duty’ in legal text.2 Using the GATE tool for the automated extraction of Duty instances and its associated roles such as DutyBearer, the method provides an incremental knowledge base intended to support scholars in their interpretation.3

I also work on creating or transforming text representation structures. In a recent project with the Law School at the University of Birmingham my main task was the reformatting of legal judgments from national and EU legislation in 23 languages for storage and querying purposes into both the Open Corpus Workbench format ( and inline XML TEI compliant format.

Finally back to the present. My main interest is in using language technology to serve Digital Humanities (DH) scholarly research. Interpreting text involves the methodological application of NLP techniques and the formal modelling of the knowledge extracted within a collaborative setting involving expert scholars and language technicians.

Language technology should assist interpretative scholarly processes. Computational involvement in DH needs to ensure that humanities researchers – a considerable part of whom still remain to be convinced of the advantages of this digital revolution for their research – will embrace language technology. This will further researchers’ aims in textual interpretation, for instance in the selection of relevant text fragments, and in the creation of an integrated knowledge structure that makes semantic content explicit, and uniformly accessible.

The collaborative automatic and manual knowledge acquisition workflow is illustrated in the figure below.


Within the DH space of LACR, I appreciate the philological building blocks that are being laid. The XML structure allows further exploration of the data though querying using Xquery. XML-based analysis tools (e.g. AntConc4 and GATE) can be used for analysis and future addition of knowledge about the content of the registers, for instance the formulaic nature of the legal language used based on ngrams, and the semantic impact of some regularly used patterns of words. For example, the Latin phrase ‘electi fuerunt’ (‘they were elected’) collocates in the text with entities such as persons, dates and offices, which fit into a conceptual framework about ‘election’.

Looking into the future, standard representations such as TEI-XML ensure that information can be added flexibly and incrementally as metadata for the purpose of scholarly corpus enrichment. Knowledge acquisition through named entity recognition, term extraction and textual pattern analysis will help build an incremental picture of the domain. This knowledge can then be formalised through knowledge representation languages such as RDF and OWL. That will serve to provide an ontological backbone to the extracted knowledge, and enable connections to Linked Data across the Web (

  1. Wyner, A., and Peters, W. (2010), Lexical semantics and expert legal knowledge towards the identification of legal case factors, JURIX 2010. 
  2. For a description of Hohfeld’s legal relations see e.g. 
  3. Peters, W. and Wyner, A. (2015), Extracting Hohfeldian Relations from Text, JURIX 2015. 

Digital Humanities – What’s the fuss about?

by Anna D. Havinga

“Digital Humanities” (DH) has become a vogue word in academia in the last few decades. DH centres have been set up, DH workshops and summer schools are held regularly all over the world, and the number of DH projects is increasing rapidly. But what is all the fuss about?


What is Digital Humanities?

There are numerous articles that discuss what DH is and is not. It is generally agreed that just posting texts or pictures on the internet or using digital tools for research does not qualify as DH.1 There are, however, few works that give a concise definition of DH. Kirschenbaum quotes a definition from Wikipedia, which he describes as a working definition that “serves as well as any”.2 In my view, the definition for DH on Wikipedia3 has even improved since 2013, when Kirschenbaum’s article was published. I believe it now captures the essence of DH more accurately:

[…] [A] distinctive feature of DH is its cultivation of a two-way relationship between the humanities and the digital: the field both employs technology in the pursuit of humanities research and subjects technology to humanistic questioning and interrogation, often simultaneously. Historically, the digital humanities developed out of humanities computing, and has become associated with other fields, such as humanistic computing, social computing, and media studies. In concrete terms, the digital humanities embraces a variety of topics, from curating online collections of primary sources (primarily textual) to the data mining of large cultural data sets to the development of maker labs. Digital humanities incorporates both digitized (remediated) and born-digital materials [i.e. materials that originate in digital form, ADH] and combines the methodologies from traditional humanities disciplines (such as history, philosophy, linguistics, literature, art, archaeology, music, and cultural studies) and social sciences, with tools provided by computing (such as Hypertext, Hypermedia, data visualisation, information retrieval, data mining, statistics, text mining, digital mapping), and digital publishing. (

Our Law in Aberdeen Council Registers project can serve as a prime example for a DH project: We create digital transcriptions of the Aberdeen Burgh Records (1397–1511) with the help of computing tools. This means that we type the original handwritten text into a software programme in a format that can be understood by computers. More specifically, we use the oXygen XML editor with the add-on HisTEI to create transcriptions that are compliant with the Text Encoding Initiative (TEI) guidelines (version P5).4 In this way, we produce a machine-readable and machine-searchable text.5 But what benefits does this have? Why do we go through all this effort when the pictures of the Aberdeen Burgh Records are already available online?6


What are the benefits of a digital, transcribed version of a text?

Apart from the obvious benefit of a digital, transcribed version of text being much easier to read than the original handwriting, it allows for information to be added to the text. With the help of so-called ‘tags’, a text can be enriched with all kinds of structural annotations and metadata. Tagging here means adding XML annotations to the text. For example, the textual passages in the Aberdeen Burgh Registers, which are mainly written in Latin or Middle Scots, can be marked up as such, using the ‘xml:lang’ tag. A researcher who is interested in the use of Middle Scots in these registers could then search for and find all Middle Scots sections in the corpus very easily with the help of a text analysis tool such as AntConc or SketchEngine without having to plough through the sections written in Latin. More generally, enriching the text with tags means that a researcher does not have to read through all of the over 5,000 pages of the Aberdeen Council Registers that we will transcribe in order to find what s/he is looking for. A machine-readable and machine-searchable text does not only save time when researching a particular topic but is also generally more flexible than a printed version of text as further tags can be added and unwanted tags can be hidden. Furthermore, a digital text allows us to ask different questions of a text corpus. It is those possible questions plus a variety of other issues that have to be considered before embarking on a DH project.


Transcription of volume 7 of the Aberdeen Counil Registers (p. 60), annotated with XML tags


What has to be considered when setting up a DH project?

There are several major questions that have to be considered before starting a DH project of the sort we are carrying out: What is it that you want to get from the material you work on? Who else will be using it? In what way will it be used? Which research questions could be asked? Information on the possible users of the born-digital material is essential in order to decide which information should be marked up in the corpus of text. This is, of course, also a matter of time (and money) since adding information to the original text in form of tags takes time. The balance between time and enrichment has to be determined for each individual DH project. In our project we decided to go through different stages of annotation – starting with basic annotations (e.g. expansions, languages) first and adding further tags later (e.g. names, places etc.). Also, users will be able to add further annotations that may be specific to their research projects. Beyond these considerations, choices about software and hardware, tools, platforms, web development, infrastructure, server environment, interface design etc. have to be made before embarking on the DH project. Anything that is not determined at the beginning of the project may lead to considerable efforts at a later stage of the project.

It is certainly worth going through all this effort. To us it is clear why DH has become such a big thing. It eases research, extends the toolkits of traditional scholarship, and opens up material to a wider audience of users.7 With tags we can enrich the content of texts by adding additional information, which can then change the nature of humanities inquiry. DH projects are by nature about networking and collaboration between different disciplines, which is certainly the way forward in the humanities.



  1. Anne Burdick et al. 2012. Digital_Humanities. Cambridge, MA: MIT Press, p. 122. 
  2. Matthew G. Kirschenbaum. 2013. ‘What Is Digital Humanities and What’s It Doing in English Departments?’ In: Melissa Terras, Julianne Nyhan, Edward Vanhoutte (eds), Defining Digital Humanities. A Reader. Farnham: Ashgate, 195-204, p. 197. 
  3. [accessed 19.07.2016] 
  4. [accessed 25.07.2016] 
  5. In further blog posts, we will explain in more detail how we do this. 
  6. [accessed 25.07.2016] 
  7. Anne Burdick et al. 2012, p. 8.