By Wim Peters
I joined LACR in December 2017 as Text Enrichment Research Fellow and this role means I work to provide computational support for the transcription activities.
What is my background? Coming from a linguistic background (Classics, psycholinguistics) I entered the world of computational lexicography after my linguistics study in the Netherlands. From then on my main activity was the building of multilingual lexical knowledge bases, such as computational lexicons, machine translation dictionaries, term banks and thesauri.
From 1996 I worked as a Senior Research Associate/Research Fellow at the Natural Language Processing (NLP) Group in the Department of Computer Science at the University of Sheffield, where I received my PhD in the areas of computational linguistics and AI. The group works with GATE (General Architecture for Text Engineering), which is a framework for language engineering applications and supports efficient and robust text processing (http://www.gate.ac.uk). In Sheffield I specialised further in knowledge acquisition from text and the modelling of that knowledge into formal representations such as RDF and OWL. I participated in many projects and application fields such as fisheries, digital archiving and law.
NLP for legal applications is a growing area in which I have been engaged. Given the fact that legal texts mostly consist of unstructured text, NLP allows the automatic filtering of legal text fragments, the extraction of conceptual information and automated support for text interpretation through close reading. By using a combination of NLP and Semantic Web technologies such as XML and ontologies, novel methods can be developed to analyse the law, attempt conceptual modelling of legal domains and support automated reasoning. For instance, concerning case based reasoning, Adam Wyner and I applied natural language information extraction techniques to a sample body of cases, in order to automatically identify and annotate the relevant facts (or ‘case factors’) that shape legal judgments. Annotated case factors can then be extracted for further processing and interpretation.1
Another example of my activity in conceptual extraction and modelling is the creation of a semi-automatic methodology and application for identifying the Hohfeldian relation ‘Duty’ in legal text.2 Using the GATE tool for the automated extraction of Duty instances and its associated roles such as DutyBearer, the method provides an incremental knowledge base intended to support scholars in their interpretation.3
I also work on creating or transforming text representation structures. In a recent project with the Law School at the University of Birmingham my main task was the reformatting of legal judgments from national and EU legislation in 23 languages for storage and querying purposes into both the Open Corpus Workbench format (http://cwb.sourceforge.net/index.php) and inline XML TEI compliant format.
Finally back to the present. My main interest is in using language technology to serve Digital Humanities (DH) scholarly research. Interpreting text involves the methodological application of NLP techniques and the formal modelling of the knowledge extracted within a collaborative setting involving expert scholars and language technicians.
Language technology should assist interpretative scholarly processes. Computational involvement in DH needs to ensure that humanities researchers – a considerable part of whom still remain to be convinced of the advantages of this digital revolution for their research – will embrace language technology. This will further researchers’ aims in textual interpretation, for instance in the selection of relevant text fragments, and in the creation of an integrated knowledge structure that makes semantic content explicit, and uniformly accessible.
The collaborative automatic and manual knowledge acquisition workflow is illustrated in the figure below.
Within the DH space of LACR, I appreciate the philological building blocks that are being laid. The XML structure allows further exploration of the data though querying using Xquery. XML-based analysis tools (e.g. AntConc4 and GATE) can be used for analysis and future addition of knowledge about the content of the registers, for instance the formulaic nature of the legal language used based on ngrams, and the semantic impact of some regularly used patterns of words. For example, the Latin phrase ‘electi fuerunt’ (‘they were elected’) collocates in the text with entities such as persons, dates and offices, which fit into a conceptual framework about ‘election’.
Looking into the future, standard representations such as TEI-XML ensure that information can be added flexibly and incrementally as metadata for the purpose of scholarly corpus enrichment. Knowledge acquisition through named entity recognition, term extraction and textual pattern analysis will help build an incremental picture of the domain. This knowledge can then be formalised through knowledge representation languages such as RDF and OWL. That will serve to provide an ontological backbone to the extracted knowledge, and enable connections to Linked Data across the Web (http://linkeddata.org/).
- Wyner, A., and Peters, W. (2010), Lexical semantics and expert legal knowledge towards the identification of legal case factors, JURIX 2010. ↩
- For a description of Hohfeld’s legal relations see e.g. http://www.kentlaw.edu/perritt/blog/2007/12/hohfeldian-primer.html). ↩
- Peters, W. and Wyner, A. (2015), Extracting Hohfeldian Relations from Text, JURIX 2015. ↩
- http://www.laurenceanthony.net/software/antconc/ ↩