Meet our new Text Enrichment Research Fellow, Wim Peters

By Wim Peters

I joined LACR in December 2017 as Text Enrichment Research Fellow and this role means I work to provide computational support for the transcription activities.

What is my background? Coming from a linguistic background (Classics, psycholinguistics) I entered the world of computational lexicography after my linguistics study in the Netherlands. From then on my main activity was the building of multilingual lexical knowledge bases, such as computational lexicons, machine translation dictionaries, term banks and thesauri.

From 1996 I worked as a Senior Research Associate/Research Fellow at the Natural Language Processing (NLP) Group in the Department of Computer Science at the University of Sheffield, where I received my PhD in the areas of computational linguistics and AI. The group works with GATE (General Architecture for Text Engineering), which is a framework for language engineering applications and supports efficient and robust text processing (http://www.gate.ac.uk). In Sheffield I specialised further in knowledge acquisition from text and the modelling of that knowledge into formal representations such as RDF and OWL. I participated in many projects and application fields such as fisheries, digital archiving and law.

NLP for legal applications is a growing area in which I have been engaged. Given the fact that legal texts mostly consist of unstructured text, NLP allows the automatic filtering of legal text fragments, the extraction of conceptual information and automated support for text interpretation through close reading. By using a combination of NLP and Semantic Web technologies such as XML and ontologies, novel methods can be developed to analyse the law, attempt conceptual modelling of legal domains and support automated reasoning. For instance, concerning case based reasoning, Adam Wyner and I applied natural language information extraction techniques to a sample body of cases, in order to automatically identify and annotate the relevant facts (or ‘case factors’) that shape legal judgments. Annotated case factors can then be extracted for further processing and interpretation.1

Another example of my activity in conceptual extraction and modelling is the creation of a semi-automatic methodology and application for identifying the Hohfeldian relation ‘Duty’ in legal text.2 Using the GATE tool for the automated extraction of Duty instances and its associated roles such as DutyBearer, the method provides an incremental knowledge base intended to support scholars in their interpretation.3

I also work on creating or transforming text representation structures. In a recent project with the Law School at the University of Birmingham my main task was the reformatting of legal judgments from national and EU legislation in 23 languages for storage and querying purposes into both the Open Corpus Workbench format (http://cwb.sourceforge.net/index.php) and inline XML TEI compliant format.

Finally back to the present. My main interest is in using language technology to serve Digital Humanities (DH) scholarly research. Interpreting text involves the methodological application of NLP techniques and the formal modelling of the knowledge extracted within a collaborative setting involving expert scholars and language technicians.

Language technology should assist interpretative scholarly processes. Computational involvement in DH needs to ensure that humanities researchers – a considerable part of whom still remain to be convinced of the advantages of this digital revolution for their research – will embrace language technology. This will further researchers’ aims in textual interpretation, for instance in the selection of relevant text fragments, and in the creation of an integrated knowledge structure that makes semantic content explicit, and uniformly accessible.

The collaborative automatic and manual knowledge acquisition workflow is illustrated in the figure below.

blog-wim-picture

Within the DH space of LACR, I appreciate the philological building blocks that are being laid. The XML structure allows further exploration of the data though querying using Xquery. XML-based analysis tools (e.g. AntConc4 and GATE) can be used for analysis and future addition of knowledge about the content of the registers, for instance the formulaic nature of the legal language used based on ngrams, and the semantic impact of some regularly used patterns of words. For example, the Latin phrase ‘electi fuerunt’ (‘they were elected’) collocates in the text with entities such as persons, dates and offices, which fit into a conceptual framework about ‘election’.

Looking into the future, standard representations such as TEI-XML ensure that information can be added flexibly and incrementally as metadata for the purpose of scholarly corpus enrichment. Knowledge acquisition through named entity recognition, term extraction and textual pattern analysis will help build an incremental picture of the domain. This knowledge can then be formalised through knowledge representation languages such as RDF and OWL. That will serve to provide an ontological backbone to the extracted knowledge, and enable connections to Linked Data across the Web (http://linkeddata.org/).


  1. Wyner, A., and Peters, W. (2010), Lexical semantics and expert legal knowledge towards the identification of legal case factors, JURIX 2010. 
  2. For a description of Hohfeld’s legal relations see e.g. http://www.kentlaw.edu/perritt/blog/2007/12/hohfeldian-primer.html). 
  3. Peters, W. and Wyner, A. (2015), Extracting Hohfeldian Relations from Text, JURIX 2015. 
  4. http://www.laurenceanthony.net/software/antconc/ 

The Aberdeen Registers: A Student Perspective

By Finn O’Neill, final year LLB student at the University of Aberdeen

Over the past academic year, I have had the good fortune to volunteer with the project ‘Law in the Aberdeen Council Registers: Concepts, Practices, Geographies, 1398-1511’ (LACR), under the supervision of Dr Claire Hawes. As a Law student at Aberdeen with a keen interest in both private law and legal history, the chance to utilise some of the earliest court records in Scotland was an opportunity not to be missed. The LACR project is transcribing the medieval registers in full, and has created a prototype web tool which makes them searchable. This is an excellent tool for the curious, as the scope of material contained in the records is vast and the time period is reasonably extensive.

My engagement with the registers started because the project asked me to help to test the web tool, by formulating queries based on my own research. As much of a degree in Scots law involves tracing the origins of legal principles through history in an attempt to understand their modern developments, I was more than happy to help. I used the web tool to search for things I had been studying relating to succession, leases and Scottish legal history. For example, a search for the word “tak”, the Middle Scots word for lease, brings up over 300 results which include cases of disputes over leases and show the vibrant development of a fundamental part of Scots law.

One of my research points had been to try to find the earliest usage of the Leases Act of 1449 in Aberdeen. The scope of the act would technically have excluded burghs but at some point this must have fallen away as the act is applied across Scotland to this day. However, the registers are neither as specific nor as detailed as modern case reports and as such I did not have sufficient time to find this. Even so, the process of searching through these extraordinary records gave me a significantly better understanding of the topic, because I could see how these legal mechanisms were being used in practice in Aberdeen between the fourteenth and sixteenth centuries. Using the registers allowed me to access materials that few others have used, and find a perspective which had not otherwise been explored. For example I had a better understanding of the customary usage of leases in Scotland outwith the scope of the Leases Act 1449.

Another area of research that I engaged with using the records was the use of brieves in Aberdeen. Brieves are early court writs, forms of actions which provide a mechanism for dispute resolution and their usage in Scotland. This is a particular passion of mine and was highly relevant to my coursework in Scottish Legal History and European Legal History as both courses have considered the use of brieves in Scotland. As part of the project I had the wonderful opportunity to visit the Aberdeen archives in the Town House. This was a fantastic experience as we were able to see many charters relating to the burgh of Aberdeen first hand. My own favourite part of the trip was the opportunity to see a brieve of right which was sewn onto the 1317 Court Roll, and in doing so experience a piece of legal history that I had been reading about for more than half of my university career.

It should be noted that my use of the web tool was greatly enhanced by the fantastic team working on the project. Although I relished the challenge to read Middle Scots, having expert knowledge at hand made the whole process of searching and using the web tool much easier. In turn, I was able to provide the team with details of some of the legal processes that we encountered.

I would encourage anyone interested in the history of Scotland to make the LACR project a top priority in their research. The scope of the registers covers many fields of interest and I can say with confidence that you will be pleasantly surprised by what you find.

My thanks go to the LACR team for the opportunity to work with these most important and truly wonderful records.

 

Transcription volumes 1-7 completed!

by Edda Frankot

An important project milestone was reached last month when the last words of volumes 1-7 were transcribed on the afternoon of 18 January. The transcription of the first seven volumes, up to the year 1501, is now complete. Over the past eighteen months or so, the project’s two research assistants, Claire Hawes and William Hepburn, with a small amount of assistance of yours truly, have transcribed 4027 pages – no mean feat!

This does not mean, of course, that the project as a whole is now finished. The checking of the transcription and annotations is still in full flow. Once that is completed a final phase of getting the corpus ready to go online will commence. In the meantime, thanks to generous additional support from Aberdeen City Council to enhance the project, Claire and William have begun the transcription of volume 8. This volume will at least partly be transcribed traditionally, but there are also ongoing investigations into the possibility of having this book machine-transcribed for us by a project called READ. Watch this space for updates on that! Overall our final corpus will in part contain a level of annotation enhanced beyond our original specification.

Now that the transcription of volumes 1-7 is complete, it has been possible to do a word count. This count confirms our suspicions that volume 6 includes a relatively large amount of material, but also brings up some other fascinating facts. The total count as it stands now (this number will most likely change slightly during the final stages of the checking process) is 1,391,217 words. To put this in perspective: Shakespeare’s complete works total 884,421 words. A significant chunk of our nearly 1.4 million corpus (so far) is taken up by volume 6: 539,254 words (39%). By contrast, volume 7, which has 137 pages more than volume 6, contains ‘only’ 332,392 words (24%). On average, then, there are about 547 words on every page of volume 6, but only 296 on those of volume 7. The average across all volumes is about 300 words per page. The scribe of a large part of volume 6 used more of the pages (he only left one of the margins blank, rather than both), he placed his text lines closer together and appears to have written in a smaller hand. The volume with the lowest amount of words per page is volume 2, at only 189. This results from many blank spaces left between court entries, and blank pages.

Above: An illustration of different page word densities and lay-outs: ACR, 6, p. 752 (left) and ACR, 7, p. 508 (right).

LACR Corpus Word Count

It has also been possible to differentiate between words in Latin and in Scots (and those from entries in ‘multiple languages’, that is to say entries with a lot of switches between Latin and Scots, which typically occurs in lists of names). Overall 58% of the corpus is in Latin, 41.1% is in Scots and 0.9% in multiple languages. Two entries are in Dutch. In volumes 1 and 2 (1398-1414) only slightly more than 1% of the words are in Scots. In volume 4 (1433-1447) this rises to nearly 9%. By volume 6 (1468-1486) the division between the two languages is almost exactly 50-50, whereas in volume 7 (1487-1501) more than 68% is in Scots. Much more detailed research into this phenomenon is of course undertaken by our former text enrichment research fellow, Anna Havinga. Anna not only distinguishes between words and entries in Scots and Latin, but she also analyses the development of the language shift by year. But even the very coarse overview given here already throws up some fascinating first indications which future research will hopefully be able to elaborate upon.

Digital Humanities – What’s the fuss about?

by Anna D. Havinga

“Digital Humanities” (DH) has become a vogue word in academia in the last few decades. DH centres have been set up, DH workshops and summer schools are held regularly all over the world, and the number of DH projects is increasing rapidly. But what is all the fuss about?

 

What is Digital Humanities?

There are numerous articles that discuss what DH is and is not. It is generally agreed that just posting texts or pictures on the internet or using digital tools for research does not qualify as DH.1 There are, however, few works that give a concise definition of DH. Kirschenbaum quotes a definition from Wikipedia, which he describes as a working definition that “serves as well as any”.2 In my view, the definition for DH on Wikipedia3 has even improved since 2013, when Kirschenbaum’s article was published. I believe it now captures the essence of DH more accurately:

[…] [A] distinctive feature of DH is its cultivation of a two-way relationship between the humanities and the digital: the field both employs technology in the pursuit of humanities research and subjects technology to humanistic questioning and interrogation, often simultaneously. Historically, the digital humanities developed out of humanities computing, and has become associated with other fields, such as humanistic computing, social computing, and media studies. In concrete terms, the digital humanities embraces a variety of topics, from curating online collections of primary sources (primarily textual) to the data mining of large cultural data sets to the development of maker labs. Digital humanities incorporates both digitized (remediated) and born-digital materials [i.e. materials that originate in digital form, ADH] and combines the methodologies from traditional humanities disciplines (such as history, philosophy, linguistics, literature, art, archaeology, music, and cultural studies) and social sciences, with tools provided by computing (such as Hypertext, Hypermedia, data visualisation, information retrieval, data mining, statistics, text mining, digital mapping), and digital publishing. (https://en.wikipedia.org/wiki/Digital_humanities)

Our Law in Aberdeen Council Registers project can serve as a prime example for a DH project: We create digital transcriptions of the Aberdeen Burgh Records (1397–1511) with the help of computing tools. This means that we type the original handwritten text into a software programme in a format that can be understood by computers. More specifically, we use the oXygen XML editor with the add-on HisTEI to create transcriptions that are compliant with the Text Encoding Initiative (TEI) guidelines (version P5).4 In this way, we produce a machine-readable and machine-searchable text.5 But what benefits does this have? Why do we go through all this effort when the pictures of the Aberdeen Burgh Records are already available online?6

 

What are the benefits of a digital, transcribed version of a text?

Apart from the obvious benefit of a digital, transcribed version of text being much easier to read than the original handwriting, it allows for information to be added to the text. With the help of so-called ‘tags’, a text can be enriched with all kinds of structural annotations and metadata. Tagging here means adding XML annotations to the text. For example, the textual passages in the Aberdeen Burgh Registers, which are mainly written in Latin or Middle Scots, can be marked up as such, using the ‘xml:lang’ tag. A researcher who is interested in the use of Middle Scots in these registers could then search for and find all Middle Scots sections in the corpus very easily with the help of a text analysis tool such as AntConc or SketchEngine without having to plough through the sections written in Latin. More generally, enriching the text with tags means that a researcher does not have to read through all of the over 5,000 pages of the Aberdeen Council Registers that we will transcribe in order to find what s/he is looking for. A machine-readable and machine-searchable text does not only save time when researching a particular topic but is also generally more flexible than a printed version of text as further tags can be added and unwanted tags can be hidden. Furthermore, a digital text allows us to ask different questions of a text corpus. It is those possible questions plus a variety of other issues that have to be considered before embarking on a DH project.

Blog1_picture

Transcription of volume 7 of the Aberdeen Counil Registers (p. 60), annotated with XML tags

 

What has to be considered when setting up a DH project?

There are several major questions that have to be considered before starting a DH project of the sort we are carrying out: What is it that you want to get from the material you work on? Who else will be using it? In what way will it be used? Which research questions could be asked? Information on the possible users of the born-digital material is essential in order to decide which information should be marked up in the corpus of text. This is, of course, also a matter of time (and money) since adding information to the original text in form of tags takes time. The balance between time and enrichment has to be determined for each individual DH project. In our project we decided to go through different stages of annotation – starting with basic annotations (e.g. expansions, languages) first and adding further tags later (e.g. names, places etc.). Also, users will be able to add further annotations that may be specific to their research projects. Beyond these considerations, choices about software and hardware, tools, platforms, web development, infrastructure, server environment, interface design etc. have to be made before embarking on the DH project. Anything that is not determined at the beginning of the project may lead to considerable efforts at a later stage of the project.

It is certainly worth going through all this effort. To us it is clear why DH has become such a big thing. It eases research, extends the toolkits of traditional scholarship, and opens up material to a wider audience of users.7 With tags we can enrich the content of texts by adding additional information, which can then change the nature of humanities inquiry. DH projects are by nature about networking and collaboration between different disciplines, which is certainly the way forward in the humanities.

 

 


  1. Anne Burdick et al. 2012. Digital_Humanities. Cambridge, MA: MIT Press, p. 122. 
  2. Matthew G. Kirschenbaum. 2013. ‘What Is Digital Humanities and What’s It Doing in English Departments?’ In: Melissa Terras, Julianne Nyhan, Edward Vanhoutte (eds), Defining Digital Humanities. A Reader. Farnham: Ashgate, 195-204, p. 197. 
  3.  https://en.wikipedia.org/wiki/Digital_humanities [accessed 19.07.2016] 
  4.  http://www.tei-c.org/Guidelines/P5/ [accessed 25.07.2016] 
  5. In further blog posts, we will explain in more detail how we do this. 
  6. http://www.scotlandsplaces.gov.uk/digital-volumes/burgh-records/aberdeen-burgh-registers/ [accessed 25.07.2016] 
  7. Anne Burdick et al. 2012, p. 8. 

From script to text

by Jackson Armstrong

One of the most cryptic and alluring aspects of the pages of the Aberdeen council registers is the handwriting which appears in them. To most people this script is not remotely decipherable.

Patterns of handwriting change over time. The study of these changes is known as palaeography. An excellent public resource may be found at the Scottish handwriting website.

Even in 1591 the town clerk of Aberdeen reported his bafflement by the handwriting of the fourteenth century. That year the clerk, Master Thomas Mollisone, who was preparing an inventory of extant registers and bailie court books, found no volumes from earlier than 1380. However, he noted that ‘Befoir this, scrowis [scrolls] on parchment ’ written in Latin ‘and for ilk year ane skrow’, survived. In his assessment they were ‘evil to be red, be resoun of the antiquitie of the wreit and the forme of the letter or character … which is not now usit’ and that ‘skairslie gif ony man can reid the samyn’.1

43

The only extant burgh court roll, from 1317, kept at the Aberdeen City and Aberdeenshire Archives.

We think of the contents of the eight register volumes from 1398–1511 as a corpus of text. But what is ‘the text’? The task of our project is to take the handwritten script in the registers and render it as machine readable text. I think it is useful to pause and consider the difference between what we think of as ‘the text’ and the handwritten script. The scripts in these registers use a set of character symbols with standard abbreviations and also special letter forms. It is helpful to think of this as a form of shorthand writing, or even encryption, which it is our job to decipher. A later and more extreme version of such abbreviation, or shorthand, is that which was devised by Thomas Shelton, and used by Samuel Pepys in writing his well-known diary. The ‘text’ of Pepys’s diary entries (what might be described as their meaningful content) is not necessarily the same as the writing on the page. The editors of Pepys’ diaries had the difficult task to extract the ‘text’ from the diarist’s shorthand (for an example, see this image of a page of his diary). Similary, a difference can be noted between the text of our material, and the handwriting which various scribes used to symbolise that meaningful content. It is our interpretation of the handwritten script which produces ‘the text’.

This brings us to the nature of the transcription we produce by rendering the script into text. A diplomatic transcription aims to reproduce everything as it is, for instance giving wt for wt . By contrast, a semi-diplomatic transcription includes the full expansion, in this case expanding wt to ‘with’. It may even be possible to represent the set of symbols used for the original script with high fidelity, producing what is in effect a facsimile. For instance, a form of typeface called ‘record type’ was invented in the late eighteenth century to reproduce medieval abbreviations. That would be a tremendously cumbersome process and it would not help in moving from script to text. In addition, record type and full diplomatic transcription were invented before the benefit of modern photography. Digital images of the original pages now provide a perfect facsimile, and as a result a diplomatic transcription is no longer necessary.

Our task is not to create a facsimile of handwriting, but to represent the text as consistently and accurately as we can. To this end we aim to produce a text which may be displayed either as a semi-diplomatic transcription, or a semi-normalised transcription. The latter allows for fuller intervention by the transcriber, regularising and smoothing out features like variant letter forms, punctuation, capitalisation, and so on. In the former case, the expansion of abbreviations is assisted by the fact that these were standardised to a large degree. Reference works are available to assist transcribers with the identification and expansion of abbreviated forms.

ACR 4, p. 7, entry 2

ACR 4, p. 7, entry 2

semi-diplomatic transcription: Eodem die Johannes mercer’ adiudicatur in amerciamento curie pro iniusta de perturbacione ade de benyn vicini sui. Et dictus adam in amerciamento pro perturbacione predicti Johannis mercer’ et dictus Johannes mercer’ dedit Johannem vokate patrem plegium legalem quod dictus adam erit indempnis de ipso et perturacione sua aliter ipse per viam iuris Et modo consimili dictus adam dedit Ricardum de Ruthirfurd plegium legalem quod Johannes mercer’ erit indempnis et cetera.

semi-normalised transcription: Eodem die Johannes Mercer’ adiudicatur in amerciamento curie pro iniusta de perturbacione Ade de Benyn vicini sui. Et dictus Adam in amerciamento pro perturbacione predicti Johannis Mercer’ et dictus Johannes Mercer’ dedit Johannem Vokate patrem plegium legalem quod dictus Adam erit indempnis de ipso et perturbacione sua aliter ipse per viam juris. Et modo consimili dictus Adam dedit Ricardum de Ruthirfurd plegium legalem quod Johannes Mercer’ erit indempnis et cetera.

In our project we are not doing all this for paper and ink, but electronically, and through Text Encoding Initiative (TEI) annotation. A useful essay on this process is by M J Driscoll, on ‘Electronic Textual Editing: Levels of transcription’.2 Thus the text we produce is not just typed out as ‘flat’ strings of characters and words (as it would be in an edition printed on paper), but it is encoded following TEI standards. This allows the text to be augmented with annotations concerning the structure of the text, features of the transcription and textual meaning. These annotations, for example, enable us to indicate where we have made an expansion by supplying in full the information represented by the abbreviation in the original script.

ARO-4-0007-02a

TEI-annotated transcription of ACR 4, p. 7, entry 2.

However, a pertinent question is whether a silent expansion of a standard abbreviation may still be a consistent and accurate representation of meaningful content when moving from original script to electronic text. Indeed, the choices made by the transcription team as to how to interpret particular characters in the script rely on a process of judgement, partly based on interpretation of context. Cumulatively, those judgements will result in the transcribed text. In all this they follow a process which enables cross-checking to ensure a high degree of inter-transcriber agreement and consistency, both on the transcribed corpus of text, and on the annotations made to augment that corpus.

The process of moving from original script to electronic text is fundamental to our work. It presents its own challenges and choices which take a range of skills to address, and to ensure the final product is robust and reliable.


  1. John Stuart, ed., The Miscellany of the Spalding Club (1841-52), v, p. 9. The use of u/v has been changed to aid readability. 
  2. http://www.tei-c.org/About/Archive_new/ETE/Preview/driscoll.xml