Teaching a machine how to read: Text2Tech at DHOxSS

Thomas Batelaan was awarded a bursary to attend the Digital Humanities Oxford Summer School in 2024.  To join the mailing list and learn about the next summer school sign up here. Read about Thomas's experience at the summer school here:

I was very glad to hear that I had been lucky enough to receive a scholarship to attend the Digital Humanities at Oxford Summer School, after applying through my studentship at the Open-Oxford-Cambridge Doctoral Training Programme with the AHRC.

My doctoral project is about the impact of technology on the reception of the music of J.S. Bach. My approach, then, is a kind of reverse of Digital Humanities: instead of studying the cultural through a digital lens, I study the digital through a cultural lens. Perhaps because of that strange relationship, I’ve always wanted to explore digital humanities and its methodologies.

I didn’t have much practical experience with programming beforehand, so I chose a course geared towards getting to grips with the programming language Python and basic methodologies for analysing texts: Text2Tech, taught by the marvellous Kaspar von Beelen, Mariona Coll Ardanuy and Federico Nanni. Especially impressive was the way they immediately made the case for learning to ‘speak’ Python, in an age in which generative models seem to be able to replace humans in writing code.

Python as a craft
The instruction was quite hands-on, approaching Python very much as a craft. Through interactive ‘notebooks’ of code in Google Colab, it was easy to see how little syntactic changes made code run or break. Later on, it was wonderful to learn how to write compact, logical code that accomplished simple tasks, such as saving pieces of information in various types of databases.

Digging deeper, the course dealt with the central question: how do we approach a text digitally? For Python, the instructors explained, a text is nothing more than a string of characters; a computer does not have a concept of meaning. The task of the digital humanist is to square the humanist concept of meaning with this fundamental computational principle, natural language processing. In a sense, you are teaching the machine how to read. This is done by first converting a string of text into linguistic units, preprocessing. For example, one might reduce the various conjugations of a verb to a single dictionary form, before storing this in a database. Only then will the computer see that ‘work’, ‘worked’, and ‘working’ are related. Through the use of so-called word embeddings, a model like Word2Vec can also recognize that words like coffee, cappuccino, and espresso are used in similar semantic and syntactic contexts. Using such a model, we can start to notice how a text uses language to construct reality, using certain words in certain contexts. I found it interesting how these trends would reveal biases  and historical trends in the corpus. Towards the end of the course, the instructors showed us how to apply the latest machine learning to NLP.

Oxford
Quite apart from the high-intensity course, I loved getting to know my coursemates. I made friends quickly and walked around the beautiful city with them, exploring the impressive college buildings and libraries and having a pint at a local pub. I also enjoyed seeing my peers’ research at the poster exhibition in the stunning Weston library.

While I’d ventured into this Summer School as an interesting reverse of my own approach, I became conscious that in future work, a digital approach might prove very useful. My interests lie in the reception of music, which entails going through many documents of criticism and distilling certain discursive trends. I realized that a tool that could help me notice trends I had not thought of might prove a useful addition to my scholarly arsenal. I’d be very interested in joining the Summer School again on a more advanced course in the future.