Let's Talk about Palm Leaves - From Minimal Data to Text Understanding
A tutorial by Magnus Bender, Marcel Gehrke, and Tanya Braun at KI 2023
In recent years, large language models have greatly improved the state of the art for text understanding. However, large language models are often computationally expensive and work best in areas with huge amounts of training data. Unfortunately, there are areas where we do not have a lot of data available. For example, in digital humanities, we have researchers investigating poems that are written on palm leafs in old Tamil. They only have a few hundred or maybe a thousand poems (documents). In such a setting, using a general pre-trained large language model (there are no for old Tamil) and further training the model by subsampling from the corpus comes to its limits, given the limited data available. Nonetheless, a support in text understanding or information retrieval also has great value for these researchers.
Therefore, in this tutorial, we give an overview of how different tasks can be per- formed with only minimal data available. We will use examples from the field of digital humanties to illustrate particular challenges. Among these examples, we will look at the above-mentioned poems on palm leafs, which include in-line annotations that are not easy to distinguish from the actual poem, if one does not know the poem. An- other example are critical editions, where scholars combine many poems, transcriptions, translations, their annotations or comments, and a dictionary. When these editions are merged, the challenges that arise lie in identifying parts of editions that are extensions to or revisions of other critical editions. During our journey, we touch upon long standing concepts such as topic modelling and hidden Markov models and how they still help in text understanding with minimal data. Further, we show how these approaches perform w.r.t. large language models in areas with minimal data.
A collaborative effort between
Target Audience, Prerequisite Knowledge, and Learning Goals
The tutorial will be mostly self-contained. While we assume familiarity with concepts such a topic modelling, but we will revisit all necessary definitions. The tutorial is there- fore potentially interesting for all researchers interested in text understanding, which include AI researcher but also researcher from other fields such as digital humanities.
In reference to the call, we talk about AI and digital humanities as well as machine learning and related methods, which are main topics of the KI 2023 call for papers. The goal of this tutorial is two-fold:
- to provide an overview about recent developments in text understanding with min- imal data with a focus on digital humanities as an application area and
- to discuss new directions for investigation.
Further, this tutorial nicely complements the Workshop on Humanities-Centred AI, which took place the last two iterations of the KI conference.
Agenda (including presentation material when ready)
- Introduction to Semantic Systems
- Supervised Learning
- Identifying different types of documents
- Inline annnotations
- Transition to Unsupervised Learning
- Adding relations to a model
- Unsupervised model learning
- Conditional Knowledge Compilation
Presentation material to follow.
- Magnus Bender, Tanya Braun, Marcel Gehrke, Felix Kuhr, Ralf Möller, and Simon Schiff. Identifying and Translating Subjective Content Descriptions among Texts. International Journal of Semantic Computing, 15(4):461–485, 2021.
- Magnus Bender, Felix Kuhr, Tanya Braun, and Ralf Möller. Estimating Context-Specific Subjective Content Descriptions using BERT. In 16th IEEE International Conference on Semantic Computing, (ICSC 2022), Virtual, January 26-28, pages 171–172. IEEE, 2022.
- Magnus Bender, Felix Kuhr, and Tanya Braun. To extend or not to extend? Enriching a Corpus with Complementary and Related Documents. International Journal of Semantic Computing, 16(4):521–545, 2022.
- Magnus Bender, Tanya Braun, Ralf Möller, and Marcel Gehrke. Unsupervised Estimation of Subjective Content Descriptions. In 17th IEEE International Conference on Semantic Computing, (ICSC 2023), February 1-3. IEEE, 2023.
David M. Blei and Michael I. Jordan. Modelling Annotated Data. In 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 127–134. ACM, 2003.
David M. Blei, Andrew Ng, and Michael I. Jordan. Latent Dirichlet Allocation. In Journal of Machine Learning Research, 3:993–1022, 2003.
Felix Kuhr, Tanya Braun, Magnus Bender, and Ralf Möller. To Extend or not to Extend? Context-specific Corpus Enrichment. In Proceedings of AI 2019: Advances in Artificial Intelligence, volume 11919 of Lecture Notes in Computer Science, pages 357–368. Springer, 2019.