The goal of this CLARIN-D curation project is to utilize the currently ongoing collection of the ICE-Scotland corpus to test and improve existing CLARIN-D resources, to document the process of enriching and integrating speech corpora as a Best Practice Guideline, and to make available richly annotated speech data. The corpus resource to be integrated is the International Corpus of English (ICE). Work on the ICE began in 1990 with the primary aim of collecting material for comparative studies of English worldwide. Each ICE corpus consists of one million words of spoken and written English. For most participating countries, the ICE project is stimulating the first systematic investigation of the national variety. To ensure compatibility among the component corpora, each team is following a common corpus design, as well as a common scheme for grammatical annotation.
At present, among others, the ICE-Scotland corpus is being compiled at the University of Münster. Based on this work in progress the aim of this curation project is
- to create a best-practice guideline for the creation and CLARIN-D integration of of phonetically rich corpora
- to test, use, evaluate and improve WebMAUS as an existing CLARIN resource
- to make (parts of) this corpus available with the help of the CLARIN infrastructure
The existing speech data of this corpus is to be transcribed on the phoneme level.
Another aim of the project is to document this transcription work along with the CLARIN-D integration process in a fine-grained way. The annotation work will be aided by the CLARIN-D resource WebMAUS, which can dramatically reduce the workload of manual labelling by force-aligning transcriptions, and which has become of de facto standard in the German scientific community working with speech data. However, WebMAUS is known to have issues resulting in - probably systematic - errors, so that the automatically aligned data will have to be reviewed manually. An additional goal of the proposed curation project lies in the evaluation of WebMAUS, by a systematic comparison of manually and automatically annotated data.
We therefore hope to propose a best-practice guideline for phone-level transcriptions of corpora together with fine-grained knowledge about the capabilities and limitations of WebMAUS, possibly also inspiring its future improvement. In addition, Scottish English will be made available as a new language variety for WebMAUS. Finally, parts of the corpus, consisting of speech data, orthographic transcriptions and phone-level annotations, including the uncorrected WebMAUS tier, will be made available online in the CLARIN-D network. The publication will be limited to a part of the entire corpus because of 1) its large size and 2) the fact that the necessary transcriptions and annotations cannot be expected to be completed in the course of the project. Work on the corpus annotations has already begun and will be continued after the project end. Those data that will be finished within the timeframe of the curation project will be made available within CLARIN-D.