blob: 42dec3a5a19284f12370e17597a5c52ef92e26f7 [file] [log] [blame]
---
title: "News from the International Comparable Corpus"
subtitle: "First launch of ICC written"
date: "`r Sys.Date()`"
author:
- name: Marc Kupietz
affil: 1
- name: Adrien Barbaresi
affil: 2
- name: Anna Čermáková
affil: 3
- name: Małgorzata Czachor
affil: 4
- name: Nils Diewald
affil: 1
- name: Jarle Ebeling
affil: 5
- name: Rafał L. Górski
affil: 4
- name: John Kirk
affil: 6
- name: Michal Křen
affil: 3
- name: Harald Lüngen
affil: 1
- name: Eliza Margaretha
affil: 1
- name: Signe Oksefjell Ebeling
affil: 5
- name: Mícheál Ó Meachair
affil: 7
- name: Ines Pisetta
affil: 1
- name: Elaine Uí Dhonnchadha
affil: 8
- name: Friedemann Vogel
affil: 9
- name: Rebecca Wilm
affil: 1
- name: Jiajin Xu
affil: 10
- name: Rameela Yaddehige
affil: 1
affiliation:
- num: 1
address: IDS Mannheim
- num: 2
address: BBAW Berlin
- num: 3
address: Charles University
- num: 4
address: Polish Academy of Sciences
- num: 5
address: University of Oslo
- num: 6
address: University of Vienna
- num: 7
address: Dublin City University
- num: 8
address: Trinity College Dublin
- num: 9
address: University of Siegen
- num: 10
address: Beijing Foreign Studies University
logoleft_name: "../Figures/ICC_COL.svg"
author_textsize: "32pt"
contact:
email: icc@ids-manneim.de
website: https://www.ids-mannheim.de/digspra/kl
qrlink: >
`r posterdown::qrlink("https://korap.ids-mannheim.de/instance/icc", "icc-logo-whitebg.svg")`
output:
posterdown::posterdown_ids:
self_contained: false
keep_md: true
bibliography: ../tex/references.bib
csl: "https://raw.githubusercontent.com/ICLC-10/Zotero/master/styles/ICLC-10.csl"
---
```{r setup, include=FALSE, echo=FALSE, warning=FALSE}
knitr::opts_chunk$set(dev = 'svg', echo = FALSE, warnings = FALSE)
source("common.R")
```
# ICC aims & charcteristics
* make available comparable corpora of many languages for contrastive linguistic research [@kirk_ice_2017]
* mostly based on existing corpora
* ICC has a pre-defined balanced composition
* based on the one of the ICE [@greenbaum_comparing_1996]
# Current launch of ICC written
* written parts for Chinese, Czech, English, German, Irish (partly), Norwegian publicly available
* partially including UDPipe 2.0 annotations [@straka_udpipe_2018]
* via Corpus Workbench or KorAP [@diewald_korap_2016]
* usable via Corpus Workbench or KorAP [@diewald_korap_2016]
```{r korap-query, fig.cap="KorAP UI for ICC-GER and ICC-NOR, showing annotation queries and layers, as well as a virtual corpus definition, based on ICC genre and publication date metadata."}
knitr::include_graphics("korap_query_ger-nor.svg")
```
## Composition of the ICC parts
```{r composition-by-genre, fig.cap="Actual composition of selected ICC parts with respect to ICC domain. (For the other ICC parts, the ICC genre metadatum was not yet accessible via the API at the editorial deadline.)", message = FALSE, fig.width=14, fig.height=10, out.width = "100%"}
icc_genre <- icc %>%
expand_grid(genre) %>%
mutate(vc = paste0("iccGenre=", genre)) %>%
rowwise() %>%
mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)
icc_genre %>% ggplot(aes(x=lang, fill=genre, y=tokens)) +
geom_col() + scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
theme_ids(base_size = 24) +
theme(
axis.title.x = element_text(size = rel(1.5), face = "bold"),
axis.title.y = element_text(size = rel(1.5), face = "bold"),
axis.text = element_text(size = rel(0.70)),
legend.title = element_text(size = rel(0.85), face = "bold"),
legend.text = element_text(size = rel(1))) +
scale_fill_ids() +
geom_text(aes(label=if_else(tokens > 0, as.character(tokens), ""), y=tokens), position= position_stack(reverse = F, vjust = 0.5), color="black", size=6.2, family="Fira Sans Condensed")
```
```{r composition-by-pubdate, fig.cap="Composition of the selected ICC parts with respect to year of publication.", message=F, warning=F, fig.width=14, fig.height=3, out.width = "80%"}
year <- c(1986:2023)
icc_year <- icc %>%
expand_grid(year) %>%
mutate(vc = paste0("pubDate in ", year)) %>%
rowwise() %>%
mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)
icc_year %>% ggplot(aes(x=year, fill=lang, color=lang, y=tokens)) +
# geom_smooth(se=F, span=0.25) +
xlim(1990, 2023) +
ylim(0, NA) +
stat_smooth(
geom = 'area', method = 'loess', span = 1/4,
alpha = 0.1) +
# geom_area(alpha=0.1, position = "identity") +
scale_fill_ids() + scale_colour_ids() +
scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
theme_ids(base_size=24) +
theme(
axis.title.x = element_text(size = rel(1.5), face = "bold"),
axis.title.y = element_text(size = rel(1.5), face = "bold"),
axis.text = element_text(size = rel(1)),
legend.title = element_text(size = rel(1), face = "bold"),
legend.text = element_text(size = rel(1)))
```
# Pilot study
* identification of light verb constructions (LVC) with *take* in English, and corresponding lemmas in German and Norwegian
* in order to explore the limitations imposed by the small corpus sizes
* using RKorapClient [@kupietz_rkorapclient_2020] to access the corpora and get reproducible results for the analyses
```{r take-icc-code, results='hide', echo=TRUE}
library(RKorAPClient)
new("KorAPConnection",
KorAPUrl = "https://korap.ids-mannheim.de/instance/icc/eng",
accessToken = Sys.getenv("KORAP_ICC_TOKEN_eng")) %>%
collocationAnalysis(
"focus({[ud/l=take]} [ud/p=NOUN])",
leftContextSize = 0,
rightContextSize = 1,
minOccur = 2,
addExamples = T
)
```
```{r take-icc, fig.cap="R code and results of a co-occurrence analysis of *take* + NOUN in ICC-ENG, using the RKorAPClient package for R."}
take_ca_icc <-
collocationAnalysis(
icc_con("eng"),
"focus({[ud/l=take]} [ud/p=NOUN])",
leftContextSize = 0,
rightContextSize = 1,
minOccur = 2,
addExamples = T
)
take_ca_icc %>% show_table()
```
## Results
* for English the query for *take* + NOUN (as direct right neighbour) yields 10 different pairs with a minimum frequency of 2 (see Figure \@ref(fig:take-icc))
* based on English Wikipedia [2015 snapshot, see @MargarethaLuengen2014] the query yields 139 pairs (log-dice-threshold: 2.0) with 44 false positives
# Summary & Outlook
* we have made corpora of 4+ languages available for contrastive research
* however, even for fairly frequent phenomena, the results on the small corpora should be treated with caution
* typically, they need to be verified on larger monolingual corpora
* this also and especially concerns recall
* nevertheless ICC can serve as a useful basis for contrastive studies
* with a uniform UI and API that leverage query and analysis
* in addition, ICC also serves as a crystallisation point
* for more ICC corpora and spoken parts to come
* for larger corpora and complementary approaches, such as EuReCo
# References