blob: d144cc8cc94b3fd21e91436d76b203847321c51f [file] [log] [blame]
---
title: "News from the International Comparable Corpus"
subtitle: "First launch of ICC written"
date: "`r Sys.Date()`"
author:
- name: Marc Kupietz
affil: 1
- name: Adrien Barbaresi
affil: 2
- name: Anna Čermáková
affil: 3
- name: Małgorzata Czachor
affil: 4
- name: Nils Diewald
affil: 1
- name: Jarle Ebeling
affil: 5
- name: Rafał L. Górski
affil: 4
- name: John Kirk
affil: 6
- name: Michal Křen
affil: 3
- name: Harald Lüngen
affil: 1
- name: Eliza Margaretha
affil: 1
- name: Signe Oksefjell Ebeling
affil: 5
- name: Mícheál Ó Meachair
affil: 7
- name: Ines Pisetta
affil: 1
- name: Elaine Uí Dhonnchadha
affil: 8
- name: Friedemann Vogel
affil: 9
- name: Rebecca Wilm
affil: 1
- name: Jiajin Xu
affil: 10
- name: Rameela Yaddehige
affil: 1
affiliation:
- num: 1
address: IDS Mannheim
- num: 2
address: BBAW Berlin
- num: 3
address: Charles University
- num: 4
address: Polish Academy of Sciences
- num: 5
address: University of Oslo
- num: 6
address: University of Vienna
- num: 7
address: Dublin City University
- num: 8
address: Trinity College Dublin
- num: 9
address: University of Siegen
- num: 10
address: Beijing Foreign Studies University
logoleft_name: "../Figures/ICC_COL.svg"
author_textsize: "32pt"
contact:
email: icc@ids-manneim.de
website: https://www.ids-mannheim.de/digspra/kl
qrlink: >
`r posterdown::qrlink("https://korap.ids-mannheim.de/instance/icc")`
output:
posterdown::posterdown_ids:
self_contained: false
keep_md: true
bibliography: ../tex/references.bib
csl: "https://raw.githubusercontent.com/ICLC-10/Zotero/master/styles/ICLC-10.csl"
---
```{r setup, include=FALSE, echo=FALSE, warning=FALSE}
knitr::opts_chunk$set(dev = 'svg', echo = FALSE, warnings = FALSE)
source("common.R")
```
# ICC aims & charcteristics
* make available comparable corpora of many languages for contrastive linguistic research [@cermakova_international_2021]
* mostly based on existing corpora
* ICC has a pre-defined balanced composition
* based on the one of the ICE [@greenbaum_comparing_1996]
# Current launch of ICC written
* written parts for Chinese, Czech, English, German, Irish (partly), Norwegian publicly available
* partially including UDPipe 2.0 annotations [@straka_udpipe_2018]
* via Corpus Workbench or KorAP [@diewald_korap_2016]
![](korap_query.png)
## Composition of the ICC parts
### By ICC genre
```{r composition_by_genre, message = FALSE, fig.width=14, fig.height=10, out.width = "100%"}
icc_genre <- icc %>%
expand_grid(genre) %>%
mutate(vc = paste0("iccGenre=", genre)) %>%
rowwise() %>%
mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)
icc_genre %>% ggplot(aes(x=lang, fill=genre, y=tokens)) +
geom_col() + scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
theme_ids(base_size = 24) +
theme(
axis.title.x = element_text(size = rel(1.5), face = "bold"),
axis.title.y = element_text(size = rel(1.5), face = "bold"),
axis.text = element_text(size = rel(0.70)),
legend.title = element_text(size = rel(0.85), face = "bold"),
legend.text = element_text(size = rel(1))) +
scale_fill_ids() +
geom_text(aes(label=if_else(tokens > 0, as.character(tokens), ""), y=tokens), position= position_stack(reverse = F, vjust = 0.5), color="black", size=6.2, family="Fira Sans Condensed")
```
### By date of publication
```{r composition-by-pubdate, message=F, warning=F, fig.width=14, fig.height=3, out.width = "80%"}
year <- c(1986:2023)
icc_year <- icc %>%
expand_grid(year) %>%
mutate(vc = paste0("pubDate in ", year)) %>%
rowwise() %>%
mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)
icc_year %>% ggplot(aes(x=year, fill=lang, color=lang, y=tokens)) +
# geom_smooth(se=F, span=0.25) +
xlim(1990, 2023) +
ylim(0, NA) +
stat_smooth(
geom = 'area', method = 'loess', span = 1/4,
alpha = 0.1) +
# geom_area(alpha=0.1, position = "identity") +
scale_fill_ids() + scale_colour_ids() +
scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
theme_ids(base_size=24) +
theme(
axis.title.x = element_text(size = rel(1.5), face = "bold"),
axis.title.y = element_text(size = rel(1.5), face = "bold"),
axis.text = element_text(size = rel(1)),
legend.title = element_text(size = rel(1), face = "bold"),
legend.text = element_text(size = rel(1)))
```
# Pilot study
* Identification of Light Verb Constructions with *take*
* in order to investigate the limitations imposed by the very small corpus sizes
* using RKorapClient [@kupietz_rkorapclient_2020] to access corpora and get reproducible results of the collocation analysis
```{r take-icc, echo=TRUE, fig.cap="Collocation analysis of *take* using the RKorAPClient package for R"}
take_ca_icc <-
collocationAnalysis(
icc_con("eng"),
"focus({[ud/l=take]} [ud/p=NOUN])",
leftContextSize = 0,
rightContextSize = 1,
minOccur = 2,
addExamples = T
)
take_ca_icc %>% show_table()
```
## Results
* for English the query for *take* + NOUN (as direct right neighbour) yields 10 different pairs with a minimum frequency of 2 (see Figure \@ref(fig:take-icc))
* based English Wikipedia (2015) the query yields 139 pairs (log-dice-threshold: 2.0) with about 20 false positives
* for ICC German with DeReKo as background corpus, the ratio of true positive LVCs is 10/80
# Summary & Outlook
* we have made available corpora of 4+ languages available for contrastive research
* however, even with quite frequent phenomena, the results on the small corpora are to be used with caution
* typically they need to be verified on larger monolingual corpora
* the uniform acces is in any case helpful for contrastive studies
* ICC also serves as a crystallization point for larger corpora and complementary approaches such as EuReCo
# References