---
title: "News from the International Comparable Corpus"
subtitle: "First launch of ICC written"
date: "`r Sys.Date()`"
author:
    - name: Marc Kupietz
      affil: 1
    - name: Adrien Barbaresi
      affil: 2
    - name: Anna Čermáková
      affil: 3
    - name: Małgorzata Czachor
      affil: 4
    - name: Nils Diewald
      affil: 1
    - name: Jarle Ebeling
      affil: 5
    - name: Rafał L. Górski
      affil: 4
    - name: John Kirk
      affil: 6
    - name: Michal Křen
      affil: 3
    - name: Harald Lüngen
      affil: 1
    - name: Eliza Margaretha
      affil: 1
    - name: Signe Oksefjell Ebeling
      affil: 5
    - name: Mícheál Ó Meachair
      affil: 7
    - name: Ines Pisetta
      affil: 1
    - name: Elaine Uí Dhonnchadha
      affil: 8
    - name: Friedemann Vogel
      affil: 9
    - name: Rebecca Wilm
      affil: 1
    - name: Jiajin Xu
      affil: 10
    - name: Rameela Yaddehige
      affil: 1
affiliation:
  - num: 1
    address: IDS Mannheim
  - num: 2
    address: BBAW Berlin
  - num: 3
    address: Charles University
  - num: 4
    address: Polish Academy of Sciences
  - num: 5
    address: University of Oslo
  - num: 6
    address: University of Vienna
  - num: 7
    address: Dublin City University
  - num: 8
    address: Trinity College Dublin
  - num: 9
    address: University of Siegen
  - num: 10
    address: Beijing Foreign Studies University


logoleft_name: "../Figures/ICC_COL.svg"
author_textsize: "32pt"

contact:
  qrlink: >
    `r posterdown::qrlink("https://korap.ids-mannheim.de/instance/icc")`

output:
  posterdown::posterdown_ids:
        self_contained: false
        keep_md: true

bibliography: ../tex/references.bib
csl: "https://raw.githubusercontent.com/ICLC-10/Zotero/master/styles/ICLC-10.csl"
---

```{r setup, include=FALSE, echo=FALSE, warning=FALSE}
knitr::opts_chunk$set(dev = 'svg', echo = FALSE, warnings = FALSE)
source("common.R")
```
# ICC aims & charcteristics
* make available comparable corpora  of many languages for contrastive linguistic research [@cermakova_international_2021]
* mostly based on existing corpora
* ICC has a pre-defined “balanced” composition
  * based on the one of the ICE [@greenbaum_comparing_1996]

# Current launch of ICC written

* written parts for Chinese, Czech, English, German, Irish (partly), Norwegian publicly available
  * partially including UDPipe 2.0 annotations [@straka_udpipe_2018]
  * via Corpus Workbench or KorAP [@diewald_korap_2016]
  
![](korap_query.png) 

## Composition of the ICC parts 
### By ICC genre

```{r composition_by_genre, message = FALSE, fig.width=14, fig.height=10, out.width = "100%"}
icc_genre <- icc %>%
  expand_grid(genre) %>%
  mutate(vc = paste0("iccGenre=", genre)) %>%
  rowwise() %>%
  mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)

icc_genre %>% ggplot(aes(x=lang, fill=genre, y=tokens)) +
  geom_col() + scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
  theme_ids(base_size = 24) +
  theme(
    axis.title.x = element_text(size = rel(1.5), face = "bold"),
    axis.title.y = element_text(size = rel(1.5), face = "bold"),
     axis.text = element_text(size = rel(0.70)),
    legend.title = element_text(size = rel(0.85), face = "bold"),
    legend.text = element_text(size = rel(1))) +
  scale_fill_ids() +
  geom_text(aes(label=if_else(tokens > 0, as.character(tokens), ""), y=tokens), position= position_stack(reverse = F, vjust = 0.5), color="black", size=6.2, family="Fira Sans Condensed")

```

### By date of publication


```{r composition-by-pubdate, message=F, warning=F, fig.width=14, fig.height=3, out.width = "80%"}
year <- c(1986:2023)

icc_year <- icc %>%
  expand_grid(year) %>%
  mutate(vc = paste0("pubDate in ", year)) %>%
  rowwise() %>%
  mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)

icc_year %>% ggplot(aes(x=year, fill=lang, color=lang, y=tokens)) +
  # geom_smooth(se=F, span=0.25) +
  xlim(1990, 2023) +
  ylim(0, NA) +
  stat_smooth(
        geom = 'area', method = 'loess', span = 1/4,
        alpha = 0.1) +
  # geom_area(alpha=0.1,  position = "identity") +
  scale_fill_ids() + scale_colour_ids() + 
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
  theme_ids(base_size=24) + 
    theme(
    axis.title.x = element_text(size = rel(1.5), face = "bold"),
    axis.title.y = element_text(size = rel(1.5), face = "bold"),
     axis.text = element_text(size = rel(1)),
    legend.title = element_text(size = rel(1), face = "bold"),
    legend.text = element_text(size = rel(1))) 
```


# Pilot study

* Identification of Light Verb Constructions with *take*
* in order to investigate the limitations imposed by the very small corpus sizes
* using RKorapClient [@kupietz_rkorapclient_2020] to access corpora and get reproducible results of the collocation analysis


```{r take-icc, echo=TRUE, fig.cap="Collocation analysis of *take* using the RKorAPClient package for R"}
take_ca_icc <-
  collocationAnalysis(
    icc_con("eng"),
    "focus({[ud/l=take]} [ud/p=NOUN])",
    leftContextSize = 0,
    rightContextSize = 1,
    minOccur = 2,
    addExamples = T
  )

take_ca_icc %>% show_table()
```

## Results

* for English the query for *take* + NOUN (as direct right neighbour) yields 10 different pairs with a minimum frequency of 2 (see  Figure \@ref(fig:take-icc))
  * based English Wikipedia (2015) the query yields 139 pairs (log-dice-threshold: 2.0) with about 20 false positives
* for ICC German with DeReKo as background corpus, the ratio of true positive LVCs is 10/80 

# Summary & Outlook

* we have made available corpora of 4+ languages available for contrastive research
* however, even with quite frequent phenomena, the results on the small corpora are to be used with caution
  * typically they need to be verified on larger monolingual corpora
* the uniform acces is in any case helpful for contrastive studies
* ICC also serves as a crystallization point for larger corpora and complementary approaches such as EuReCo

# References


