R/poster.Rmd - ICC/2023-07-20-ICC-ICLC10 - Gitiles

 ---
 title: "News from the International Comparable Corpus"
 subtitle: "First launch of ICC written"
 date: "`r Sys.Date()`"
 author:
     - name: Marc Kupietz
       affil: 1
     - name: Adrien Barbaresi
       affil: 2
     - name: Anna Čermáková
       affil: 3
     - name: Małgorzata Czachor
       affil: 4
     - name: Nils Diewald
       affil: 1
     - name: Jarle Ebeling
       affil: 5
     - name: Rafał L. Górski
       affil: 4
     - name: John Kirk
       affil: 6
     - name: Michal Křen
       affil: 3
     - name: Harald Lüngen
       affil: 1
     - name: Eliza Margaretha
       affil: 1
     - name: Signe Oksefjell Ebeling
       affil: 5
     - name: Mícheál Ó Meachair
       affil: 7
     - name: Ines Pisetta
       affil: 1
     - name: Elaine Uí Dhonnchadha
       affil: 8
     - name: Friedemann Vogel
       affil: 9
     - name: Rebecca Wilm
       affil: 1
     - name: Jiajin Xu
       affil: 10
     - name: Rameela Yaddehige
       affil: 1
 affiliation:
   - num: 1
     address: IDS Mannheim
   - num: 2
     address: BBAW Berlin
   - num: 3
     address: Charles University
   - num: 4
     address: Polish Academy of Sciences
   - num: 5
     address: University of Oslo
   - num: 6
     address: University of Vienna
   - num: 7
     address: Dublin City University
   - num: 8
     address: Trinity College Dublin
   - num: 9
     address: University of Siegen
   - num: 10
     address: Beijing Foreign Studies University


 logoleft_name: "../Figures/ICC_COL.svg"
 author_textsize: "32pt"

 contact:
   email: icc@ids-manneim.de
   website: https://www.ids-mannheim.de/digspra/kl
   qrlink: >
     `r posterdown::qrlink("https://korap.ids-mannheim.de/instance/icc", "icc-logo-whitebg.svg")`

 output:
   posterdown::posterdown_ids:
         self_contained: false
         keep_md: true

 lang: en
 bibliography: ../tex/references.bib
 csl: "https://raw.githubusercontent.com/ICLC-10/Zotero/master/styles/ICLC-10.csl"
 ---

 ```{r setup, include=FALSE, echo=FALSE, warning=FALSE}
 knitr::opts_chunk$set(dev = 'svg', echo = FALSE, message = FALSE, warnings = FALSE)
 source("common.R")
 ```
 # ICC aims & characteristics

 * make available comparable corpora  of many languages for contrastive linguistic research [@kirk_ice_2017]
 * mostly based on existing corpora
 * small corpora with 1M words each (400K written)
 * pre-defined “balanced” composition
   * inspired by the one of the ICE [@greenbaum_comparing_1996]

 # Current launch of ICC written

 * written parts for Chinese, Czech, English (mostly), German, Irish (partly), Norwegian publicly available
   * partially including UDPipe 2.0 annotations [@straka_udpipe_2018]
   * usable via CWB or KorAP [@diewald_korap_2016] ➝ QR Code on the left

 ```{r korap-query, fig.cap="KorAP UI for ICC-GER and ICC-NOR, showing annotation queries and layers, as well as a virtual corpus definition, based on ICC genre and publication date metadata."}
 knitr::include_graphics("korap_query_ger-nor.svg")
 ```

 ## Composition of the ICC parts

 ```{r composition-by-genre, fig.cap="Actual composition of selected ICC parts with respect to ICC domain. (For the other ICC parts, the ICC genre metadatum was not yet accessible via the API at the editorial deadline.)", message = FALSE, fig.width=14, fig.height=10, out.width = "100%"}
 icc_genre <- icc %>%
   expand_grid(genre) %>%
   mutate(vc = paste0("iccGenre=", genre)) %>%
   rowwise() %>%
   mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)

 icc_genre %>% ggplot(aes(x=lang, fill=genre, y=tokens)) +
   geom_col() + scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
   theme_ids(base_size = 24) +
   theme(
     axis.title.x = element_blank(),
     axis.title.y = element_text(size = rel(1.5), face = "bold"),
      axis.text = element_text(size = rel(0.70)),
     legend.title = element_text(size = rel(0.85), face = "bold"),
     legend.text = element_text(size = rel(1))) +
   scale_fill_ids() +
   geom_text(aes(label=if_else(tokens > 0, as.character(tokens), ""), y=tokens), position= position_stack(reverse = F, vjust = 0.5), color="black", size=6.2, family="Fira Sans Condensed")

 ```


 ```{r composition-by-pubdate, fig.cap="Composition of the selected ICC parts with respect to year of publication.", message=F, warning=F, fig.width=14, fig.height=3, out.width = "80%"}
 year <- c(1986:2023)

 icc_year <- icc %>%
   expand_grid(year) %>%
   mutate(vc = paste0("pubDate in ", year)) %>%
   rowwise() %>%
   mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)

 icc_year %>% ggplot(aes(x=year, fill=lang, color=lang, y=tokens)) +
   # geom_smooth(se=F, span=0.25) +
   xlim(1990, 2023) +
   ylim(0, NA) +
   stat_smooth(
         geom = 'area', method = 'loess', span = 1/4,
         alpha = 0.1) +
   # geom_area(alpha=0.1,  position = "identity") +
   scale_fill_ids() + scale_colour_ids() +
   scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
   theme_ids(base_size=24) +
     theme(
     axis.title.x = element_text(size = rel(1.5), face = "bold"),
     axis.title.y = element_text(size = rel(1.5), face = "bold"),
      axis.text = element_text(size = rel(1)),
     legend.title = element_text(size = rel(1), face = "bold"),
     legend.text = element_text(size = rel(1)))
 ```


 # Pilot study

 * identification of light verb constructions (LVC) with *take* in English, and corresponding lemmas in German and Norwegian
   * to explore the limitations imposed by the small corpus sizes
   * using RKorAPClient [@kupietz_rkorapclient_2020] to access the corpora and get reproducible results for the analyses


 ```{r take-icc-code, results='hide', echo=TRUE}
 library(RKorAPClient)
 new("KorAPConnection",
     KorAPUrl = "https://korap.ids-mannheim.de/instance/icc/eng",
     accessToken = Sys.getenv("KORAP_ICC_TOKEN_eng")) %>%
 collocationAnalysis(
     "focus({[ud/l=take]} [ud/p=NOUN])",
     leftContextSize = 0,
     rightContextSize = 1,
     minOccur = 2,
     addExamples = T)
 ```

 ```{r take-icc, fig.cap="R code for, and results of a co-occurrence analysis of *take* + NOUN in ICC-ENG, using the RKorAPClient package."}
 take_ca_icc <-
   collocationAnalysis(
     icc_con("eng"),
     "focus({[ud/l=take]} [ud/p=NOUN])",
     leftContextSize = 0,
     rightContextSize = 1,
     minOccur = 2,
     addExamples = T
   )

 take_ca_icc %>% show_table()
 ```

 ## Results

 * for English the query for *take* + NOUN (as direct right neighbour) yields 10 different pairs with a minimum frequency of 2 (see  Figure \@ref(fig:take-icc))
   * based on English Wikipedia [2015 snapshot, see @MargarethaLuengen2014] the query yields 139 pairs (log-dice-threshold: 2.0) with 44 false positives
   * the true positive ratio of discovered take-LVCs between ICC and Wikipedia is 10:95
 * for ICC German with DeReKo as background corpus, the ratio of discovered true LVCs with ›nehmen‹ (=take) is 10:89
 * in both cases, not much more than 10% of LVCs could be discovered

 # Summary & Outlook

 * we have made comparable corpora of 4+ languages available, readily usable for contrastive research
 * however, even for fairly frequent phenomena, the results on the small corpora should be treated with caution
   * typically, they need to be verified on larger monolingual corpora
   * this also and especially concerns recall
 * nevertheless ICC can serve as a useful basis for contrastive studies
   * with a uniform UI and API that facilitate query and analysis
 * in addition, ICC also serves as a crystallisation point
   * for more ICC corpora and spoken parts to come
   * for larger corpora and complementary approaches, such as EuReCo

 # References
	---
	title: "News from the International Comparable Corpus"
	subtitle: "First launch of ICC written"
	date: "`r Sys.Date()`"
	author:
	- name: Marc Kupietz
	affil: 1
	- name: Adrien Barbaresi
	affil: 2
	- name: Anna Čermáková
	affil: 3
	- name: Małgorzata Czachor
	affil: 4
	- name: Nils Diewald
	affil: 1
	- name: Jarle Ebeling
	affil: 5
	- name: Rafał L. Górski
	affil: 4
	- name: John Kirk
	affil: 6
	- name: Michal Křen
	affil: 3
	- name: Harald Lüngen
	affil: 1
	- name: Eliza Margaretha
	affil: 1
	- name: Signe Oksefjell Ebeling
	affil: 5
	- name: Mícheál Ó Meachair
	affil: 7
	- name: Ines Pisetta
	affil: 1
	- name: Elaine Uí Dhonnchadha
	affil: 8
	- name: Friedemann Vogel
	affil: 9
	- name: Rebecca Wilm
	affil: 1
	- name: Jiajin Xu
	affil: 10
	- name: Rameela Yaddehige
	affil: 1
	affiliation:
	- num: 1
	address: IDS Mannheim
	- num: 2
	address: BBAW Berlin
	- num: 3
	address: Charles University
	- num: 4
	address: Polish Academy of Sciences
	- num: 5
	address: University of Oslo
	- num: 6
	address: University of Vienna
	- num: 7
	address: Dublin City University
	- num: 8
	address: Trinity College Dublin
	- num: 9
	address: University of Siegen
	- num: 10
	address: Beijing Foreign Studies University


	logoleft_name: "../Figures/ICC_COL.svg"
	author_textsize: "32pt"

	contact:
	email: icc@ids-manneim.de
	website: https://www.ids-mannheim.de/digspra/kl
	qrlink: >
	`r posterdown::qrlink("https://korap.ids-mannheim.de/instance/icc", "icc-logo-whitebg.svg")`

	output:
	posterdown::posterdown_ids:
	self_contained: false
	keep_md: true

	lang: en
	bibliography: ../tex/references.bib
	csl: "https://raw.githubusercontent.com/ICLC-10/Zotero/master/styles/ICLC-10.csl"
	---

	```{r setup, include=FALSE, echo=FALSE, warning=FALSE}
	knitr::opts_chunk$set(dev = 'svg', echo = FALSE, message = FALSE, warnings = FALSE)
	source("common.R")
	```
	# ICC aims & characteristics

	* make available comparable corpora of many languages for contrastive linguistic research [@kirk_ice_2017]
	* mostly based on existing corpora
	* small corpora with 1M words each (400K written)
	* pre-defined “balanced” composition
	* inspired by the one of the ICE [@greenbaum_comparing_1996]

	# Current launch of ICC written

	* written parts for Chinese, Czech, English (mostly), German, Irish (partly), Norwegian publicly available
	* partially including UDPipe 2.0 annotations [@straka_udpipe_2018]
	* usable via CWB or KorAP [@diewald_korap_2016] ➝ QR Code on the left

	```{r korap-query, fig.cap="KorAP UI for ICC-GER and ICC-NOR, showing annotation queries and layers, as well as a virtual corpus definition, based on ICC genre and publication date metadata."}
	knitr::include_graphics("korap_query_ger-nor.svg")
	```

	## Composition of the ICC parts

	```{r composition-by-genre, fig.cap="Actual composition of selected ICC parts with respect to ICC domain. (For the other ICC parts, the ICC genre metadatum was not yet accessible via the API at the editorial deadline.)", message = FALSE, fig.width=14, fig.height=10, out.width = "100%"}
	icc_genre <- icc %>%
	expand_grid(genre) %>%
	mutate(vc = paste0("iccGenre=", genre)) %>%
	rowwise() %>%
	mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)

	icc_genre %>% ggplot(aes(x=lang, fill=genre, y=tokens)) +
	geom_col() + scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
	theme_ids(base_size = 24) +
	theme(
	axis.title.x = element_blank(),
	axis.title.y = element_text(size = rel(1.5), face = "bold"),
	axis.text = element_text(size = rel(0.70)),
	legend.title = element_text(size = rel(0.85), face = "bold"),
	legend.text = element_text(size = rel(1))) +
	scale_fill_ids() +
	geom_text(aes(label=if_else(tokens > 0, as.character(tokens), ""), y=tokens), position= position_stack(reverse = F, vjust = 0.5), color="black", size=6.2, family="Fira Sans Condensed")

	```


	```{r composition-by-pubdate, fig.cap="Composition of the selected ICC parts with respect to year of publication.", message=F, warning=F, fig.width=14, fig.height=3, out.width = "80%"}
	year <- c(1986:2023)

	icc_year <- icc %>%
	expand_grid(year) %>%
	mutate(vc = paste0("pubDate in ", year)) %>%
	rowwise() %>%
	mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)

	icc_year %>% ggplot(aes(x=year, fill=lang, color=lang, y=tokens)) +
	# geom_smooth(se=F, span=0.25) +
	xlim(1990, 2023) +
	ylim(0, NA) +
	stat_smooth(
	geom = 'area', method = 'loess', span = 1/4,
	alpha = 0.1) +
	# geom_area(alpha=0.1, position = "identity") +
	scale_fill_ids() + scale_colour_ids() +
	scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
	theme_ids(base_size=24) +
	theme(
	axis.title.x = element_text(size = rel(1.5), face = "bold"),
	axis.title.y = element_text(size = rel(1.5), face = "bold"),
	axis.text = element_text(size = rel(1)),
	legend.title = element_text(size = rel(1), face = "bold"),
	legend.text = element_text(size = rel(1)))
	```


	# Pilot study

	* identification of light verb constructions (LVC) with take in English, and corresponding lemmas in German and Norwegian
	* to explore the limitations imposed by the small corpus sizes
	* using RKorAPClient [@kupietz_rkorapclient_2020] to access the corpora and get reproducible results for the analyses


	```{r take-icc-code, results='hide', echo=TRUE}
	library(RKorAPClient)
	new("KorAPConnection",
	KorAPUrl = "https://korap.ids-mannheim.de/instance/icc/eng",
	accessToken = Sys.getenv("KORAP_ICC_TOKEN_eng")) %>%
	collocationAnalysis(
	"focus({[ud/l=take]} [ud/p=NOUN])",
	leftContextSize = 0,
	rightContextSize = 1,
	minOccur = 2,
	addExamples = T)
	```

	```{r take-icc, fig.cap="R code for, and results of a co-occurrence analysis of take + NOUN in ICC-ENG, using the RKorAPClient package."}
	take_ca_icc <-
	collocationAnalysis(
	icc_con("eng"),
	"focus({[ud/l=take]} [ud/p=NOUN])",
	leftContextSize = 0,
	rightContextSize = 1,
	minOccur = 2,
	addExamples = T
	)

	take_ca_icc %>% show_table()
	```

	## Results

	* for English the query for take + NOUN (as direct right neighbour) yields 10 different pairs with a minimum frequency of 2 (see Figure \@ref(fig:take-icc))
	* based on English Wikipedia [2015 snapshot, see @MargarethaLuengen2014] the query yields 139 pairs (log-dice-threshold: 2.0) with 44 false positives
	* the true positive ratio of discovered take-LVCs between ICC and Wikipedia is 10:95
	* for ICC German with DeReKo as background corpus, the ratio of discovered true LVCs with ›nehmen‹ (=take) is 10:89
	* in both cases, not much more than 10% of LVCs could be discovered

	# Summary & Outlook

	* we have made comparable corpora of 4+ languages available, readily usable for contrastive research
	* however, even for fairly frequent phenomena, the results on the small corpora should be treated with caution
	* typically, they need to be verified on larger monolingual corpora
	* this also and especially concerns recall
	* nevertheless ICC can serve as a useful basis for contrastive studies
	* with a uniform UI and API that facilitate query and analysis
	* in addition, ICC also serves as a crystallisation point
	* for more ICC corpora and spoken parts to come
	* for larger corpora and complementary approaches, such as EuReCo

	# References