blob: 5253df955587e5c24451be5d55e2f02ac2355445 [file] [log] [blame]
Marc Kupietzafce9c12023-06-13 09:18:53 +02001---
2title: "News from the International Comparable Corpus"
3subtitle: "First launch of ICC written"
4date: "`r Sys.Date()`"
5author:
Marc Kupietzfd920862023-06-29 09:15:12 +02006 - name: Marc Kupietz
Marc Kupietzafce9c12023-06-13 09:18:53 +02007 affil: 1
Marc Kupietzfd920862023-06-29 09:15:12 +02008 - name: Adrien Barbaresi
Marc Kupietzafce9c12023-06-13 09:18:53 +02009 affil: 2
Marc Kupietzfd920862023-06-29 09:15:12 +020010 - name: Anna Čermáková
Marc Kupietzafce9c12023-06-13 09:18:53 +020011 affil: 3
Marc Kupietzfd920862023-06-29 09:15:12 +020012 - name: Małgorzata Czachor
Marc Kupietzafce9c12023-06-13 09:18:53 +020013 affil: 4
Marc Kupietzfd920862023-06-29 09:15:12 +020014 - name: Nils Diewald
Marc Kupietzafce9c12023-06-13 09:18:53 +020015 affil: 1
Marc Kupietzfd920862023-06-29 09:15:12 +020016 - name: Jarle Ebeling
Marc Kupietzafce9c12023-06-13 09:18:53 +020017 affil: 5
Marc Kupietzfd920862023-06-29 09:15:12 +020018 - name: Rafał LGórski
Marc Kupietzafce9c12023-06-13 09:18:53 +020019 affil: 4
Marc Kupietzfd920862023-06-29 09:15:12 +020020 - name: John Kirk
Marc Kupietzafce9c12023-06-13 09:18:53 +020021 affil: 6
Marc Kupietzfd920862023-06-29 09:15:12 +020022 - name: Michal Křen
Marc Kupietzafce9c12023-06-13 09:18:53 +020023 affil: 3
Marc Kupietzfd920862023-06-29 09:15:12 +020024 - name: Harald Lüngen
Marc Kupietzafce9c12023-06-13 09:18:53 +020025 affil: 1
Marc Kupietzfd920862023-06-29 09:15:12 +020026 - name: Eliza Margaretha
Marc Kupietzafce9c12023-06-13 09:18:53 +020027 affil: 1
Marc Kupietzfd920862023-06-29 09:15:12 +020028 - name: Signe Oksefjell Ebeling
Marc Kupietzafce9c12023-06-13 09:18:53 +020029 affil: 5
Marc Kupietzfd920862023-06-29 09:15:12 +020030 - name: Mícheál Ó Meachair
Marc Kupietzafce9c12023-06-13 09:18:53 +020031 affil: 7
Marc Kupietzfd920862023-06-29 09:15:12 +020032 - name: Ines Pisetta
Marc Kupietzafce9c12023-06-13 09:18:53 +020033 affil: 1
Marc Kupietzfd920862023-06-29 09:15:12 +020034 - name: Elaine Uí Dhonnchadha
Marc Kupietzafce9c12023-06-13 09:18:53 +020035 affil: 8
Marc Kupietzfd920862023-06-29 09:15:12 +020036 - name: Friedemann Vogel
Marc Kupietzafce9c12023-06-13 09:18:53 +020037 affil: 9
Marc Kupietzfd920862023-06-29 09:15:12 +020038 - name: Rebecca Wilm
Marc Kupietzafce9c12023-06-13 09:18:53 +020039 affil: 1
Marc Kupietzfd920862023-06-29 09:15:12 +020040 - name: Jiajin Xu
Marc Kupietzafce9c12023-06-13 09:18:53 +020041 affil: 10
Marc Kupietzfd920862023-06-29 09:15:12 +020042 - name: Rameela Yaddehige
Marc Kupietzafce9c12023-06-13 09:18:53 +020043 affil: 1
44affiliation:
45 - num: 1
Marc Kupietzfd920862023-06-29 09:15:12 +020046 address: IDS Mannheim
Marc Kupietzafce9c12023-06-13 09:18:53 +020047 - num: 2
Marc Kupietzfd920862023-06-29 09:15:12 +020048 address: BBAW Berlin
Marc Kupietzafce9c12023-06-13 09:18:53 +020049 - num: 3
Marc Kupietzfd920862023-06-29 09:15:12 +020050 address: Charles University
Marc Kupietzafce9c12023-06-13 09:18:53 +020051 - num: 4
Marc Kupietzfd920862023-06-29 09:15:12 +020052 address: Polish Academy of Sciences
Marc Kupietzafce9c12023-06-13 09:18:53 +020053 - num: 5
Marc Kupietzfd920862023-06-29 09:15:12 +020054 address: University of Oslo
Marc Kupietzafce9c12023-06-13 09:18:53 +020055 - num: 6
Marc Kupietzfd920862023-06-29 09:15:12 +020056 address: University of Vienna
Marc Kupietzafce9c12023-06-13 09:18:53 +020057 - num: 7
Marc Kupietzfd920862023-06-29 09:15:12 +020058 address: Dublin City University
Marc Kupietzafce9c12023-06-13 09:18:53 +020059 - num: 8
Marc Kupietzfd920862023-06-29 09:15:12 +020060 address: Trinity College Dublin
Marc Kupietzafce9c12023-06-13 09:18:53 +020061 - num: 9
Marc Kupietzfd920862023-06-29 09:15:12 +020062 address: University of Siegen
Marc Kupietzafce9c12023-06-13 09:18:53 +020063 - num: 10
Marc Kupietzfd920862023-06-29 09:15:12 +020064 address: Beijing Foreign Studies University
Marc Kupietzafce9c12023-06-13 09:18:53 +020065
66
67logoleft_name: "../Figures/ICC_COL.svg"
68author_textsize: "32pt"
69
Marc Kupietzfbd648c2023-06-24 12:31:45 +020070contact:
Marc Kupietzc5f7a922023-06-26 21:16:25 +020071 email: icc@ids-manneim.de
72 website: https://www.ids-mannheim.de/digspra/kl
Marc Kupietzf0f58822023-06-26 20:32:03 +020073 qrlink: >
Marc Kupietze3bba7b2023-06-26 21:17:11 +020074 `r posterdown::qrlink("https://korap.ids-mannheim.de/instance/icc", "icc-logo-whitebg.svg")`
Marc Kupietzfbd648c2023-06-24 12:31:45 +020075
Marc Kupietzafce9c12023-06-13 09:18:53 +020076output:
Marc Kupietzfbd648c2023-06-24 12:31:45 +020077 posterdown::posterdown_ids:
78 self_contained: false
79 keep_md: true
Marc Kupietzbcde0b62023-06-14 14:22:35 +020080
81bibliography: ../tex/references.bib
Marc Kupietzdf8083d2023-06-26 20:31:42 +020082csl: "https://raw.githubusercontent.com/ICLC-10/Zotero/master/styles/ICLC-10.csl"
Marc Kupietzafce9c12023-06-13 09:18:53 +020083---
84
85```{r setup, include=FALSE, echo=FALSE, warning=FALSE}
Marc Kupietz2b51d502023-06-28 18:25:13 +020086knitr::opts_chunk$set(dev = 'svg', echo = FALSE, message = FALSE, warnings = FALSE)
Marc Kupietzafce9c12023-06-13 09:18:53 +020087source("common.R")
88```
Marc Kupietz2b51d502023-06-28 18:25:13 +020089# ICC aims & characteristics
Marc Kupietz8f6c71b2023-06-28 18:13:55 +020090
91* make available comparable corpora of many languages for contrastive linguistic research [@kirk_ice_2017]
Marc Kupietzafce9c12023-06-13 09:18:53 +020092* mostly based on existing corpora
Marc Kupietz3ef8bda2023-06-28 20:25:14 +020093* small corpora with 1M words each (400K written)
Marc Kupietz2b51d502023-06-28 18:25:13 +020094* pre-defined balanced composition
95 * inspired by the one of the ICE [@greenbaum_comparing_1996]
Marc Kupietzafce9c12023-06-13 09:18:53 +020096
Marc Kupietz6354d202023-06-26 20:34:05 +020097# Current launch of ICC written
Marc Kupietzafce9c12023-06-13 09:18:53 +020098
Marc Kupietz2b51d502023-06-28 18:25:13 +020099* written parts for Chinese, Czech, English (mostly), German, Irish (partly), Norwegian publicly available
Marc Kupietz6354d202023-06-26 20:34:05 +0200100 * partially including UDPipe 2.0 annotations [@straka_udpipe_2018]
Marc Kupietz49a7c182023-06-28 18:15:46 +0200101 * usable via Corpus Workbench or KorAP [@diewald_korap_2016]
102
103```{r korap-query, fig.cap="KorAP UI for ICC-GER and ICC-NOR, showing annotation queries and layers, as well as a virtual corpus definition, based on ICC genre and publication date metadata."}
104knitr::include_graphics("korap_query_ger-nor.svg")
105```
Marc Kupietz6354d202023-06-26 20:34:05 +0200106
107## Composition of the ICC parts
Marc Kupietzafce9c12023-06-13 09:18:53 +0200108
Marc Kupietz1ed69ff2023-06-28 18:14:34 +0200109```{r composition-by-genre, fig.cap="Actual composition of selected ICC parts with respect to ICC domain. (For the other ICC parts, the ICC genre metadatum was not yet accessible via the API at the editorial deadline.)", message = FALSE, fig.width=14, fig.height=10, out.width = "100%"}
Marc Kupietzafce9c12023-06-13 09:18:53 +0200110icc_genre <- icc %>%
111 expand_grid(genre) %>%
112 mutate(vc = paste0("iccGenre=", genre)) %>%
113 rowwise() %>%
114 mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)
115
116icc_genre %>% ggplot(aes(x=lang, fill=genre, y=tokens)) +
117 geom_col() + scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
118 theme_ids(base_size = 24) +
119 theme(
Marc Kupietz3ef8bda2023-06-28 20:25:14 +0200120 axis.title.x = element_blank(),
Marc Kupietzafce9c12023-06-13 09:18:53 +0200121 axis.title.y = element_text(size = rel(1.5), face = "bold"),
122 axis.text = element_text(size = rel(0.70)),
123 legend.title = element_text(size = rel(0.85), face = "bold"),
124 legend.text = element_text(size = rel(1))) +
125 scale_fill_ids() +
126 geom_text(aes(label=if_else(tokens > 0, as.character(tokens), ""), y=tokens), position= position_stack(reverse = F, vjust = 0.5), color="black", size=6.2, family="Fira Sans Condensed")
127
128```
129
Marc Kupietzafce9c12023-06-13 09:18:53 +0200130
Marc Kupietz1ed69ff2023-06-28 18:14:34 +0200131```{r composition-by-pubdate, fig.cap="Composition of the selected ICC parts with respect to year of publication.", message=F, warning=F, fig.width=14, fig.height=3, out.width = "80%"}
Marc Kupietzafce9c12023-06-13 09:18:53 +0200132year <- c(1986:2023)
133
134icc_year <- icc %>%
135 expand_grid(year) %>%
136 mutate(vc = paste0("pubDate in ", year)) %>%
137 rowwise() %>%
138 mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)
139
140icc_year %>% ggplot(aes(x=year, fill=lang, color=lang, y=tokens)) +
141 # geom_smooth(se=F, span=0.25) +
142 xlim(1990, 2023) +
143 ylim(0, NA) +
144 stat_smooth(
145 geom = 'area', method = 'loess', span = 1/4,
146 alpha = 0.1) +
147 # geom_area(alpha=0.1, position = "identity") +
148 scale_fill_ids() + scale_colour_ids() +
149 scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
150 theme_ids(base_size=24) +
151 theme(
152 axis.title.x = element_text(size = rel(1.5), face = "bold"),
153 axis.title.y = element_text(size = rel(1.5), face = "bold"),
154 axis.text = element_text(size = rel(1)),
155 legend.title = element_text(size = rel(1), face = "bold"),
156 legend.text = element_text(size = rel(1)))
157```
158
Marc Kupietzafce9c12023-06-13 09:18:53 +0200159
Marc Kupietz4e3ab832023-06-26 20:33:18 +0200160# Pilot study
Marc Kupietzafce9c12023-06-13 09:18:53 +0200161
Marc Kupietz58d1bc12023-06-28 18:18:19 +0200162* identification of light verb constructions (LVC) with *take* in English, and corresponding lemmas in German and Norwegian
163 * in order to explore the limitations imposed by the small corpus sizes
164 * using RKorapClient [@kupietz_rkorapclient_2020] to access the corpora and get reproducible results for the analyses
Marc Kupietzafce9c12023-06-13 09:18:53 +0200165
Marc Kupietzafce9c12023-06-13 09:18:53 +0200166
Marc Kupietz9fe544b2023-06-28 18:17:31 +0200167```{r take-icc-code, results='hide', echo=TRUE}
168library(RKorAPClient)
169new("KorAPConnection",
170 KorAPUrl = "https://korap.ids-mannheim.de/instance/icc/eng",
171 accessToken = Sys.getenv("KORAP_ICC_TOKEN_eng")) %>%
172collocationAnalysis(
173 "focus({[ud/l=take]} [ud/p=NOUN])",
174 leftContextSize = 0,
175 rightContextSize = 1,
176 minOccur = 2,
Marc Kupietz3ef8bda2023-06-28 20:25:14 +0200177 addExamples = T )
Marc Kupietz9fe544b2023-06-28 18:17:31 +0200178```
179
Marc Kupietz3ef8bda2023-06-28 20:25:14 +0200180```{r take-icc, fig.cap="R code for, and results of a co-occurrence analysis of *take* + NOUN in ICC-ENG, using the RKorAPClient package."}
Marc Kupietzafce9c12023-06-13 09:18:53 +0200181take_ca_icc <-
182 collocationAnalysis(
183 icc_con("eng"),
184 "focus({[ud/l=take]} [ud/p=NOUN])",
185 leftContextSize = 0,
186 rightContextSize = 1,
187 minOccur = 2,
188 addExamples = T
189 )
190
191take_ca_icc %>% show_table()
192```
193
Marc Kupietz9af399d2023-06-26 20:34:36 +0200194## Results
195
196* for English the query for *take* + NOUN (as direct right neighbour) yields 10 different pairs with a minimum frequency of 2 (see Figure \@ref(fig:take-icc))
Marc Kupietz48a41342023-06-28 18:16:54 +0200197 * based on English Wikipedia [2015 snapshot, see @MargarethaLuengen2014] the query yields 139 pairs (log-dice-threshold: 2.0) with 44 false positives
Marc Kupietz3a458ce2023-06-28 18:19:07 +0200198 * the true positive ratio of discovered take-LVCs between ICC and Wikipedia is 10:95
199* for ICC German with DeReKo as background corpus, the ratio of discovered true LVCs with nehmen (=take) is 10:89
200* in both cases, not much more than 10% of LVCs could be discovered
Marc Kupietz9af399d2023-06-26 20:34:36 +0200201
Marc Kupietz32b70ae2023-06-26 20:34:58 +0200202# Summary & Outlook
203
Marc Kupietz3ef8bda2023-06-28 20:25:14 +0200204* we have made comparable corpora of 4+ languages available, readily usable for contrastive research
Marc Kupietzfb570ea2023-06-28 18:17:59 +0200205* however, even for fairly frequent phenomena, the results on the small corpora should be treated with caution
206 * typically, they need to be verified on larger monolingual corpora
207 * this also and especially concerns recall
208* nevertheless ICC can serve as a useful basis for contrastive studies
Marc Kupietz2b51d502023-06-28 18:25:13 +0200209 * with a uniform UI and API that facilitate query and analysis
Marc Kupietzfb570ea2023-06-28 18:17:59 +0200210* in addition, ICC also serves as a crystallisation point
211 * for more ICC corpora and spoken parts to come
212 * for larger corpora and complementary approaches, such as EuReCo
Marc Kupietz32b70ae2023-06-26 20:34:58 +0200213
Marc Kupietzbcde0b62023-06-14 14:22:35 +0200214# References
215
Marc Kupietzafce9c12023-06-13 09:18:53 +0200216