blob: 802a578f18240bad75cd78bf3f67e3c2107ea1bd [file] [log] [blame]
Marc Kupietzafce9c12023-06-13 09:18:53 +02001---
2title: "News from the International Comparable Corpus"
3subtitle: "First launch of ICC written"
4date: "`r Sys.Date()`"
5author:
Marc Kupietzfd920862023-06-29 09:15:12 +02006 - name: Marc Kupietz
Marc Kupietzafce9c12023-06-13 09:18:53 +02007 affil: 1
Marc Kupietzfd920862023-06-29 09:15:12 +02008 - name: Adrien Barbaresi
Marc Kupietzafce9c12023-06-13 09:18:53 +02009 affil: 2
Marc Kupietzfd920862023-06-29 09:15:12 +020010 - name: Anna Čermáková
Marc Kupietzafce9c12023-06-13 09:18:53 +020011 affil: 3
Marc Kupietzfd920862023-06-29 09:15:12 +020012 - name: Małgorzata Czachor
Marc Kupietzafce9c12023-06-13 09:18:53 +020013 affil: 4
Marc Kupietzfd920862023-06-29 09:15:12 +020014 - name: Nils Diewald
Marc Kupietzafce9c12023-06-13 09:18:53 +020015 affil: 1
Marc Kupietzfd920862023-06-29 09:15:12 +020016 - name: Jarle Ebeling
Marc Kupietzafce9c12023-06-13 09:18:53 +020017 affil: 5
Marc Kupietzfd920862023-06-29 09:15:12 +020018 - name: Rafał LGórski
Marc Kupietzafce9c12023-06-13 09:18:53 +020019 affil: 4
Marc Kupietzfd920862023-06-29 09:15:12 +020020 - name: John Kirk
Marc Kupietzafce9c12023-06-13 09:18:53 +020021 affil: 6
Marc Kupietzfd920862023-06-29 09:15:12 +020022 - name: Michal Křen
Marc Kupietzafce9c12023-06-13 09:18:53 +020023 affil: 3
Marc Kupietzfd920862023-06-29 09:15:12 +020024 - name: Harald Lüngen
Marc Kupietzafce9c12023-06-13 09:18:53 +020025 affil: 1
Marc Kupietzfd920862023-06-29 09:15:12 +020026 - name: Eliza Margaretha
Marc Kupietzafce9c12023-06-13 09:18:53 +020027 affil: 1
Marc Kupietzfd920862023-06-29 09:15:12 +020028 - name: Signe Oksefjell Ebeling
Marc Kupietzafce9c12023-06-13 09:18:53 +020029 affil: 5
Marc Kupietzfd920862023-06-29 09:15:12 +020030 - name: Mícheál Ó Meachair
Marc Kupietzafce9c12023-06-13 09:18:53 +020031 affil: 7
Marc Kupietzfd920862023-06-29 09:15:12 +020032 - name: Ines Pisetta
Marc Kupietzafce9c12023-06-13 09:18:53 +020033 affil: 1
Marc Kupietzfd920862023-06-29 09:15:12 +020034 - name: Elaine Uí Dhonnchadha
Marc Kupietzafce9c12023-06-13 09:18:53 +020035 affil: 8
Marc Kupietzfd920862023-06-29 09:15:12 +020036 - name: Friedemann Vogel
Marc Kupietzafce9c12023-06-13 09:18:53 +020037 affil: 9
Marc Kupietzfd920862023-06-29 09:15:12 +020038 - name: Rebecca Wilm
Marc Kupietzafce9c12023-06-13 09:18:53 +020039 affil: 1
Marc Kupietzfd920862023-06-29 09:15:12 +020040 - name: Jiajin Xu
Marc Kupietzafce9c12023-06-13 09:18:53 +020041 affil: 10
Marc Kupietzfd920862023-06-29 09:15:12 +020042 - name: Rameela Yaddehige
Marc Kupietzafce9c12023-06-13 09:18:53 +020043 affil: 1
44affiliation:
45 - num: 1
Marc Kupietzfd920862023-06-29 09:15:12 +020046 address: IDS Mannheim
Marc Kupietzafce9c12023-06-13 09:18:53 +020047 - num: 2
Marc Kupietzfd920862023-06-29 09:15:12 +020048 address: BBAW Berlin
Marc Kupietzafce9c12023-06-13 09:18:53 +020049 - num: 3
Marc Kupietzfd920862023-06-29 09:15:12 +020050 address: Charles University
Marc Kupietzafce9c12023-06-13 09:18:53 +020051 - num: 4
Marc Kupietzfd920862023-06-29 09:15:12 +020052 address: Polish Academy of Sciences
Marc Kupietzafce9c12023-06-13 09:18:53 +020053 - num: 5
Marc Kupietzfd920862023-06-29 09:15:12 +020054 address: University of Oslo
Marc Kupietzafce9c12023-06-13 09:18:53 +020055 - num: 6
Marc Kupietzfd920862023-06-29 09:15:12 +020056 address: University of Vienna
Marc Kupietzafce9c12023-06-13 09:18:53 +020057 - num: 7
Marc Kupietzfd920862023-06-29 09:15:12 +020058 address: Dublin City University
Marc Kupietzafce9c12023-06-13 09:18:53 +020059 - num: 8
Marc Kupietzfd920862023-06-29 09:15:12 +020060 address: Trinity College Dublin
Marc Kupietzafce9c12023-06-13 09:18:53 +020061 - num: 9
Marc Kupietzfd920862023-06-29 09:15:12 +020062 address: University of Siegen
Marc Kupietzafce9c12023-06-13 09:18:53 +020063 - num: 10
Marc Kupietzfd920862023-06-29 09:15:12 +020064 address: Beijing Foreign Studies University
Marc Kupietzafce9c12023-06-13 09:18:53 +020065
66
67logoleft_name: "../Figures/ICC_COL.svg"
68author_textsize: "32pt"
69
Marc Kupietzfbd648c2023-06-24 12:31:45 +020070contact:
Marc Kupietzc5f7a922023-06-26 21:16:25 +020071 email: icc@ids-manneim.de
72 website: https://www.ids-mannheim.de/digspra/kl
Marc Kupietzf0f58822023-06-26 20:32:03 +020073 qrlink: >
Marc Kupietze3bba7b2023-06-26 21:17:11 +020074 `r posterdown::qrlink("https://korap.ids-mannheim.de/instance/icc", "icc-logo-whitebg.svg")`
Marc Kupietzfbd648c2023-06-24 12:31:45 +020075
Marc Kupietzafce9c12023-06-13 09:18:53 +020076output:
Marc Kupietzfbd648c2023-06-24 12:31:45 +020077 posterdown::posterdown_ids:
78 self_contained: false
79 keep_md: true
Marc Kupietzbcde0b62023-06-14 14:22:35 +020080
Marc Kupietz840cdb02023-06-29 15:58:05 +020081lang: en
Marc Kupietzbcde0b62023-06-14 14:22:35 +020082bibliography: ../tex/references.bib
Marc Kupietzdf8083d2023-06-26 20:31:42 +020083csl: "https://raw.githubusercontent.com/ICLC-10/Zotero/master/styles/ICLC-10.csl"
Marc Kupietzafce9c12023-06-13 09:18:53 +020084---
85
86```{r setup, include=FALSE, echo=FALSE, warning=FALSE}
Marc Kupietz2b51d502023-06-28 18:25:13 +020087knitr::opts_chunk$set(dev = 'svg', echo = FALSE, message = FALSE, warnings = FALSE)
Marc Kupietzafce9c12023-06-13 09:18:53 +020088source("common.R")
89```
Marc Kupietz2b51d502023-06-28 18:25:13 +020090# ICC aims & characteristics
Marc Kupietz8f6c71b2023-06-28 18:13:55 +020091
92* make available comparable corpora of many languages for contrastive linguistic research [@kirk_ice_2017]
Marc Kupietzafce9c12023-06-13 09:18:53 +020093* mostly based on existing corpora
Marc Kupietz3ef8bda2023-06-28 20:25:14 +020094* small corpora with 1M words each (400K written)
Marc Kupietz2b51d502023-06-28 18:25:13 +020095* pre-defined balanced composition
96 * inspired by the one of the ICE [@greenbaum_comparing_1996]
Marc Kupietzafce9c12023-06-13 09:18:53 +020097
Marc Kupietz6354d202023-06-26 20:34:05 +020098# Current launch of ICC written
Marc Kupietzafce9c12023-06-13 09:18:53 +020099
Marc Kupietz2b51d502023-06-28 18:25:13 +0200100* written parts for Chinese, Czech, English (mostly), German, Irish (partly), Norwegian publicly available
Marc Kupietz6354d202023-06-26 20:34:05 +0200101 * partially including UDPipe 2.0 annotations [@straka_udpipe_2018]
Marc Kupietzeaabc1e2023-06-29 15:58:29 +0200102 * usable via CWB or KorAP [@diewald_korap_2016] QR Code on the left
Marc Kupietz49a7c182023-06-28 18:15:46 +0200103
104```{r korap-query, fig.cap="KorAP UI for ICC-GER and ICC-NOR, showing annotation queries and layers, as well as a virtual corpus definition, based on ICC genre and publication date metadata."}
105knitr::include_graphics("korap_query_ger-nor.svg")
106```
Marc Kupietz6354d202023-06-26 20:34:05 +0200107
108## Composition of the ICC parts
Marc Kupietzafce9c12023-06-13 09:18:53 +0200109
Marc Kupietz1ed69ff2023-06-28 18:14:34 +0200110```{r composition-by-genre, fig.cap="Actual composition of selected ICC parts with respect to ICC domain. (For the other ICC parts, the ICC genre metadatum was not yet accessible via the API at the editorial deadline.)", message = FALSE, fig.width=14, fig.height=10, out.width = "100%"}
Marc Kupietzafce9c12023-06-13 09:18:53 +0200111icc_genre <- icc %>%
112 expand_grid(genre) %>%
113 mutate(vc = paste0("iccGenre=", genre)) %>%
114 rowwise() %>%
115 mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)
116
117icc_genre %>% ggplot(aes(x=lang, fill=genre, y=tokens)) +
118 geom_col() + scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
119 theme_ids(base_size = 24) +
120 theme(
Marc Kupietz3ef8bda2023-06-28 20:25:14 +0200121 axis.title.x = element_blank(),
Marc Kupietzafce9c12023-06-13 09:18:53 +0200122 axis.title.y = element_text(size = rel(1.5), face = "bold"),
123 axis.text = element_text(size = rel(0.70)),
124 legend.title = element_text(size = rel(0.85), face = "bold"),
125 legend.text = element_text(size = rel(1))) +
126 scale_fill_ids() +
127 geom_text(aes(label=if_else(tokens > 0, as.character(tokens), ""), y=tokens), position= position_stack(reverse = F, vjust = 0.5), color="black", size=6.2, family="Fira Sans Condensed")
128
129```
130
Marc Kupietzafce9c12023-06-13 09:18:53 +0200131
Marc Kupietz1ed69ff2023-06-28 18:14:34 +0200132```{r composition-by-pubdate, fig.cap="Composition of the selected ICC parts with respect to year of publication.", message=F, warning=F, fig.width=14, fig.height=3, out.width = "80%"}
Marc Kupietzafce9c12023-06-13 09:18:53 +0200133year <- c(1986:2023)
134
135icc_year <- icc %>%
136 expand_grid(year) %>%
137 mutate(vc = paste0("pubDate in ", year)) %>%
138 rowwise() %>%
139 mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)
140
141icc_year %>% ggplot(aes(x=year, fill=lang, color=lang, y=tokens)) +
142 # geom_smooth(se=F, span=0.25) +
143 xlim(1990, 2023) +
144 ylim(0, NA) +
145 stat_smooth(
146 geom = 'area', method = 'loess', span = 1/4,
147 alpha = 0.1) +
148 # geom_area(alpha=0.1, position = "identity") +
149 scale_fill_ids() + scale_colour_ids() +
150 scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
151 theme_ids(base_size=24) +
152 theme(
153 axis.title.x = element_text(size = rel(1.5), face = "bold"),
154 axis.title.y = element_text(size = rel(1.5), face = "bold"),
155 axis.text = element_text(size = rel(1)),
156 legend.title = element_text(size = rel(1), face = "bold"),
157 legend.text = element_text(size = rel(1)))
158```
159
Marc Kupietzafce9c12023-06-13 09:18:53 +0200160
Marc Kupietz4e3ab832023-06-26 20:33:18 +0200161# Pilot study
Marc Kupietzafce9c12023-06-13 09:18:53 +0200162
Marc Kupietz58d1bc12023-06-28 18:18:19 +0200163* identification of light verb constructions (LVC) with *take* in English, and corresponding lemmas in German and Norwegian
164 * in order to explore the limitations imposed by the small corpus sizes
165 * using RKorapClient [@kupietz_rkorapclient_2020] to access the corpora and get reproducible results for the analyses
Marc Kupietzafce9c12023-06-13 09:18:53 +0200166
Marc Kupietzafce9c12023-06-13 09:18:53 +0200167
Marc Kupietz9fe544b2023-06-28 18:17:31 +0200168```{r take-icc-code, results='hide', echo=TRUE}
169library(RKorAPClient)
170new("KorAPConnection",
171 KorAPUrl = "https://korap.ids-mannheim.de/instance/icc/eng",
172 accessToken = Sys.getenv("KORAP_ICC_TOKEN_eng")) %>%
173collocationAnalysis(
174 "focus({[ud/l=take]} [ud/p=NOUN])",
175 leftContextSize = 0,
176 rightContextSize = 1,
177 minOccur = 2,
Marc Kupietz3ef8bda2023-06-28 20:25:14 +0200178 addExamples = T )
Marc Kupietz9fe544b2023-06-28 18:17:31 +0200179```
180
Marc Kupietz3ef8bda2023-06-28 20:25:14 +0200181```{r take-icc, fig.cap="R code for, and results of a co-occurrence analysis of *take* + NOUN in ICC-ENG, using the RKorAPClient package."}
Marc Kupietzafce9c12023-06-13 09:18:53 +0200182take_ca_icc <-
183 collocationAnalysis(
184 icc_con("eng"),
185 "focus({[ud/l=take]} [ud/p=NOUN])",
186 leftContextSize = 0,
187 rightContextSize = 1,
188 minOccur = 2,
189 addExamples = T
190 )
191
192take_ca_icc %>% show_table()
193```
194
Marc Kupietz9af399d2023-06-26 20:34:36 +0200195## Results
196
197* for English the query for *take* + NOUN (as direct right neighbour) yields 10 different pairs with a minimum frequency of 2 (see Figure \@ref(fig:take-icc))
Marc Kupietz48a41342023-06-28 18:16:54 +0200198 * based on English Wikipedia [2015 snapshot, see @MargarethaLuengen2014] the query yields 139 pairs (log-dice-threshold: 2.0) with 44 false positives
Marc Kupietz3a458ce2023-06-28 18:19:07 +0200199 * the true positive ratio of discovered take-LVCs between ICC and Wikipedia is 10:95
200* for ICC German with DeReKo as background corpus, the ratio of discovered true LVCs with nehmen (=take) is 10:89
201* in both cases, not much more than 10% of LVCs could be discovered
Marc Kupietz9af399d2023-06-26 20:34:36 +0200202
Marc Kupietz32b70ae2023-06-26 20:34:58 +0200203# Summary & Outlook
204
Marc Kupietz3ef8bda2023-06-28 20:25:14 +0200205* we have made comparable corpora of 4+ languages available, readily usable for contrastive research
Marc Kupietzfb570ea2023-06-28 18:17:59 +0200206* however, even for fairly frequent phenomena, the results on the small corpora should be treated with caution
207 * typically, they need to be verified on larger monolingual corpora
208 * this also and especially concerns recall
209* nevertheless ICC can serve as a useful basis for contrastive studies
Marc Kupietz2b51d502023-06-28 18:25:13 +0200210 * with a uniform UI and API that facilitate query and analysis
Marc Kupietzfb570ea2023-06-28 18:17:59 +0200211* in addition, ICC also serves as a crystallisation point
212 * for more ICC corpora and spoken parts to come
213 * for larger corpora and complementary approaches, such as EuReCo
Marc Kupietz32b70ae2023-06-26 20:34:58 +0200214
Marc Kupietzbcde0b62023-06-14 14:22:35 +0200215# References
216
Marc Kupietzafce9c12023-06-13 09:18:53 +0200217