blob: bce56fd951c8a6c8b96f01201a55b491eae82cce [file] [log] [blame]
Marc Kupietzafce9c12023-06-13 09:18:53 +02001---
2title: "News from the International Comparable Corpus"
3subtitle: "First launch of ICC written"
4date: "`r Sys.Date()`"
5author:
6 - name: Marc Kupietz
7 affil: 1
8 - name: Adrien Barbaresi
9 affil: 2
Marc Kupietzbcde0b62023-06-14 14:22:35 +020010 - name: Anna Čermáková
Marc Kupietzafce9c12023-06-13 09:18:53 +020011 affil: 3
12 - name: Małgorzata Czachor
13 affil: 4
14 - name: Nils Diewald
15 affil: 1
16 - name: Jarle Ebeling
17 affil: 5
18 - name: Rafał L. Górski
19 affil: 4
20 - name: John Kirk
21 affil: 6
22 - name: Michal Křen
23 affil: 3
24 - name: Harald Lüngen
25 affil: 1
26 - name: Eliza Margaretha
27 affil: 1
28 - name: Signe Oksefjell Ebeling
29 affil: 5
30 - name: Mícheál Ó Meachair
31 affil: 7
32 - name: Ines Pisetta
33 affil: 1
34 - name: Elaine Uí Dhonnchadha
35 affil: 8
36 - name: Friedemann Vogel
37 affil: 9
38 - name: Rebecca Wilm
39 affil: 1
40 - name: Jiajin Xu
41 affil: 10
42 - name: Rameela Yaddehige
43 affil: 1
44affiliation:
45 - num: 1
46 address: IDS Mannheim
47 - num: 2
48 address: BBAW Berlin
49 - num: 3
50 address: Charles University
51 - num: 4
52 address: Polish Academy of Sciences
53 - num: 5
54 address: University of Oslo
55 - num: 6
56 address: University of Vienna
57 - num: 7
58 address: Dublin City University
59 - num: 8
60 address: Trinity College Dublin
61 - num: 9
62 address: University of Siegen
63 - num: 10
64 address: Beijing Foreign Studies University
65
66
67logoleft_name: "../Figures/ICC_COL.svg"
68author_textsize: "32pt"
69
Marc Kupietzfbd648c2023-06-24 12:31:45 +020070contact:
Marc Kupietzc5f7a922023-06-26 21:16:25 +020071 email: icc@ids-manneim.de
72 website: https://www.ids-mannheim.de/digspra/kl
Marc Kupietzf0f58822023-06-26 20:32:03 +020073 qrlink: >
Marc Kupietze3bba7b2023-06-26 21:17:11 +020074 `r posterdown::qrlink("https://korap.ids-mannheim.de/instance/icc", "icc-logo-whitebg.svg")`
Marc Kupietzfbd648c2023-06-24 12:31:45 +020075
Marc Kupietzafce9c12023-06-13 09:18:53 +020076output:
Marc Kupietzfbd648c2023-06-24 12:31:45 +020077 posterdown::posterdown_ids:
78 self_contained: false
79 keep_md: true
Marc Kupietzbcde0b62023-06-14 14:22:35 +020080
81bibliography: ../tex/references.bib
Marc Kupietzdf8083d2023-06-26 20:31:42 +020082csl: "https://raw.githubusercontent.com/ICLC-10/Zotero/master/styles/ICLC-10.csl"
Marc Kupietzafce9c12023-06-13 09:18:53 +020083---
84
85```{r setup, include=FALSE, echo=FALSE, warning=FALSE}
Marc Kupietz48d2b522023-06-14 12:31:06 +020086knitr::opts_chunk$set(dev = 'svg', echo = FALSE, warnings = FALSE)
Marc Kupietzafce9c12023-06-13 09:18:53 +020087source("common.R")
88```
89# ICC aims & charcteristics
Marc Kupietz8f6c71b2023-06-28 18:13:55 +020090
91* make available comparable corpora of many languages for contrastive linguistic research [@kirk_ice_2017]
Marc Kupietzafce9c12023-06-13 09:18:53 +020092* mostly based on existing corpora
Marc Kupietz6a4d3a72023-06-26 20:32:39 +020093* ICC has a pre-defined balanced composition
94 * based on the one of the ICE [@greenbaum_comparing_1996]
Marc Kupietzafce9c12023-06-13 09:18:53 +020095
Marc Kupietz6354d202023-06-26 20:34:05 +020096# Current launch of ICC written
Marc Kupietzafce9c12023-06-13 09:18:53 +020097
Marc Kupietz6354d202023-06-26 20:34:05 +020098* written parts for Chinese, Czech, English, German, Irish (partly), Norwegian publicly available
99 * partially including UDPipe 2.0 annotations [@straka_udpipe_2018]
100 * via Corpus Workbench or KorAP [@diewald_korap_2016]
101
Marc Kupietz49a7c182023-06-28 18:15:46 +0200102 * usable via Corpus Workbench or KorAP [@diewald_korap_2016]
103
104```{r korap-query, fig.cap="KorAP UI for ICC-GER and ICC-NOR, showing annotation queries and layers, as well as a virtual corpus definition, based on ICC genre and publication date metadata."}
105knitr::include_graphics("korap_query_ger-nor.svg")
106```
Marc Kupietz6354d202023-06-26 20:34:05 +0200107
108## Composition of the ICC parts
Marc Kupietzafce9c12023-06-13 09:18:53 +0200109
Marc Kupietz1ed69ff2023-06-28 18:14:34 +0200110```{r composition-by-genre, fig.cap="Actual composition of selected ICC parts with respect to ICC domain. (For the other ICC parts, the ICC genre metadatum was not yet accessible via the API at the editorial deadline.)", message = FALSE, fig.width=14, fig.height=10, out.width = "100%"}
Marc Kupietzafce9c12023-06-13 09:18:53 +0200111icc_genre <- icc %>%
112 expand_grid(genre) %>%
113 mutate(vc = paste0("iccGenre=", genre)) %>%
114 rowwise() %>%
115 mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)
116
117icc_genre %>% ggplot(aes(x=lang, fill=genre, y=tokens)) +
118 geom_col() + scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
119 theme_ids(base_size = 24) +
120 theme(
121 axis.title.x = element_text(size = rel(1.5), face = "bold"),
122 axis.title.y = element_text(size = rel(1.5), face = "bold"),
123 axis.text = element_text(size = rel(0.70)),
124 legend.title = element_text(size = rel(0.85), face = "bold"),
125 legend.text = element_text(size = rel(1))) +
126 scale_fill_ids() +
127 geom_text(aes(label=if_else(tokens > 0, as.character(tokens), ""), y=tokens), position= position_stack(reverse = F, vjust = 0.5), color="black", size=6.2, family="Fira Sans Condensed")
128
129```
130
Marc Kupietzafce9c12023-06-13 09:18:53 +0200131
Marc Kupietz1ed69ff2023-06-28 18:14:34 +0200132```{r composition-by-pubdate, fig.cap="Composition of the selected ICC parts with respect to year of publication.", message=F, warning=F, fig.width=14, fig.height=3, out.width = "80%"}
Marc Kupietzafce9c12023-06-13 09:18:53 +0200133year <- c(1986:2023)
134
135icc_year <- icc %>%
136 expand_grid(year) %>%
137 mutate(vc = paste0("pubDate in ", year)) %>%
138 rowwise() %>%
139 mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)
140
141icc_year %>% ggplot(aes(x=year, fill=lang, color=lang, y=tokens)) +
142 # geom_smooth(se=F, span=0.25) +
143 xlim(1990, 2023) +
144 ylim(0, NA) +
145 stat_smooth(
146 geom = 'area', method = 'loess', span = 1/4,
147 alpha = 0.1) +
148 # geom_area(alpha=0.1, position = "identity") +
149 scale_fill_ids() + scale_colour_ids() +
150 scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
151 theme_ids(base_size=24) +
152 theme(
153 axis.title.x = element_text(size = rel(1.5), face = "bold"),
154 axis.title.y = element_text(size = rel(1.5), face = "bold"),
155 axis.text = element_text(size = rel(1)),
156 legend.title = element_text(size = rel(1), face = "bold"),
157 legend.text = element_text(size = rel(1)))
158```
159
Marc Kupietzafce9c12023-06-13 09:18:53 +0200160
Marc Kupietz4e3ab832023-06-26 20:33:18 +0200161# Pilot study
Marc Kupietzafce9c12023-06-13 09:18:53 +0200162
Marc Kupietz58d1bc12023-06-28 18:18:19 +0200163* identification of light verb constructions (LVC) with *take* in English, and corresponding lemmas in German and Norwegian
164 * in order to explore the limitations imposed by the small corpus sizes
165 * using RKorapClient [@kupietz_rkorapclient_2020] to access the corpora and get reproducible results for the analyses
Marc Kupietzafce9c12023-06-13 09:18:53 +0200166
Marc Kupietzafce9c12023-06-13 09:18:53 +0200167
Marc Kupietz9fe544b2023-06-28 18:17:31 +0200168```{r take-icc-code, results='hide', echo=TRUE}
169library(RKorAPClient)
170new("KorAPConnection",
171 KorAPUrl = "https://korap.ids-mannheim.de/instance/icc/eng",
172 accessToken = Sys.getenv("KORAP_ICC_TOKEN_eng")) %>%
173collocationAnalysis(
174 "focus({[ud/l=take]} [ud/p=NOUN])",
175 leftContextSize = 0,
176 rightContextSize = 1,
177 minOccur = 2,
178 addExamples = T
179)
180```
181
182```{r take-icc, fig.cap="R code and results of a co-occurrence analysis of *take* + NOUN in ICC-ENG, using the RKorAPClient package for R."}
Marc Kupietzafce9c12023-06-13 09:18:53 +0200183take_ca_icc <-
184 collocationAnalysis(
185 icc_con("eng"),
186 "focus({[ud/l=take]} [ud/p=NOUN])",
187 leftContextSize = 0,
188 rightContextSize = 1,
189 minOccur = 2,
190 addExamples = T
191 )
192
193take_ca_icc %>% show_table()
194```
195
Marc Kupietz9af399d2023-06-26 20:34:36 +0200196## Results
197
198* for English the query for *take* + NOUN (as direct right neighbour) yields 10 different pairs with a minimum frequency of 2 (see Figure \@ref(fig:take-icc))
Marc Kupietz48a41342023-06-28 18:16:54 +0200199 * based on English Wikipedia [2015 snapshot, see @MargarethaLuengen2014] the query yields 139 pairs (log-dice-threshold: 2.0) with 44 false positives
Marc Kupietz3a458ce2023-06-28 18:19:07 +0200200 * the true positive ratio of discovered take-LVCs between ICC and Wikipedia is 10:95
201* for ICC German with DeReKo as background corpus, the ratio of discovered true LVCs with nehmen (=take) is 10:89
202* in both cases, not much more than 10% of LVCs could be discovered
Marc Kupietz9af399d2023-06-26 20:34:36 +0200203
Marc Kupietz32b70ae2023-06-26 20:34:58 +0200204# Summary & Outlook
205
Marc Kupietzfb570ea2023-06-28 18:17:59 +0200206* we have made corpora of 4+ languages available for contrastive research
207* however, even for fairly frequent phenomena, the results on the small corpora should be treated with caution
208 * typically, they need to be verified on larger monolingual corpora
209 * this also and especially concerns recall
210* nevertheless ICC can serve as a useful basis for contrastive studies
211 * with a uniform UI and API that leverage query and analysis
212* in addition, ICC also serves as a crystallisation point
213 * for more ICC corpora and spoken parts to come
214 * for larger corpora and complementary approaches, such as EuReCo
Marc Kupietz32b70ae2023-06-26 20:34:58 +0200215
Marc Kupietzbcde0b62023-06-14 14:22:35 +0200216# References
217
Marc Kupietzafce9c12023-06-13 09:18:53 +0200218