blob: 6acc69337d44cc7d7ac4ada180b3c480185e7e5a [file] [log] [blame]
Marc Kupietzafce9c12023-06-13 09:18:53 +02001---
2title: "News from the International Comparable Corpus"
3subtitle: "First launch of ICC written"
4date: "`r Sys.Date()`"
5author:
6 - name: Marc Kupietz
7 affil: 1
8 - name: Adrien Barbaresi
9 affil: 2
Marc Kupietzbcde0b62023-06-14 14:22:35 +020010 - name: Anna Čermáková
Marc Kupietzafce9c12023-06-13 09:18:53 +020011 affil: 3
12 - name: Małgorzata Czachor
13 affil: 4
14 - name: Nils Diewald
15 affil: 1
16 - name: Jarle Ebeling
17 affil: 5
18 - name: Rafał L. Górski
19 affil: 4
20 - name: John Kirk
21 affil: 6
22 - name: Michal Křen
23 affil: 3
24 - name: Harald Lüngen
25 affil: 1
26 - name: Eliza Margaretha
27 affil: 1
28 - name: Signe Oksefjell Ebeling
29 affil: 5
30 - name: Mícheál Ó Meachair
31 affil: 7
32 - name: Ines Pisetta
33 affil: 1
34 - name: Elaine Uí Dhonnchadha
35 affil: 8
36 - name: Friedemann Vogel
37 affil: 9
38 - name: Rebecca Wilm
39 affil: 1
40 - name: Jiajin Xu
41 affil: 10
42 - name: Rameela Yaddehige
43 affil: 1
44affiliation:
45 - num: 1
46 address: IDS Mannheim
47 - num: 2
48 address: BBAW Berlin
49 - num: 3
50 address: Charles University
51 - num: 4
52 address: Polish Academy of Sciences
53 - num: 5
54 address: University of Oslo
55 - num: 6
56 address: University of Vienna
57 - num: 7
58 address: Dublin City University
59 - num: 8
60 address: Trinity College Dublin
61 - num: 9
62 address: University of Siegen
63 - num: 10
64 address: Beijing Foreign Studies University
65
66
67logoleft_name: "../Figures/ICC_COL.svg"
68author_textsize: "32pt"
69
Marc Kupietzfbd648c2023-06-24 12:31:45 +020070contact:
Marc Kupietzc5f7a922023-06-26 21:16:25 +020071 email: icc@ids-manneim.de
72 website: https://www.ids-mannheim.de/digspra/kl
Marc Kupietzf0f58822023-06-26 20:32:03 +020073 qrlink: >
Marc Kupietze3bba7b2023-06-26 21:17:11 +020074 `r posterdown::qrlink("https://korap.ids-mannheim.de/instance/icc", "icc-logo-whitebg.svg")`
Marc Kupietzfbd648c2023-06-24 12:31:45 +020075
Marc Kupietzafce9c12023-06-13 09:18:53 +020076output:
Marc Kupietzfbd648c2023-06-24 12:31:45 +020077 posterdown::posterdown_ids:
78 self_contained: false
79 keep_md: true
Marc Kupietzbcde0b62023-06-14 14:22:35 +020080
81bibliography: ../tex/references.bib
Marc Kupietzdf8083d2023-06-26 20:31:42 +020082csl: "https://raw.githubusercontent.com/ICLC-10/Zotero/master/styles/ICLC-10.csl"
Marc Kupietzafce9c12023-06-13 09:18:53 +020083---
84
85```{r setup, include=FALSE, echo=FALSE, warning=FALSE}
Marc Kupietz2b51d502023-06-28 18:25:13 +020086knitr::opts_chunk$set(dev = 'svg', echo = FALSE, message = FALSE, warnings = FALSE)
Marc Kupietzafce9c12023-06-13 09:18:53 +020087source("common.R")
88```
Marc Kupietz2b51d502023-06-28 18:25:13 +020089# ICC aims & characteristics
Marc Kupietz8f6c71b2023-06-28 18:13:55 +020090
91* make available comparable corpora of many languages for contrastive linguistic research [@kirk_ice_2017]
Marc Kupietzafce9c12023-06-13 09:18:53 +020092* mostly based on existing corpora
Marc Kupietz2b51d502023-06-28 18:25:13 +020093* small corpora with 1M words (400K written)
94* pre-defined balanced composition
95 * inspired by the one of the ICE [@greenbaum_comparing_1996]
Marc Kupietzafce9c12023-06-13 09:18:53 +020096
Marc Kupietz6354d202023-06-26 20:34:05 +020097# Current launch of ICC written
Marc Kupietzafce9c12023-06-13 09:18:53 +020098
Marc Kupietz2b51d502023-06-28 18:25:13 +020099* written parts for Chinese, Czech, English (mostly), German, Irish (partly), Norwegian publicly available
Marc Kupietz6354d202023-06-26 20:34:05 +0200100 * partially including UDPipe 2.0 annotations [@straka_udpipe_2018]
Marc Kupietz49a7c182023-06-28 18:15:46 +0200101 * usable via Corpus Workbench or KorAP [@diewald_korap_2016]
102
103```{r korap-query, fig.cap="KorAP UI for ICC-GER and ICC-NOR, showing annotation queries and layers, as well as a virtual corpus definition, based on ICC genre and publication date metadata."}
104knitr::include_graphics("korap_query_ger-nor.svg")
105```
Marc Kupietz6354d202023-06-26 20:34:05 +0200106
107## Composition of the ICC parts
Marc Kupietzafce9c12023-06-13 09:18:53 +0200108
Marc Kupietz1ed69ff2023-06-28 18:14:34 +0200109```{r composition-by-genre, fig.cap="Actual composition of selected ICC parts with respect to ICC domain. (For the other ICC parts, the ICC genre metadatum was not yet accessible via the API at the editorial deadline.)", message = FALSE, fig.width=14, fig.height=10, out.width = "100%"}
Marc Kupietzafce9c12023-06-13 09:18:53 +0200110icc_genre <- icc %>%
111 expand_grid(genre) %>%
112 mutate(vc = paste0("iccGenre=", genre)) %>%
113 rowwise() %>%
114 mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)
115
116icc_genre %>% ggplot(aes(x=lang, fill=genre, y=tokens)) +
117 geom_col() + scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
118 theme_ids(base_size = 24) +
119 theme(
120 axis.title.x = element_text(size = rel(1.5), face = "bold"),
121 axis.title.y = element_text(size = rel(1.5), face = "bold"),
122 axis.text = element_text(size = rel(0.70)),
123 legend.title = element_text(size = rel(0.85), face = "bold"),
124 legend.text = element_text(size = rel(1))) +
125 scale_fill_ids() +
126 geom_text(aes(label=if_else(tokens > 0, as.character(tokens), ""), y=tokens), position= position_stack(reverse = F, vjust = 0.5), color="black", size=6.2, family="Fira Sans Condensed")
127
128```
129
Marc Kupietzafce9c12023-06-13 09:18:53 +0200130
Marc Kupietz1ed69ff2023-06-28 18:14:34 +0200131```{r composition-by-pubdate, fig.cap="Composition of the selected ICC parts with respect to year of publication.", message=F, warning=F, fig.width=14, fig.height=3, out.width = "80%"}
Marc Kupietzafce9c12023-06-13 09:18:53 +0200132year <- c(1986:2023)
133
134icc_year <- icc %>%
135 expand_grid(year) %>%
136 mutate(vc = paste0("pubDate in ", year)) %>%
137 rowwise() %>%
138 mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)
139
140icc_year %>% ggplot(aes(x=year, fill=lang, color=lang, y=tokens)) +
141 # geom_smooth(se=F, span=0.25) +
142 xlim(1990, 2023) +
143 ylim(0, NA) +
144 stat_smooth(
145 geom = 'area', method = 'loess', span = 1/4,
146 alpha = 0.1) +
147 # geom_area(alpha=0.1, position = "identity") +
148 scale_fill_ids() + scale_colour_ids() +
149 scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
150 theme_ids(base_size=24) +
151 theme(
152 axis.title.x = element_text(size = rel(1.5), face = "bold"),
153 axis.title.y = element_text(size = rel(1.5), face = "bold"),
154 axis.text = element_text(size = rel(1)),
155 legend.title = element_text(size = rel(1), face = "bold"),
156 legend.text = element_text(size = rel(1)))
157```
158
Marc Kupietzafce9c12023-06-13 09:18:53 +0200159
Marc Kupietz4e3ab832023-06-26 20:33:18 +0200160# Pilot study
Marc Kupietzafce9c12023-06-13 09:18:53 +0200161
Marc Kupietz58d1bc12023-06-28 18:18:19 +0200162* identification of light verb constructions (LVC) with *take* in English, and corresponding lemmas in German and Norwegian
163 * in order to explore the limitations imposed by the small corpus sizes
164 * using RKorapClient [@kupietz_rkorapclient_2020] to access the corpora and get reproducible results for the analyses
Marc Kupietzafce9c12023-06-13 09:18:53 +0200165
Marc Kupietzafce9c12023-06-13 09:18:53 +0200166
Marc Kupietz9fe544b2023-06-28 18:17:31 +0200167```{r take-icc-code, results='hide', echo=TRUE}
168library(RKorAPClient)
169new("KorAPConnection",
170 KorAPUrl = "https://korap.ids-mannheim.de/instance/icc/eng",
171 accessToken = Sys.getenv("KORAP_ICC_TOKEN_eng")) %>%
172collocationAnalysis(
173 "focus({[ud/l=take]} [ud/p=NOUN])",
174 leftContextSize = 0,
175 rightContextSize = 1,
176 minOccur = 2,
177 addExamples = T
178)
179```
180
Marc Kupietz2b51d502023-06-28 18:25:13 +0200181```{r take-icc, fig.cap="R code for, and results of a co-occurrence analysis of *take* + NOUN in ICC-ENG, using the RKorAPClient package for R."}
Marc Kupietzafce9c12023-06-13 09:18:53 +0200182take_ca_icc <-
183 collocationAnalysis(
184 icc_con("eng"),
185 "focus({[ud/l=take]} [ud/p=NOUN])",
186 leftContextSize = 0,
187 rightContextSize = 1,
188 minOccur = 2,
189 addExamples = T
190 )
191
192take_ca_icc %>% show_table()
193```
194
Marc Kupietz9af399d2023-06-26 20:34:36 +0200195## Results
196
197* for English the query for *take* + NOUN (as direct right neighbour) yields 10 different pairs with a minimum frequency of 2 (see Figure \@ref(fig:take-icc))
Marc Kupietz48a41342023-06-28 18:16:54 +0200198 * based on English Wikipedia [2015 snapshot, see @MargarethaLuengen2014] the query yields 139 pairs (log-dice-threshold: 2.0) with 44 false positives
Marc Kupietz3a458ce2023-06-28 18:19:07 +0200199 * the true positive ratio of discovered take-LVCs between ICC and Wikipedia is 10:95
200* for ICC German with DeReKo as background corpus, the ratio of discovered true LVCs with nehmen (=take) is 10:89
201* in both cases, not much more than 10% of LVCs could be discovered
Marc Kupietz9af399d2023-06-26 20:34:36 +0200202
Marc Kupietz32b70ae2023-06-26 20:34:58 +0200203# Summary & Outlook
204
Marc Kupietz2b51d502023-06-28 18:25:13 +0200205* we have made comparable corpora of 4+ languages available, radily usable for contrastive research
Marc Kupietzfb570ea2023-06-28 18:17:59 +0200206* however, even for fairly frequent phenomena, the results on the small corpora should be treated with caution
207 * typically, they need to be verified on larger monolingual corpora
208 * this also and especially concerns recall
209* nevertheless ICC can serve as a useful basis for contrastive studies
Marc Kupietz2b51d502023-06-28 18:25:13 +0200210 * with a uniform UI and API that facilitate query and analysis
Marc Kupietzfb570ea2023-06-28 18:17:59 +0200211* in addition, ICC also serves as a crystallisation point
212 * for more ICC corpora and spoken parts to come
213 * for larger corpora and complementary approaches, such as EuReCo
Marc Kupietz32b70ae2023-06-26 20:34:58 +0200214
Marc Kupietzbcde0b62023-06-14 14:22:35 +0200215# References
216
Marc Kupietzafce9c12023-06-13 09:18:53 +0200217