blob: 7b1df91b7a8083b3fe049cb8ba77b194bdac43ee [file] [log] [blame]
Marc Kupietzafce9c12023-06-13 09:18:53 +02001---
2title: "News from the International Comparable Corpus"
3subtitle: "First launch of ICC written"
4date: "`r Sys.Date()`"
5author:
6 - name: Marc Kupietz
7 affil: 1
8 - name: Adrien Barbaresi
9 affil: 2
Marc Kupietzbcde0b62023-06-14 14:22:35 +020010 - name: Anna Čermáková
Marc Kupietzafce9c12023-06-13 09:18:53 +020011 affil: 3
12 - name: Małgorzata Czachor
13 affil: 4
14 - name: Nils Diewald
15 affil: 1
16 - name: Jarle Ebeling
17 affil: 5
18 - name: Rafał L. Górski
19 affil: 4
20 - name: John Kirk
21 affil: 6
22 - name: Michal Křen
23 affil: 3
24 - name: Harald Lüngen
25 affil: 1
26 - name: Eliza Margaretha
27 affil: 1
28 - name: Signe Oksefjell Ebeling
29 affil: 5
30 - name: Mícheál Ó Meachair
31 affil: 7
32 - name: Ines Pisetta
33 affil: 1
34 - name: Elaine Uí Dhonnchadha
35 affil: 8
36 - name: Friedemann Vogel
37 affil: 9
38 - name: Rebecca Wilm
39 affil: 1
40 - name: Jiajin Xu
41 affil: 10
42 - name: Rameela Yaddehige
43 affil: 1
44affiliation:
45 - num: 1
46 address: IDS Mannheim
47 - num: 2
48 address: BBAW Berlin
49 - num: 3
50 address: Charles University
51 - num: 4
52 address: Polish Academy of Sciences
53 - num: 5
54 address: University of Oslo
55 - num: 6
56 address: University of Vienna
57 - num: 7
58 address: Dublin City University
59 - num: 8
60 address: Trinity College Dublin
61 - num: 9
62 address: University of Siegen
63 - num: 10
64 address: Beijing Foreign Studies University
65
66
67logoleft_name: "../Figures/ICC_COL.svg"
68author_textsize: "32pt"
69
Marc Kupietzfbd648c2023-06-24 12:31:45 +020070contact:
Marc Kupietzc5f7a922023-06-26 21:16:25 +020071 email: icc@ids-manneim.de
72 website: https://www.ids-mannheim.de/digspra/kl
Marc Kupietzf0f58822023-06-26 20:32:03 +020073 qrlink: >
Marc Kupietze3bba7b2023-06-26 21:17:11 +020074 `r posterdown::qrlink("https://korap.ids-mannheim.de/instance/icc", "icc-logo-whitebg.svg")`
Marc Kupietzfbd648c2023-06-24 12:31:45 +020075
Marc Kupietzafce9c12023-06-13 09:18:53 +020076output:
Marc Kupietzfbd648c2023-06-24 12:31:45 +020077 posterdown::posterdown_ids:
78 self_contained: false
79 keep_md: true
Marc Kupietzbcde0b62023-06-14 14:22:35 +020080
81bibliography: ../tex/references.bib
Marc Kupietzdf8083d2023-06-26 20:31:42 +020082csl: "https://raw.githubusercontent.com/ICLC-10/Zotero/master/styles/ICLC-10.csl"
Marc Kupietzafce9c12023-06-13 09:18:53 +020083---
84
85```{r setup, include=FALSE, echo=FALSE, warning=FALSE}
Marc Kupietz48d2b522023-06-14 12:31:06 +020086knitr::opts_chunk$set(dev = 'svg', echo = FALSE, warnings = FALSE)
Marc Kupietzafce9c12023-06-13 09:18:53 +020087source("common.R")
88```
89# ICC aims & charcteristics
Marc Kupietz8f6c71b2023-06-28 18:13:55 +020090
91* make available comparable corpora of many languages for contrastive linguistic research [@kirk_ice_2017]
Marc Kupietzafce9c12023-06-13 09:18:53 +020092* mostly based on existing corpora
Marc Kupietz6a4d3a72023-06-26 20:32:39 +020093* ICC has a pre-defined balanced composition
94 * based on the one of the ICE [@greenbaum_comparing_1996]
Marc Kupietzafce9c12023-06-13 09:18:53 +020095
Marc Kupietz6354d202023-06-26 20:34:05 +020096# Current launch of ICC written
Marc Kupietzafce9c12023-06-13 09:18:53 +020097
Marc Kupietz6354d202023-06-26 20:34:05 +020098* written parts for Chinese, Czech, English, German, Irish (partly), Norwegian publicly available
99 * partially including UDPipe 2.0 annotations [@straka_udpipe_2018]
100 * via Corpus Workbench or KorAP [@diewald_korap_2016]
101
Marc Kupietz49a7c182023-06-28 18:15:46 +0200102 * usable via Corpus Workbench or KorAP [@diewald_korap_2016]
103
104```{r korap-query, fig.cap="KorAP UI for ICC-GER and ICC-NOR, showing annotation queries and layers, as well as a virtual corpus definition, based on ICC genre and publication date metadata."}
105knitr::include_graphics("korap_query_ger-nor.svg")
106```
Marc Kupietz6354d202023-06-26 20:34:05 +0200107
108## Composition of the ICC parts
Marc Kupietzafce9c12023-06-13 09:18:53 +0200109
Marc Kupietz1ed69ff2023-06-28 18:14:34 +0200110```{r composition-by-genre, fig.cap="Actual composition of selected ICC parts with respect to ICC domain. (For the other ICC parts, the ICC genre metadatum was not yet accessible via the API at the editorial deadline.)", message = FALSE, fig.width=14, fig.height=10, out.width = "100%"}
Marc Kupietzafce9c12023-06-13 09:18:53 +0200111icc_genre <- icc %>%
112 expand_grid(genre) %>%
113 mutate(vc = paste0("iccGenre=", genre)) %>%
114 rowwise() %>%
115 mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)
116
117icc_genre %>% ggplot(aes(x=lang, fill=genre, y=tokens)) +
118 geom_col() + scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
119 theme_ids(base_size = 24) +
120 theme(
121 axis.title.x = element_text(size = rel(1.5), face = "bold"),
122 axis.title.y = element_text(size = rel(1.5), face = "bold"),
123 axis.text = element_text(size = rel(0.70)),
124 legend.title = element_text(size = rel(0.85), face = "bold"),
125 legend.text = element_text(size = rel(1))) +
126 scale_fill_ids() +
127 geom_text(aes(label=if_else(tokens > 0, as.character(tokens), ""), y=tokens), position= position_stack(reverse = F, vjust = 0.5), color="black", size=6.2, family="Fira Sans Condensed")
128
129```
130
Marc Kupietzafce9c12023-06-13 09:18:53 +0200131
Marc Kupietz1ed69ff2023-06-28 18:14:34 +0200132```{r composition-by-pubdate, fig.cap="Composition of the selected ICC parts with respect to year of publication.", message=F, warning=F, fig.width=14, fig.height=3, out.width = "80%"}
Marc Kupietzafce9c12023-06-13 09:18:53 +0200133year <- c(1986:2023)
134
135icc_year <- icc %>%
136 expand_grid(year) %>%
137 mutate(vc = paste0("pubDate in ", year)) %>%
138 rowwise() %>%
139 mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)
140
141icc_year %>% ggplot(aes(x=year, fill=lang, color=lang, y=tokens)) +
142 # geom_smooth(se=F, span=0.25) +
143 xlim(1990, 2023) +
144 ylim(0, NA) +
145 stat_smooth(
146 geom = 'area', method = 'loess', span = 1/4,
147 alpha = 0.1) +
148 # geom_area(alpha=0.1, position = "identity") +
149 scale_fill_ids() + scale_colour_ids() +
150 scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
151 theme_ids(base_size=24) +
152 theme(
153 axis.title.x = element_text(size = rel(1.5), face = "bold"),
154 axis.title.y = element_text(size = rel(1.5), face = "bold"),
155 axis.text = element_text(size = rel(1)),
156 legend.title = element_text(size = rel(1), face = "bold"),
157 legend.text = element_text(size = rel(1)))
158```
159
Marc Kupietzafce9c12023-06-13 09:18:53 +0200160
Marc Kupietz4e3ab832023-06-26 20:33:18 +0200161# Pilot study
Marc Kupietzafce9c12023-06-13 09:18:53 +0200162
Marc Kupietz4e3ab832023-06-26 20:33:18 +0200163* Identification of Light Verb Constructions with *take*
164* in order to investigate the limitations imposed by the very small corpus sizes
165* using RKorapClient [@kupietz_rkorapclient_2020] to access corpora and get reproducible results of the collocation analysis
Marc Kupietzafce9c12023-06-13 09:18:53 +0200166
Marc Kupietzafce9c12023-06-13 09:18:53 +0200167
Marc Kupietz4e6311e2023-06-26 20:37:25 +0200168```{r take-icc, echo=TRUE, fig.cap="Collocation analysis of *take* using the RKorAPClient package for R"}
Marc Kupietzafce9c12023-06-13 09:18:53 +0200169take_ca_icc <-
170 collocationAnalysis(
171 icc_con("eng"),
172 "focus({[ud/l=take]} [ud/p=NOUN])",
173 leftContextSize = 0,
174 rightContextSize = 1,
175 minOccur = 2,
176 addExamples = T
177 )
178
179take_ca_icc %>% show_table()
180```
181
Marc Kupietz9af399d2023-06-26 20:34:36 +0200182## Results
183
184* for English the query for *take* + NOUN (as direct right neighbour) yields 10 different pairs with a minimum frequency of 2 (see Figure \@ref(fig:take-icc))
185 * based English Wikipedia (2015) the query yields 139 pairs (log-dice-threshold: 2.0) with about 20 false positives
186* for ICC German with DeReKo as background corpus, the ratio of true positive LVCs is 10/80
187
Marc Kupietz32b70ae2023-06-26 20:34:58 +0200188# Summary & Outlook
189
190* we have made available corpora of 4+ languages available for contrastive research
191* however, even with quite frequent phenomena, the results on the small corpora are to be used with caution
192 * typically they need to be verified on larger monolingual corpora
193* the uniform acces is in any case helpful for contrastive studies
194* ICC also serves as a crystallization point for larger corpora and complementary approaches such as EuReCo
195
Marc Kupietzbcde0b62023-06-14 14:22:35 +0200196# References
197
Marc Kupietzafce9c12023-06-13 09:18:53 +0200198