blob: 3c435e55fca0f96911491998cb27e7c925561871 [file] [log] [blame]
Marc Kupietzafce9c12023-06-13 09:18:53 +02001---
2title: "News from the International Comparable Corpus"
3subtitle: "First launch of ICC written"
4date: "`r Sys.Date()`"
5author:
6 - name: Marc Kupietz
7 affil: 1
8 - name: Adrien Barbaresi
9 affil: 2
Marc Kupietzbcde0b62023-06-14 14:22:35 +020010 - name: Anna Čermáková
Marc Kupietzafce9c12023-06-13 09:18:53 +020011 affil: 3
12 - name: Małgorzata Czachor
13 affil: 4
14 - name: Nils Diewald
15 affil: 1
16 - name: Jarle Ebeling
17 affil: 5
18 - name: Rafał L. Górski
19 affil: 4
20 - name: John Kirk
21 affil: 6
22 - name: Michal Křen
23 affil: 3
24 - name: Harald Lüngen
25 affil: 1
26 - name: Eliza Margaretha
27 affil: 1
28 - name: Signe Oksefjell Ebeling
29 affil: 5
30 - name: Mícheál Ó Meachair
31 affil: 7
32 - name: Ines Pisetta
33 affil: 1
34 - name: Elaine Uí Dhonnchadha
35 affil: 8
36 - name: Friedemann Vogel
37 affil: 9
38 - name: Rebecca Wilm
39 affil: 1
40 - name: Jiajin Xu
41 affil: 10
42 - name: Rameela Yaddehige
43 affil: 1
44affiliation:
45 - num: 1
46 address: IDS Mannheim
47 - num: 2
48 address: BBAW Berlin
49 - num: 3
50 address: Charles University
51 - num: 4
52 address: Polish Academy of Sciences
53 - num: 5
54 address: University of Oslo
55 - num: 6
56 address: University of Vienna
57 - num: 7
58 address: Dublin City University
59 - num: 8
60 address: Trinity College Dublin
61 - num: 9
62 address: University of Siegen
63 - num: 10
64 address: Beijing Foreign Studies University
65
66
67logoleft_name: "../Figures/ICC_COL.svg"
68author_textsize: "32pt"
69
Marc Kupietzfbd648c2023-06-24 12:31:45 +020070contact:
Marc Kupietzc5f7a922023-06-26 21:16:25 +020071 email: icc@ids-manneim.de
72 website: https://www.ids-mannheim.de/digspra/kl
Marc Kupietzf0f58822023-06-26 20:32:03 +020073 qrlink: >
Marc Kupietze3bba7b2023-06-26 21:17:11 +020074 `r posterdown::qrlink("https://korap.ids-mannheim.de/instance/icc", "icc-logo-whitebg.svg")`
Marc Kupietzfbd648c2023-06-24 12:31:45 +020075
Marc Kupietzafce9c12023-06-13 09:18:53 +020076output:
Marc Kupietzfbd648c2023-06-24 12:31:45 +020077 posterdown::posterdown_ids:
78 self_contained: false
79 keep_md: true
Marc Kupietzbcde0b62023-06-14 14:22:35 +020080
81bibliography: ../tex/references.bib
Marc Kupietzdf8083d2023-06-26 20:31:42 +020082csl: "https://raw.githubusercontent.com/ICLC-10/Zotero/master/styles/ICLC-10.csl"
Marc Kupietzafce9c12023-06-13 09:18:53 +020083---
84
85```{r setup, include=FALSE, echo=FALSE, warning=FALSE}
Marc Kupietz48d2b522023-06-14 12:31:06 +020086knitr::opts_chunk$set(dev = 'svg', echo = FALSE, warnings = FALSE)
Marc Kupietzafce9c12023-06-13 09:18:53 +020087source("common.R")
88```
89# ICC aims & charcteristics
Marc Kupietz8f6c71b2023-06-28 18:13:55 +020090
91* make available comparable corpora of many languages for contrastive linguistic research [@kirk_ice_2017]
Marc Kupietzafce9c12023-06-13 09:18:53 +020092* mostly based on existing corpora
Marc Kupietz6a4d3a72023-06-26 20:32:39 +020093* ICC has a pre-defined balanced composition
94 * based on the one of the ICE [@greenbaum_comparing_1996]
Marc Kupietzafce9c12023-06-13 09:18:53 +020095
Marc Kupietz6354d202023-06-26 20:34:05 +020096# Current launch of ICC written
Marc Kupietzafce9c12023-06-13 09:18:53 +020097
Marc Kupietz6354d202023-06-26 20:34:05 +020098* written parts for Chinese, Czech, English, German, Irish (partly), Norwegian publicly available
99 * partially including UDPipe 2.0 annotations [@straka_udpipe_2018]
100 * via Corpus Workbench or KorAP [@diewald_korap_2016]
101
102![](korap_query.png)
103
104## Composition of the ICC parts
Marc Kupietzafce9c12023-06-13 09:18:53 +0200105
Marc Kupietz1ed69ff2023-06-28 18:14:34 +0200106```{r composition-by-genre, fig.cap="Actual composition of selected ICC parts with respect to ICC domain. (For the other ICC parts, the ICC genre metadatum was not yet accessible via the API at the editorial deadline.)", message = FALSE, fig.width=14, fig.height=10, out.width = "100%"}
Marc Kupietzafce9c12023-06-13 09:18:53 +0200107icc_genre <- icc %>%
108 expand_grid(genre) %>%
109 mutate(vc = paste0("iccGenre=", genre)) %>%
110 rowwise() %>%
111 mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)
112
113icc_genre %>% ggplot(aes(x=lang, fill=genre, y=tokens)) +
114 geom_col() + scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
115 theme_ids(base_size = 24) +
116 theme(
117 axis.title.x = element_text(size = rel(1.5), face = "bold"),
118 axis.title.y = element_text(size = rel(1.5), face = "bold"),
119 axis.text = element_text(size = rel(0.70)),
120 legend.title = element_text(size = rel(0.85), face = "bold"),
121 legend.text = element_text(size = rel(1))) +
122 scale_fill_ids() +
123 geom_text(aes(label=if_else(tokens > 0, as.character(tokens), ""), y=tokens), position= position_stack(reverse = F, vjust = 0.5), color="black", size=6.2, family="Fira Sans Condensed")
124
125```
126
Marc Kupietzafce9c12023-06-13 09:18:53 +0200127
Marc Kupietz1ed69ff2023-06-28 18:14:34 +0200128```{r composition-by-pubdate, fig.cap="Composition of the selected ICC parts with respect to year of publication.", message=F, warning=F, fig.width=14, fig.height=3, out.width = "80%"}
Marc Kupietzafce9c12023-06-13 09:18:53 +0200129year <- c(1986:2023)
130
131icc_year <- icc %>%
132 expand_grid(year) %>%
133 mutate(vc = paste0("pubDate in ", year)) %>%
134 rowwise() %>%
135 mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)
136
137icc_year %>% ggplot(aes(x=year, fill=lang, color=lang, y=tokens)) +
138 # geom_smooth(se=F, span=0.25) +
139 xlim(1990, 2023) +
140 ylim(0, NA) +
141 stat_smooth(
142 geom = 'area', method = 'loess', span = 1/4,
143 alpha = 0.1) +
144 # geom_area(alpha=0.1, position = "identity") +
145 scale_fill_ids() + scale_colour_ids() +
146 scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
147 theme_ids(base_size=24) +
148 theme(
149 axis.title.x = element_text(size = rel(1.5), face = "bold"),
150 axis.title.y = element_text(size = rel(1.5), face = "bold"),
151 axis.text = element_text(size = rel(1)),
152 legend.title = element_text(size = rel(1), face = "bold"),
153 legend.text = element_text(size = rel(1)))
154```
155
Marc Kupietzafce9c12023-06-13 09:18:53 +0200156
Marc Kupietz4e3ab832023-06-26 20:33:18 +0200157# Pilot study
Marc Kupietzafce9c12023-06-13 09:18:53 +0200158
Marc Kupietz4e3ab832023-06-26 20:33:18 +0200159* Identification of Light Verb Constructions with *take*
160* in order to investigate the limitations imposed by the very small corpus sizes
161* using RKorapClient [@kupietz_rkorapclient_2020] to access corpora and get reproducible results of the collocation analysis
Marc Kupietzafce9c12023-06-13 09:18:53 +0200162
Marc Kupietzafce9c12023-06-13 09:18:53 +0200163
Marc Kupietz4e6311e2023-06-26 20:37:25 +0200164```{r take-icc, echo=TRUE, fig.cap="Collocation analysis of *take* using the RKorAPClient package for R"}
Marc Kupietzafce9c12023-06-13 09:18:53 +0200165take_ca_icc <-
166 collocationAnalysis(
167 icc_con("eng"),
168 "focus({[ud/l=take]} [ud/p=NOUN])",
169 leftContextSize = 0,
170 rightContextSize = 1,
171 minOccur = 2,
172 addExamples = T
173 )
174
175take_ca_icc %>% show_table()
176```
177
Marc Kupietz9af399d2023-06-26 20:34:36 +0200178## Results
179
180* for English the query for *take* + NOUN (as direct right neighbour) yields 10 different pairs with a minimum frequency of 2 (see Figure \@ref(fig:take-icc))
181 * based English Wikipedia (2015) the query yields 139 pairs (log-dice-threshold: 2.0) with about 20 false positives
182* for ICC German with DeReKo as background corpus, the ratio of true positive LVCs is 10/80
183
Marc Kupietz32b70ae2023-06-26 20:34:58 +0200184# Summary & Outlook
185
186* we have made available corpora of 4+ languages available for contrastive research
187* however, even with quite frequent phenomena, the results on the small corpora are to be used with caution
188 * typically they need to be verified on larger monolingual corpora
189* the uniform acces is in any case helpful for contrastive studies
190* ICC also serves as a crystallization point for larger corpora and complementary approaches such as EuReCo
191
Marc Kupietzbcde0b62023-06-14 14:22:35 +0200192# References
193
Marc Kupietzafce9c12023-06-13 09:18:53 +0200194