blob: 56703b97f49dbcfa6b04d57ff94903df358cff06 [file] [log] [blame]
Marc Kupietzafce9c12023-06-13 09:18:53 +02001---
2title: "News from the International Comparable Corpus"
3subtitle: "First launch of ICC written"
4date: "`r Sys.Date()`"
5author:
6 - name: Marc Kupietz
7 affil: 1
8 - name: Adrien Barbaresi
9 affil: 2
Marc Kupietzbcde0b62023-06-14 14:22:35 +020010 - name: Anna Čermáková
Marc Kupietzafce9c12023-06-13 09:18:53 +020011 affil: 3
12 - name: Małgorzata Czachor
13 affil: 4
14 - name: Nils Diewald
15 affil: 1
16 - name: Jarle Ebeling
17 affil: 5
18 - name: Rafał L. Górski
19 affil: 4
20 - name: John Kirk
21 affil: 6
22 - name: Michal Křen
23 affil: 3
24 - name: Harald Lüngen
25 affil: 1
26 - name: Eliza Margaretha
27 affil: 1
28 - name: Signe Oksefjell Ebeling
29 affil: 5
30 - name: Mícheál Ó Meachair
31 affil: 7
32 - name: Ines Pisetta
33 affil: 1
34 - name: Elaine Uí Dhonnchadha
35 affil: 8
36 - name: Friedemann Vogel
37 affil: 9
38 - name: Rebecca Wilm
39 affil: 1
40 - name: Jiajin Xu
41 affil: 10
42 - name: Rameela Yaddehige
43 affil: 1
44affiliation:
45 - num: 1
46 address: IDS Mannheim
47 - num: 2
48 address: BBAW Berlin
49 - num: 3
50 address: Charles University
51 - num: 4
52 address: Polish Academy of Sciences
53 - num: 5
54 address: University of Oslo
55 - num: 6
56 address: University of Vienna
57 - num: 7
58 address: Dublin City University
59 - num: 8
60 address: Trinity College Dublin
61 - num: 9
62 address: University of Siegen
63 - num: 10
64 address: Beijing Foreign Studies University
65
66
67logoleft_name: "../Figures/ICC_COL.svg"
68author_textsize: "32pt"
69
Marc Kupietzfbd648c2023-06-24 12:31:45 +020070contact:
Marc Kupietzc5f7a922023-06-26 21:16:25 +020071 email: icc@ids-manneim.de
72 website: https://www.ids-mannheim.de/digspra/kl
Marc Kupietzf0f58822023-06-26 20:32:03 +020073 qrlink: >
Marc Kupietze3bba7b2023-06-26 21:17:11 +020074 `r posterdown::qrlink("https://korap.ids-mannheim.de/instance/icc", "icc-logo-whitebg.svg")`
Marc Kupietzfbd648c2023-06-24 12:31:45 +020075
Marc Kupietzafce9c12023-06-13 09:18:53 +020076output:
Marc Kupietzfbd648c2023-06-24 12:31:45 +020077 posterdown::posterdown_ids:
78 self_contained: false
79 keep_md: true
Marc Kupietzbcde0b62023-06-14 14:22:35 +020080
81bibliography: ../tex/references.bib
Marc Kupietzdf8083d2023-06-26 20:31:42 +020082csl: "https://raw.githubusercontent.com/ICLC-10/Zotero/master/styles/ICLC-10.csl"
Marc Kupietzafce9c12023-06-13 09:18:53 +020083---
84
85```{r setup, include=FALSE, echo=FALSE, warning=FALSE}
Marc Kupietz48d2b522023-06-14 12:31:06 +020086knitr::opts_chunk$set(dev = 'svg', echo = FALSE, warnings = FALSE)
Marc Kupietzafce9c12023-06-13 09:18:53 +020087source("common.R")
88```
89# ICC aims & charcteristics
Marc Kupietz8f6c71b2023-06-28 18:13:55 +020090
91* make available comparable corpora of many languages for contrastive linguistic research [@kirk_ice_2017]
Marc Kupietzafce9c12023-06-13 09:18:53 +020092* mostly based on existing corpora
Marc Kupietz6a4d3a72023-06-26 20:32:39 +020093* ICC has a pre-defined balanced composition
94 * based on the one of the ICE [@greenbaum_comparing_1996]
Marc Kupietzafce9c12023-06-13 09:18:53 +020095
Marc Kupietz6354d202023-06-26 20:34:05 +020096# Current launch of ICC written
Marc Kupietzafce9c12023-06-13 09:18:53 +020097
Marc Kupietz6354d202023-06-26 20:34:05 +020098* written parts for Chinese, Czech, English, German, Irish (partly), Norwegian publicly available
99 * partially including UDPipe 2.0 annotations [@straka_udpipe_2018]
100 * via Corpus Workbench or KorAP [@diewald_korap_2016]
101
102![](korap_query.png)
103
104## Composition of the ICC parts
Marc Kupietzafce9c12023-06-13 09:18:53 +0200105### By ICC genre
106
107```{r composition_by_genre, message = FALSE, fig.width=14, fig.height=10, out.width = "100%"}
108icc_genre <- icc %>%
109 expand_grid(genre) %>%
110 mutate(vc = paste0("iccGenre=", genre)) %>%
111 rowwise() %>%
112 mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)
113
114icc_genre %>% ggplot(aes(x=lang, fill=genre, y=tokens)) +
115 geom_col() + scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
116 theme_ids(base_size = 24) +
117 theme(
118 axis.title.x = element_text(size = rel(1.5), face = "bold"),
119 axis.title.y = element_text(size = rel(1.5), face = "bold"),
120 axis.text = element_text(size = rel(0.70)),
121 legend.title = element_text(size = rel(0.85), face = "bold"),
122 legend.text = element_text(size = rel(1))) +
123 scale_fill_ids() +
124 geom_text(aes(label=if_else(tokens > 0, as.character(tokens), ""), y=tokens), position= position_stack(reverse = F, vjust = 0.5), color="black", size=6.2, family="Fira Sans Condensed")
125
126```
127
128### By date of publication
129
130
Marc Kupietzf7b93ed2023-06-26 20:35:33 +0200131```{r composition-by-pubdate, message=F, warning=F, fig.width=14, fig.height=3, out.width = "80%"}
Marc Kupietzafce9c12023-06-13 09:18:53 +0200132year <- c(1986:2023)
133
134icc_year <- icc %>%
135 expand_grid(year) %>%
136 mutate(vc = paste0("pubDate in ", year)) %>%
137 rowwise() %>%
138 mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)
139
140icc_year %>% ggplot(aes(x=year, fill=lang, color=lang, y=tokens)) +
141 # geom_smooth(se=F, span=0.25) +
142 xlim(1990, 2023) +
143 ylim(0, NA) +
144 stat_smooth(
145 geom = 'area', method = 'loess', span = 1/4,
146 alpha = 0.1) +
147 # geom_area(alpha=0.1, position = "identity") +
148 scale_fill_ids() + scale_colour_ids() +
149 scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
150 theme_ids(base_size=24) +
151 theme(
152 axis.title.x = element_text(size = rel(1.5), face = "bold"),
153 axis.title.y = element_text(size = rel(1.5), face = "bold"),
154 axis.text = element_text(size = rel(1)),
155 legend.title = element_text(size = rel(1), face = "bold"),
156 legend.text = element_text(size = rel(1)))
157```
158
Marc Kupietzafce9c12023-06-13 09:18:53 +0200159
Marc Kupietz4e3ab832023-06-26 20:33:18 +0200160# Pilot study
Marc Kupietzafce9c12023-06-13 09:18:53 +0200161
Marc Kupietz4e3ab832023-06-26 20:33:18 +0200162* Identification of Light Verb Constructions with *take*
163* in order to investigate the limitations imposed by the very small corpus sizes
164* using RKorapClient [@kupietz_rkorapclient_2020] to access corpora and get reproducible results of the collocation analysis
Marc Kupietzafce9c12023-06-13 09:18:53 +0200165
Marc Kupietzafce9c12023-06-13 09:18:53 +0200166
Marc Kupietz4e6311e2023-06-26 20:37:25 +0200167```{r take-icc, echo=TRUE, fig.cap="Collocation analysis of *take* using the RKorAPClient package for R"}
Marc Kupietzafce9c12023-06-13 09:18:53 +0200168take_ca_icc <-
169 collocationAnalysis(
170 icc_con("eng"),
171 "focus({[ud/l=take]} [ud/p=NOUN])",
172 leftContextSize = 0,
173 rightContextSize = 1,
174 minOccur = 2,
175 addExamples = T
176 )
177
178take_ca_icc %>% show_table()
179```
180
Marc Kupietz9af399d2023-06-26 20:34:36 +0200181## Results
182
183* for English the query for *take* + NOUN (as direct right neighbour) yields 10 different pairs with a minimum frequency of 2 (see Figure \@ref(fig:take-icc))
184 * based English Wikipedia (2015) the query yields 139 pairs (log-dice-threshold: 2.0) with about 20 false positives
185* for ICC German with DeReKo as background corpus, the ratio of true positive LVCs is 10/80
186
Marc Kupietz32b70ae2023-06-26 20:34:58 +0200187# Summary & Outlook
188
189* we have made available corpora of 4+ languages available for contrastive research
190* however, even with quite frequent phenomena, the results on the small corpora are to be used with caution
191 * typically they need to be verified on larger monolingual corpora
192* the uniform acces is in any case helpful for contrastive studies
193* ICC also serves as a crystallization point for larger corpora and complementary approaches such as EuReCo
194
Marc Kupietzbcde0b62023-06-14 14:22:35 +0200195# References
196
Marc Kupietzafce9c12023-06-13 09:18:53 +0200197