Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 1 | --- |
| 2 | title: "News from the International Comparable Corpus" |
| 3 | subtitle: "First launch of ICC written" |
| 4 | date: "`r Sys.Date()`" |
| 5 | author: |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 6 | - name: Marc Kupietz |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 7 | affil: 1 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 8 | - name: Adrien Barbaresi |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 9 | affil: 2 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 10 | - name: Anna Čermáková |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 11 | affil: 3 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 12 | - name: Małgorzata Czachor |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 13 | affil: 4 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 14 | - name: Nils Diewald |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 15 | affil: 1 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 16 | - name: Jarle Ebeling |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 17 | affil: 5 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 18 | - name: Rafał L. Górski |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 19 | affil: 4 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 20 | - name: John Kirk |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 21 | affil: 6 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 22 | - name: Michal Křen |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 23 | affil: 3 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 24 | - name: Harald Lüngen |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 25 | affil: 1 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 26 | - name: Eliza Margaretha |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 27 | affil: 1 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 28 | - name: Signe Oksefjell Ebeling |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 29 | affil: 5 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 30 | - name: Mícheál Ó Meachair |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 31 | affil: 7 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 32 | - name: Ines Pisetta |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 33 | affil: 1 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 34 | - name: Elaine Uí Dhonnchadha |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 35 | affil: 8 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 36 | - name: Friedemann Vogel |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 37 | affil: 9 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 38 | - name: Rebecca Wilm |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 39 | affil: 1 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 40 | - name: Jiajin Xu |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 41 | affil: 10 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 42 | - name: Rameela Yaddehige |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 43 | affil: 1 |
| 44 | affiliation: |
| 45 | - num: 1 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 46 | address: IDS Mannheim |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 47 | - num: 2 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 48 | address: BBAW Berlin |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 49 | - num: 3 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 50 | address: Charles University |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 51 | - num: 4 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 52 | address: Polish Academy of Sciences |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 53 | - num: 5 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 54 | address: University of Oslo |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 55 | - num: 6 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 56 | address: University of Vienna |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 57 | - num: 7 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 58 | address: Dublin City University |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 59 | - num: 8 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 60 | address: Trinity College Dublin |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 61 | - num: 9 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 62 | address: University of Siegen |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 63 | - num: 10 |
Marc Kupietz | fd92086 | 2023-06-29 09:15:12 +0200 | [diff] [blame] | 64 | address: Beijing Foreign Studies University |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 65 | |
| 66 | |
| 67 | logoleft_name: "../Figures/ICC_COL.svg" |
| 68 | author_textsize: "32pt" |
| 69 | |
Marc Kupietz | fbd648c | 2023-06-24 12:31:45 +0200 | [diff] [blame] | 70 | contact: |
Marc Kupietz | c5f7a92 | 2023-06-26 21:16:25 +0200 | [diff] [blame] | 71 | email: icc@ids-manneim.de |
| 72 | website: https://www.ids-mannheim.de/digspra/kl |
Marc Kupietz | f0f5882 | 2023-06-26 20:32:03 +0200 | [diff] [blame] | 73 | qrlink: > |
Marc Kupietz | e3bba7b | 2023-06-26 21:17:11 +0200 | [diff] [blame] | 74 | `r posterdown::qrlink("https://korap.ids-mannheim.de/instance/icc", "icc-logo-whitebg.svg")` |
Marc Kupietz | fbd648c | 2023-06-24 12:31:45 +0200 | [diff] [blame] | 75 | |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 76 | output: |
Marc Kupietz | fbd648c | 2023-06-24 12:31:45 +0200 | [diff] [blame] | 77 | posterdown::posterdown_ids: |
| 78 | self_contained: false |
| 79 | keep_md: true |
Marc Kupietz | bcde0b6 | 2023-06-14 14:22:35 +0200 | [diff] [blame] | 80 | |
Marc Kupietz | 840cdb0 | 2023-06-29 15:58:05 +0200 | [diff] [blame] | 81 | lang: en |
Marc Kupietz | bcde0b6 | 2023-06-14 14:22:35 +0200 | [diff] [blame] | 82 | bibliography: ../tex/references.bib |
Marc Kupietz | df8083d | 2023-06-26 20:31:42 +0200 | [diff] [blame] | 83 | csl: "https://raw.githubusercontent.com/ICLC-10/Zotero/master/styles/ICLC-10.csl" |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 84 | --- |
| 85 | |
| 86 | ```{r setup, include=FALSE, echo=FALSE, warning=FALSE} |
Marc Kupietz | 2b51d50 | 2023-06-28 18:25:13 +0200 | [diff] [blame] | 87 | knitr::opts_chunk$set(dev = 'svg', echo = FALSE, message = FALSE, warnings = FALSE) |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 88 | source("common.R") |
| 89 | ``` |
Marc Kupietz | 2b51d50 | 2023-06-28 18:25:13 +0200 | [diff] [blame] | 90 | # ICC aims & characteristics |
Marc Kupietz | 8f6c71b | 2023-06-28 18:13:55 +0200 | [diff] [blame] | 91 | |
| 92 | * make available comparable corpora of many languages for contrastive linguistic research [@kirk_ice_2017] |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 93 | * mostly based on existing corpora |
Marc Kupietz | 3ef8bda | 2023-06-28 20:25:14 +0200 | [diff] [blame] | 94 | * small corpora with 1M words each (400K written) |
Marc Kupietz | 2b51d50 | 2023-06-28 18:25:13 +0200 | [diff] [blame] | 95 | * pre-defined “balanced” composition |
| 96 | * inspired by the one of the ICE [@greenbaum_comparing_1996] |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 97 | |
Marc Kupietz | 6354d20 | 2023-06-26 20:34:05 +0200 | [diff] [blame] | 98 | # Current launch of ICC written |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 99 | |
Marc Kupietz | 2b51d50 | 2023-06-28 18:25:13 +0200 | [diff] [blame] | 100 | * written parts for Chinese, Czech, English (mostly), German, Irish (partly), Norwegian publicly available |
Marc Kupietz | 6354d20 | 2023-06-26 20:34:05 +0200 | [diff] [blame] | 101 | * partially including UDPipe 2.0 annotations [@straka_udpipe_2018] |
Marc Kupietz | eaabc1e | 2023-06-29 15:58:29 +0200 | [diff] [blame] | 102 | * usable via CWB or KorAP [@diewald_korap_2016] ➝ QR Code on the left |
Marc Kupietz | 49a7c18 | 2023-06-28 18:15:46 +0200 | [diff] [blame] | 103 | |
| 104 | ```{r korap-query, fig.cap="KorAP UI for ICC-GER and ICC-NOR, showing annotation queries and layers, as well as a virtual corpus definition, based on ICC genre and publication date metadata."} |
| 105 | knitr::include_graphics("korap_query_ger-nor.svg") |
| 106 | ``` |
Marc Kupietz | 6354d20 | 2023-06-26 20:34:05 +0200 | [diff] [blame] | 107 | |
| 108 | ## Composition of the ICC parts |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 109 | |
Marc Kupietz | 1ed69ff | 2023-06-28 18:14:34 +0200 | [diff] [blame] | 110 | ```{r composition-by-genre, fig.cap="Actual composition of selected ICC parts with respect to ICC domain. (For the other ICC parts, the ICC genre metadatum was not yet accessible via the API at the editorial deadline.)", message = FALSE, fig.width=14, fig.height=10, out.width = "100%"} |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 111 | icc_genre <- icc %>% |
| 112 | expand_grid(genre) %>% |
| 113 | mutate(vc = paste0("iccGenre=", genre)) %>% |
| 114 | rowwise() %>% |
| 115 | mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens) |
| 116 | |
| 117 | icc_genre %>% ggplot(aes(x=lang, fill=genre, y=tokens)) + |
| 118 | geom_col() + scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) + |
| 119 | theme_ids(base_size = 24) + |
| 120 | theme( |
Marc Kupietz | 3ef8bda | 2023-06-28 20:25:14 +0200 | [diff] [blame] | 121 | axis.title.x = element_blank(), |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 122 | axis.title.y = element_text(size = rel(1.5), face = "bold"), |
| 123 | axis.text = element_text(size = rel(0.70)), |
| 124 | legend.title = element_text(size = rel(0.85), face = "bold"), |
| 125 | legend.text = element_text(size = rel(1))) + |
| 126 | scale_fill_ids() + |
| 127 | geom_text(aes(label=if_else(tokens > 0, as.character(tokens), ""), y=tokens), position= position_stack(reverse = F, vjust = 0.5), color="black", size=6.2, family="Fira Sans Condensed") |
| 128 | |
| 129 | ``` |
| 130 | |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 131 | |
Marc Kupietz | 1ed69ff | 2023-06-28 18:14:34 +0200 | [diff] [blame] | 132 | ```{r composition-by-pubdate, fig.cap="Composition of the selected ICC parts with respect to year of publication.", message=F, warning=F, fig.width=14, fig.height=3, out.width = "80%"} |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 133 | year <- c(1986:2023) |
| 134 | |
| 135 | icc_year <- icc %>% |
| 136 | expand_grid(year) %>% |
| 137 | mutate(vc = paste0("pubDate in ", year)) %>% |
| 138 | rowwise() %>% |
| 139 | mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens) |
| 140 | |
| 141 | icc_year %>% ggplot(aes(x=year, fill=lang, color=lang, y=tokens)) + |
| 142 | # geom_smooth(se=F, span=0.25) + |
| 143 | xlim(1990, 2023) + |
| 144 | ylim(0, NA) + |
| 145 | stat_smooth( |
| 146 | geom = 'area', method = 'loess', span = 1/4, |
| 147 | alpha = 0.1) + |
| 148 | # geom_area(alpha=0.1, position = "identity") + |
| 149 | scale_fill_ids() + scale_colour_ids() + |
| 150 | scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) + |
| 151 | theme_ids(base_size=24) + |
| 152 | theme( |
| 153 | axis.title.x = element_text(size = rel(1.5), face = "bold"), |
| 154 | axis.title.y = element_text(size = rel(1.5), face = "bold"), |
| 155 | axis.text = element_text(size = rel(1)), |
| 156 | legend.title = element_text(size = rel(1), face = "bold"), |
| 157 | legend.text = element_text(size = rel(1))) |
| 158 | ``` |
| 159 | |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 160 | |
Marc Kupietz | 4e3ab83 | 2023-06-26 20:33:18 +0200 | [diff] [blame] | 161 | # Pilot study |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 162 | |
Marc Kupietz | 58d1bc1 | 2023-06-28 18:18:19 +0200 | [diff] [blame] | 163 | * identification of light verb constructions (LVC) with *take* in English, and corresponding lemmas in German and Norwegian |
Marc Kupietz | 913f60c | 2023-06-29 15:59:37 +0200 | [diff] [blame^] | 164 | * to explore the limitations imposed by the small corpus sizes |
Marc Kupietz | 3e204f8 | 2023-06-29 15:58:56 +0200 | [diff] [blame] | 165 | * using RKorAPClient [@kupietz_rkorapclient_2020] to access the corpora and get reproducible results for the analyses |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 166 | |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 167 | |
Marc Kupietz | 9fe544b | 2023-06-28 18:17:31 +0200 | [diff] [blame] | 168 | ```{r take-icc-code, results='hide', echo=TRUE} |
| 169 | library(RKorAPClient) |
| 170 | new("KorAPConnection", |
| 171 | KorAPUrl = "https://korap.ids-mannheim.de/instance/icc/eng", |
| 172 | accessToken = Sys.getenv("KORAP_ICC_TOKEN_eng")) %>% |
| 173 | collocationAnalysis( |
| 174 | "focus({[ud/l=take]} [ud/p=NOUN])", |
| 175 | leftContextSize = 0, |
| 176 | rightContextSize = 1, |
| 177 | minOccur = 2, |
Marc Kupietz | 3ef8bda | 2023-06-28 20:25:14 +0200 | [diff] [blame] | 178 | addExamples = T ) |
Marc Kupietz | 9fe544b | 2023-06-28 18:17:31 +0200 | [diff] [blame] | 179 | ``` |
| 180 | |
Marc Kupietz | 3ef8bda | 2023-06-28 20:25:14 +0200 | [diff] [blame] | 181 | ```{r take-icc, fig.cap="R code for, and results of a co-occurrence analysis of *take* + NOUN in ICC-ENG, using the RKorAPClient package."} |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 182 | take_ca_icc <- |
| 183 | collocationAnalysis( |
| 184 | icc_con("eng"), |
| 185 | "focus({[ud/l=take]} [ud/p=NOUN])", |
| 186 | leftContextSize = 0, |
| 187 | rightContextSize = 1, |
| 188 | minOccur = 2, |
| 189 | addExamples = T |
| 190 | ) |
| 191 | |
| 192 | take_ca_icc %>% show_table() |
| 193 | ``` |
| 194 | |
Marc Kupietz | 9af399d | 2023-06-26 20:34:36 +0200 | [diff] [blame] | 195 | ## Results |
| 196 | |
| 197 | * for English the query for *take* + NOUN (as direct right neighbour) yields 10 different pairs with a minimum frequency of 2 (see Figure \@ref(fig:take-icc)) |
Marc Kupietz | 48a4134 | 2023-06-28 18:16:54 +0200 | [diff] [blame] | 198 | * based on English Wikipedia [2015 snapshot, see @MargarethaLuengen2014] the query yields 139 pairs (log-dice-threshold: 2.0) with 44 false positives |
Marc Kupietz | 3a458ce | 2023-06-28 18:19:07 +0200 | [diff] [blame] | 199 | * the true positive ratio of discovered take-LVCs between ICC and Wikipedia is 10:95 |
| 200 | * for ICC German with DeReKo as background corpus, the ratio of discovered true LVCs with ›nehmen‹ (=take) is 10:89 |
| 201 | * in both cases, not much more than 10% of LVCs could be discovered |
Marc Kupietz | 9af399d | 2023-06-26 20:34:36 +0200 | [diff] [blame] | 202 | |
Marc Kupietz | 32b70ae | 2023-06-26 20:34:58 +0200 | [diff] [blame] | 203 | # Summary & Outlook |
| 204 | |
Marc Kupietz | 3ef8bda | 2023-06-28 20:25:14 +0200 | [diff] [blame] | 205 | * we have made comparable corpora of 4+ languages available, readily usable for contrastive research |
Marc Kupietz | fb570ea | 2023-06-28 18:17:59 +0200 | [diff] [blame] | 206 | * however, even for fairly frequent phenomena, the results on the small corpora should be treated with caution |
| 207 | * typically, they need to be verified on larger monolingual corpora |
| 208 | * this also and especially concerns recall |
| 209 | * nevertheless ICC can serve as a useful basis for contrastive studies |
Marc Kupietz | 2b51d50 | 2023-06-28 18:25:13 +0200 | [diff] [blame] | 210 | * with a uniform UI and API that facilitate query and analysis |
Marc Kupietz | fb570ea | 2023-06-28 18:17:59 +0200 | [diff] [blame] | 211 | * in addition, ICC also serves as a crystallisation point |
| 212 | * for more ICC corpora and spoken parts to come |
| 213 | * for larger corpora and complementary approaches, such as EuReCo |
Marc Kupietz | 32b70ae | 2023-06-26 20:34:58 +0200 | [diff] [blame] | 214 | |
Marc Kupietz | bcde0b6 | 2023-06-14 14:22:35 +0200 | [diff] [blame] | 215 | # References |
| 216 | |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 217 | |