Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 1 | --- |
| 2 | title: "News from the International Comparable Corpus" |
| 3 | subtitle: "First launch of ICC written" |
| 4 | date: "`r Sys.Date()`" |
| 5 | author: |
| 6 | - name: Marc Kupietz |
| 7 | affil: 1 |
| 8 | - name: Adrien Barbaresi |
| 9 | affil: 2 |
Marc Kupietz | bcde0b6 | 2023-06-14 14:22:35 +0200 | [diff] [blame] | 10 | - name: Anna Čermáková |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 11 | affil: 3 |
| 12 | - name: Małgorzata Czachor |
| 13 | affil: 4 |
| 14 | - name: Nils Diewald |
| 15 | affil: 1 |
| 16 | - name: Jarle Ebeling |
| 17 | affil: 5 |
| 18 | - name: Rafał L. Górski |
| 19 | affil: 4 |
| 20 | - name: John Kirk |
| 21 | affil: 6 |
| 22 | - name: Michal Křen |
| 23 | affil: 3 |
| 24 | - name: Harald Lüngen |
| 25 | affil: 1 |
| 26 | - name: Eliza Margaretha |
| 27 | affil: 1 |
| 28 | - name: Signe Oksefjell Ebeling |
| 29 | affil: 5 |
| 30 | - name: Mícheál Ó Meachair |
| 31 | affil: 7 |
| 32 | - name: Ines Pisetta |
| 33 | affil: 1 |
| 34 | - name: Elaine Uí Dhonnchadha |
| 35 | affil: 8 |
| 36 | - name: Friedemann Vogel |
| 37 | affil: 9 |
| 38 | - name: Rebecca Wilm |
| 39 | affil: 1 |
| 40 | - name: Jiajin Xu |
| 41 | affil: 10 |
| 42 | - name: Rameela Yaddehige |
| 43 | affil: 1 |
| 44 | affiliation: |
| 45 | - num: 1 |
| 46 | address: IDS Mannheim |
| 47 | - num: 2 |
| 48 | address: BBAW Berlin |
| 49 | - num: 3 |
| 50 | address: Charles University |
| 51 | - num: 4 |
| 52 | address: Polish Academy of Sciences |
| 53 | - num: 5 |
| 54 | address: University of Oslo |
| 55 | - num: 6 |
| 56 | address: University of Vienna |
| 57 | - num: 7 |
| 58 | address: Dublin City University |
| 59 | - num: 8 |
| 60 | address: Trinity College Dublin |
| 61 | - num: 9 |
| 62 | address: University of Siegen |
| 63 | - num: 10 |
| 64 | address: Beijing Foreign Studies University |
| 65 | |
| 66 | |
| 67 | logoleft_name: "../Figures/ICC_COL.svg" |
| 68 | author_textsize: "32pt" |
| 69 | |
Marc Kupietz | fbd648c | 2023-06-24 12:31:45 +0200 | [diff] [blame] | 70 | contact: |
Marc Kupietz | c5f7a92 | 2023-06-26 21:16:25 +0200 | [diff] [blame] | 71 | email: icc@ids-manneim.de |
| 72 | website: https://www.ids-mannheim.de/digspra/kl |
Marc Kupietz | f0f5882 | 2023-06-26 20:32:03 +0200 | [diff] [blame] | 73 | qrlink: > |
Marc Kupietz | e3bba7b | 2023-06-26 21:17:11 +0200 | [diff] [blame] | 74 | `r posterdown::qrlink("https://korap.ids-mannheim.de/instance/icc", "icc-logo-whitebg.svg")` |
Marc Kupietz | fbd648c | 2023-06-24 12:31:45 +0200 | [diff] [blame] | 75 | |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 76 | output: |
Marc Kupietz | fbd648c | 2023-06-24 12:31:45 +0200 | [diff] [blame] | 77 | posterdown::posterdown_ids: |
| 78 | self_contained: false |
| 79 | keep_md: true |
Marc Kupietz | bcde0b6 | 2023-06-14 14:22:35 +0200 | [diff] [blame] | 80 | |
| 81 | bibliography: ../tex/references.bib |
Marc Kupietz | df8083d | 2023-06-26 20:31:42 +0200 | [diff] [blame] | 82 | csl: "https://raw.githubusercontent.com/ICLC-10/Zotero/master/styles/ICLC-10.csl" |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 83 | --- |
| 84 | |
| 85 | ```{r setup, include=FALSE, echo=FALSE, warning=FALSE} |
Marc Kupietz | 48d2b52 | 2023-06-14 12:31:06 +0200 | [diff] [blame] | 86 | knitr::opts_chunk$set(dev = 'svg', echo = FALSE, warnings = FALSE) |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 87 | source("common.R") |
| 88 | ``` |
| 89 | # ICC aims & charcteristics |
Marc Kupietz | 8f6c71b | 2023-06-28 18:13:55 +0200 | [diff] [blame] | 90 | |
| 91 | * make available comparable corpora of many languages for contrastive linguistic research [@kirk_ice_2017] |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 92 | * mostly based on existing corpora |
Marc Kupietz | 6a4d3a7 | 2023-06-26 20:32:39 +0200 | [diff] [blame] | 93 | * ICC has a pre-defined “balanced” composition |
| 94 | * based on the one of the ICE [@greenbaum_comparing_1996] |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 95 | |
Marc Kupietz | 6354d20 | 2023-06-26 20:34:05 +0200 | [diff] [blame] | 96 | # Current launch of ICC written |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 97 | |
Marc Kupietz | 6354d20 | 2023-06-26 20:34:05 +0200 | [diff] [blame] | 98 | * written parts for Chinese, Czech, English, German, Irish (partly), Norwegian publicly available |
| 99 | * partially including UDPipe 2.0 annotations [@straka_udpipe_2018] |
| 100 | * via Corpus Workbench or KorAP [@diewald_korap_2016] |
| 101 | |
Marc Kupietz | 49a7c18 | 2023-06-28 18:15:46 +0200 | [diff] [blame^] | 102 | * usable via Corpus Workbench or KorAP [@diewald_korap_2016] |
| 103 | |
| 104 | ```{r korap-query, fig.cap="KorAP UI for ICC-GER and ICC-NOR, showing annotation queries and layers, as well as a virtual corpus definition, based on ICC genre and publication date metadata."} |
| 105 | knitr::include_graphics("korap_query_ger-nor.svg") |
| 106 | ``` |
Marc Kupietz | 6354d20 | 2023-06-26 20:34:05 +0200 | [diff] [blame] | 107 | |
| 108 | ## Composition of the ICC parts |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 109 | |
Marc Kupietz | 1ed69ff | 2023-06-28 18:14:34 +0200 | [diff] [blame] | 110 | ```{r composition-by-genre, fig.cap="Actual composition of selected ICC parts with respect to ICC domain. (For the other ICC parts, the ICC genre metadatum was not yet accessible via the API at the editorial deadline.)", message = FALSE, fig.width=14, fig.height=10, out.width = "100%"} |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 111 | icc_genre <- icc %>% |
| 112 | expand_grid(genre) %>% |
| 113 | mutate(vc = paste0("iccGenre=", genre)) %>% |
| 114 | rowwise() %>% |
| 115 | mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens) |
| 116 | |
| 117 | icc_genre %>% ggplot(aes(x=lang, fill=genre, y=tokens)) + |
| 118 | geom_col() + scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) + |
| 119 | theme_ids(base_size = 24) + |
| 120 | theme( |
| 121 | axis.title.x = element_text(size = rel(1.5), face = "bold"), |
| 122 | axis.title.y = element_text(size = rel(1.5), face = "bold"), |
| 123 | axis.text = element_text(size = rel(0.70)), |
| 124 | legend.title = element_text(size = rel(0.85), face = "bold"), |
| 125 | legend.text = element_text(size = rel(1))) + |
| 126 | scale_fill_ids() + |
| 127 | geom_text(aes(label=if_else(tokens > 0, as.character(tokens), ""), y=tokens), position= position_stack(reverse = F, vjust = 0.5), color="black", size=6.2, family="Fira Sans Condensed") |
| 128 | |
| 129 | ``` |
| 130 | |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 131 | |
Marc Kupietz | 1ed69ff | 2023-06-28 18:14:34 +0200 | [diff] [blame] | 132 | ```{r composition-by-pubdate, fig.cap="Composition of the selected ICC parts with respect to year of publication.", message=F, warning=F, fig.width=14, fig.height=3, out.width = "80%"} |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 133 | year <- c(1986:2023) |
| 134 | |
| 135 | icc_year <- icc %>% |
| 136 | expand_grid(year) %>% |
| 137 | mutate(vc = paste0("pubDate in ", year)) %>% |
| 138 | rowwise() %>% |
| 139 | mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens) |
| 140 | |
| 141 | icc_year %>% ggplot(aes(x=year, fill=lang, color=lang, y=tokens)) + |
| 142 | # geom_smooth(se=F, span=0.25) + |
| 143 | xlim(1990, 2023) + |
| 144 | ylim(0, NA) + |
| 145 | stat_smooth( |
| 146 | geom = 'area', method = 'loess', span = 1/4, |
| 147 | alpha = 0.1) + |
| 148 | # geom_area(alpha=0.1, position = "identity") + |
| 149 | scale_fill_ids() + scale_colour_ids() + |
| 150 | scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) + |
| 151 | theme_ids(base_size=24) + |
| 152 | theme( |
| 153 | axis.title.x = element_text(size = rel(1.5), face = "bold"), |
| 154 | axis.title.y = element_text(size = rel(1.5), face = "bold"), |
| 155 | axis.text = element_text(size = rel(1)), |
| 156 | legend.title = element_text(size = rel(1), face = "bold"), |
| 157 | legend.text = element_text(size = rel(1))) |
| 158 | ``` |
| 159 | |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 160 | |
Marc Kupietz | 4e3ab83 | 2023-06-26 20:33:18 +0200 | [diff] [blame] | 161 | # Pilot study |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 162 | |
Marc Kupietz | 4e3ab83 | 2023-06-26 20:33:18 +0200 | [diff] [blame] | 163 | * Identification of Light Verb Constructions with *take* |
| 164 | * in order to investigate the limitations imposed by the very small corpus sizes |
| 165 | * using RKorapClient [@kupietz_rkorapclient_2020] to access corpora and get reproducible results of the collocation analysis |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 166 | |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 167 | |
Marc Kupietz | 4e6311e | 2023-06-26 20:37:25 +0200 | [diff] [blame] | 168 | ```{r take-icc, echo=TRUE, fig.cap="Collocation analysis of *take* using the RKorAPClient package for R"} |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 169 | take_ca_icc <- |
| 170 | collocationAnalysis( |
| 171 | icc_con("eng"), |
| 172 | "focus({[ud/l=take]} [ud/p=NOUN])", |
| 173 | leftContextSize = 0, |
| 174 | rightContextSize = 1, |
| 175 | minOccur = 2, |
| 176 | addExamples = T |
| 177 | ) |
| 178 | |
| 179 | take_ca_icc %>% show_table() |
| 180 | ``` |
| 181 | |
Marc Kupietz | 9af399d | 2023-06-26 20:34:36 +0200 | [diff] [blame] | 182 | ## Results |
| 183 | |
| 184 | * for English the query for *take* + NOUN (as direct right neighbour) yields 10 different pairs with a minimum frequency of 2 (see Figure \@ref(fig:take-icc)) |
| 185 | * based English Wikipedia (2015) the query yields 139 pairs (log-dice-threshold: 2.0) with about 20 false positives |
| 186 | * for ICC German with DeReKo as background corpus, the ratio of true positive LVCs is 10/80 |
| 187 | |
Marc Kupietz | 32b70ae | 2023-06-26 20:34:58 +0200 | [diff] [blame] | 188 | # Summary & Outlook |
| 189 | |
| 190 | * we have made available corpora of 4+ languages available for contrastive research |
| 191 | * however, even with quite frequent phenomena, the results on the small corpora are to be used with caution |
| 192 | * typically they need to be verified on larger monolingual corpora |
| 193 | * the uniform acces is in any case helpful for contrastive studies |
| 194 | * ICC also serves as a crystallization point for larger corpora and complementary approaches such as EuReCo |
| 195 | |
Marc Kupietz | bcde0b6 | 2023-06-14 14:22:35 +0200 | [diff] [blame] | 196 | # References |
| 197 | |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 198 | |