Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 1 | --- |
| 2 | title: "News from the International Comparable Corpus" |
| 3 | subtitle: "First launch of ICC written" |
| 4 | date: "`r Sys.Date()`" |
| 5 | author: |
| 6 | - name: Marc Kupietz |
| 7 | affil: 1 |
| 8 | - name: Adrien Barbaresi |
| 9 | affil: 2 |
Marc Kupietz | bcde0b6 | 2023-06-14 14:22:35 +0200 | [diff] [blame] | 10 | - name: Anna Čermáková |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 11 | affil: 3 |
| 12 | - name: Małgorzata Czachor |
| 13 | affil: 4 |
| 14 | - name: Nils Diewald |
| 15 | affil: 1 |
| 16 | - name: Jarle Ebeling |
| 17 | affil: 5 |
| 18 | - name: Rafał L. Górski |
| 19 | affil: 4 |
| 20 | - name: John Kirk |
| 21 | affil: 6 |
| 22 | - name: Michal Křen |
| 23 | affil: 3 |
| 24 | - name: Harald Lüngen |
| 25 | affil: 1 |
| 26 | - name: Eliza Margaretha |
| 27 | affil: 1 |
| 28 | - name: Signe Oksefjell Ebeling |
| 29 | affil: 5 |
| 30 | - name: Mícheál Ó Meachair |
| 31 | affil: 7 |
| 32 | - name: Ines Pisetta |
| 33 | affil: 1 |
| 34 | - name: Elaine Uí Dhonnchadha |
| 35 | affil: 8 |
| 36 | - name: Friedemann Vogel |
| 37 | affil: 9 |
| 38 | - name: Rebecca Wilm |
| 39 | affil: 1 |
| 40 | - name: Jiajin Xu |
| 41 | affil: 10 |
| 42 | - name: Rameela Yaddehige |
| 43 | affil: 1 |
| 44 | affiliation: |
| 45 | - num: 1 |
| 46 | address: IDS Mannheim |
| 47 | - num: 2 |
| 48 | address: BBAW Berlin |
| 49 | - num: 3 |
| 50 | address: Charles University |
| 51 | - num: 4 |
| 52 | address: Polish Academy of Sciences |
| 53 | - num: 5 |
| 54 | address: University of Oslo |
| 55 | - num: 6 |
| 56 | address: University of Vienna |
| 57 | - num: 7 |
| 58 | address: Dublin City University |
| 59 | - num: 8 |
| 60 | address: Trinity College Dublin |
| 61 | - num: 9 |
| 62 | address: University of Siegen |
| 63 | - num: 10 |
| 64 | address: Beijing Foreign Studies University |
| 65 | |
| 66 | |
| 67 | logoleft_name: "../Figures/ICC_COL.svg" |
| 68 | author_textsize: "32pt" |
| 69 | |
Marc Kupietz | fbd648c | 2023-06-24 12:31:45 +0200 | [diff] [blame] | 70 | contact: |
Marc Kupietz | c5f7a92 | 2023-06-26 21:16:25 +0200 | [diff] [blame^] | 71 | email: icc@ids-manneim.de |
| 72 | website: https://www.ids-mannheim.de/digspra/kl |
Marc Kupietz | f0f5882 | 2023-06-26 20:32:03 +0200 | [diff] [blame] | 73 | qrlink: > |
| 74 | `r posterdown::qrlink("https://korap.ids-mannheim.de/instance/icc")` |
Marc Kupietz | fbd648c | 2023-06-24 12:31:45 +0200 | [diff] [blame] | 75 | |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 76 | output: |
Marc Kupietz | fbd648c | 2023-06-24 12:31:45 +0200 | [diff] [blame] | 77 | posterdown::posterdown_ids: |
| 78 | self_contained: false |
| 79 | keep_md: true |
Marc Kupietz | bcde0b6 | 2023-06-14 14:22:35 +0200 | [diff] [blame] | 80 | |
| 81 | bibliography: ../tex/references.bib |
Marc Kupietz | df8083d | 2023-06-26 20:31:42 +0200 | [diff] [blame] | 82 | csl: "https://raw.githubusercontent.com/ICLC-10/Zotero/master/styles/ICLC-10.csl" |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 83 | --- |
| 84 | |
| 85 | ```{r setup, include=FALSE, echo=FALSE, warning=FALSE} |
Marc Kupietz | 48d2b52 | 2023-06-14 12:31:06 +0200 | [diff] [blame] | 86 | knitr::opts_chunk$set(dev = 'svg', echo = FALSE, warnings = FALSE) |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 87 | source("common.R") |
| 88 | ``` |
| 89 | # ICC aims & charcteristics |
Marc Kupietz | 6a4d3a7 | 2023-06-26 20:32:39 +0200 | [diff] [blame] | 90 | * make available comparable corpora of many languages for contrastive linguistic research [@cermakova_international_2021] |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 91 | * mostly based on existing corpora |
Marc Kupietz | 6a4d3a7 | 2023-06-26 20:32:39 +0200 | [diff] [blame] | 92 | * ICC has a pre-defined “balanced” composition |
| 93 | * based on the one of the ICE [@greenbaum_comparing_1996] |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 94 | |
Marc Kupietz | 6354d20 | 2023-06-26 20:34:05 +0200 | [diff] [blame] | 95 | # Current launch of ICC written |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 96 | |
Marc Kupietz | 6354d20 | 2023-06-26 20:34:05 +0200 | [diff] [blame] | 97 | * written parts for Chinese, Czech, English, German, Irish (partly), Norwegian publicly available |
| 98 | * partially including UDPipe 2.0 annotations [@straka_udpipe_2018] |
| 99 | * via Corpus Workbench or KorAP [@diewald_korap_2016] |
| 100 | |
| 101 | ![](korap_query.png) |
| 102 | |
| 103 | ## Composition of the ICC parts |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 104 | ### By ICC genre |
| 105 | |
| 106 | ```{r composition_by_genre, message = FALSE, fig.width=14, fig.height=10, out.width = "100%"} |
| 107 | icc_genre <- icc %>% |
| 108 | expand_grid(genre) %>% |
| 109 | mutate(vc = paste0("iccGenre=", genre)) %>% |
| 110 | rowwise() %>% |
| 111 | mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens) |
| 112 | |
| 113 | icc_genre %>% ggplot(aes(x=lang, fill=genre, y=tokens)) + |
| 114 | geom_col() + scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) + |
| 115 | theme_ids(base_size = 24) + |
| 116 | theme( |
| 117 | axis.title.x = element_text(size = rel(1.5), face = "bold"), |
| 118 | axis.title.y = element_text(size = rel(1.5), face = "bold"), |
| 119 | axis.text = element_text(size = rel(0.70)), |
| 120 | legend.title = element_text(size = rel(0.85), face = "bold"), |
| 121 | legend.text = element_text(size = rel(1))) + |
| 122 | scale_fill_ids() + |
| 123 | geom_text(aes(label=if_else(tokens > 0, as.character(tokens), ""), y=tokens), position= position_stack(reverse = F, vjust = 0.5), color="black", size=6.2, family="Fira Sans Condensed") |
| 124 | |
| 125 | ``` |
| 126 | |
| 127 | ### By date of publication |
| 128 | |
| 129 | |
Marc Kupietz | f7b93ed | 2023-06-26 20:35:33 +0200 | [diff] [blame] | 130 | ```{r composition-by-pubdate, message=F, warning=F, fig.width=14, fig.height=3, out.width = "80%"} |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 131 | year <- c(1986:2023) |
| 132 | |
| 133 | icc_year <- icc %>% |
| 134 | expand_grid(year) %>% |
| 135 | mutate(vc = paste0("pubDate in ", year)) %>% |
| 136 | rowwise() %>% |
| 137 | mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens) |
| 138 | |
| 139 | icc_year %>% ggplot(aes(x=year, fill=lang, color=lang, y=tokens)) + |
| 140 | # geom_smooth(se=F, span=0.25) + |
| 141 | xlim(1990, 2023) + |
| 142 | ylim(0, NA) + |
| 143 | stat_smooth( |
| 144 | geom = 'area', method = 'loess', span = 1/4, |
| 145 | alpha = 0.1) + |
| 146 | # geom_area(alpha=0.1, position = "identity") + |
| 147 | scale_fill_ids() + scale_colour_ids() + |
| 148 | scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) + |
| 149 | theme_ids(base_size=24) + |
| 150 | theme( |
| 151 | axis.title.x = element_text(size = rel(1.5), face = "bold"), |
| 152 | axis.title.y = element_text(size = rel(1.5), face = "bold"), |
| 153 | axis.text = element_text(size = rel(1)), |
| 154 | legend.title = element_text(size = rel(1), face = "bold"), |
| 155 | legend.text = element_text(size = rel(1))) |
| 156 | ``` |
| 157 | |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 158 | |
Marc Kupietz | 4e3ab83 | 2023-06-26 20:33:18 +0200 | [diff] [blame] | 159 | # Pilot study |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 160 | |
Marc Kupietz | 4e3ab83 | 2023-06-26 20:33:18 +0200 | [diff] [blame] | 161 | * Identification of Light Verb Constructions with *take* |
| 162 | * in order to investigate the limitations imposed by the very small corpus sizes |
| 163 | * using RKorapClient [@kupietz_rkorapclient_2020] to access corpora and get reproducible results of the collocation analysis |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 164 | |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 165 | |
Marc Kupietz | 4e6311e | 2023-06-26 20:37:25 +0200 | [diff] [blame] | 166 | ```{r take-icc, echo=TRUE, fig.cap="Collocation analysis of *take* using the RKorAPClient package for R"} |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 167 | take_ca_icc <- |
| 168 | collocationAnalysis( |
| 169 | icc_con("eng"), |
| 170 | "focus({[ud/l=take]} [ud/p=NOUN])", |
| 171 | leftContextSize = 0, |
| 172 | rightContextSize = 1, |
| 173 | minOccur = 2, |
| 174 | addExamples = T |
| 175 | ) |
| 176 | |
| 177 | take_ca_icc %>% show_table() |
| 178 | ``` |
| 179 | |
Marc Kupietz | 9af399d | 2023-06-26 20:34:36 +0200 | [diff] [blame] | 180 | ## Results |
| 181 | |
| 182 | * for English the query for *take* + NOUN (as direct right neighbour) yields 10 different pairs with a minimum frequency of 2 (see Figure \@ref(fig:take-icc)) |
| 183 | * based English Wikipedia (2015) the query yields 139 pairs (log-dice-threshold: 2.0) with about 20 false positives |
| 184 | * for ICC German with DeReKo as background corpus, the ratio of true positive LVCs is 10/80 |
| 185 | |
Marc Kupietz | 32b70ae | 2023-06-26 20:34:58 +0200 | [diff] [blame] | 186 | # Summary & Outlook |
| 187 | |
| 188 | * we have made available corpora of 4+ languages available for contrastive research |
| 189 | * however, even with quite frequent phenomena, the results on the small corpora are to be used with caution |
| 190 | * typically they need to be verified on larger monolingual corpora |
| 191 | * the uniform acces is in any case helpful for contrastive studies |
| 192 | * ICC also serves as a crystallization point for larger corpora and complementary approaches such as EuReCo |
| 193 | |
Marc Kupietz | bcde0b6 | 2023-06-14 14:22:35 +0200 | [diff] [blame] | 194 | # References |
| 195 | |
Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 196 | |