| Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 1 | --- | 
|  | 2 | title: "News from the International Comparable Corpus" | 
|  | 3 | subtitle: "First launch of ICC written" | 
|  | 4 | date: "`r Sys.Date()`" | 
|  | 5 | author: | 
|  | 6 | - name: Marc Kupietz | 
|  | 7 | affil: 1 | 
|  | 8 | - name: Adrien Barbaresi | 
|  | 9 | affil: 2 | 
| Marc Kupietz | bcde0b6 | 2023-06-14 14:22:35 +0200 | [diff] [blame] | 10 | - name: Anna Čermáková | 
| Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 11 | affil: 3 | 
|  | 12 | - name: Małgorzata Czachor | 
|  | 13 | affil: 4 | 
|  | 14 | - name: Nils Diewald | 
|  | 15 | affil: 1 | 
|  | 16 | - name: Jarle Ebeling | 
|  | 17 | affil: 5 | 
|  | 18 | - name: Rafał L. Górski | 
|  | 19 | affil: 4 | 
|  | 20 | - name: John Kirk | 
|  | 21 | affil: 6 | 
|  | 22 | - name: Michal Křen | 
|  | 23 | affil: 3 | 
|  | 24 | - name: Harald Lüngen | 
|  | 25 | affil: 1 | 
|  | 26 | - name: Eliza Margaretha | 
|  | 27 | affil: 1 | 
|  | 28 | - name: Signe Oksefjell Ebeling | 
|  | 29 | affil: 5 | 
|  | 30 | - name: Mícheál Ó Meachair | 
|  | 31 | affil: 7 | 
|  | 32 | - name: Ines Pisetta | 
|  | 33 | affil: 1 | 
|  | 34 | - name: Elaine Uí Dhonnchadha | 
|  | 35 | affil: 8 | 
|  | 36 | - name: Friedemann Vogel | 
|  | 37 | affil: 9 | 
|  | 38 | - name: Rebecca Wilm | 
|  | 39 | affil: 1 | 
|  | 40 | - name: Jiajin Xu | 
|  | 41 | affil: 10 | 
|  | 42 | - name: Rameela Yaddehige | 
|  | 43 | affil: 1 | 
|  | 44 | affiliation: | 
|  | 45 | - num: 1 | 
|  | 46 | address: IDS Mannheim | 
|  | 47 | - num: 2 | 
|  | 48 | address: BBAW Berlin | 
|  | 49 | - num: 3 | 
|  | 50 | address: Charles University | 
|  | 51 | - num: 4 | 
|  | 52 | address: Polish Academy of Sciences | 
|  | 53 | - num: 5 | 
|  | 54 | address: University of Oslo | 
|  | 55 | - num: 6 | 
|  | 56 | address: University of Vienna | 
|  | 57 | - num: 7 | 
|  | 58 | address: Dublin City University | 
|  | 59 | - num: 8 | 
|  | 60 | address: Trinity College Dublin | 
|  | 61 | - num: 9 | 
|  | 62 | address: University of Siegen | 
|  | 63 | - num: 10 | 
|  | 64 | address: Beijing Foreign Studies University | 
|  | 65 |  | 
|  | 66 |  | 
|  | 67 | logoleft_name: "../Figures/ICC_COL.svg" | 
|  | 68 | author_textsize: "32pt" | 
|  | 69 |  | 
| Marc Kupietz | fbd648c | 2023-06-24 12:31:45 +0200 | [diff] [blame] | 70 | contact: | 
| Marc Kupietz | f0f5882 | 2023-06-26 20:32:03 +0200 | [diff] [blame] | 71 | qrlink: > | 
|  | 72 | `r posterdown::qrlink("https://korap.ids-mannheim.de/instance/icc")` | 
| Marc Kupietz | fbd648c | 2023-06-24 12:31:45 +0200 | [diff] [blame] | 73 |  | 
| Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 74 | output: | 
| Marc Kupietz | fbd648c | 2023-06-24 12:31:45 +0200 | [diff] [blame] | 75 | posterdown::posterdown_ids: | 
|  | 76 | self_contained: false | 
|  | 77 | keep_md: true | 
| Marc Kupietz | bcde0b6 | 2023-06-14 14:22:35 +0200 | [diff] [blame] | 78 |  | 
|  | 79 | bibliography: ../tex/references.bib | 
| Marc Kupietz | df8083d | 2023-06-26 20:31:42 +0200 | [diff] [blame] | 80 | csl: "https://raw.githubusercontent.com/ICLC-10/Zotero/master/styles/ICLC-10.csl" | 
| Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 81 | --- | 
|  | 82 |  | 
|  | 83 | ```{r setup, include=FALSE, echo=FALSE, warning=FALSE} | 
| Marc Kupietz | 48d2b52 | 2023-06-14 12:31:06 +0200 | [diff] [blame] | 84 | knitr::opts_chunk$set(dev = 'svg', echo = FALSE, warnings = FALSE) | 
| Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 85 | source("common.R") | 
|  | 86 | ``` | 
|  | 87 | # ICC aims & charcteristics | 
| Marc Kupietz | 6a4d3a7 | 2023-06-26 20:32:39 +0200 | [diff] [blame] | 88 | * make available comparable corpora  of many languages for contrastive linguistic research [@cermakova_international_2021] | 
| Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 89 | * mostly based on existing corpora | 
| Marc Kupietz | 6a4d3a7 | 2023-06-26 20:32:39 +0200 | [diff] [blame] | 90 | * ICC has a pre-defined “balanced” composition | 
|  | 91 | * based on the one of the ICE [@greenbaum_comparing_1996] | 
| Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 92 |  | 
| Marc Kupietz | 6354d20 | 2023-06-26 20:34:05 +0200 | [diff] [blame] | 93 | # Current launch of ICC written | 
| Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 94 |  | 
| Marc Kupietz | 6354d20 | 2023-06-26 20:34:05 +0200 | [diff] [blame] | 95 | * written parts for Chinese, Czech, English, German, Irish (partly), Norwegian publicly available | 
|  | 96 | * partially including UDPipe 2.0 annotations [@straka_udpipe_2018] | 
|  | 97 | * via Corpus Workbench or KorAP [@diewald_korap_2016] | 
|  | 98 |  | 
|  | 99 |  | 
|  | 100 |  | 
|  | 101 | ## Composition of the ICC parts | 
| Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 102 | ### By ICC genre | 
|  | 103 |  | 
|  | 104 | ```{r composition_by_genre, message = FALSE, fig.width=14, fig.height=10, out.width = "100%"} | 
|  | 105 | icc_genre <- icc %>% | 
|  | 106 | expand_grid(genre) %>% | 
|  | 107 | mutate(vc = paste0("iccGenre=", genre)) %>% | 
|  | 108 | rowwise() %>% | 
|  | 109 | mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens) | 
|  | 110 |  | 
|  | 111 | icc_genre %>% ggplot(aes(x=lang, fill=genre, y=tokens)) + | 
|  | 112 | geom_col() + scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) + | 
|  | 113 | theme_ids(base_size = 24) + | 
|  | 114 | theme( | 
|  | 115 | axis.title.x = element_text(size = rel(1.5), face = "bold"), | 
|  | 116 | axis.title.y = element_text(size = rel(1.5), face = "bold"), | 
|  | 117 | axis.text = element_text(size = rel(0.70)), | 
|  | 118 | legend.title = element_text(size = rel(0.85), face = "bold"), | 
|  | 119 | legend.text = element_text(size = rel(1))) + | 
|  | 120 | scale_fill_ids() + | 
|  | 121 | geom_text(aes(label=if_else(tokens > 0, as.character(tokens), ""), y=tokens), position= position_stack(reverse = F, vjust = 0.5), color="black", size=6.2, family="Fira Sans Condensed") | 
|  | 122 |  | 
|  | 123 | ``` | 
|  | 124 |  | 
|  | 125 | ### By date of publication | 
|  | 126 |  | 
|  | 127 |  | 
|  | 128 | ```{r composition_by_pubdate, message=F, warning=F, fig.width=14, fig.height=7, out.width = "100%"} | 
|  | 129 | year <- c(1986:2023) | 
|  | 130 |  | 
|  | 131 | icc_year <- icc %>% | 
|  | 132 | expand_grid(year) %>% | 
|  | 133 | mutate(vc = paste0("pubDate in ", year)) %>% | 
|  | 134 | rowwise() %>% | 
|  | 135 | mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens) | 
|  | 136 |  | 
|  | 137 | icc_year %>% ggplot(aes(x=year, fill=lang, color=lang, y=tokens)) + | 
|  | 138 | # geom_smooth(se=F, span=0.25) + | 
|  | 139 | xlim(1990, 2023) + | 
|  | 140 | ylim(0, NA) + | 
|  | 141 | stat_smooth( | 
|  | 142 | geom = 'area', method = 'loess', span = 1/4, | 
|  | 143 | alpha = 0.1) + | 
|  | 144 | # geom_area(alpha=0.1,  position = "identity") + | 
|  | 145 | scale_fill_ids() + scale_colour_ids() + | 
|  | 146 | scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) + | 
|  | 147 | theme_ids(base_size=24) + | 
|  | 148 | theme( | 
|  | 149 | axis.title.x = element_text(size = rel(1.5), face = "bold"), | 
|  | 150 | axis.title.y = element_text(size = rel(1.5), face = "bold"), | 
|  | 151 | axis.text = element_text(size = rel(1)), | 
|  | 152 | legend.title = element_text(size = rel(1), face = "bold"), | 
|  | 153 | legend.text = element_text(size = rel(1))) | 
|  | 154 | ``` | 
|  | 155 |  | 
| Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 156 |  | 
| Marc Kupietz | 4e3ab83 | 2023-06-26 20:33:18 +0200 | [diff] [blame] | 157 | # Pilot study | 
| Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 158 |  | 
| Marc Kupietz | 4e3ab83 | 2023-06-26 20:33:18 +0200 | [diff] [blame] | 159 | * Identification of Light Verb Constructions with *take* | 
|  | 160 | * in order to investigate the limitations imposed by the very small corpus sizes | 
|  | 161 | * using RKorapClient [@kupietz_rkorapclient_2020] to access corpora and get reproducible results of the collocation analysis | 
| Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 162 |  | 
| Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 163 |  | 
|  | 164 | # Identification of Light Verb Constructions with *take* | 
|  | 165 |  | 
|  | 166 |  | 
|  | 167 | ## English: *take* | 
|  | 168 |  | 
|  | 169 | ```{r take_icc, echo=TRUE, message=FALSE} | 
|  | 170 | take_ca_icc <- | 
|  | 171 | collocationAnalysis( | 
|  | 172 | icc_con("eng"), | 
|  | 173 | "focus({[ud/l=take]} [ud/p=NOUN])", | 
|  | 174 | leftContextSize = 0, | 
|  | 175 | rightContextSize = 1, | 
|  | 176 | minOccur = 2, | 
|  | 177 | addExamples = T | 
|  | 178 | ) | 
|  | 179 |  | 
|  | 180 | take_ca_icc %>% show_table() | 
|  | 181 | ``` | 
|  | 182 |  | 
| Marc Kupietz | 9af399d | 2023-06-26 20:34:36 +0200 | [diff] [blame^] | 183 | ## Results | 
|  | 184 |  | 
|  | 185 | * for English the query for *take* + NOUN (as direct right neighbour) yields 10 different pairs with a minimum frequency of 2 (see  Figure \@ref(fig:take-icc)) | 
|  | 186 | * based English Wikipedia (2015) the query yields 139 pairs (log-dice-threshold: 2.0) with about 20 false positives | 
|  | 187 | * for ICC German with DeReKo as background corpus, the ratio of true positive LVCs is 10/80 | 
|  | 188 |  | 
| Marc Kupietz | bcde0b6 | 2023-06-14 14:22:35 +0200 | [diff] [blame] | 189 | # References | 
|  | 190 |  | 
| Marc Kupietz | afce9c1 | 2023-06-13 09:18:53 +0200 | [diff] [blame] | 191 |  |