Blame - R/poster.Rmd - ICC/2023-07-20-ICC-ICLC10

blob: 6acc69337d44cc7d7ac4ada180b3c480185e7e5a [file] [log] [blame]

Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	1	---
				2	title: "News from the International Comparable Corpus"
				3	subtitle: "First launch of ICC written"
				4	date: "`r Sys.Date()`"
				5	author:
				6	- name: Marc Kupietz
				7	affil: 1
				8	- name: Adrien Barbaresi
				9	affil: 2
Marc Kupietz	bcde0b6	2023-06-14 14:22:35 +0200	[diff] [blame]	10	- name: Anna Čermáková
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	11	affil: 3
				12	- name: Małgorzata Czachor
				13	affil: 4
				14	- name: Nils Diewald
				15	affil: 1
				16	- name: Jarle Ebeling
				17	affil: 5
				18	- name: Rafał L. Górski
				19	affil: 4
				20	- name: John Kirk
				21	affil: 6
				22	- name: Michal Křen
				23	affil: 3
				24	- name: Harald Lüngen
				25	affil: 1
				26	- name: Eliza Margaretha
				27	affil: 1
				28	- name: Signe Oksefjell Ebeling
				29	affil: 5
				30	- name: Mícheál Ó Meachair
				31	affil: 7
				32	- name: Ines Pisetta
				33	affil: 1
				34	- name: Elaine Uí Dhonnchadha
				35	affil: 8
				36	- name: Friedemann Vogel
				37	affil: 9
				38	- name: Rebecca Wilm
				39	affil: 1
				40	- name: Jiajin Xu
				41	affil: 10
				42	- name: Rameela Yaddehige
				43	affil: 1
				44	affiliation:
				45	- num: 1
				46	address: IDS Mannheim
				47	- num: 2
				48	address: BBAW Berlin
				49	- num: 3
				50	address: Charles University
				51	- num: 4
				52	address: Polish Academy of Sciences
				53	- num: 5
				54	address: University of Oslo
				55	- num: 6
				56	address: University of Vienna
				57	- num: 7
				58	address: Dublin City University
				59	- num: 8
				60	address: Trinity College Dublin
				61	- num: 9
				62	address: University of Siegen
				63	- num: 10
				64	address: Beijing Foreign Studies University
				65
				66
				67	logoleft_name: "../Figures/ICC_COL.svg"
				68	author_textsize: "32pt"
				69
Marc Kupietz	fbd648c	2023-06-24 12:31:45 +0200	[diff] [blame]	70	contact:
Marc Kupietz	c5f7a92	2023-06-26 21:16:25 +0200	[diff] [blame]	71	email: icc@ids-manneim.de
				72	website: https://www.ids-mannheim.de/digspra/kl
Marc Kupietz	f0f5882	2023-06-26 20:32:03 +0200	[diff] [blame]	73	qrlink: >
Marc Kupietz	e3bba7b	2023-06-26 21:17:11 +0200	[diff] [blame]	74	`r posterdown::qrlink("https://korap.ids-mannheim.de/instance/icc", "icc-logo-whitebg.svg")`
Marc Kupietz	fbd648c	2023-06-24 12:31:45 +0200	[diff] [blame]	75
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	76	output:
Marc Kupietz	fbd648c	2023-06-24 12:31:45 +0200	[diff] [blame]	77	posterdown::posterdown_ids:
				78	self_contained: false
				79	keep_md: true
Marc Kupietz	bcde0b6	2023-06-14 14:22:35 +0200	[diff] [blame]	80
				81	bibliography: ../tex/references.bib
Marc Kupietz	df8083d	2023-06-26 20:31:42 +0200	[diff] [blame]	82	csl: "https://raw.githubusercontent.com/ICLC-10/Zotero/master/styles/ICLC-10.csl"
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	83	---
				84
				85	```{r setup, include=FALSE, echo=FALSE, warning=FALSE}
Marc Kupietz	2b51d50	2023-06-28 18:25:13 +0200	[diff] [blame^]	86	knitr::opts_chunk$set(dev = 'svg', echo = FALSE, message = FALSE, warnings = FALSE)
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	87	source("common.R")
				88	```
Marc Kupietz	2b51d50	2023-06-28 18:25:13 +0200	[diff] [blame^]	89	# ICC aims & characteristics
Marc Kupietz	8f6c71b	2023-06-28 18:13:55 +0200	[diff] [blame]	90
				91	* make available comparable corpora of many languages for contrastive linguistic research [@kirk_ice_2017]
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	92	* mostly based on existing corpora
Marc Kupietz	2b51d50	2023-06-28 18:25:13 +0200	[diff] [blame^]	93	* small corpora with 1M words (400K written)
				94	* pre-defined “balanced” composition
				95	* inspired by the one of the ICE [@greenbaum_comparing_1996]
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	96
Marc Kupietz	6354d20	2023-06-26 20:34:05 +0200	[diff] [blame]	97	# Current launch of ICC written
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	98
Marc Kupietz	2b51d50	2023-06-28 18:25:13 +0200	[diff] [blame^]	99	* written parts for Chinese, Czech, English (mostly), German, Irish (partly), Norwegian publicly available
Marc Kupietz	6354d20	2023-06-26 20:34:05 +0200	[diff] [blame]	100	* partially including UDPipe 2.0 annotations [@straka_udpipe_2018]
Marc Kupietz	49a7c18	2023-06-28 18:15:46 +0200	[diff] [blame]	101	* usable via Corpus Workbench or KorAP [@diewald_korap_2016]
				102
				103	```{r korap-query, fig.cap="KorAP UI for ICC-GER and ICC-NOR, showing annotation queries and layers, as well as a virtual corpus definition, based on ICC genre and publication date metadata."}
				104	knitr::include_graphics("korap_query_ger-nor.svg")
				105	```
Marc Kupietz	6354d20	2023-06-26 20:34:05 +0200	[diff] [blame]	106
				107	## Composition of the ICC parts
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	108
Marc Kupietz	1ed69ff	2023-06-28 18:14:34 +0200	[diff] [blame]	109	```{r composition-by-genre, fig.cap="Actual composition of selected ICC parts with respect to ICC domain. (For the other ICC parts, the ICC genre metadatum was not yet accessible via the API at the editorial deadline.)", message = FALSE, fig.width=14, fig.height=10, out.width = "100%"}
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	110	icc_genre <- icc %>%
				111	expand_grid(genre) %>%
				112	mutate(vc = paste0("iccGenre=", genre)) %>%
				113	rowwise() %>%
				114	mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)
				115
				116	icc_genre %>% ggplot(aes(x=lang, fill=genre, y=tokens)) +
				117	geom_col() + scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
				118	theme_ids(base_size = 24) +
				119	theme(
				120	axis.title.x = element_text(size = rel(1.5), face = "bold"),
				121	axis.title.y = element_text(size = rel(1.5), face = "bold"),
				122	axis.text = element_text(size = rel(0.70)),
				123	legend.title = element_text(size = rel(0.85), face = "bold"),
				124	legend.text = element_text(size = rel(1))) +
				125	scale_fill_ids() +
				126	geom_text(aes(label=if_else(tokens > 0, as.character(tokens), ""), y=tokens), position= position_stack(reverse = F, vjust = 0.5), color="black", size=6.2, family="Fira Sans Condensed")
				127
				128	```
				129
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	130
Marc Kupietz	1ed69ff	2023-06-28 18:14:34 +0200	[diff] [blame]	131	```{r composition-by-pubdate, fig.cap="Composition of the selected ICC parts with respect to year of publication.", message=F, warning=F, fig.width=14, fig.height=3, out.width = "80%"}
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	132	year <- c(1986:2023)
				133
				134	icc_year <- icc %>%
				135	expand_grid(year) %>%
				136	mutate(vc = paste0("pubDate in ", year)) %>%
				137	rowwise() %>%
				138	mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)
				139
				140	icc_year %>% ggplot(aes(x=year, fill=lang, color=lang, y=tokens)) +
				141	# geom_smooth(se=F, span=0.25) +
				142	xlim(1990, 2023) +
				143	ylim(0, NA) +
				144	stat_smooth(
				145	geom = 'area', method = 'loess', span = 1/4,
				146	alpha = 0.1) +
				147	# geom_area(alpha=0.1, position = "identity") +
				148	scale_fill_ids() + scale_colour_ids() +
				149	scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
				150	theme_ids(base_size=24) +
				151	theme(
				152	axis.title.x = element_text(size = rel(1.5), face = "bold"),
				153	axis.title.y = element_text(size = rel(1.5), face = "bold"),
				154	axis.text = element_text(size = rel(1)),
				155	legend.title = element_text(size = rel(1), face = "bold"),
				156	legend.text = element_text(size = rel(1)))
				157	```
				158
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	159
Marc Kupietz	4e3ab83	2023-06-26 20:33:18 +0200	[diff] [blame]	160	# Pilot study
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	161
Marc Kupietz	58d1bc1	2023-06-28 18:18:19 +0200	[diff] [blame]	162	* identification of light verb constructions (LVC) with take in English, and corresponding lemmas in German and Norwegian
				163	* in order to explore the limitations imposed by the small corpus sizes
				164	* using RKorapClient [@kupietz_rkorapclient_2020] to access the corpora and get reproducible results for the analyses
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	165
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	166
Marc Kupietz	9fe544b	2023-06-28 18:17:31 +0200	[diff] [blame]	167	```{r take-icc-code, results='hide', echo=TRUE}
				168	library(RKorAPClient)
				169	new("KorAPConnection",
				170	KorAPUrl = "https://korap.ids-mannheim.de/instance/icc/eng",
				171	accessToken = Sys.getenv("KORAP_ICC_TOKEN_eng")) %>%
				172	collocationAnalysis(
				173	"focus({[ud/l=take]} [ud/p=NOUN])",
				174	leftContextSize = 0,
				175	rightContextSize = 1,
				176	minOccur = 2,
				177	addExamples = T
				178	)
				179	```
				180
Marc Kupietz	2b51d50	2023-06-28 18:25:13 +0200	[diff] [blame^]	181	```{r take-icc, fig.cap="R code for, and results of a co-occurrence analysis of take + NOUN in ICC-ENG, using the RKorAPClient package for R."}
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	182	take_ca_icc <-
				183	collocationAnalysis(
				184	icc_con("eng"),
				185	"focus({[ud/l=take]} [ud/p=NOUN])",
				186	leftContextSize = 0,
				187	rightContextSize = 1,
				188	minOccur = 2,
				189	addExamples = T
				190	)
				191
				192	take_ca_icc %>% show_table()
				193	```
				194
Marc Kupietz	9af399d	2023-06-26 20:34:36 +0200	[diff] [blame]	195	## Results
				196
				197	* for English the query for take + NOUN (as direct right neighbour) yields 10 different pairs with a minimum frequency of 2 (see Figure \@ref(fig:take-icc))
Marc Kupietz	48a4134	2023-06-28 18:16:54 +0200	[diff] [blame]	198	* based on English Wikipedia [2015 snapshot, see @MargarethaLuengen2014] the query yields 139 pairs (log-dice-threshold: 2.0) with 44 false positives
Marc Kupietz	3a458ce	2023-06-28 18:19:07 +0200	[diff] [blame]	199	* the true positive ratio of discovered take-LVCs between ICC and Wikipedia is 10:95
				200	* for ICC German with DeReKo as background corpus, the ratio of discovered true LVCs with ›nehmen‹ (=take) is 10:89
				201	* in both cases, not much more than 10% of LVCs could be discovered
Marc Kupietz	9af399d	2023-06-26 20:34:36 +0200	[diff] [blame]	202
Marc Kupietz	32b70ae	2023-06-26 20:34:58 +0200	[diff] [blame]	203	# Summary & Outlook
				204
Marc Kupietz	2b51d50	2023-06-28 18:25:13 +0200	[diff] [blame^]	205	* we have made comparable corpora of 4+ languages available, radily usable for contrastive research
Marc Kupietz	fb570ea	2023-06-28 18:17:59 +0200	[diff] [blame]	206	* however, even for fairly frequent phenomena, the results on the small corpora should be treated with caution
				207	* typically, they need to be verified on larger monolingual corpora
				208	* this also and especially concerns recall
				209	* nevertheless ICC can serve as a useful basis for contrastive studies
Marc Kupietz	2b51d50	2023-06-28 18:25:13 +0200	[diff] [blame^]	210	* with a uniform UI and API that facilitate query and analysis
Marc Kupietz	fb570ea	2023-06-28 18:17:59 +0200	[diff] [blame]	211	* in addition, ICC also serves as a crystallisation point
				212	* for more ICC corpora and spoken parts to come
				213	* for larger corpora and complementary approaches, such as EuReCo
Marc Kupietz	32b70ae	2023-06-26 20:34:58 +0200	[diff] [blame]	214
Marc Kupietz	bcde0b6	2023-06-14 14:22:35 +0200	[diff] [blame]	215	# References
				216
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	217