Blame - R/poster.Rmd - ICC/2023-07-20-ICC-ICLC10

blob: 56703b97f49dbcfa6b04d57ff94903df358cff06 [file] [log] [blame]

Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	1	---
				2	title: "News from the International Comparable Corpus"
				3	subtitle: "First launch of ICC written"
				4	date: "`r Sys.Date()`"
				5	author:
				6	- name: Marc Kupietz
				7	affil: 1
				8	- name: Adrien Barbaresi
				9	affil: 2
Marc Kupietz	bcde0b6	2023-06-14 14:22:35 +0200	[diff] [blame]	10	- name: Anna Čermáková
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	11	affil: 3
				12	- name: Małgorzata Czachor
				13	affil: 4
				14	- name: Nils Diewald
				15	affil: 1
				16	- name: Jarle Ebeling
				17	affil: 5
				18	- name: Rafał L. Górski
				19	affil: 4
				20	- name: John Kirk
				21	affil: 6
				22	- name: Michal Křen
				23	affil: 3
				24	- name: Harald Lüngen
				25	affil: 1
				26	- name: Eliza Margaretha
				27	affil: 1
				28	- name: Signe Oksefjell Ebeling
				29	affil: 5
				30	- name: Mícheál Ó Meachair
				31	affil: 7
				32	- name: Ines Pisetta
				33	affil: 1
				34	- name: Elaine Uí Dhonnchadha
				35	affil: 8
				36	- name: Friedemann Vogel
				37	affil: 9
				38	- name: Rebecca Wilm
				39	affil: 1
				40	- name: Jiajin Xu
				41	affil: 10
				42	- name: Rameela Yaddehige
				43	affil: 1
				44	affiliation:
				45	- num: 1
				46	address: IDS Mannheim
				47	- num: 2
				48	address: BBAW Berlin
				49	- num: 3
				50	address: Charles University
				51	- num: 4
				52	address: Polish Academy of Sciences
				53	- num: 5
				54	address: University of Oslo
				55	- num: 6
				56	address: University of Vienna
				57	- num: 7
				58	address: Dublin City University
				59	- num: 8
				60	address: Trinity College Dublin
				61	- num: 9
				62	address: University of Siegen
				63	- num: 10
				64	address: Beijing Foreign Studies University
				65
				66
				67	logoleft_name: "../Figures/ICC_COL.svg"
				68	author_textsize: "32pt"
				69
Marc Kupietz	fbd648c	2023-06-24 12:31:45 +0200	[diff] [blame]	70	contact:
Marc Kupietz	c5f7a92	2023-06-26 21:16:25 +0200	[diff] [blame]	71	email: icc@ids-manneim.de
				72	website: https://www.ids-mannheim.de/digspra/kl
Marc Kupietz	f0f5882	2023-06-26 20:32:03 +0200	[diff] [blame]	73	qrlink: >
Marc Kupietz	e3bba7b	2023-06-26 21:17:11 +0200	[diff] [blame]	74	`r posterdown::qrlink("https://korap.ids-mannheim.de/instance/icc", "icc-logo-whitebg.svg")`
Marc Kupietz	fbd648c	2023-06-24 12:31:45 +0200	[diff] [blame]	75
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	76	output:
Marc Kupietz	fbd648c	2023-06-24 12:31:45 +0200	[diff] [blame]	77	posterdown::posterdown_ids:
				78	self_contained: false
				79	keep_md: true
Marc Kupietz	bcde0b6	2023-06-14 14:22:35 +0200	[diff] [blame]	80
				81	bibliography: ../tex/references.bib
Marc Kupietz	df8083d	2023-06-26 20:31:42 +0200	[diff] [blame]	82	csl: "https://raw.githubusercontent.com/ICLC-10/Zotero/master/styles/ICLC-10.csl"
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	83	---
				84
				85	```{r setup, include=FALSE, echo=FALSE, warning=FALSE}
Marc Kupietz	48d2b52	2023-06-14 12:31:06 +0200	[diff] [blame]	86	knitr::opts_chunk$set(dev = 'svg', echo = FALSE, warnings = FALSE)
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	87	source("common.R")
				88	```
				89	# ICC aims & charcteristics
Marc Kupietz	8f6c71b	2023-06-28 18:13:55 +0200	[diff] [blame^]	90
				91	* make available comparable corpora of many languages for contrastive linguistic research [@kirk_ice_2017]
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	92	* mostly based on existing corpora
Marc Kupietz	6a4d3a7	2023-06-26 20:32:39 +0200	[diff] [blame]	93	* ICC has a pre-defined “balanced” composition
				94	* based on the one of the ICE [@greenbaum_comparing_1996]
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	95
Marc Kupietz	6354d20	2023-06-26 20:34:05 +0200	[diff] [blame]	96	# Current launch of ICC written
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	97
Marc Kupietz	6354d20	2023-06-26 20:34:05 +0200	[diff] [blame]	98	* written parts for Chinese, Czech, English, German, Irish (partly), Norwegian publicly available
				99	* partially including UDPipe 2.0 annotations [@straka_udpipe_2018]
				100	* via Corpus Workbench or KorAP [@diewald_korap_2016]
				101
				102	![](korap_query.png)
				103
				104	## Composition of the ICC parts
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	105	### By ICC genre
				106
				107	```{r composition_by_genre, message = FALSE, fig.width=14, fig.height=10, out.width = "100%"}
				108	icc_genre <- icc %>%
				109	expand_grid(genre) %>%
				110	mutate(vc = paste0("iccGenre=", genre)) %>%
				111	rowwise() %>%
				112	mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)
				113
				114	icc_genre %>% ggplot(aes(x=lang, fill=genre, y=tokens)) +
				115	geom_col() + scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
				116	theme_ids(base_size = 24) +
				117	theme(
				118	axis.title.x = element_text(size = rel(1.5), face = "bold"),
				119	axis.title.y = element_text(size = rel(1.5), face = "bold"),
				120	axis.text = element_text(size = rel(0.70)),
				121	legend.title = element_text(size = rel(0.85), face = "bold"),
				122	legend.text = element_text(size = rel(1))) +
				123	scale_fill_ids() +
				124	geom_text(aes(label=if_else(tokens > 0, as.character(tokens), ""), y=tokens), position= position_stack(reverse = F, vjust = 0.5), color="black", size=6.2, family="Fira Sans Condensed")
				125
				126	```
				127
				128	### By date of publication
				129
				130
Marc Kupietz	f7b93ed	2023-06-26 20:35:33 +0200	[diff] [blame]	131	```{r composition-by-pubdate, message=F, warning=F, fig.width=14, fig.height=3, out.width = "80%"}
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	132	year <- c(1986:2023)
				133
				134	icc_year <- icc %>%
				135	expand_grid(year) %>%
				136	mutate(vc = paste0("pubDate in ", year)) %>%
				137	rowwise() %>%
				138	mutate(tokens= corpusStats(icc_con(lang, token), vc = vc)@tokens)
				139
				140	icc_year %>% ggplot(aes(x=year, fill=lang, color=lang, y=tokens)) +
				141	# geom_smooth(se=F, span=0.25) +
				142	xlim(1990, 2023) +
				143	ylim(0, NA) +
				144	stat_smooth(
				145	geom = 'area', method = 'loess', span = 1/4,
				146	alpha = 0.1) +
				147	# geom_area(alpha=0.1, position = "identity") +
				148	scale_fill_ids() + scale_colour_ids() +
				149	scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
				150	theme_ids(base_size=24) +
				151	theme(
				152	axis.title.x = element_text(size = rel(1.5), face = "bold"),
				153	axis.title.y = element_text(size = rel(1.5), face = "bold"),
				154	axis.text = element_text(size = rel(1)),
				155	legend.title = element_text(size = rel(1), face = "bold"),
				156	legend.text = element_text(size = rel(1)))
				157	```
				158
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	159
Marc Kupietz	4e3ab83	2023-06-26 20:33:18 +0200	[diff] [blame]	160	# Pilot study
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	161
Marc Kupietz	4e3ab83	2023-06-26 20:33:18 +0200	[diff] [blame]	162	* Identification of Light Verb Constructions with take
				163	* in order to investigate the limitations imposed by the very small corpus sizes
				164	* using RKorapClient [@kupietz_rkorapclient_2020] to access corpora and get reproducible results of the collocation analysis
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	165
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	166
Marc Kupietz	4e6311e	2023-06-26 20:37:25 +0200	[diff] [blame]	167	```{r take-icc, echo=TRUE, fig.cap="Collocation analysis of take using the RKorAPClient package for R"}
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	168	take_ca_icc <-
				169	collocationAnalysis(
				170	icc_con("eng"),
				171	"focus({[ud/l=take]} [ud/p=NOUN])",
				172	leftContextSize = 0,
				173	rightContextSize = 1,
				174	minOccur = 2,
				175	addExamples = T
				176	)
				177
				178	take_ca_icc %>% show_table()
				179	```
				180
Marc Kupietz	9af399d	2023-06-26 20:34:36 +0200	[diff] [blame]	181	## Results
				182
				183	* for English the query for take + NOUN (as direct right neighbour) yields 10 different pairs with a minimum frequency of 2 (see Figure \@ref(fig:take-icc))
				184	* based English Wikipedia (2015) the query yields 139 pairs (log-dice-threshold: 2.0) with about 20 false positives
				185	* for ICC German with DeReKo as background corpus, the ratio of true positive LVCs is 10/80
				186
Marc Kupietz	32b70ae	2023-06-26 20:34:58 +0200	[diff] [blame]	187	# Summary & Outlook
				188
				189	* we have made available corpora of 4+ languages available for contrastive research
				190	* however, even with quite frequent phenomena, the results on the small corpora are to be used with caution
				191	* typically they need to be verified on larger monolingual corpora
				192	* the uniform acces is in any case helpful for contrastive studies
				193	* ICC also serves as a crystallization point for larger corpora and complementary approaches such as EuReCo
				194
Marc Kupietz	bcde0b6	2023-06-14 14:22:35 +0200	[diff] [blame]	195	# References
				196
Marc Kupietz	afce9c1	2023-06-13 09:18:53 +0200	[diff] [blame]	197