International Comparable Corpus
Access ICC alpha
The International Comparable Corpus (ICC) is a collaborative project in the field of contrastive corpus-based linguistics. The ultimate goal of the project is the facilitation of contrastive studies between English and other languages involving highly comparable datasets of spoken, written and electronic registers. What we are introducing is not a parallel translation corpus (where source language texts are aligned with their translations) but a set of comparable corpora in different languages, these languages currently involve the following: Czech, Finnish, French, German, Irish, Italian, Norwegian, Polish, Slovak, Swedish, and Chinese. If you are interested in adding another language please do get in touch.
For more information about ICC please visit the ICC website at the institute of the Czech National Corpus: https://korpus.cz/icc
Using the Query dropdown menu above you can already access preliminary versions of the ICC-written parts via KorAP. Please note that the data is still being changed constantly and full of bugs and shortcomings. A first beta version was published at the International Contrastive Linguistics Conference 2023 (ICLC-10) (ICC-ICLC-10 poster).
- 2023-08-28: ICC Irish updated with more metadata
- 2023-07-30: ICC Irish (partial) added
- 2023-07-25: additional annotation layer for ICC German (GSD model for udpipe2)
- 2023-06-22: ICC English extended by previously broken texts
- 2023-04-24: first version of ICC English online
- 2023-04-21: first version of ICC Norwegian online
- 2023-04-21: ICC Czech and ICC Chinese integrated
- 2022-12-24: first version of ICC German online
- some languages are still missing
for the existing corpora, texts from the original TEI data is missing to varying degrees- some metadata from the original TEI data is missing
, in particular the ICC genres text ids are inconsistent and will change- the universal dependency annotations (thanks to udpipe2!) are unchecked
oauthorization of client applications does not work, yetthis means the all quantitative analyses will have to wait- comparative studies are not yet optimally supported
- you need to login for every corpus separately (but you can use the same login/registration)
you don't know which ICC corpus you query
- everything is subject to change without notice
- ...