Convert between KorAP XML format and CoNLL-U format.

Clone this repo:
  1. 3559fa4 Fix -o option test by Marc Kupietz · 6 months ago master
  2. 67d8c43 2korapxml: Print input and output filenames in warnings by Marc Kupietz · 6 months ago
  3. 187abd7 2korapxml: Add -o option by Marc Kupietz · 6 months ago
  4. cc39147 2korapxml: Trim docid by Marc Kupietz · 6 months ago
  5. 249fc83 Prepare new release by Akron · 7 months ago v0.6.3

KorAP-XML-CoNLL-U

Tool package to convert between KorAP XML format and CoNLL-U format, as well as other simple formats,including token boundary information.

Description

The state of the package is very preliminary. Currently, two scripts are provided:

  • korapxml2conllu converts KorAP XML zip "base" and "morpho" (with POS and lemma annotations) files to corresponding CoNLL-U (or word2vec input) files with foundry information, text ids and token offsets in comments
  • conllu2korapxml converts CoNLL-U files that follow KorAP-specific comment conventions and contain morphosyntactic and/or dependency annotations to corresponding KorAP-XML zip files

Installation

Using cpanm

cpanm https://github.com/KorAP/KorAP-XML-CoNLL-U.git

Local

perl Makefile.PL
make
make test TEST_VERBOSE=1
make install

Command Line Invocation

korapxml2conllu


$ korapxml2conllu wpd17.tree_tagger.zip | head -42 # foundry = tree_tagger # filename = WPD17/A00/00001/tree_tagger/morpho.xml # text_id = WPD17_A00.00001 # start_offsets = 0 0 5 13 19 23 33 37 43 52 61 63 67 73 85 87 91 97 101 113 123 130 136 142 146 150 155 158 169 178 184 190 # end_offsets = 191 4 12 18 22 32 36 42 51 61 62 66 72 85 86 90 96 100 112 122 129 135 141 145 149 154 157 168 177 183 190 191 1 Alan Alan NE NE _ _ _ _ 1.000000 2 Smithee -- NE NE _ _ _ _ 1.000000 3 steht stehen VVFIN VVFIN _ _ _ _ 1.000000 4 als als KOKOM KOKOM _ _ _ _ 0.995658 5 Pseudonym Pseudonym NN NN _ _ _ _ 1.000000 6 für für APPR APPR _ _ _ _ 1.000000 7 einen eine ART ART _ _ _ _ 0.998238 8 fiktiven fiktiv ADJA ADJA _ _ _ _ 1.000000 9 Regisseur Regisseur NN NN _ _ _ _ 1.000000 10 , , $, $, _ _ _ _ 1.000000 11 der die ART ART _ _ _ _ 0.954604 12 Filme Film NN NN _ _ _ _ 1.000000 13 verantwortet verantworten VVPP VVPP _ _ _ _ 0.753983 14 , , $, $, _ _ _ _ 1.000000 15 bei bei APPR APPR _ _ _ _ 0.999325 16 denen die PDS PDS _ _ _ _ 0.906725 17 der die ART ART _ _ _ _ 0.998927 18 eigentliche eigentlich ADJA ADJA _ _ _ _ 1.000000 19 Regisseur Regisseur NN NN _ _ _ _ 1.000000 20 seinen sein PPOSAT PPOSAT _ _ _ _ 1.000000 21 Namen Name NN NN _ _ _ _ 1.000000 22 nicht nicht PTKNEG PTKNEG _ _ _ _ 1.000000 23 mit mit APPR APPR _ _ _ _ 0.999012 24 dem die ART ART _ _ _ _ 0.999949 25 Werk Werk NN NN _ _ _ _ 1.000000 26 in in APPR APPR _ _ _ _ 1.000000 27 Verbindung Verbindung NN NN _ _ _ _ 1.000000 28 gebracht bringen VVPP VVPP _ _ _ _ 0.999331 29 haben haben VAINF VAINF _ _ _ _ 0.999987 30 möchte mögen VMFIN VMFIN _ _ _ _ 1.000000 31 . . $. $. _ _ _ _ 1.000000 # start_offsets = 192 192 196 201 205 210 216 219 223 227 237 243 246 254 255 258 260 264 271 283 292 294 302 306 309 316 319 # end_offsets = 320 195 200 204 209 215 218 222 226 236 242 245 253 255 258 259 263 270 282 292 293 301 305 308 315 319 320 1 Von von APPR APPR _ _ _ _ 0.999214 2 1968 1968 CARD CARD _ _ _ _ 1.000000 3 bis bis APPR APPR _ _ _ _ 0.861721 $ ./script/korapxml2conllu t/data/goe.zip | head -20 # foundry = base # filename = GOE/AGA/00000/base/tokens.xml # text_id = GOE_AGA.00000 # start_offsets = 0 0 9 12 # end_offsets = 22 8 11 22 1 Campagne _ _ _ _ _ _ _ _ 2 in _ _ _ _ _ _ _ _ 3 Frankreich _ _ _ _ _ _ _ _ # start_offsets = 23 23 # end_offsets = 27 27 1 1792 _ _ _ _ _ _ _ _ # start_offsets = 28 28 33 37 40 44 53 # end_offsets = 54 32 36 39 43 53 54 1 auch _ _ _ _ _ _ _ _ 2 ich _ _ _ _ _ _ _ _ 3 in _ _ _ _ _ _ _ _ 4 der _ _ _ _ _ _ _ _ 5 Champagne _ _ _ _ _ _ _ _

Example producing language model training input from KorAP-XML

./script/korapxml2conllu --word2vec t/data/wdf19.zip
Arts visuels Pourquoi toujours vouloir séparer BD et Manga ?
Ffx 18:20 fév 25 , 2003 ( CET ) soit on ne sépara pas , soit alors on distingue aussi , le comics , le manwa , le manga ..
la bd belge et touts les auteurs européens ..
on commence aussi a parlé de la bd africaine et donc ...
wikipedia ce prete parfaitement à ce genre de decryptage .
…

Example producing language model training input with preceding metadata columns

./script/korapxml2conllu -m '<textSigle>([^<]+)' -m '<creatDate>([^<]+)' --word2vec t/data/wdf19.zip 
WDF19/A0000.10894	2014.08.28	Arts visuels Pourquoi toujours vouloir séparer BD et Manga ?
WDF19/A0000.10894	2014.08.28	Ffx 18:20 fév 25 , 2003 ( CET ) soit on ne sépara pas , soit alors on distingue aussi , le comics , le manwa , le manga ..
WDF19/A0000.10894	2014.08.28	la bd belge et touts les auteurs européens ..
WDF19/A0000.10894	2014.08.28	on commence aussi a parlé de la bd africaine et donc ...
WDF19/A0000.10894	2014.08.28	wikipedia ce prete parfaitement à ce genre de decryptage .

conllu2korapxml

./script/conllu2korapxml < t/data/goe.morpho.conllu > goe.morpho.zip

Development and License

Author:

Copyright (c) 2024, Leibniz Institute for the German Language, Mannheim, Germany

This package is developed as part of the KorAP Corpus Analysis Platform at the Leibniz Institute for German Language (IDS).

It is published under the BSD 2-clause "Simplified" license.

Contributions

Contributions are very welcome!

Your contributions should ideally be committed via our Gerrit server to facilitate reviewing ( see Gerrit Code Review - A Quick Introduction if you are not familiar with Gerrit). However, we are also happy to accept comments and pull requests via GitHub.