Bump version to 2.0-alpha-04

Change-Id: I79459dd789c445e01e869b2afc911ac997a4adc8
1 file changed
tree: bb2b6274cfd4da7429add353fd0f2d2dfa9d2d4d
  1. .idea/
  2. app/
  3. gradle/
  4. .gitattributes
  5. .gitignore
  6. .gitlab-ci.yml
  7. build.gradle
  8. gradlew
  9. gradlew.bat
  10. LICENSE
  11. Readme.md
  12. settings.gradle
Readme.md

korapxml2conllu

Tool package to convert from KorAP XML format to CoNLL-U format, as well as other simple formats, including token boundary information.

Up to 200 times faster and more accurate drop-in replacement for the korapxml2conllu part of KorAP-XML-CoNLL-U.

Build

./gradlew build

Run

$ java  -jar ./app/build/libs/korapxml2conllu.jar app/src/test/resources/wdf19.zip | head -10

# foundry = base
# filename = WDF19/A0000/13072/base/tokens.xml
# text_id = WDF19_A0000.13072
# start_offsets = 0 0 14 17 25 30 35 42 44 52 60 73
# end_offsets = 74 12 16 24 29 34 41 43 51 59 72 74
1	Australasien	_	_	_	_	_	_	_	_
2	on	_	_	_	_	_	_	_	_
3	devrait	_	_	_	_	_	_	_	_
4	peut	_	_	_	_	_	_	_	_
5	être	_	_	_	_	_	_	_	_

Example producing language model training input from KorAP-XML

$ java  -jar ./app/build/libs/korapxml2conllu.jar --word2vec t/data/wdf19.zip

Arts visuels Pourquoi toujours vouloir séparer BD et Manga ?
Ffx 18:20 fév 25 , 2003 ( CET ) soit on ne sépara pas , soit alors on distingue aussi , le comics , le manwa , le manga ..
la bd belge et touts les auteurs européens ..
on commence aussi a parlé de la bd africaine et donc ...
wikipedia ce prete parfaitement à ce genre de decryptage .

Example producing language model training input with preceding metadata columns

java  -jar ./app/build/libs/korapxml2conllu.jar  -m '<textSigle>([^<]+)' -m '<creatDate>([^<]+)' --word2vec t/data/wdf19.zip
WDF19/A0000.10894	2014.08.28	Arts visuels Pourquoi toujours vouloir séparer BD et Manga ?
WDF19/A0000.10894	2014.08.28	Ffx 18:20 fév 25 , 2003 ( CET ) soit on ne sépara pas , soit alors on distingue aussi , le comics , le manwa , le manga ..
WDF19/A0000.10894	2014.08.28	la bd belge et touts les auteurs européens ..
WDF19/A0000.10894	2014.08.28	on commence aussi a parlé de la bd africaine et donc ...
WDF19/A0000.10894	2014.08.28	wikipedia ce prete parfaitement à ce genre de decryptage .

Example for POS annotating the data on the fly, using 10 threads

java  -jar app/build/libs/korapxml2conllu.jar -T 10 -A "docker run --rm -i korap/conllu2treetagger -l french" app/src/test/resources/wdf19.zip | conllu2korapxml wdf19.tree_tagger.zip

Tag with integrated MarMoT POS tagger

$ java -jar ./app/build/libs/korapxml2conllu.jar -t marmot:models/de.marmot app/src/test/resources/goe.zip

# foundry = base
# filename = GOE/AGA/00000/base/tokens.xml
# text_id = GOE_AGA.00000
# start_offsets = 0 0 9 12
# end_offsets = 22 8 11 22
1       Campagne        _       _       NN      case=nom|number=sg|gender=fem   _       _       _       _
2       in      _       _       APPR    _       _       _       _       _
3       Frankreich      _       _       NE      case=dat|number=sg|gender=neut  _       _       _       _

Development and License

Author:

Copyright (c) 2024, Leibniz Institute for the German Language, Mannheim, Germany

This package is developed as part of the KorAP Corpus Analysis Platform at the Leibniz Institute for German Language (IDS).

It is published under the GNU General Public License, Version 3, 29 June 2007.

Contributions

Contributions are very welcome!

Your contributions should ideally be committed via our Gerrit server to facilitate reviewing ( see Gerrit Code Review - A Quick Introduction if you are not familiar with Gerrit). However, we are also happy to accept comments and pull requests via GitHub.