# korapxmltool

Tool package to convert and annotate KorAP-XML ZIP files.

Up to 200 times faster and more accurate drop-in replacement for the korapxml2conllu part of [KorAP-XML-CoNLL-U](https://github.com/KorAP/KorAP-XML-CoNLL-U).

For some conversion tasks, however, you currently need the conllu2korapxml part of [KorAP-XML-CoNLL-U](https://github.com/KorAP/KorAP-XML-CoNLL-U).

## Download

You can download the latest jar build from the build artifacts [here]([here](https://gitlab.ids-mannheim.de/KorAP/korapxml2conllu/-/jobs/artifacts/master/raw/build/libs/korapxmltool.jar?job=build))

## Build it yourself

```shell script
./gradlew shadowJar
```

## Conversion to [CoNLL-U format](https://universaldependencies.org/format.html)

```shell script
$ java  -jar ./app/build/libs/korapxmltool.jar app/src/test/resources/wdf19.zip | head -10

# foundry = base
# filename = WDF19/A0000/13072/base/tokens.xml
# text_id = WDF19_A0000.13072
# start_offsets = 0 0 14 17 25 30 35 42 44 52 60 73
# end_offsets = 74 12 16 24 29 34 41 43 51 59 72 74
1	Australasien	_	_	_	_	_	_	_	_
2	on	_	_	_	_	_	_	_	_
3	devrait	_	_	_	_	_	_	_	_
4	peut	_	_	_	_	_	_	_	_
5	être	_	_	_	_	_	_	_	_

```

## Conversion to language model training data input format from KorAP-XML

```shell script
$ java  -jar ./app/build/libs/korapxmltool.jar --word2vec t/data/wdf19.zip

Arts visuels Pourquoi toujours vouloir séparer BD et Manga ?
Ffx 18:20 fév 25 , 2003 ( CET ) soit on ne sépara pas , soit alors on distingue aussi , le comics , le manwa , le manga ..
la bd belge et touts les auteurs européens ..
on commence aussi a parlé de la bd africaine et donc ...
wikipedia ce prete parfaitement à ce genre de decryptage .
…
```

### Example producing language model training input with preceding metadata columns

```shell script
java  -jar ./app/build/libs/korapxmltool.jar  -m '<textSigle>([^<]+)' -m '<creatDate>([^<]+)' --word2vec t/data/wdf19.zip
```
```
WDF19/A0000.10894	2014.08.28	Arts visuels Pourquoi toujours vouloir séparer BD et Manga ?
WDF19/A0000.10894	2014.08.28	Ffx 18:20 fév 25 , 2003 ( CET ) soit on ne sépara pas , soit alors on distingue aussi , le comics , le manwa , le manga ..
WDF19/A0000.10894	2014.08.28	la bd belge et touts les auteurs européens ..
WDF19/A0000.10894	2014.08.28	on commence aussi a parlé de la bd africaine et donc ...
WDF19/A0000.10894	2014.08.28	wikipedia ce prete parfaitement à ce genre de decryptage .
```

## Annotation

### Tagging with integrated MarMoT POS tagger directly to a new KorAP-XML ZIP file

You need to download the pre-trained MarMoT models from the [here](http://cistern.cis.lmu.de/marmot/models/CURRENT/).

```shell script
$ java -jar ./app/build/libs/korapxmltool.jar -f zip -t marmot:models/de.marmot app/src/test/resources/goe.zip
```

### Tagging with integrated OpenNLP POS tagger directly to a new KorAP-XML ZIP file

You need to download the pre-trained OpenNLP models from [here](https://opennlp.apache.org/models.html#part_of_speech_tagging) or older models from  [here](http://opennlp.sourceforge.net/models-1.5/).
```shell script
java -jar ./app/build/libs/korapxmltool.jar -f zip -t opennlp:/usr/local/kl/korap/Ingestion/lib/models/opennlp/de-pos-maxent.bin /tmp/zca24.zip
```

### Tag and lemmatize with TreeTagger

This requires the [TreeTagger Docker Image with CoNLL-U Support](https://gitlab.ids-mannheim.de/KorAP/CoNLL-U-Treetagger). 
Language models are downloaded automatically.

```shell script
java  -jar app/build/libs/korapxmltool.jar app/src/test/resources/wdf19.zip | docker run --rm -i korap/conllu2treetagger -l french | conllu2korapxml
```

### Tag and lemmatize with spaCy

This requires the [spaCy Docker Image with CoNLL-U Support](https://gitlab.ids-mannheim.de/KorAP/sota-pos-lemmatizers) and is only available for German.

```shell script
java  -jar app/build/libs/korapxmltool.jar app/src/test/resources/goe.zip | docker run --rm -i korap/conllu2spacy | conllu2korapxml > goe.spacy.zip
```

## Parsing

### Using the integrated Maltparser

You need to download the pre-trained MaltParser models from the [here](http://www.maltparser.org/mco/mco.html).
Note that parsers take POS tagged input.

```shell script
java -jar ./app/build/libs/korapxmltool.jar -f zip -T2 -P malt:libs/german.mco goe.tree_tagger.zip
```

## Development and License

**Author**:

* [Marc Kupietz](https://www.ids-mannheim.de/digspra/personal/kupietz.html)

Copyright (c) 2024-2025, [Leibniz Institute for the German Language](http://www.ids-mannheim.de/), Mannheim, Germany

This package is developed as part of the [KorAP](http://korap.ids-mannheim.de/)
Corpus Analysis Platform at the Leibniz Institute for German Language
([IDS](http://www.ids-mannheim.de/)).

It is published under the GNU General Public License, Version 3, 29 June 2007.

## Contributions

Contributions are very welcome!

Your contributions should ideally be committed via our [Gerrit server](https://korap.ids-mannheim.de/gerrit/)
to facilitate reviewing (
see [Gerrit Code Review - A Quick Introduction](https://korap.ids-mannheim.de/gerrit/Documentation/intro-quick.html)
if you are not familiar with Gerrit). However, we are also happy to accept comments and pull requests
via GitHub.
