Print indexing progress
1 file changed
tree: b8ff1e42a3b34318a9d112f72e7c736945b3247d
  1. Buchpreis/
  2. data/
  3. html/
  4. lib/
  5. scripts/
  6. test/
  7. xslt/
  8. .gitignore
  9. .gitlab-ci.yml
  10. deliko-xl-compose.yml
  11. kalamar.conf
  12. korap4dnb-compose.yml
  13. krill-korap4dnb.cfg
  14. Makefile
  15. Readme.md
Readme.md

EPub to KorAP (via TEI I5) conversion

Testing

Run TEI I5 conversion tests on local test data

make -j $(nproc) test

Build test index

make -j $(nproc) test index

Run local KorAP with test index

INDEX=./target/dnb.index docker compose -p korap4dnb --profile=lite -f korap4dnb-compose.yml up -d

xdg-open http://localhost:4000/?q=Test

Stop local KorAP

docker compose -p korap4dnb down

To generate Annotations

Install prerequisite korap/conllu2treetagger and korap/conllu2spacy docker images if not present:

docker image inspect korap/conllu2treetagger:latest || curl -Ls 'https://gitlab.ids-mannheim.de/KorAP/CoNLL-U-Treetagger/-/jobs/artifacts/master/raw/conllu2treetagger.xz?job=build-docker-image' | docker load

docker image inspect korap/conllu2spacy:latest || curl -Ls https://corpora.ids-mannheim.de/tools/conllu2spacy.tar.xz | docker load

Make annotations fro dnb20:

make -j $(nproc) target/dnb20.marmot-malt.zip target/dnb20.spacy.zip target/dnb20.tree_tagger.zip

Production

Build a new KorAP index

make -j $(( $(nproc) / 2 )) index

By default, as sources directories, all directories in ./DeLiKo@DNB are used. Note that (due to a bug in the Makefile), the nesting depth of the EPUB files must be exactly 2. You can check, what files will be converted, by running ls DeLiKo@DNB/*/*.epub.

The new index will be built as target/dnb.index.

Build new KorAP index, just with prize winners index

make clean && time make -j $(( $(nproc) / 2 )) index SRC_DIR=./Buchpreis

The index will be in target/dnb.index.

Run KorAP

and start the docker:

INDEX=./target/dnb.index docker compose -p korap4dnb --profile=lite -f korap4dnb-compose.yml up -d

Stop KorAP

docker compose -p korap4dnb down

Restart KorAP

docker compose -p korap4dnb --profile=lite restart

News

  • 2024-05-26

    • extended genre classification based on metadata keywords
    • Saxon XSLT processor and license updated from 9 to 12.4
  • 2024-05-08

    • added idno elements with all ids given by dnb SRU api
    • fixed bug with ambiguous (dnb-id/isbn) ids
    • basic genre classification based on metadata keywords
  • 2024-04-19

    • SRC_DIR now defaults to the production sample!
    • ISBN number recognition should be fixed now
    • ignore faulty xhtml input files and conversion errors – just issue a warning
    • docker compose now uses http default port 80 externally
  • 2024-04-15

    • added pass2 and pass3 to xslt conversion to …
      • fix div, p, hi, ref … nestings
      • remove empty elements
      • join subsequent hi elements
    • improved korapxml2krill performance by using all cores (-1 does not work here)
    • sanitized the Makefile and dropped YY variable, use YEARS instead
  • 2024-04-10

    • multiple authors (and non-authors) are now correctly handled
    • some more .(x)html files are now dropped (toc, cover, etc.)
    • PRELIMINARY support for splitting everything into annual volumes
      • use make YY=22 to select 2022
      • does not yet work for the index!
  • 2024-03-24

    • slow udpipe2 dropped
    • added marmot POS and morpho-syntactic annotations
    • added malt dependency annotations
  • 2024-03-18

    • added make deploy to install new index and restart local KorAP@DNB instance (also available as ci target)
    • added show-server-logs and show-server-status make targets to monitor the local KorAP@DNB instance
  • 2024-03-17

    • added make all to build all targets, including the index
  • 2024-03-16

    • CI/CD pipeline added
    • first working pipeline for EPub ⮕ TEI I5 ⮕ KorAP-XML ⮕ (UDPipe+TreeTagger+Spacy) ⮕ Krill ⮕ KorAP-JSON
  • 2024-03-15: DNB test data added

  • 2024-03-08: example EPub and I5 added from DeReKo KJL corpus: Christiane F. ; Kai Hermann ; Horst Rieck: Wir Kinder vom Bahnhof Zoo in the folder test/resources/ – do not distribute (copyrighted data)