commit | 6c46d70f6ce240bdf1c0938dfeae68b6cb0d39a0 | [log] [tgz] |
---|---|---|
author | Marc Kupietz <kupietz@ids-mannheim.de> | Thu Sep 08 13:29:28 2022 +0200 |
committer | Marc Kupietz <kupietz@ids-mannheim.de> | Fri Sep 09 18:25:33 2022 +0200 |
tree | 629bda9185e35e8fe0cf1605fefd78a8eb22d6e2 | |
parent | 0d23fb5174a1830750b41da9c25e74b2cc5b6480 [diff] |
Automatically convert python lists to string vectors Currently that's all we need. Change-Id: I4e6dd61da8c374116c88a0c18a5277d55d9f3f45
Python client wrapper package to access the web service API of the KorAP Corpus Analysis Platform developed at IDS Mannheim. Currently, this is no native Python package. Internally, it uses KorAP's client package for R via rpy2. The latter also automatically translates between R data frames (or tibbles) and pandas DataFrames.
or, alternatively, on some recent Linux distributions:
#### Debian / Ubuntu sudo apt-get install -y r-base r-base-dev r-cran-tidyverse r-cran-r.utils r-cran-pixmap r-cran-webshot r-cran-ade4 r-cran-segmented r-cran-purrr r-cran-dygraphs r-cran-cvst r-cran-quantmod r-cran-graphlayouts r-cran-rappdirs r-cran-ggdendro r-cran-seqinr r-cran-heatmaply r-cran-igraph r-cran-plotly libcurl4-gnutls-dev libssl-dev libfontconfig1-dev libsecret-1-dev libxml2-dev libsodium-dev python3-pip python3-rpy2 python3-pandas #### Fedora / CentOS / RHEL sudo yum install -y R R-devel libcurl-devel openssl-devel libxml2-devel libsodium-devel python3-pandas
Start R and run:
install.packages('RKorAPClient', repos='https://cloud.r-project.org/')
or install RKorAPClient from the package installation menu entry.
On Linux an MacOs:
python3 -m pip install KorAPClient
On Windows:
py -m pip install KorAPClient
The core classes and methods to access the KorAP API are documented in the KorAPClient API documentation. For additional, mostly static helper functions, please refer to the Reference Manual of RKorAPClient for now. For translating R syntax to Python and vice versa, refer to the rpy2 Documentation.
Please note that some arguments in the original RKorAPClient functions use characters that are not allowed in Python keyword argument names. For these cases, you can however use Python's **kwargs
syntax. For example, to get the result of corpusStats
as a pandas.DataFrame
, and print the size of the whole corpus in tokens, you can write:
from KorAPClient import KorAPConnection kcon = KorAPConnection(verbose=True) print(kcon.corpusStats(**{"as.df": True})['tokens'][0])
from KorAPClient import KorAPClient, KorAPConnection import plotly.express as px QUERY = "Hello World" YEARS = range(2010, 2019) COUNTRIES = ["DE", "CH"] kcon = KorAPConnection(verbose=True) vcs = [f"textType=/Zeit.*/ & pubPlaceKey={c} & pubDate in {y}" for c in COUNTRIES for y in YEARS] df = KorAPClient.ipm(kcon.frequencyQuery(QUERY, vcs)) df['Year'] = [y for c in COUNTRIES for y in YEARS] df['Country'] = [c for c in COUNTRIES for y in YEARS] df['error_y'] = df["conf.high"] - df["ipm"] df['error_y_minus'] = df["ipm"] - df["conf.low"] fig = px.line(df, title=QUERY, x="Year", y="ipm", color="Country", error_y="error_y", error_y_minus="error_y_minus") fig.show()
collocationAnalysis
methodfrom KorAPClient import KorAPConnection kcon = KorAPConnection(verbose=True) results = kcon.collocationAnalysis("focus(in [tt/p=NN] {[tt/l=setzen]})", leftContextSize=1, rightContextSize=0, exactFrequencies=False, searchHitsSampleLimit=1000, topCollocatesLimit=20) results['collocate'] = "[" + results['collocate'] +"](" + results['webUIRequestUrl'] +")" print(results[['collocate', 'logDice', 'pmi', 'll']].head(10).round(2).to_markdown(floatfmt=".2f"))
collocate | logDice | pmi | ll | |
---|---|---|---|---|
1 | Szene | 10.37 | 11.54 | 824928.58 |
2 | Gang | 9.65 | 10.99 | 366993.93 |
3 | Verbindung | 9.20 | 10.34 | 347644.75 |
4 | Kenntnis | 9.15 | 10.67 | 206902.89 |
5 | Bewegung | 8.80 | 9.91 | 264577.07 |
6 | Brand | 8.76 | 9.97 | 210654.43 |
7 | Anführungszeichen | 8.06 | 12.52 | 54148.31 |
8 | Kraft | 7.94 | 8.91 | 189399.70 |
9 | Beziehung | 6.92 | 8.29 | 37723.54 |
10 | Relation | 6.64 | 10.24 | 17105.84 |
The Python KorAP client can also be called from the command line and shell scripts:
$ korapclient -h usage: python -m KorAPClient [-h] [-v] [-l QUERY_LANGUAGE] [-u API_URL] [-c VC [VC ...]] [-q QUERY [QUERY ...]] Send a query to the KorAP API and print results as tsv. optional arguments: -h, --help show this help message and exit -v, --verbose -l QUERY_LANGUAGE, --query-language QUERY_LANGUAGE -u API_URL, --api-url API_URL Specify this to access a corpus other that DeReKo. -c VC [VC ...], --vc VC [VC ...] virtual corpus definition[s] -q QUERY [QUERY ...], --query QUERY [QUERY ...] If not specified only the size of the virtual corpus will be queried. example: python -m KorAPClient -v --query "Hello World" "Hallo Welt" --vc "pubDate in 2017" "pubDate in 2018" "pubDate in 2019"
By using the KorAPClient you agree to the respective terms of use of the accessed KorAP API services which will be printed upon opening a connection.
Author: Marc Kupietz
Copyright (c) 2021, Leibniz Institute for the German Language, Mannheim, Germany
This package is developed as part of the KorAP Corpus Analysis Platform at the Leibniz Institute for German Language (IDS).
It is published under the BSD-2 License.
To cite this work, …
please refer to: Kupietz et al. (2020), below.
Contributions are very welcome!
Your contributions should ideally be committed via our Gerrit server to facilitate reviewing (see Gerrit Code Review - A Quick Introduction if you are not familiar with Gerrit). However, we are also happy to accept comments and pull requests via GitHub.
Please note that unless you explicitly state otherwise any contribution intentionally submitted for inclusion into this software shall – as this software itself – be under the BSD-2 License.
Kupietz, Marc / Margaretha, Eliza / Diewald, Nils / Lüngen, Harald / Fankhauser, Peter (2019): What’s New in EuReCo? Interoperability, Comparable Corpora, Licensing. In: Bański, Piotr/Barbaresi, Adrien/Biber, Hanno/Breiteneder, Evelyn/Clematide, Simon/Kupietz, Marc/Lüngen, Harald/Iliadi, Caroline (eds.): Proceedings of the International Corpus Linguistics Conference 2019 Workshop "Challenges in the Management of Large Corpora (CMLC-7)", 22nd of July Mannheim: Leibniz-Institut für Deutsche Sprache, 33-39.
Kupietz, Marc / Diewald, Nils / Margaretha, Eliza (2020): RKorAPClient: An R package for accessing the German Reference Corpus DeReKo via KorAP. In: Calzolari, Nicoletta, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis (eds.): Proceedings of The 12th Language Resources and Evaluation Conference (LREC 2020). Marseille: European Language Resources Association (ELRA), 7017-7023.