in with

KorAP: CQP

The following documentation introduces all features provided by our version of the CQP Query Language and some KorAP specific extensions. This tutorial is based on the IMS Open Corpus Workbench (CWB) CQP Query Language Tutorial, version 3.4.24 (May 2020) and on the Korap Poliqarp+ tutorial.

Simple Segments

The atomic elements of CQP queries are segments. Most of the time, segments represent words and can be queried by encapsulating them in qoutes or double quotes:

'Baum'

or

"Baum"

A word segment is always interpreted as a regular expressions, e.g., a query like

"(Tannen)?baum"

Sequences of simple segments are expressed using a space delimiter:

"der" "Baum"
'der' 'Baum'

Originally, CQP was developped as a corpus query processor tool and any CQP command had to be followed by a semicolon. The CQPweb server at Lancaster treats the semicolon as optional, and we implemented it in the same way.

Simple segments always refer to the surface form of a word. To search for surface forms without case sensitivity, you can use the %c flag:

"laufen"%c

The query above will find all occurrences of the term irrespective of the capitalization of letters.

Diacritics is not been supported yet.

Regular Expressions

Special regular expressions characters like ., ?, *, +, |, (, ), [, ], {, }, ^, $ have to be "escaped" with backslash (\):

Beware: Queries with prepended .* expressions can become extremely slow!

In Poliqarp+ only double quotes are used for regular expressions, while single quotes are used to mark verbatim strings. In CQP, you can use %l flag to match the string in a verbatim manner.

To match a word form containing single or double quotes, use one of the following methods :

Complex Segments

Complex segments are expressed in square brackets and contain additional information on the resource of the term under scrutiny by providing key/value pairs, separated by an equal-sign.

The KorAP implementation of CQP provides three special segment keys: orth for surface forms, base for lemmata, and pos for Part-of-Speech. The following complex query finds all surface forms of the defined word:

[orth="Baum"]

The query is thus equivalent to:

"Baum"

Complex segments expect simple expressions as values, meaning that the following expression is valid as well:

[orth="l(au|ie)fen"%c]

Another special key is base, refering to the lemma annotation of the default foundry. The following query finds all occurrences of segments annotated as a specified lemma by the default foundry:

[base="Baum"]

The third special key is pos, refering to the part-of-speech annotation of the default foundry. The following query finds all attributive adjectives:

[pos="ADJA"]

Complex segments requesting further token annotations can have keys following the foundry/layer notation. For example to find all occurrences of plural words in a supporting foundry, you can search using the following queries:

[marmot/m="number":"pl"]
[marmot/m='tense':'pres']

In case an annotation contains special non-alphabetic and non-numeric characters, the annotation part can be followed by %l to ensure a verbatim interpretation:

[orth='https://de.wikipedia.org'%l]

Negation

Negation of terms in complex expressions can be expressed by prepending the equal sign or the whole expression with an exclamation mark.

[pos!="ADJA"] "Haare"
[!pos="ADJA"] "Haare"

Beware: Negated complex segments can't be searched as a single statement. However, they work in case they are part of a sequence.

Empty Segments

A special segment is the empty segment, that matches every word in the index.

[]

Empty segments are useful to express distances of words by using repetitions.

Beware: Empty segments can't be searched as a single statement. However, they work in case they are part of a sequence.

Span Segments

Not all segments are bound to words - some are bound to concepts spanning multiple words, for example noun phrases, sentences, or paragraphs. Span segments are structural elements and they have specific syntax in different contexts. When used in complex segments, they need to be searched by using angular brackets :

<corenlp/c=NP>
Some spans such as s are special keywords that can be used without angular brackets, as operands of specific functional operators like within, region, lbound, rbound or MU(meet).

Paradigmatic Operators

A complex segment can have multiple properties a token requires. For example to search for all words with a certain surface form of a particular lemma (no matter if capitalized or not), you can search for:

[orth="laufe"%c & base="Lauf"]

The ampersand combines multiple properties with a logical AND. Terms of the complex segment can be negated as introduced before. The following queries are equivalent:

[orth="laufe"%c & base!="Lauf"]
[base="laufen" | base="gehen"]

Alternatives can be expressed by using the pipe symbol:

[base="laufen" | base="gehen"]

All these sub expressions can be grouped using round brackets to form complex boolean expressions:

[(base="laufen" | base="gehen") & tt/pos="VVFIN"]
Round brackets can also be used to encapsulate simple segments, to increase query readability, although they are not necessary:
[(base="laufen" | base="gehen") & (tt/pos="VVFIN")]
Negation operator can be used outside expressions grouped by round brackets. Be aware of the De Morgan's Laws when you design your queries: the following query
[(!(base="laufen" | base="gehen")) & (tt/pos="VVFIN")]
is logically equivalent to:
[!(base="laufen") & !(base="gehen") & (tt/pos="VVFIN")]
which can be written in a more simple way like:
[!base="laufen" & !base="gehen" & tt/pos="VVFIN"]
or like :
[base!="laufen" & base!="gehen" & tt/pos="VVFIN"]

Syntagmatic Operators

Sequences

Sequences can be used to search for segments in order. For this, simple expressions are separated by whitespaces.

"der" "alte" "Mann"

However, you can obviously search using complex segments as well:

[orth="der"][orth="alte"][orth="Mann"]

Now you may see the benefit of the empty segment to search for words you don't know:

[orth="der"][][orth="Mann"]

Position

You are also able to mix segments and spans in sequences. In CQP, spans are marked by XML-like structural elements signalling the beginning and/or the end of a region and they can be used to look for segments in a specific position in a bigger structure like a noun phrase or a sentence.

To search for a word at the beginning of a sentence (or a syntactic group), the following queries are equivalent.

To search for a word at the end of a sentence (or a syntactic group), you can use:

Beware that when searching for longer sequences, sentence boundaries may be crossed.

In the following example, sequences where "für" occurs in a previous sentence may also be matched, because of the long sequence of empty tokens in the query (minimum 20, maximum 25).

"für" []{20,25} "uns"</s>