KorAP: CQP
The following documentation introduces all features provided by our version of the CQP Query Language and some KorAP specific extensions. This tutorial is based on the IMS Open Corpus Workbench (CWB) CQP Query Language Tutorial, version 3.4.24 (May 2020) and on the Korap Poliqarp+ tutorial.
Simple Segments
The atomic elements of CQP queries are segments. Most of the time, segments represent words and can be queried by encapsulating them in qoutes or double quotes:
'Baum'
or
"Baum"
A word segment is always interpreted as a regular expressions, e.g., a query like
"(Tannen)?baum"
Sequences of simple segments are expressed using a space delimiter:
"der" "Baum"
'der' 'Baum'
Originally, CQP was developped as a corpus query processor tool and any CQP command had to be followed by a semicolon. The CQPweb server at Lancaster treats the semicolon as optional, and we implemented it in the same way.
Simple segments always refer to the surface form of a word. To search
for surface forms without case sensitivity, you can use the %c
flag:
"laufen"%c
The query above will find all occurrences of the term irrespective of the capitalization of letters.
Diacritics is not been supported yet.
Regular Expressions
Special regular expressions characters like .
, ?
,
*
, +
, |
, (
, )
,
[
, ]
, {
, }
, ^
,
$
have to be "escaped" with backslash (\
):
"?";
fails while"\?";
returns?.
"."
returns any character, while"\$\."
returns$.
Beware: Queries with prepended
.*
expressions can become extremely slow!In Poliqarp+ only double quotes are used for regular expressions, while single quotes are used to mark verbatim strings. In CQP, you can use %l flag to match the string in a verbatim manner.
To match a word form containing single or double quotes, use one of the following methods :
- if the string you need to match contain either single or double quotes, use the other quote character to encapsulate the string:
"It's"
'It''s'
(\)
: 'It\'s'
Complex Segments
Complex segments are expressed in square brackets and contain additional information on the resource of the term under scrutiny by providing key/value pairs, separated by an equal-sign.
The KorAP implementation of CQP provides three special segment keys:
orth
for surface forms, base
for lemmata,
and pos
for Part-of-Speech. The following complex query
finds all surface forms of the defined word:
[orth="Baum"]
The query is thus equivalent to:
"Baum"
Complex segments expect simple expressions as values, meaning that the following expression is valid as well:
[orth="l(au|ie)fen"%c]
Another special key is base
, refering to the lemma
annotation of the default foundry. The following query finds all occurrences of segments
annotated as a specified lemma by the default foundry:
[base="Baum"]
The third special key is pos
, refering to the
part-of-speech annotation of the default foundry. The following query finds all attributive adjectives:
[pos="ADJA"]
Complex segments requesting further token annotations can have keys
following the foundry/layer
notation. For example to
find all occurrences of plural words in a supporting foundry, you can
search using the following queries:
[marmot/m="number":"pl"]
[marmot/m='tense':'pres']
In case an annotation contains special non-alphabetic and non-numeric
characters, the annotation part can be followed by %l
to
ensure a verbatim interpretation:
[orth='https://de.wikipedia.org'%l]
Negation
Negation of terms in complex expressions can be expressed by prepending the equal sign or the whole expression with an exclamation mark.
[pos!="ADJA"] "Haare"
[!pos="ADJA"] "Haare"
Beware: Negated complex segments can't be searched as a single statement. However, they work in case they are part of a sequence.
Empty Segments
A special segment is the empty segment, that matches every word in the index.
[]
Empty segments are useful to express distances of words by using repetitions.
Beware: Empty segments can't be searched as a single statement. However, they work in case they are part of a sequence.
Span Segments
Not all segments are bound to words - some are bound to concepts spanning multiple words, for example noun phrases, sentences, or paragraphs. Span segments are structural elements and they have specific syntax in different contexts. When used in complex segments, they need to be searched by using angular brackets :
<corenlp/c=NP>
Some spans such as s
are special keywords that can be
used without angular brackets, as operands of specific functional
operators like within
, region
, lbound
,
rbound
or MU(meet)
.
Paradigmatic Operators
A complex segment can have multiple properties a token requires. For example to search for all words with a certain surface form of a particular lemma (no matter if capitalized or not), you can search for:
[orth="laufe"%c & base="Lauf"]
The ampersand combines multiple properties with a logical AND. Terms of the complex segment can be negated as introduced before. The following queries are equivalent:
[orth="laufe"%c & base!="Lauf"]
[base="laufen" | base="gehen"]
Alternatives can be expressed by using the pipe symbol:
[base="laufen" | base="gehen"]
All these sub expressions can be grouped using round brackets to form complex boolean expressions:
[(base="laufen" | base="gehen") & tt/pos="VVFIN"]
Round brackets can also be used to encapsulate simple segments, to
increase query readability, although they are not necessary:
[(base="laufen" | base="gehen") & (tt/pos="VVFIN")]
Negation operator can be used outside expressions grouped by round
brackets. Be aware of the De
Morgan's Laws when you design your queries: the following query
[(!(base="laufen" | base="gehen")) & (tt/pos="VVFIN")]
is logically equivalent to:
[!(base="laufen") & !(base="gehen") & (tt/pos="VVFIN")]
which can be written in a more simple way like:
[!base="laufen" & !base="gehen" & tt/pos="VVFIN"]
or like :
[base!="laufen" & base!="gehen" & tt/pos="VVFIN"]
Syntagmatic Operators
Sequences
Sequences can be used to search for segments in order. For this, simple expressions are separated by whitespaces.
"der" "alte" "Mann"
However, you can obviously search using complex segments as well:
[orth="der"][orth="alte"][orth="Mann"]
Now you may see the benefit of the empty segment to search for words you don't know:
[orth="der"][][orth="Mann"]
Position
You are also able to mix segments and spans in sequences. In CQP, spans are marked by XML-like structural elements signalling the beginning and/or the end of a region and they can be used to look for segments in a specific position in a bigger structure like a noun phrase or a sentence.
To search for a word at the beginning of a sentence (or a syntactic group), the following queries are equivalent.
-
The queries both match the word "Der" when positioned as a first word in a sentence:
<base/s=s>[orth="Der"]
<s>[orth="Der"]
- The queries both match the word "Der" when positioned after the end of a sentence:
</base/s=s>[orth="Der"]
</s>[orth="Der"]
- Match the word "Mann" when positioned as a last word in a sentence:
[orth="Mann"]</base/s=s>
[orth="Mann"]</s>
[orth="Mann"]<base/s=s>
[orth="Mann"]<s>
Beware that when searching for longer sequences, sentence boundaries may be crossed.
In the following example, sequences where "für" occurs in a previous sentence may also be matched, because of the long sequence of empty tokens in the query (minimum 20, maximum 25).
"für" []{20,25} "uns"</s>