KorAP: Poliqarp+

The following documentation introduces all features provided by our version of the Poliqarp Query Language and some KorAP specific extensions.

Simple Segments

The atomic elements of Poliqarp queries are segments. Most of the time segments represent words and can be simple queried:

Baum

Sequences of simple segments are expressed using a space delimiter:

der Baum

Simple segments always refer to the surface form of a word. To search for surface forms without case sensitivity, you can use the /i flag.

laufen/i

The query above will find all occurrences of the term irrespective of the capitalization of letters.

Regular Expressions

Segments can also be queried using regular expressions - by surrounding the segment with double quotes.

"l(au|ie)fen"

Regular expression segments will always match the whole segment, meaning the above query will find words starting with the first letter of the regular expression and ending with the last letter. To support subqueries, you can use the /x flag.

"l(au|ie)fen"/x

The /x will search for all segments that contain a sequence of characters the regular expression matches. That means the above query is equivalent to:

".*?l(au|ie)fen.*?"

The /x flag can also be used in conjunction with strict expressions to search for substrings:

trenn/xi

The above query will find all occurrences of segments including the defined substring regardless of upper and lower case.

Beware: Queries with prepended .* expressions can become extremely slow!

In the original Poliqarp specification, regular expressions can be marked both by double quotes and single quotes. In Poliqarp+ only double quotes are used for regular expressions, while single quotes are used to mark verbatim strings.

You can again apply the /i flag to regular expressions to search case insensitive.

"l(au|ie)fen"/xi

Reserved terms

The following terms are reserved words in Poliqarp+ and can therefore not be used in short notation of simple segments. Use the notation for complex segments to query them (e.g. [orth='contains']):

Complex Segments

Complex segments are expressed in square brackets and contain additional information on the resource of the term under scrutiny by providing key/value pairs, separated by an equal-sign.

The KorAP implementation of Poliqarp provides three special segment keys: orth for surface forms, base for lemmata, and pos for Part-of-Speech. The following complex query finds all surface forms of the defined word.

[orth=Baum]

The query is thus equivalent to:

Baum

Complex segments expect simple expressions as values, meaning that the following expression is valid as well:

[orth="l(au|ie)fen"/xi]

Another special key is base, refering to the lemma annotation of the default foundry. The following query finds all occurrences of segments annotated as a specified lemma by the default foundry.

[base=Baum]

The third special key is pos, refering to the part-of-speech annotation of the default foundry. The following query finds all attributive adjectives:

[pos=ADJA]

Complex segments requesting further token annotations can have keys following the foundry/layer notation. For example to find all occurrences of plural words in a supporting foundry, you can search using the following query:

[marmot/m=number:pl]

In case an annotation contains special non-alphabetic and non-numeric characters, the annotation part can be surrounded by single quotes to ensure a verbatim interpretation:

[orth='http://www.ids-mannheim.de/cosmas2/projekt/']

Negation

Negation of terms in complex expressions can be expressed by prepending the equal sign or the whole expression with an exclamation mark.

[pos!=ADJA]
[!pos=ADJA]

Beware: Negated complex segments can't be searched as a single statement. However, they work in case they are part of a sequence.

Empty Segments

A special segment is the empty segment, that matches every word in the index.

[]

Empty segments are useful to express distances of words by using repetitions.

Beware: Empty segments can't be searched as a single statement. However, they work in case they are part of a sequence.

Span Segments

Not all segments are bound to words - some are bound to concepts spanning multiple words, for example noun phrases, sentences, or paragraphs. Span segments can be searched for using angular brackets instead of square brackets.

<corenlp/c=NP>

Otherwise they can be treated in exactly the same way as simple or complex segments.

Paradigmatic Operators

A complex segment can have multiple properties a token requires. For example to search for all words with a certain surface form of a particular lemma (no matter if capitalized or not), you can search for:

[orth=laufe/i & base=Lauf]

The ampersand combines multiple properties with a logical AND. Terms of the complex segment can be negated as introduced before.

[orth=laufe/i & base!=Lauf]

The following query is therefore equivalent:

[orth=laufe/i & !base=Lauf]

Alternatives can be expressed by using the pipe symbol:

[base=laufen | base=gehen]

All these sub expressions can be grouped using round brackets to form complex boolean expressions:

[(base=laufen | base=gehen) & tt/pos=VVFIN]

Syntagmatic Operators

Sequences

Sequences can be used to search for segments in order. For this, simple expressions are separated by whitespaces.

der alte Mann

However, you can obviously search using complex segments as well:

[orth=der][orth=alte][orth=Mann]

Now you may see the benefit of the empty segment to search for words you don't know:

[orth=der][][orth=Mann]

You are also able to mix segments and spans in sequences, for example to search for a word at the beginning of a sentence (which can be interpreted as the first word after the end of a sentence).

<base/s=s>[orth=Der]

Groups

...

Alternation

Alternations allow for searching alternative segments or sequences of segments, similar to the paradigmatic operator. You already have seen that you can search for a sequence with an alternative adjective in between by typing in:

der [orth=alte | orth=junge] Mann

However, this formulation has problems in case you want to search for alternations of sequences rather than terms. In this case you can use syntagmatic alternations and groups:

(dem jungen | der alte) Mann

The pipe symbol works the same way as with the paradigmatic alternation, but supports sequences of different length as operands. The above query with an alternative adjective in a sequence can therefore be reformulated as:

der (junge | alte) Mann

Repetition

Repetitions in Poliqarp are realized as in regular expressions, by giving quantifieres in curly brackets.

To search for a sequence of three occurrences of a defined string, you can formulate your query in any of the following ways - they will have the same results:

der der der
der{3}
[orth=der]{3}

In difference to regular expressions, the repetition operation won't refer to the match but to the pattern given. So the following query will give you a sequence of three words with a defined substring - but the words don't have to be identical.

"la.*?"/i{3}

The same is true for annotations. The following query will find a sequence of 3 to 4 adjectives in a defined context. The adjectives do not have to be identical though.

[base=ein][tt/p=ADJA]{3,4}[corenlp/p=NN]

In addition to numbered quantities, it is also possible to pass repetition information as Kleene operators ?, *, and +.

To search for a sequence with an optional segment, you can search for:

[base=die][tt/pos=ADJA]?[base=Baum]

This query is identical to the numbered quantification of:

[base=die][tt/pos=ADJA]{,1}[base=Baum]

To search for the same sequences but with unlimited adjectives in between, you can use the Kleene Star:

[base=die][tt/pos=ADJA]*[base=Baum]

And to search for this sequence but with at least one adjective in between, you can use the Kleene Plus (all queries are identical):

[base=die][tt/pos=ADJA]+[base=Baum]
[base=die][tt/pos=ADJA]{1,}[base=Baum]
[base=die][tt/pos=ADJA][tt/pos=ADJA]*[base=Baum]

Repetition operators like {,n}, ?, and * make segments or groups of segments optional. In case these queries are used separated and not as part of a sequence (and there are no mandatory segments in the query), you will be warned by the system that your query won't be treated as optional.

Keep in mind that optionality may be somehow inherited, for example an entire query becomes optional as soon as one segment of an alternation is optional.

Repetition can also be used to express distances between segments by using empty segments.

[base=die][][base=Baum]
[base=die][]{2}[base=Baum]
[base=die][]{2,}[base=Baum]
[base=die][]{,3}[base=Baum]

Of course, Kleene operators can be used with empty segments as well.

[base=die][]?[base=Baum]
[base=die][]*[base=Baum]
[base=die][]+[base=Baum]

Position

Sequences as shown above can all be nested in further complex queries and treated as subqueries (see class operators on how to later access these subqueries directly).

Positional operators compare two matches of subqueries and will match, in case a certain condition regarding the position of both is true.

The contains() operation will match, when a second subquery matches inside the span of a first subquery.

contains(<base/s=s>, [tt/p=KOUS])

The startsWith() operation will match, when a second subquery matches at the beginning of the span of a first subquery.

startsWith(<base/s=s>, [tt/p=KOUS])

The endsWith() operation will match, when a second subquery matches at the end of the span of a first subquery.

endsWith(<base/s=s>, [opennlp/p=NN])

The matches() operation will match, when a second subquery has the exact same span as a first subquery.

matches(<base/s=s>,[tt/p=CARD][tt/p="N.*"])

The overlaps() operation will match, when a second subquery has an overlapping span with the first subquery.

overlaps([][tt/p=ADJA],{1:[tt/p=ADJA]}[])

Positional operators are still experimental and may change in certain aspects in the future (although the behaviour defined is intended to be stable). There is also known incorrect behaviour which will be corrected in future versions.

Optional operands in position operators have to be mandatory at the moment and will be reformulated to occur at least once.

This behaviour may change in the future.

Class Operators

Classes are used to group submatches by surrounding curly brackets and a class number {1:...}. Classes can be used to refer to submatches in a query, similar to captures in regular expressions. In Poliqarp+ classes have multiple purposes, with highlighting being the most intuitive one:

der {1:{2:[]} Mann}

In KorAP classes can be defined from 1 to 128. In case a class number is missing, the class defaults to the class number 1: {...} is equal to {1:...}.

Match Modification

Based on classes, matches may be modified. The focus() operator restricts the span of a match to the boundary of a certain class.

focus(der {Baum})

The query above will search for a sequence but the match will be limited to the second segment. You can think of the first segment in this query as a positive look-behind zero-length assertion in regular expressions.

But focus is way more useful if you are searching for matches without knowing the surface form. For example, to find all terms between defined words you can search:

focus(der {[]} Mann)

Or you may want to search for all words following a known sequence immediately:

focus(der alte und {[]})

focus() is especially useful if you are searching for matches in certain areas, for example in quotes using positional operators. While not being interested in the whole quote as a match, you can focus on what's really relevant to you.

focus(contains(er []{,10} sagte, {Baum}))

In case a class number is missing, the focus operator defaults to the class number 1: focus(...) is equal to focus(1: ...).

As numbers in curly brackets can be ambiguous in certain circumstances, for example []{3} can be read as either "any word repeated three times" or "any word followed by the number 3 highlighted as class number 1", numbers should always be expressed as [orth=3] for the latter case.