blob: 9c1ff59510c5fe87092dbb56d1d20bd883d0bfe0 [file] [log] [blame]
% layout 'main', title => 'KorAP: CQP';
%= page_title
<p>The following documentation introduces all features provided by our
version of the CQP Query Language and some KorAP specific extensions.
This tutorial is based on the IMS Open Corpus Workbench (CWB)
<%= ext_link_to 'CQP Query Language Tutorial, version 3.4.24 (May 2020)',"https://cwb.sourceforge.io/files/CQP_Manual.pdf" %>
and on
<%= embedded_link_to 'doc', 'the Korap Poliqarp+ tutorial', 'ql', 'poliqarp-plus' %>.</p>
<section id="segments">
<h3>Simple Segments</h3>
<p>The atomic elements of CQP queries are segments. Most of the time,
segments represent words and can be queried by encapsulating them in
qoutes or double quotes:</p>
%= doc_query cqp => loc('Q_cqp_simplesquote', "** 'Tree'")
<p>or</p>
%= doc_query cqp => loc('Q_cqp_simpledquote', '** "Tree"')
<p>A word segment is always interpreted as a <%= embedded_link_to 'doc', 'regular expressions', 'ql', 'regexp' %>, e.g., a query like</p>
%= doc_query cqp => loc('Q_cqp_re', '** "r(u|a)n"'), cutoff => 1
%# <p>can return both "Tannenbaum" and "baum".</p>
<p>Sequences of simple segments are expressed using a space delimiter:</p>
%= doc_query cqp => loc('Q_cqp_simpleseq1', '** "the" "Tree"')
%= doc_query cqp => loc('Q_cqp_simpleseq2', "** 'the' 'Tree'")
%# ------------------- Current state (ND)
<p>Originally, CQP was developped as a corpus query processor tool and
any CQP command had to be followed by a semicolon. <%= ext_link_to 'The CQPweb server', "https://cqpweb.lancs.ac.uk/" %> at
Lancaster treats the semicolon as optional, and we implemented it in
the same way.</p>
<p>Simple segments always refer to the surface form of a word. To search
for surface forms without case sensitivity, you can use the <code>%c</code>
flag:</p>
%= doc_query cqp => loc('Q_cqp_simplescflag', '"laufen"%c'), cutoff => 0
<p>The query above will find all occurrences of the term irrespective of
the capitalization of letters.</p>
<p>Diacritics is not been supported yet.</p>
<!-- EM
<p>To ignore diacritics, you can use the <code>%d</code> flag:</p>
%= doc_query cqp => loc('Q_cqp_simplesidia2', '"Fraulein"%d'), cutoff => 0
<p>The query above will find all occurrences of the term irrespective of
the use of diacritics (i.e., <code>Fräulein</code> and <code>Fraulein</code>).</p>
<p>Flags can be combined to ignore bose case sensitivity and diacritics:</p>
%= doc_query cqp => loc('Q_cqp_simplesegidia2', '"Fraulein"%cd'), cutoff => 0
<p>The query above will find all occurrences of the term irrespective of
the use of diacritics or of capitalization: <code>fraulein</code>, <code>Fraulein</code>,
<code>fräulein</code>, etc.</p>
-->
<h4 id="regexp">Regular Expressions</h4>
<p>Special regular expressions characters like <code>.</code>, <code>?</code>,
<code>*</code>, <code>+</code>, <code>|</code>, <code>(</code>, <code>)</code>,
<code>[</code>, <code>]</code>, <code>{</code>, <code>}</code>, <code>^</code>,
<code>$</code> have to be "escaped" with backslash (<code>\</code>):</p>
<ul>
<li><code>"?";</code> fails while <code>"\?";</code> returns <code>?.</code></li>
<li><code>"."</code> returns any character, while <code>"\$\."</code>
returns <code>$.</code></li>
</ul>
<blockquote class="warning">
<p>Beware: Queries with prepended <code>.*</code> expressions can
become extremely slow!</p>
<p>In Poliqarp+ only double quotes are used for regular expressions,
while single quotes are used to mark verbatim strings. In CQP, you
can use %l flag to match the string in a verbatim manner.</p>
</blockquote>
<p>To match a word form containing single or double quotes, use one of
the following methods :</p>
<ul>
<li>if the string you need to match contain either single or
double quotes, use the other quote character to encapsulate the
string: </li>
%= doc_query cqp => loc('Q_cqp_regexqu1', '"It\'s"'), cutoff => 0
<!-- EM
%= doc_query cqp => loc('Q_cqp_xxxx', '\'12"-screen\''), cutoff => 0
-->
<li>escape the qoutes by doubling every occurrence of the quotes
character inside the string: </li>
%= doc_query cqp => loc('Q_cqp_regexequ1', '\'It\'\'s\''), cutoff => 0
<!-- %= doc_query cqp => loc('Q_cqp_regexequ2', '"12""-screen"'), cutoff => 0 -->
<li>escape the qoutes by using <code>(\)</code>: </li>
%= doc_query cqp => loc('Q_cqp_regexequ3', "'It\\'s'"), cutoff => 0
<!-- %= doc_query cqp => loc('Q_cqp_regexequ4', '"12\\"-screen"'), cutoff => 0 -->
</ul>
</section>
<section id="complex">
<h3>Complex Segments</h3>
<p>Complex segments are expressed in square brackets and contain
additional information on the resource of the term under scrutiny by
providing key/value pairs, separated by an equal-sign.</p>
<p>The KorAP implementation of CQP provides three special segment keys:
<code>orth</code> for surface forms, <code>base</code> for lemmata,
and <code>pos</code> for Part-of-Speech. The following complex query
finds all surface forms of the defined word:</p>
%= doc_query cqp => loc('Q_cqp_compsl1', '[orth="Baum"]'), cutoff => 0
<p>The query is thus equivalent to:</p>
%= doc_query cqp => loc('Q_cqp_compsl2', '"Baum"'), cutoff => 0
<p>Complex segments expect simple expressions as values, meaning that
the following expression is valid as well:</p>
%= doc_query cqp => loc('Q_cqp_compsse', '[orth="l(au|ie)fen"%c]'), cutoff => 1
<p>Another special key is <code>base</code>, refering to the lemma
annotation of the <%= embedded_link_to 'doc', 'default foundry', 'data', 'annotation'%>. The following query finds all occurrences of segments
annotated as a specified lemma by the default foundry:</p>
%= doc_query cqp => loc('Q_cqp_compsbase', '[base="Baum"]'), cutoff => 1
<p>The third special key is <code>pos</code>, refering to the
part-of-speech annotation of the <%= embedded_link_to 'doc', 'default foundry', 'data', 'annotation'%>. The following query finds all attributive adjectives:</p>
%= doc_query cqp => loc('Q_cqp_compspos', '[pos="ADJA"]'), cutoff => 1
<p>Complex segments requesting further token annotations can have keys
following the <code>foundry/layer</code> notation. For example to
find all occurrences of plural words in a supporting foundry, you can
search using the following queries:</p>
%= doc_query cqp => loc('Q_cqp_compstoken1', '[marmot/m="number":"pl"]'), cutoff => 1
%= doc_query cqp => loc('Q_cqp_compstoken2', "[marmot/m='tense':'pres']"), cutoff => 1
<p>In case an annotation contains special non-alphabetic and non-numeric
characters, the annotation part can be followed by <code>%l</code> to
ensure a verbatim interpretation:</p>
%= doc_query cqp => loc('Q_cqp_compstokenverb', "[orth='https://de.wikipedia.org'%l]"), cutoff => 1
<h4>Negation</h4>
<p>Negation of terms in complex expressions can be expressed by
prepending the equal sign or the whole expression with an exclamation
mark.</p>
%= doc_query cqp => loc('Q_cqp_neg1', '[pos!="ADJA"] "Haare"'), cutoff => 1
%= doc_query cqp => loc('Q_cqp_neg2', '[!pos="ADJA"] "Haare"'), cutoff => 1
<blockquote class="warning">
<p>Beware: Negated complex segments can't be searched as a single
statement. However, they work in case they are part of a <%= embedded_link_to 'doc', 'sequence', 'ql', 'poliqarp-plus#syntagmatic-operators-sequence'%>.</p>
</blockquote>
<h4 id="empty-segments">Empty Segments</h4>
<p>A special segment is the empty segment, that matches every word in
the index.</p>
%= doc_query cqp => loc('Q_cqp_empseq', '[]'), cutoff => 1
<p>Empty segments are useful to express distances of words by using
<%= embedded_link_to 'doc', 'repetitions', 'ql', 'poliqarp-plus#syntagmatic-operators-repetitions'%>.</p>
<blockquote class="warning">
<p>Beware: Empty segments can't be searched as a single statement.
However, they work in case they are part of a <%= embedded_link_to 'doc', 'sequence', 'ql', 'poliqarp-plus#syntagmatic-operators-sequence'%>.</p>
</blockquote>
</section>
<section id="spans">
<h3>Span Segments</h3>
<p>Not all segments are bound to words - some are bound to concepts
spanning multiple words, for example noun phrases, sentences, or
paragraphs. Span segments are structural elements and they have
specific syntax in different contexts. When used in complex segments,
they need to be searched by using angular brackets :
%= doc_query cqp => loc('Q_cqp_spansegm', '<corenlp/c=NP>'), cutoff => 1
Some spans such as <code>s</code> are special keywords that can be
used without angular brackets, as operands of specific functional
operators like <code>within</code>, <code>region</code>, <code>lbound</code>,
<code>rbound</code> or <code>MU(meet)</code>.
<!-- EM
but when used with specific functional
operators like <code>within</code>, <code>region</code>, <code>lbound</code>,
<code>rbound</code> or <code>MU(meet)</code>, the angular brackets
are not mandatory.
-->
</p>
</section>
<section id="paradigmatic-operators">
<h3>Paradigmatic Operators</h3>
<p>A complex segment can have multiple properties a token requires. For
example to search for all words with a certain surface form of a
particular lemma (no matter if capitalized or not), you can search
for:</p>
%= doc_query cqp => loc('Q_cqp_parseg', '[orth="laufe"%c & base="Lauf"]'), cutoff => 1
<p>The ampersand combines multiple properties with a logical AND. Terms
of the complex segment can be negated as introduced before. The
following queries are equivalent:</p>
%= doc_query cqp => loc('Q_cqp_parsegamp1', '[orth="laufe"%c & base!="Lauf"]'), cutoff => 1
%= doc_query cqp => loc('Q_cqp_parsegamp2', '[orth="laufe"%c & !base="Lauf"]'), cutoff => 1
<p>Alternatives can be expressed by using the pipe symbol:</p>
%= doc_query cqp => loc('Q_cqp_parsegalt', '[base="laufen" | base="gehen"]'), cutoff => 1
<p>All these sub expressions can be grouped using round brackets to form
complex boolean expressions:</p>
%= doc_query cqp => loc('Q_cqp_parsegcb', '[(base="laufen" | base="gehen") & tt/pos="VVFIN"]'), cutoff => 1
Round brackets can also be used to encapsulate simple segments, to
increase query readability, although they are not necessary:
%= doc_query cqp => loc('Q_cqp_parsegrb', '[(base="laufen" | base="gehen") & (tt/pos="VVFIN")]'), cutoff => 1
Negation operator can be used outside expressions grouped by round
brackets. Be aware of the <%= ext_link_to "De
Morgan's Laws", "https://en.wikipedia.org/wiki/De_Morgan%27s_laws" %> when you design your queries: the following query
%= doc_query cqp => loc('Q_cqp_parsegneg1', '[(!(base="laufen" | base="gehen")) & (tt/pos="VVFIN")]'), cutoff => 1
<a>is logically equivalent to:</a>
%= doc_query cqp => loc('Q_cqp_parsegneg2', '[!(base="laufen") & !(base="gehen") & (tt/pos="VVFIN")]'), cutoff => 1
<a>which can be written in a more simple way like:</a>
%= doc_query cqp => loc('Q_cqp_parsegneg3', '[!base="laufen" & !base="gehen" & tt/pos="VVFIN"]'), cutoff => 1
<a> or like </a>:
%= doc_query cqp => loc('Q_cqp_parsegneg4', '[base!="laufen" & base!="gehen" & tt/pos="VVFIN"]'), cutoff => 1
</section>
<section id="syntagmatic-operators">
<h3>Syntagmatic Operators</h3>
<h4 id="syntagmatic-operators-sequence">Sequences</h4>
<p>Sequences can be used to search for segments in order. For this,
simple expressions are separated by whitespaces.</p>
%= doc_query cqp => loc('Q_cqp_syntop1', '"der" "alte" "Mann"'), cutoff => 1
<p>However, you can obviously search using complex segments as well:</p>
%= doc_query cqp => loc('Q_cqp_syntop2', '[orth="der"][orth="alte"][orth="Mann"]'), cutoff => 1
<p>Now you may see the benefit of the empty segment to search for words
you don't know:</p>
%= doc_query cqp => loc('Q_cqp_syntop3', '[orth="der"][][orth="Mann"]'), cutoff => 1
<h4>Position</h4>
<p>You are also able to mix segments and spans in sequences. In CQP,
spans are marked by XML-like structural elements signalling the
beginning and/or the end of a region and they can be used to look for
segments in a specific position in a bigger structure like a noun
phrase or a sentence.</p>
<p>To search for a word at the beginning of a sentence (or a syntactic
group), the following queries are equivalent.
<ul>
<li>
The queries both match the word "Der" when positioned as a first word in a sentence:
%= doc_query cqp => loc('Q_cqp_posfirst1', '<base/s=s>[orth="Der"]'), cutoff => 1
%= doc_query cqp => loc('Q_cqp_posfirst2','<s>[orth="Der"]'), cutoff => 1
</li>
<li>The queries both match the word "Der" when positioned after the end of a sentence:
%= doc_query cqp => loc('Q_cqp_posaend1','</base/s=s>[orth="Der"]'), cutoff => 1
%= doc_query cqp => loc('Q_cqp_posaend2','</s>[orth="Der"]'), cutoff => 1
</li>
</ul>
To search for a word at the end of a sentence (or a syntactic group),
you can use:<br>
<ul>
<li>Match the word "Mann"
when positioned as a last word in a sentence: </li>
%= doc_query cqp => loc('Q_cqp_posend1','[orth="Mann"]</base/s=s>'), cutoff => 1
%= doc_query cqp => loc('Q_cqp_posend2','[orth="Mann"]</s>'), cutoff => 1
<li>Match the
word "Mann" when positioned before the beginning of a sentence, as a
last word of the previous sentence: </li>
%= doc_query cqp => loc('Q_cqp_posbbeg1','[orth="Mann"]<base/s=s>'), cutoff => 1
%= doc_query cqp => loc('Q_cqp_posbbeg2','[orth="Mann"]<s>'), cutoff => 1
</ul>
<blockquote class="warning">
<p>Beware that when searching for longer sequences, sentence boundaries may be crossed. </p>
</blockquote>
<p> In the following example, sequences where "für" occurs in a previous
sentence may also be matched, because of the long sequence of empty
tokens in the query (minimum 20, maximum 25).
</p>
%= doc_query cqp => loc('Q_cqp_posbbeg3', '"für" []{20,25} "uns"</s>'), cutoff => 1
</section>