blob: 9c1ff59510c5fe87092dbb56d1d20bd883d0bfe0 [file] [log] [blame]
Akron51da0022023-10-10 14:17:21 +02001% layout 'main', title => 'KorAP: CQP';
2
3%= page_title
4
5<p>The following documentation introduces all features provided by our
6 version of the CQP Query Language and some KorAP specific extensions.
7 This tutorial is based on the IMS Open Corpus Workbench (CWB)
8
9 <%= ext_link_to 'CQP Query Language Tutorial, version 3.4.24 (May 2020)',"https://cwb.sourceforge.io/files/CQP_Manual.pdf" %>
10 and on
11 <%= embedded_link_to 'doc', 'the Korap Poliqarp+ tutorial', 'ql', 'poliqarp-plus' %>.</p>
12
13<section id="segments">
14 <h3>Simple Segments</h3>
15 <p>The atomic elements of CQP queries are segments. Most of the time,
16 segments represent words and can be queried by encapsulating them in
17 qoutes or double quotes:</p>
18
19
20 %= doc_query cqp => loc('Q_cqp_simplesquote', "** 'Tree'")
21
22 <p>or</p>
23
24 %= doc_query cqp => loc('Q_cqp_simpledquote', '** "Tree"')
25
26 <p>A word segment is always interpreted as a <%= embedded_link_to 'doc', 'regular expressions', 'ql', 'regexp' %>, e.g., a query like</p>
27
28 %= doc_query cqp => loc('Q_cqp_re', '** "r(u|a)n"'), cutoff => 1
29
30 %# <p>can return both "Tannenbaum" and "baum".</p>
31
32 <p>Sequences of simple segments are expressed using a space delimiter:</p>
33
34 %= doc_query cqp => loc('Q_cqp_simpleseq1', '** "the" "Tree"')
35
36 %= doc_query cqp => loc('Q_cqp_simpleseq2', "** 'the' 'Tree'")
37
38 %# ------------------- Current state (ND)
39
40 <p>Originally, CQP was developped as a corpus query processor tool and
41 any CQP command had to be followed by a semicolon. <%= ext_link_to 'The CQPweb server', "https://cqpweb.lancs.ac.uk/" %> at
42 Lancaster treats the semicolon as optional, and we implemented it in
43 the same way.</p>
44 <p>Simple segments always refer to the surface form of a word. To search
45 for surface forms without case sensitivity, you can use the <code>%c</code>
46 flag:</p>
47
48
49 %= doc_query cqp => loc('Q_cqp_simplescflag', '"laufen"%c'), cutoff => 0
50
51
52
53 <p>The query above will find all occurrences of the term irrespective of
54 the capitalization of letters.</p>
55
56 <p>Diacritics is not been supported yet.</p>
57
58 <!-- EM
59 <p>To ignore diacritics, you can use the <code>%d</code> flag:</p>
60
61
62 %= doc_query cqp => loc('Q_cqp_simplesidia2', '"Fraulein"%d'), cutoff => 0
63
64
65
66 <p>The query above will find all occurrences of the term irrespective of
67 the use of diacritics (i.e., <code>Fräulein</code> and <code>Fraulein</code>).</p>
68
69 <p>Flags can be combined to ignore bose case sensitivity and diacritics:</p>
70
71
72 %= doc_query cqp => loc('Q_cqp_simplesegidia2', '"Fraulein"%cd'), cutoff => 0
73
74 <p>The query above will find all occurrences of the term irrespective of
75 the use of diacritics or of capitalization: <code>fraulein</code>, <code>Fraulein</code>,
76 <code>fräulein</code>, etc.</p>
77-->
78
79 <h4 id="regexp">Regular Expressions</h4>
80 <p>Special regular expressions characters like <code>.</code>, <code>?</code>,
81 <code>*</code>, <code>+</code>, <code>|</code>, <code>(</code>, <code>)</code>,
82 <code>[</code>, <code>]</code>, <code>{</code>, <code>}</code>, <code>^</code>,
83 <code>$</code> have to be "escaped" with backslash (<code>\</code>):</p>
84 <ul>
85 <li><code>"?";</code> fails while <code>"\?";</code> returns <code>?.</code></li>
86 <li><code>"."</code> returns any character, while <code>"\$\."</code>
87 returns <code>$.</code></li>
88 </ul>
89 <blockquote class="warning">
90 <p>Beware: Queries with prepended <code>.*</code> expressions can
91 become extremely slow!</p>
92 <p>In Poliqarp+ only double quotes are used for regular expressions,
93 while single quotes are used to mark verbatim strings. In CQP, you
94 can use %l flag to match the string in a verbatim manner.</p>
95 </blockquote>
96 <p>To match a word form containing single or double quotes, use one of
97 the following methods :</p>
98 <ul>
99 <li>if the string you need to match contain either single or
100 double quotes, use the other quote character to encapsulate the
101 string: </li>
102
103 %= doc_query cqp => loc('Q_cqp_regexqu1', '"It\'s"'), cutoff => 0
104
105 <!-- EM
106 %= doc_query cqp => loc('Q_cqp_xxxx', '\'12"-screen\''), cutoff => 0
107 -->
108
109 <li>escape the qoutes by doubling every occurrence of the quotes
110 character inside the string: </li>
111
112 %= doc_query cqp => loc('Q_cqp_regexequ1', '\'It\'\'s\''), cutoff => 0
113
114 <!-- %= doc_query cqp => loc('Q_cqp_regexequ2', '"12""-screen"'), cutoff => 0 -->
115
116 <li>escape the qoutes by using <code>(\)</code>: </li>
117
118 %= doc_query cqp => loc('Q_cqp_regexequ3', "'It\\'s'"), cutoff => 0
119
120 <!-- %= doc_query cqp => loc('Q_cqp_regexequ4', '"12\\"-screen"'), cutoff => 0 -->
121 </ul>
122 </section>
123 <section id="complex">
124 <h3>Complex Segments</h3>
125 <p>Complex segments are expressed in square brackets and contain
126 additional information on the resource of the term under scrutiny by
127 providing key/value pairs, separated by an equal-sign.</p>
128 <p>The KorAP implementation of CQP provides three special segment keys:
129 <code>orth</code> for surface forms, <code>base</code> for lemmata,
130 and <code>pos</code> for Part-of-Speech. The following complex query
131 finds all surface forms of the defined word:</p>
132
133 %= doc_query cqp => loc('Q_cqp_compsl1', '[orth="Baum"]'), cutoff => 0
134
135
136 <p>The query is thus equivalent to:</p>
137
138 %= doc_query cqp => loc('Q_cqp_compsl2', '"Baum"'), cutoff => 0
139
140
141 <p>Complex segments expect simple expressions as values, meaning that
142 the following expression is valid as well:</p>
143
144 %= doc_query cqp => loc('Q_cqp_compsse', '[orth="l(au|ie)fen"%c]'), cutoff => 1
145
146
147 <p>Another special key is <code>base</code>, refering to the lemma
148 annotation of the <%= embedded_link_to 'doc', 'default foundry', 'data', 'annotation'%>. The following query finds all occurrences of segments
149 annotated as a specified lemma by the default foundry:</p>
150
151 %= doc_query cqp => loc('Q_cqp_compsbase', '[base="Baum"]'), cutoff => 1
152
153
154 <p>The third special key is <code>pos</code>, refering to the
155 part-of-speech annotation of the <%= embedded_link_to 'doc', 'default foundry', 'data', 'annotation'%>. The following query finds all attributive adjectives:</p>
156
157 %= doc_query cqp => loc('Q_cqp_compspos', '[pos="ADJA"]'), cutoff => 1
158
159
160 <p>Complex segments requesting further token annotations can have keys
161 following the <code>foundry/layer</code> notation. For example to
162 find all occurrences of plural words in a supporting foundry, you can
163 search using the following queries:</p>
164
165 %= doc_query cqp => loc('Q_cqp_compstoken1', '[marmot/m="number":"pl"]'), cutoff => 1
166
167
168 %= doc_query cqp => loc('Q_cqp_compstoken2', "[marmot/m='tense':'pres']"), cutoff => 1
169
170
171 <p>In case an annotation contains special non-alphabetic and non-numeric
172 characters, the annotation part can be followed by <code>%l</code> to
173 ensure a verbatim interpretation:</p>
174
175 %= doc_query cqp => loc('Q_cqp_compstokenverb', "[orth='https://de.wikipedia.org'%l]"), cutoff => 1
176
177
178 <h4>Negation</h4>
179 <p>Negation of terms in complex expressions can be expressed by
180 prepending the equal sign or the whole expression with an exclamation
181 mark.</p>
182
183 %= doc_query cqp => loc('Q_cqp_neg1', '[pos!="ADJA"] "Haare"'), cutoff => 1
184
185
186
187 %= doc_query cqp => loc('Q_cqp_neg2', '[!pos="ADJA"] "Haare"'), cutoff => 1
188
189
190 <blockquote class="warning">
191 <p>Beware: Negated complex segments can't be searched as a single
192 statement. However, they work in case they are part of a <%= embedded_link_to 'doc', 'sequence', 'ql', 'poliqarp-plus#syntagmatic-operators-sequence'%>.</p>
193 </blockquote>
194 <h4 id="empty-segments">Empty Segments</h4>
195 <p>A special segment is the empty segment, that matches every word in
196 the index.</p>
197
198 %= doc_query cqp => loc('Q_cqp_empseq', '[]'), cutoff => 1
199
200
201 <p>Empty segments are useful to express distances of words by using
202 <%= embedded_link_to 'doc', 'repetitions', 'ql', 'poliqarp-plus#syntagmatic-operators-repetitions'%>.</p>
203 <blockquote class="warning">
204 <p>Beware: Empty segments can't be searched as a single statement.
205 However, they work in case they are part of a <%= embedded_link_to 'doc', 'sequence', 'ql', 'poliqarp-plus#syntagmatic-operators-sequence'%>.</p>
206 </blockquote>
207 </section>
208 <section id="spans">
209 <h3>Span Segments</h3>
210 <p>Not all segments are bound to words - some are bound to concepts
211 spanning multiple words, for example noun phrases, sentences, or
212 paragraphs. Span segments are structural elements and they have
213 specific syntax in different contexts. When used in complex segments,
214 they need to be searched by using angular brackets :
215
216 %= doc_query cqp => loc('Q_cqp_spansegm', '<corenlp/c=NP>'), cutoff => 1
217
218 Some spans such as <code>s</code> are special keywords that can be
219 used without angular brackets, as operands of specific functional
220 operators like <code>within</code>, <code>region</code>, <code>lbound</code>,
221 <code>rbound</code> or <code>MU(meet)</code>.
222
223 <!-- EM
224 but when used with specific functional
225 operators like <code>within</code>, <code>region</code>, <code>lbound</code>,
226 <code>rbound</code> or <code>MU(meet)</code>, the angular brackets
227 are not mandatory.
228 -->
229 </p>
230 </section>
231 <section id="paradigmatic-operators">
232 <h3>Paradigmatic Operators</h3>
233 <p>A complex segment can have multiple properties a token requires. For
234 example to search for all words with a certain surface form of a
235 particular lemma (no matter if capitalized or not), you can search
236 for:</p>
237
238 %= doc_query cqp => loc('Q_cqp_parseg', '[orth="laufe"%c & base="Lauf"]'), cutoff => 1
239
240
241 <p>The ampersand combines multiple properties with a logical AND. Terms
242 of the complex segment can be negated as introduced before. The
243 following queries are equivalent:</p>
244
245 %= doc_query cqp => loc('Q_cqp_parsegamp1', '[orth="laufe"%c & base!="Lauf"]'), cutoff => 1
246
247
248
249 %= doc_query cqp => loc('Q_cqp_parsegamp2', '[orth="laufe"%c & !base="Lauf"]'), cutoff => 1
250
251
252 <p>Alternatives can be expressed by using the pipe symbol:</p>
253
254 %= doc_query cqp => loc('Q_cqp_parsegalt', '[base="laufen" | base="gehen"]'), cutoff => 1
255
256
257 <p>All these sub expressions can be grouped using round brackets to form
258 complex boolean expressions:</p>
259
260 %= doc_query cqp => loc('Q_cqp_parsegcb', '[(base="laufen" | base="gehen") & tt/pos="VVFIN"]'), cutoff => 1
261
262
263 Round brackets can also be used to encapsulate simple segments, to
264 increase query readability, although they are not necessary:
265
266 %= doc_query cqp => loc('Q_cqp_parsegrb', '[(base="laufen" | base="gehen") & (tt/pos="VVFIN")]'), cutoff => 1
267
268
269 Negation operator can be used outside expressions grouped by round
270 brackets. Be aware of the <%= ext_link_to "De
271 Morgan's Laws", "https://en.wikipedia.org/wiki/De_Morgan%27s_laws" %> when you design your queries: the following query
272
273 %= doc_query cqp => loc('Q_cqp_parsegneg1', '[(!(base="laufen" | base="gehen")) & (tt/pos="VVFIN")]'), cutoff => 1
274
275
276 <a>is logically equivalent to:</a>
277
278 %= doc_query cqp => loc('Q_cqp_parsegneg2', '[!(base="laufen") & !(base="gehen") & (tt/pos="VVFIN")]'), cutoff => 1
279
280
281 <a>which can be written in a more simple way like:</a>
282
283 %= doc_query cqp => loc('Q_cqp_parsegneg3', '[!base="laufen" & !base="gehen" & tt/pos="VVFIN"]'), cutoff => 1
284
285
286 <a> or like </a>:
287
288 %= doc_query cqp => loc('Q_cqp_parsegneg4', '[base!="laufen" & base!="gehen" & tt/pos="VVFIN"]'), cutoff => 1
289
290
291 </section>
292 <section id="syntagmatic-operators">
293 <h3>Syntagmatic Operators</h3>
294 <h4 id="syntagmatic-operators-sequence">Sequences</h4>
295 <p>Sequences can be used to search for segments in order. For this,
296 simple expressions are separated by whitespaces.</p>
297
298 %= doc_query cqp => loc('Q_cqp_syntop1', '"der" "alte" "Mann"'), cutoff => 1
299
300
301 <p>However, you can obviously search using complex segments as well:</p>
302
303 %= doc_query cqp => loc('Q_cqp_syntop2', '[orth="der"][orth="alte"][orth="Mann"]'), cutoff => 1
304
305
306 <p>Now you may see the benefit of the empty segment to search for words
307 you don't know:</p>
308
309 %= doc_query cqp => loc('Q_cqp_syntop3', '[orth="der"][][orth="Mann"]'), cutoff => 1
310
311
312 <h4>Position</h4>
313 <p>You are also able to mix segments and spans in sequences. In CQP,
314 spans are marked by XML-like structural elements signalling the
315 beginning and/or the end of a region and they can be used to look for
316 segments in a specific position in a bigger structure like a noun
317 phrase or a sentence.</p>
318 <p>To search for a word at the beginning of a sentence (or a syntactic
319 group), the following queries are equivalent.
320 <ul>
321 <li>
322 The queries both match the word "Der" when positioned as a first word in a sentence:
323 %= doc_query cqp => loc('Q_cqp_posfirst1', '<base/s=s>[orth="Der"]'), cutoff => 1
324 %= doc_query cqp => loc('Q_cqp_posfirst2','<s>[orth="Der"]'), cutoff => 1
325 </li>
326 <li>The queries both match the word "Der" when positioned after the end of a sentence:
327 %= doc_query cqp => loc('Q_cqp_posaend1','</base/s=s>[orth="Der"]'), cutoff => 1
328 %= doc_query cqp => loc('Q_cqp_posaend2','</s>[orth="Der"]'), cutoff => 1
329 </li>
330 </ul>
331 To search for a word at the end of a sentence (or a syntactic group),
332 you can use:<br>
333 <ul>
334 <li>Match the word "Mann"
335 when positioned as a last word in a sentence: </li>
336
337 %= doc_query cqp => loc('Q_cqp_posend1','[orth="Mann"]</base/s=s>'), cutoff => 1
338 %= doc_query cqp => loc('Q_cqp_posend2','[orth="Mann"]</s>'), cutoff => 1
339
340 <li>Match the
341 word "Mann" when positioned before the beginning of a sentence, as a
342 last word of the previous sentence: </li>
343
344 %= doc_query cqp => loc('Q_cqp_posbbeg1','[orth="Mann"]<base/s=s>'), cutoff => 1
345 %= doc_query cqp => loc('Q_cqp_posbbeg2','[orth="Mann"]<s>'), cutoff => 1
346
347 </ul>
348 <blockquote class="warning">
349 <p>Beware that when searching for longer sequences, sentence boundaries may be crossed. </p>
350 </blockquote>
351 <p> In the following example, sequences where "für" occurs in a previous
352 sentence may also be matched, because of the long sequence of empty
353 tokens in the query (minimum 20, maximum 25).
354 </p>
355
356 %= doc_query cqp => loc('Q_cqp_posbbeg3', '"für" []{20,25} "uns"</s>'), cutoff => 1
357
358 </section>