Akron | 51da002 | 2023-10-10 14:17:21 +0200 | [diff] [blame] | 1 | % layout 'main', title => 'KorAP: CQP'; |
| 2 | |
| 3 | %= page_title |
| 4 | |
| 5 | <p>The following documentation introduces all features provided by our |
| 6 | version of the CQP Query Language and some KorAP specific extensions. |
| 7 | This tutorial is based on the IMS Open Corpus Workbench (CWB) |
| 8 | |
| 9 | <%= ext_link_to 'CQP Query Language Tutorial, version 3.4.24 (May 2020)',"https://cwb.sourceforge.io/files/CQP_Manual.pdf" %> |
| 10 | and on |
| 11 | <%= embedded_link_to 'doc', 'the Korap Poliqarp+ tutorial', 'ql', 'poliqarp-plus' %>.</p> |
| 12 | |
| 13 | <section id="segments"> |
| 14 | <h3>Simple Segments</h3> |
| 15 | <p>The atomic elements of CQP queries are segments. Most of the time, |
| 16 | segments represent words and can be queried by encapsulating them in |
| 17 | qoutes or double quotes:</p> |
| 18 | |
| 19 | |
| 20 | %= doc_query cqp => loc('Q_cqp_simplesquote', "** 'Tree'") |
| 21 | |
| 22 | <p>or</p> |
| 23 | |
| 24 | %= doc_query cqp => loc('Q_cqp_simpledquote', '** "Tree"') |
| 25 | |
| 26 | <p>A word segment is always interpreted as a <%= embedded_link_to 'doc', 'regular expressions', 'ql', 'regexp' %>, e.g., a query like</p> |
| 27 | |
| 28 | %= doc_query cqp => loc('Q_cqp_re', '** "r(u|a)n"'), cutoff => 1 |
| 29 | |
| 30 | %# <p>can return both "Tannenbaum" and "baum".</p> |
| 31 | |
| 32 | <p>Sequences of simple segments are expressed using a space delimiter:</p> |
| 33 | |
| 34 | %= doc_query cqp => loc('Q_cqp_simpleseq1', '** "the" "Tree"') |
| 35 | |
| 36 | %= doc_query cqp => loc('Q_cqp_simpleseq2', "** 'the' 'Tree'") |
| 37 | |
| 38 | %# ------------------- Current state (ND) |
| 39 | |
| 40 | <p>Originally, CQP was developped as a corpus query processor tool and |
| 41 | any CQP command had to be followed by a semicolon. <%= ext_link_to 'The CQPweb server', "https://cqpweb.lancs.ac.uk/" %> at |
| 42 | Lancaster treats the semicolon as optional, and we implemented it in |
| 43 | the same way.</p> |
| 44 | <p>Simple segments always refer to the surface form of a word. To search |
| 45 | for surface forms without case sensitivity, you can use the <code>%c</code> |
| 46 | flag:</p> |
| 47 | |
| 48 | |
| 49 | %= doc_query cqp => loc('Q_cqp_simplescflag', '"laufen"%c'), cutoff => 0 |
| 50 | |
| 51 | |
| 52 | |
| 53 | <p>The query above will find all occurrences of the term irrespective of |
| 54 | the capitalization of letters.</p> |
| 55 | |
| 56 | <p>Diacritics is not been supported yet.</p> |
| 57 | |
| 58 | <!-- EM |
| 59 | <p>To ignore diacritics, you can use the <code>%d</code> flag:</p> |
| 60 | |
| 61 | |
| 62 | %= doc_query cqp => loc('Q_cqp_simplesidia2', '"Fraulein"%d'), cutoff => 0 |
| 63 | |
| 64 | |
| 65 | |
| 66 | <p>The query above will find all occurrences of the term irrespective of |
| 67 | the use of diacritics (i.e., <code>Fräulein</code> and <code>Fraulein</code>).</p> |
| 68 | |
| 69 | <p>Flags can be combined to ignore bose case sensitivity and diacritics:</p> |
| 70 | |
| 71 | |
| 72 | %= doc_query cqp => loc('Q_cqp_simplesegidia2', '"Fraulein"%cd'), cutoff => 0 |
| 73 | |
| 74 | <p>The query above will find all occurrences of the term irrespective of |
| 75 | the use of diacritics or of capitalization: <code>fraulein</code>, <code>Fraulein</code>, |
| 76 | <code>fräulein</code>, etc.</p> |
| 77 | --> |
| 78 | |
| 79 | <h4 id="regexp">Regular Expressions</h4> |
| 80 | <p>Special regular expressions characters like <code>.</code>, <code>?</code>, |
| 81 | <code>*</code>, <code>+</code>, <code>|</code>, <code>(</code>, <code>)</code>, |
| 82 | <code>[</code>, <code>]</code>, <code>{</code>, <code>}</code>, <code>^</code>, |
| 83 | <code>$</code> have to be "escaped" with backslash (<code>\</code>):</p> |
| 84 | <ul> |
| 85 | <li><code>"?";</code> fails while <code>"\?";</code> returns <code>?.</code></li> |
| 86 | <li><code>"."</code> returns any character, while <code>"\$\."</code> |
| 87 | returns <code>$.</code></li> |
| 88 | </ul> |
| 89 | <blockquote class="warning"> |
| 90 | <p>Beware: Queries with prepended <code>.*</code> expressions can |
| 91 | become extremely slow!</p> |
| 92 | <p>In Poliqarp+ only double quotes are used for regular expressions, |
| 93 | while single quotes are used to mark verbatim strings. In CQP, you |
| 94 | can use %l flag to match the string in a verbatim manner.</p> |
| 95 | </blockquote> |
| 96 | <p>To match a word form containing single or double quotes, use one of |
| 97 | the following methods :</p> |
| 98 | <ul> |
| 99 | <li>if the string you need to match contain either single or |
| 100 | double quotes, use the other quote character to encapsulate the |
| 101 | string: </li> |
| 102 | |
| 103 | %= doc_query cqp => loc('Q_cqp_regexqu1', '"It\'s"'), cutoff => 0 |
| 104 | |
| 105 | <!-- EM |
| 106 | %= doc_query cqp => loc('Q_cqp_xxxx', '\'12"-screen\''), cutoff => 0 |
| 107 | --> |
| 108 | |
| 109 | <li>escape the qoutes by doubling every occurrence of the quotes |
| 110 | character inside the string: </li> |
| 111 | |
| 112 | %= doc_query cqp => loc('Q_cqp_regexequ1', '\'It\'\'s\''), cutoff => 0 |
| 113 | |
| 114 | <!-- %= doc_query cqp => loc('Q_cqp_regexequ2', '"12""-screen"'), cutoff => 0 --> |
| 115 | |
| 116 | <li>escape the qoutes by using <code>(\)</code>: </li> |
| 117 | |
| 118 | %= doc_query cqp => loc('Q_cqp_regexequ3', "'It\\'s'"), cutoff => 0 |
| 119 | |
| 120 | <!-- %= doc_query cqp => loc('Q_cqp_regexequ4', '"12\\"-screen"'), cutoff => 0 --> |
| 121 | </ul> |
| 122 | </section> |
| 123 | <section id="complex"> |
| 124 | <h3>Complex Segments</h3> |
| 125 | <p>Complex segments are expressed in square brackets and contain |
| 126 | additional information on the resource of the term under scrutiny by |
| 127 | providing key/value pairs, separated by an equal-sign.</p> |
| 128 | <p>The KorAP implementation of CQP provides three special segment keys: |
| 129 | <code>orth</code> for surface forms, <code>base</code> for lemmata, |
| 130 | and <code>pos</code> for Part-of-Speech. The following complex query |
| 131 | finds all surface forms of the defined word:</p> |
| 132 | |
| 133 | %= doc_query cqp => loc('Q_cqp_compsl1', '[orth="Baum"]'), cutoff => 0 |
| 134 | |
| 135 | |
| 136 | <p>The query is thus equivalent to:</p> |
| 137 | |
| 138 | %= doc_query cqp => loc('Q_cqp_compsl2', '"Baum"'), cutoff => 0 |
| 139 | |
| 140 | |
| 141 | <p>Complex segments expect simple expressions as values, meaning that |
| 142 | the following expression is valid as well:</p> |
| 143 | |
| 144 | %= doc_query cqp => loc('Q_cqp_compsse', '[orth="l(au|ie)fen"%c]'), cutoff => 1 |
| 145 | |
| 146 | |
| 147 | <p>Another special key is <code>base</code>, refering to the lemma |
| 148 | annotation of the <%= embedded_link_to 'doc', 'default foundry', 'data', 'annotation'%>. The following query finds all occurrences of segments |
| 149 | annotated as a specified lemma by the default foundry:</p> |
| 150 | |
| 151 | %= doc_query cqp => loc('Q_cqp_compsbase', '[base="Baum"]'), cutoff => 1 |
| 152 | |
| 153 | |
| 154 | <p>The third special key is <code>pos</code>, refering to the |
| 155 | part-of-speech annotation of the <%= embedded_link_to 'doc', 'default foundry', 'data', 'annotation'%>. The following query finds all attributive adjectives:</p> |
| 156 | |
| 157 | %= doc_query cqp => loc('Q_cqp_compspos', '[pos="ADJA"]'), cutoff => 1 |
| 158 | |
| 159 | |
| 160 | <p>Complex segments requesting further token annotations can have keys |
| 161 | following the <code>foundry/layer</code> notation. For example to |
| 162 | find all occurrences of plural words in a supporting foundry, you can |
| 163 | search using the following queries:</p> |
| 164 | |
| 165 | %= doc_query cqp => loc('Q_cqp_compstoken1', '[marmot/m="number":"pl"]'), cutoff => 1 |
| 166 | |
| 167 | |
| 168 | %= doc_query cqp => loc('Q_cqp_compstoken2', "[marmot/m='tense':'pres']"), cutoff => 1 |
| 169 | |
| 170 | |
| 171 | <p>In case an annotation contains special non-alphabetic and non-numeric |
| 172 | characters, the annotation part can be followed by <code>%l</code> to |
| 173 | ensure a verbatim interpretation:</p> |
| 174 | |
| 175 | %= doc_query cqp => loc('Q_cqp_compstokenverb', "[orth='https://de.wikipedia.org'%l]"), cutoff => 1 |
| 176 | |
| 177 | |
| 178 | <h4>Negation</h4> |
| 179 | <p>Negation of terms in complex expressions can be expressed by |
| 180 | prepending the equal sign or the whole expression with an exclamation |
| 181 | mark.</p> |
| 182 | |
| 183 | %= doc_query cqp => loc('Q_cqp_neg1', '[pos!="ADJA"] "Haare"'), cutoff => 1 |
| 184 | |
| 185 | |
| 186 | |
| 187 | %= doc_query cqp => loc('Q_cqp_neg2', '[!pos="ADJA"] "Haare"'), cutoff => 1 |
| 188 | |
| 189 | |
| 190 | <blockquote class="warning"> |
| 191 | <p>Beware: Negated complex segments can't be searched as a single |
| 192 | statement. However, they work in case they are part of a <%= embedded_link_to 'doc', 'sequence', 'ql', 'poliqarp-plus#syntagmatic-operators-sequence'%>.</p> |
| 193 | </blockquote> |
| 194 | <h4 id="empty-segments">Empty Segments</h4> |
| 195 | <p>A special segment is the empty segment, that matches every word in |
| 196 | the index.</p> |
| 197 | |
| 198 | %= doc_query cqp => loc('Q_cqp_empseq', '[]'), cutoff => 1 |
| 199 | |
| 200 | |
| 201 | <p>Empty segments are useful to express distances of words by using |
| 202 | <%= embedded_link_to 'doc', 'repetitions', 'ql', 'poliqarp-plus#syntagmatic-operators-repetitions'%>.</p> |
| 203 | <blockquote class="warning"> |
| 204 | <p>Beware: Empty segments can't be searched as a single statement. |
| 205 | However, they work in case they are part of a <%= embedded_link_to 'doc', 'sequence', 'ql', 'poliqarp-plus#syntagmatic-operators-sequence'%>.</p> |
| 206 | </blockquote> |
| 207 | </section> |
| 208 | <section id="spans"> |
| 209 | <h3>Span Segments</h3> |
| 210 | <p>Not all segments are bound to words - some are bound to concepts |
| 211 | spanning multiple words, for example noun phrases, sentences, or |
| 212 | paragraphs. Span segments are structural elements and they have |
| 213 | specific syntax in different contexts. When used in complex segments, |
| 214 | they need to be searched by using angular brackets : |
| 215 | |
| 216 | %= doc_query cqp => loc('Q_cqp_spansegm', '<corenlp/c=NP>'), cutoff => 1 |
| 217 | |
| 218 | Some spans such as <code>s</code> are special keywords that can be |
| 219 | used without angular brackets, as operands of specific functional |
| 220 | operators like <code>within</code>, <code>region</code>, <code>lbound</code>, |
| 221 | <code>rbound</code> or <code>MU(meet)</code>. |
| 222 | |
| 223 | <!-- EM |
| 224 | but when used with specific functional |
| 225 | operators like <code>within</code>, <code>region</code>, <code>lbound</code>, |
| 226 | <code>rbound</code> or <code>MU(meet)</code>, the angular brackets |
| 227 | are not mandatory. |
| 228 | --> |
| 229 | </p> |
| 230 | </section> |
| 231 | <section id="paradigmatic-operators"> |
| 232 | <h3>Paradigmatic Operators</h3> |
| 233 | <p>A complex segment can have multiple properties a token requires. For |
| 234 | example to search for all words with a certain surface form of a |
| 235 | particular lemma (no matter if capitalized or not), you can search |
| 236 | for:</p> |
| 237 | |
| 238 | %= doc_query cqp => loc('Q_cqp_parseg', '[orth="laufe"%c & base="Lauf"]'), cutoff => 1 |
| 239 | |
| 240 | |
| 241 | <p>The ampersand combines multiple properties with a logical AND. Terms |
| 242 | of the complex segment can be negated as introduced before. The |
| 243 | following queries are equivalent:</p> |
| 244 | |
| 245 | %= doc_query cqp => loc('Q_cqp_parsegamp1', '[orth="laufe"%c & base!="Lauf"]'), cutoff => 1 |
| 246 | |
| 247 | |
| 248 | |
| 249 | %= doc_query cqp => loc('Q_cqp_parsegamp2', '[orth="laufe"%c & !base="Lauf"]'), cutoff => 1 |
| 250 | |
| 251 | |
| 252 | <p>Alternatives can be expressed by using the pipe symbol:</p> |
| 253 | |
| 254 | %= doc_query cqp => loc('Q_cqp_parsegalt', '[base="laufen" | base="gehen"]'), cutoff => 1 |
| 255 | |
| 256 | |
| 257 | <p>All these sub expressions can be grouped using round brackets to form |
| 258 | complex boolean expressions:</p> |
| 259 | |
| 260 | %= doc_query cqp => loc('Q_cqp_parsegcb', '[(base="laufen" | base="gehen") & tt/pos="VVFIN"]'), cutoff => 1 |
| 261 | |
| 262 | |
| 263 | Round brackets can also be used to encapsulate simple segments, to |
| 264 | increase query readability, although they are not necessary: |
| 265 | |
| 266 | %= doc_query cqp => loc('Q_cqp_parsegrb', '[(base="laufen" | base="gehen") & (tt/pos="VVFIN")]'), cutoff => 1 |
| 267 | |
| 268 | |
| 269 | Negation operator can be used outside expressions grouped by round |
| 270 | brackets. Be aware of the <%= ext_link_to "De |
| 271 | Morgan's Laws", "https://en.wikipedia.org/wiki/De_Morgan%27s_laws" %> when you design your queries: the following query |
| 272 | |
| 273 | %= doc_query cqp => loc('Q_cqp_parsegneg1', '[(!(base="laufen" | base="gehen")) & (tt/pos="VVFIN")]'), cutoff => 1 |
| 274 | |
| 275 | |
| 276 | <a>is logically equivalent to:</a> |
| 277 | |
| 278 | %= doc_query cqp => loc('Q_cqp_parsegneg2', '[!(base="laufen") & !(base="gehen") & (tt/pos="VVFIN")]'), cutoff => 1 |
| 279 | |
| 280 | |
| 281 | <a>which can be written in a more simple way like:</a> |
| 282 | |
| 283 | %= doc_query cqp => loc('Q_cqp_parsegneg3', '[!base="laufen" & !base="gehen" & tt/pos="VVFIN"]'), cutoff => 1 |
| 284 | |
| 285 | |
| 286 | <a> or like </a>: |
| 287 | |
| 288 | %= doc_query cqp => loc('Q_cqp_parsegneg4', '[base!="laufen" & base!="gehen" & tt/pos="VVFIN"]'), cutoff => 1 |
| 289 | |
| 290 | |
| 291 | </section> |
| 292 | <section id="syntagmatic-operators"> |
| 293 | <h3>Syntagmatic Operators</h3> |
| 294 | <h4 id="syntagmatic-operators-sequence">Sequences</h4> |
| 295 | <p>Sequences can be used to search for segments in order. For this, |
| 296 | simple expressions are separated by whitespaces.</p> |
| 297 | |
| 298 | %= doc_query cqp => loc('Q_cqp_syntop1', '"der" "alte" "Mann"'), cutoff => 1 |
| 299 | |
| 300 | |
| 301 | <p>However, you can obviously search using complex segments as well:</p> |
| 302 | |
| 303 | %= doc_query cqp => loc('Q_cqp_syntop2', '[orth="der"][orth="alte"][orth="Mann"]'), cutoff => 1 |
| 304 | |
| 305 | |
| 306 | <p>Now you may see the benefit of the empty segment to search for words |
| 307 | you don't know:</p> |
| 308 | |
| 309 | %= doc_query cqp => loc('Q_cqp_syntop3', '[orth="der"][][orth="Mann"]'), cutoff => 1 |
| 310 | |
| 311 | |
| 312 | <h4>Position</h4> |
| 313 | <p>You are also able to mix segments and spans in sequences. In CQP, |
| 314 | spans are marked by XML-like structural elements signalling the |
| 315 | beginning and/or the end of a region and they can be used to look for |
| 316 | segments in a specific position in a bigger structure like a noun |
| 317 | phrase or a sentence.</p> |
| 318 | <p>To search for a word at the beginning of a sentence (or a syntactic |
| 319 | group), the following queries are equivalent. |
| 320 | <ul> |
| 321 | <li> |
| 322 | The queries both match the word "Der" when positioned as a first word in a sentence: |
| 323 | %= doc_query cqp => loc('Q_cqp_posfirst1', '<base/s=s>[orth="Der"]'), cutoff => 1 |
| 324 | %= doc_query cqp => loc('Q_cqp_posfirst2','<s>[orth="Der"]'), cutoff => 1 |
| 325 | </li> |
| 326 | <li>The queries both match the word "Der" when positioned after the end of a sentence: |
| 327 | %= doc_query cqp => loc('Q_cqp_posaend1','</base/s=s>[orth="Der"]'), cutoff => 1 |
| 328 | %= doc_query cqp => loc('Q_cqp_posaend2','</s>[orth="Der"]'), cutoff => 1 |
| 329 | </li> |
| 330 | </ul> |
| 331 | To search for a word at the end of a sentence (or a syntactic group), |
| 332 | you can use:<br> |
| 333 | <ul> |
| 334 | <li>Match the word "Mann" |
| 335 | when positioned as a last word in a sentence: </li> |
| 336 | |
| 337 | %= doc_query cqp => loc('Q_cqp_posend1','[orth="Mann"]</base/s=s>'), cutoff => 1 |
| 338 | %= doc_query cqp => loc('Q_cqp_posend2','[orth="Mann"]</s>'), cutoff => 1 |
| 339 | |
| 340 | <li>Match the |
| 341 | word "Mann" when positioned before the beginning of a sentence, as a |
| 342 | last word of the previous sentence: </li> |
| 343 | |
| 344 | %= doc_query cqp => loc('Q_cqp_posbbeg1','[orth="Mann"]<base/s=s>'), cutoff => 1 |
| 345 | %= doc_query cqp => loc('Q_cqp_posbbeg2','[orth="Mann"]<s>'), cutoff => 1 |
| 346 | |
| 347 | </ul> |
| 348 | <blockquote class="warning"> |
| 349 | <p>Beware that when searching for longer sequences, sentence boundaries may be crossed. </p> |
| 350 | </blockquote> |
| 351 | <p> In the following example, sequences where "für" occurs in a previous |
| 352 | sentence may also be matched, because of the long sequence of empty |
| 353 | tokens in the query (minimum 20, maximum 25). |
| 354 | </p> |
| 355 | |
| 356 | %= doc_query cqp => loc('Q_cqp_posbbeg3', '"für" []{20,25} "uns"</s>'), cutoff => 1 |
| 357 | |
| 358 | </section> |