blob: f9a778dce3824fc3e25129f58eab2221f0aed603 [file] [log] [blame]
Nils Diewald7cad8402014-07-08 17:06:56 +00001% content main => begin
2
3<h2>KorAP-Tutorial: Poliqarp+</h2>
4
Nils Diewald4e9fbcb2014-07-15 11:45:09 +00005<p><%= korap_tut_link_to 'Back to Index', '/tutorial' %></p>
6
7<p>The following tutorial introduces all features provided by our version of the Poliqarp Query Language and some KorAP specific extensions.</p>
8
9<section id="tut-segments">
Nils Diewald7cad8402014-07-08 17:06:56 +000010<h3>Simple Segments</h3>
11
12<p>The atomic elements of Poliqarp queries are segments. Most of the time segments represent words and can be simply queried:</p>
13%# footnote: In the polish national corpus, Poliqarp can join multiple segments when identifying a single word.
14
15%= korap_tut_query poliqarp => 'Baum'
16
17<p>Sequences of simple segments are expressed using a space delimiter:</p>
18
19%= korap_tut_query poliqarp => 'der Baum'
20
21<p>Simple segments always refer to the surface form of a word. To search for surface forms without case sensitivity, you can use the <code>/i</code> flag.</p>
22
23%= korap_tut_query poliqarp => 'laufen/i'
24
25<p>The query above will find all occurrences of <code>laufen</code> irrespective of the capitalization of letters, so <code>wir laufen</code> will be find as well as <code>das Laufen</code> and even <code>&quot;GEH LAUFEN!&quot;</code>.
26</section>
27
Nils Diewald4e9fbcb2014-07-15 11:45:09 +000028<section id="tut-regexp">
Nils Diewald7cad8402014-07-08 17:06:56 +000029 <h3>Regular Expressions</h3>
30
Nils Diewald4e9fbcb2014-07-15 11:45:09 +000031<p>Segments can also be queried using <%= korap_tut_link_to 'regular expressions', '/tutorial/regular-expressions' %> - by surrounding the segment with double quotes.</p>
Nils Diewald7cad8402014-07-08 17:06:56 +000032
33%= korap_tut_query poliqarp => '"l(au|ie)fen"'
34
35<p>Regular expression segments will always match the whole segment, meaning the above query will find words starting with <code>l</code> and ending with <code>n</code>. To support subqueries, you can use the <code>/x</code> flag.
36
37%= korap_tut_query poliqarp => '"l(au|ie)fen"/x', cutoff => 1
38
39<p>The <code>/x</code> will search for all segments that contain a sequence of characters the regular expression matches. That means the above query is equivalent to:</p>
40
Nils Diewaldbfcf0902014-07-15 13:36:47 +000041%= korap_tut_query poliqarp => '".*?l(au|ie)fen.*?"', cutoff => 1
Nils Diewald7cad8402014-07-08 17:06:56 +000042
43<p>The <code>/x</code> flag can also be used in conjuntion with strict expressions to search for substrings:</p>
44
45%= korap_tut_query poliqarp => 'trenn/xi', cutoff => 1
46
47<p>The above query will find all occurrences of segments including the string <code>trenn</code> case insensitive, like &quot;Trennung&quot;, &quot;unzertrennlich&quot;, or &quot;Wettrennen&quot;.</p>
48
Nils Diewald4e9fbcb2014-07-15 11:45:09 +000049<blockquote class="warning">
50 <p>Beware: These kinds of queries (with prepended <code>.*</code> expressions) are extremely slow!</p>
Nils Diewald7cad8402014-07-08 17:06:56 +000051</blockquote>
52
53<p>You can again apply the <code>/i</code> flag to search case insensitive.</p>
54
55%= korap_tut_query poliqarp => '"l(au|ie)fen"/xi', cutoff => 1
56
57</section>
58
Nils Diewald4e9fbcb2014-07-15 11:45:09 +000059<section id="tut-complex">
Nils Diewald7cad8402014-07-08 17:06:56 +000060 <h3>Complex Segments</h3>
61
Nils Diewald955ca872014-11-07 03:38:31 +000062<p>Complex segments are expressed in square brackets and contain additional information on the resource of the term under scrutiny by providing key/value pairs, separated by an equal-sign.</p>
Nils Diewald7cad8402014-07-08 17:06:56 +000063
64<p>The KorAP implementation of Poliqarp provides three special segment keys: <code>orth</code> for surface forms, <code>base</code> for lemmata, and <code>pos</code> for Part-of-Speech. The following complex query finds all surface forms of <code>Baum</code>.</p>
65
66%# There are more special keys in Poliqarp, but KorAP doesn't provide them.
67
68%= korap_tut_query poliqarp => '[orth=Baum]'
69
70<p>The query is thus equivalent to:</p>
71
72%= korap_tut_query poliqarp => 'Baum'
73
Nils Diewald955ca872014-11-07 03:38:31 +000074<p>Complex segments expect simple expressions as values, meaning that the following expression is valid as well:</p>
Nils Diewald7cad8402014-07-08 17:06:56 +000075
76%= korap_tut_query poliqarp => '[orth="l(au|ie)fen"/xi]', cutoff => 1
77
Nils Diewald955ca872014-11-07 03:38:31 +000078<p>Another special key is <code>base</code>, refering to the lemma annotation of the <%= korap_tut_link_to 'default foundry', '/tutorial/foundries' %>.
79The following query finds all occurrences of segments annotated as the lemma <code>Baum</code> by the default foundry.</p>
Nils Diewald7cad8402014-07-08 17:06:56 +000080
Nils Diewald955ca872014-11-07 03:38:31 +000081%= korap_tut_query poliqarp => '[base=Baum]'
Nils Diewald7cad8402014-07-08 17:06:56 +000082
Nils Diewald955ca872014-11-07 03:38:31 +000083<p>The third special key is <code>pos</code>, refering to the part-of-speech annotation of the <%= korap_tut_link_to 'default foundry', '/tutorial/foundries' %>.
84The following query finds all attributive adjectives:</p>
Nils Diewald7cad8402014-07-08 17:06:56 +000085
86%= korap_tut_query poliqarp => '[pos=ADJA]'
87
Nils Diewald955ca872014-11-07 03:38:31 +000088<p>Complex segments requesting further token annotations can have keys following the <code>foundry/layer</code> notation.
89For example to find all occurrences of plural words in the mate foundry, you can search using the following query:</p>
Nils Diewald7cad8402014-07-08 17:06:56 +000090
91%= korap_tut_query poliqarp => '[mate/m=number:pl]'
92
Nils Diewald955ca872014-11-07 03:38:31 +000093<h4>Negation</h4>
94<p>Negation of terms in complex expressions can be expressed by prepending the equal sign with an exclamation mark or by prepending the expression with one.</p>
95
96%= korap_tut_query poliqarp => '[pos!=ADJA]'
97%= korap_tut_query poliqarp => '[!pos=ADJA]'
98
Nils Diewald4e9fbcb2014-07-15 11:45:09 +000099<blockquote class="warning">
Nils Diewald955ca872014-11-07 03:38:31 +0000100 <p>Beware: Negated complex segments can't be searched solely in the Lucene index.
101 However, they work in case they are part of a <a href="#tut-syntagmatic-operators-sequence">sequence</a>.</p>
102</blockquote>
103
104<h4>Empty Segments</h4>
105
106<p>A special segment is the empty segment, that matches every word in the index.</p>
107
108%= korap_tut_query poliqarp => '[]'
109
110<p>Empty segments are useful to express distances of words by using <a href="tut-syntagmatic-operators-repetitions">repetitions</a>.</p>
111
112<blockquote class="warning">
113 <p>Beware: Empty segments can't be searched solely in the Lucene index.
114 However, they work in case they are part of a <a href="#tut-syntagmatic-operators-sequence">sequence</a>.</p>
Nils Diewald4e9fbcb2014-07-15 11:45:09 +0000115</blockquote>
116
Nils Diewald7cad8402014-07-08 17:06:56 +0000117</section>
118
Nils Diewald4e9fbcb2014-07-15 11:45:09 +0000119<section id="tut-spans">
Nils Diewald7cad8402014-07-08 17:06:56 +0000120<h3>Span Segments</h3>
121
Nils Diewald955ca872014-11-07 03:38:31 +0000122<p>Not all segments are bound to words - some are bound to concepts spanning multiple words, for example noun phrases, sentences, or paragraphs.
123Span segments can be searched for using angular brackets instead of square brackets.</p>
Nils Diewald7cad8402014-07-08 17:06:56 +0000124
Nils Diewaldbc1aab12014-11-07 03:44:15 +0000125%= korap_tut_query poliqarp => '<xip/c=INFC>'
Nils Diewald955ca872014-11-07 03:38:31 +0000126
127<p>Otherwise they can be treated in exactly the same way as simple or complex segments.</p>
Nils Diewald7cad8402014-07-08 17:06:56 +0000128</section>
129
Nils Diewald955ca872014-11-07 03:38:31 +0000130
Nils Diewald4e9fbcb2014-07-15 11:45:09 +0000131<section id="tut-paradigmatic-operators">
Nils Diewald7cad8402014-07-08 17:06:56 +0000132<h3>Paradigmatic Operators</h3>
Nils Diewaldf7366232014-07-25 15:57:08 +0000133
Nils Diewald955ca872014-11-07 03:38:31 +0000134<p>A complex segment can have multiple properties a token has to fulfill.
135For example to search for all words with the surface form <code>laufe</code> (no matter if capitalized or not) that have the lemma <code>lauf</code> (and not, for example, <code>laufen</code>, which would indicate a verb or a gerund), you can search for:</p>
Nils Diewaldf7366232014-07-25 15:57:08 +0000136
Nils Diewald4e9fbcb2014-07-15 11:45:09 +0000137%= korap_tut_query poliqarp => '[orth=laufe/i & base=lauf]'
Nils Diewald7cad8402014-07-08 17:06:56 +0000138
Nils Diewald955ca872014-11-07 03:38:31 +0000139<p>The ampersand combines multiple properties with a logical AND.
140Terms of the complex segment can be negated as introduced before.</p>
Nils Diewaldf7366232014-07-25 15:57:08 +0000141
Nils Diewald4e9fbcb2014-07-15 11:45:09 +0000142%= korap_tut_query poliqarp => '[orth=laufe/i & base!=lauf]'
143
Nils Diewald955ca872014-11-07 03:38:31 +0000144<p>The following query is therefore equivalent:</p>
Nils Diewald7cad8402014-07-08 17:06:56 +0000145
Nils Diewald955ca872014-11-07 03:38:31 +0000146%= korap_tut_query poliqarp => '[orth=laufe & !base=lauf]'
Nils Diewald7cad8402014-07-08 17:06:56 +0000147
Nils Diewald955ca872014-11-07 03:38:31 +0000148<p>Alternatives can be expressed by using the pipe symbol:</p>
Nils Diewald4e9fbcb2014-07-15 11:45:09 +0000149
Nils Diewald7cad8402014-07-08 17:06:56 +0000150%= korap_tut_query poliqarp => '[base=laufen | base=gehen]'
151
Nils Diewald955ca872014-11-07 03:38:31 +0000152<p>All these sub expressions can be grouped using round brackets to form
153complex boolean expressions:</p>
154
Nils Diewald7cad8402014-07-08 17:06:56 +0000155%= korap_tut_query poliqarp => '[(base=laufen | base=gehen) & tt/pos=VVFIN]'
Nils Diewald7cad8402014-07-08 17:06:56 +0000156</section>
157
Nils Diewald955ca872014-11-07 03:38:31 +0000158
Nils Diewald4e9fbcb2014-07-15 11:45:09 +0000159<section id="tut-syntagmatic-operators">
Nils Diewald7cad8402014-07-08 17:06:56 +0000160<h3>Syntagmatic Operators</h3>
161
Nils Diewald955ca872014-11-07 03:38:31 +0000162<h4 id="tut-syntagmatic-operators-sequence">Sequences</h4>
163
164<p>Sequences can be used to search for segments in order.
165For example to search for the word &quot;alte&quot; preceded by &quot;der&quot; and followed by &quot;Mann&quot;, you can simple search for the sequence of simple expressions separated by whitespaces.</p>
166
167%= korap_tut_query poliqarp => 'der alte Mann'
168
169<p>However, you can obviously search using complex segments as well:</p>
170
171%= korap_tut_query poliqarp => '[orth=der][orth=alte][orth=Mann]'
172
173<p>Now you may see the benefit of the empty segment to search for words you don't know:</p>
174
175%= korap_tut_query poliqarp => '[orth=der][][orth=Mann]'
176
Nils Diewald7cad8402014-07-08 17:06:56 +0000177
Nils Diewaldf7366232014-07-25 15:57:08 +0000178<h4>Groups</h4>
179
180<h4>Alternation</h4>
181
Nils Diewald955ca872014-11-07 03:38:31 +0000182<p>Alternations allow for searching alternative segments or sequences of segments,
183similar to the paradigmatic operator.
184You already have seen that you can search for both sequences of
185<code>der alte Mann</code> and <code>der junge Mann</code> by typing in:</p>
Nils Diewaldf7366232014-07-25 15:57:08 +0000186
187%= korap_tut_query poliqarp => 'der [orth=alte | orth=junge] Mann'
188
189<p>However, this formulation has problems in case you want to search for alternations of sequences rather than terms. If you want to search for both sequences of <code>dem jungen Mann</code> and <code>der alte Mann</code> you can use syntagmatic alternations and groups:</p>
190
191%= korap_tut_query poliqarp => '(dem jungen | der alte) Mann'
192
193<p>The pipe symbol works the same way as with the paradigmatic alternation, but supports sequences of different length as operands. The above query for <code>der alte Mann</code> and <code>der junge Mann</code> can therefor be reformulated as:</p>
194
195%= korap_tut_query poliqarp => 'der (junge | alte) Mann'
196
Nils Diewald955ca872014-11-07 03:38:31 +0000197<h4 id="tut-syntagmatic-operators-repetitions">Repetition</h4>
Nils Diewald7cad8402014-07-08 17:06:56 +0000198
Nils Diewald57a262e2014-07-22 15:18:38 +0000199<p>Repetitions in Poliqarp are realized as in <%= korap_tut_link_to 'regular expressions', '/tutorial/regular-expressions' %>, by giving quantifieres in curly brackets.</p>
200<p>To search for a sequence of three occurrences of <code>der</code>, you can formulate your query in any of the following ways - they will have the same results:</p>
201
202%= korap_tut_query poliqarp => 'der der der'
203%= korap_tut_query poliqarp => 'der{3}'
204%= korap_tut_query poliqarp => '[orth=der]{3}'
205
Nils Diewaldf7366232014-07-25 15:57:08 +0000206<p>In difference to regular expressions, the repetition operation won't refer to the match but to the pattern given. So the following query will give you a sequence of three words having the term <code>der</code> as a substring - but the words don't have to be identical. The following query for example will match a sequence of three words all starting with <code>la</code>.</p>
Nils Diewald57a262e2014-07-22 15:18:38 +0000207
208%= korap_tut_query poliqarp => '"la.*?"/i{3}'
209
Nils Diewaldf7366232014-07-25 15:57:08 +0000210<p>The same is true for annotations. The following query will find a sequence of 3 to 4 adjectives as annotated by the TreeTagger foundry, that is preceded by the lemma <code>ein</code> as annotated by the default foundry and followed by a noun as annotated by the XIP foundry. The adjectives do not have to be identical though.</p>
Nils Diewald57a262e2014-07-22 15:18:38 +0000211
Nils Diewaldf7366232014-07-25 15:57:08 +0000212%= korap_tut_query poliqarp => '[base=ein][tt/p=ADJA]{3,4}[xip/p=NOUN]'
Nils Diewald7cad8402014-07-08 17:06:56 +0000213
Nils Diewaldf7366232014-07-25 15:57:08 +0000214<p>In addition to numbered quantities, it is also possible to pass repetition information as Kleene operators <code>?</code>, <code>+</code>, and <code>+</code>.</p>
Nils Diewald7cad8402014-07-08 17:06:56 +0000215
Nils Diewaldf7366232014-07-25 15:57:08 +0000216<p>To search for a sequence of the lemma <code>der</code> followed by the lemma <code>baum</code> as annotated by the base foundry, but allowing an optional adjective as annotated by the TreeTagger foundry in between, you can search for:</p>
Nils Diewald7cad8402014-07-08 17:06:56 +0000217
Nils Diewald955ca872014-11-07 03:38:31 +0000218%= korap_tut_query poliqarp => '[base=der][tt/pos=ADJA]?[base=Baum]'
Nils Diewaldf7366232014-07-25 15:57:08 +0000219
220<p>This query is identical to the numbered quantification of:</p>
221
Nils Diewald955ca872014-11-07 03:38:31 +0000222%= korap_tut_query poliqarp => '[base=der][tt/pos=ADJA]{,1}[base=Baum]'
Nils Diewaldf7366232014-07-25 15:57:08 +0000223
224<p>To search for the same sequences but with unlimited adjectives as annotated by the TreeTagger foundry in between, you can use the Kleene Star:</p>
225
Nils Diewald955ca872014-11-07 03:38:31 +0000226%= korap_tut_query poliqarp => '[base=der][tt/pos=ADJA]*[base=Baum]'
Nils Diewaldf7366232014-07-25 15:57:08 +0000227
228<p>And to search for this sequence but with at least one adjective in between, you can use the Kleene Plus (all queries are identical):</p>
229
Nils Diewald955ca872014-11-07 03:38:31 +0000230%= korap_tut_query poliqarp => '[base=der][tt/pos=ADJA]+[base=Baum]'
231%= korap_tut_query poliqarp => '[base=der][tt/pos=ADJA]{1,}[base=Baum]'
232%= korap_tut_query poliqarp => '[base=der][tt/pos=ADJA][tt/pos=ADJA]*[base=Baum]'
Nils Diewaldf7366232014-07-25 15:57:08 +0000233
234<blockquote class="warning">
235 <p>Repetition operators like <code>{,4}</code>, <code>?</code>, and <code>*</code> make segments or groups of segments optional. In case these queries are used separated (and there are no mandatory segments in the query), you will be warned by the system that your query won't be treated as optional. Keep in mind that optionality may be somehow <i>inherited</i>, for example when you search for <code>(junge|alte)?|tote</code>, one segment of the alternation is optional, which makes the whole query optional as well.</p>
236</blockquote>
237
Nils Diewald955ca872014-11-07 03:38:31 +0000238%= korap_tut_query poliqarp => '[base=der][][base=Baum]'
239%= korap_tut_query poliqarp => '[base=der][]{2}[base=Baum]'
240%= korap_tut_query poliqarp => '[base=der][]{2,}[base=Baum]'
241%= korap_tut_query poliqarp => '[base=der][]{,3}[base=Baum]'
Nils Diewaldf7366232014-07-25 15:57:08 +0000242
243%#= korap_tut_query poliqarp => '[base=der][tt/pos=ADJA]*[base=Baum]'
244%#= korap_tut_query poliqarp => '[base=der][tt/pos=ADJA]+[base=Baum]'
245
246<h4>Position</h4>
Nils Diewald7cad8402014-07-08 17:06:56 +0000247
Nils Diewald13bad6a2014-07-18 16:44:51 +0000248%#= korap_tut_query poliqarp => 'matches(<s>,[])'
249%# matches(<s>,[cnx/p=INTERJ]{2})
Nils Diewald4e9fbcb2014-07-15 11:45:09 +0000250<p>contains()</p>
251<p>startsWith()</p>
252<p>endsWith()</p>
253<p>overlaps()</p>
Nils Diewald7cad8402014-07-08 17:06:56 +0000254
Nils Diewaldf7366232014-07-25 15:57:08 +0000255<blockquote class="warning">
256 <p>Optional operands in position operators, like in <code>within(&lt;s&gt;,[orth=Baum]*)</code>, have to be mandatory at the moment and will be reformulated to occur at least once.</p>
257 <p>This behaviour may change in the future.</p>
258</blockquote>
259
Nils Diewald7cad8402014-07-08 17:06:56 +0000260<blockquote>
Nils Diewaldf21aa152014-07-18 19:10:21 +0000261 <p>The KorAP implementation of Poliqarp also supports the postfix <code>within</code> operator, that works similar to the <code>contains()</code> operator, but is not nestable.</p>
Nils Diewald7cad8402014-07-08 17:06:56 +0000262</blockquote>
263
Nils Diewald955ca872014-11-07 03:38:31 +0000264</section>
265<section id="tut-class-operators">
Nils Diewald7cad8402014-07-08 17:06:56 +0000266
Nils Diewald955ca872014-11-07 03:38:31 +0000267<h3>Class Operators</h3>
268
269<p>Classes are used to group sub matches by surrounding curly brackets and a class number <code>{1:...}</code>.
270Classes can be used to refer to sub matches in a query, similar to captures in regular expressions.
271In Poliqarp+ classes have multiple purposes, with highlighting being the most intuitive one:</p>
272
273%= korap_tut_query poliqarp => 'der {1:{2:[]} Mann}'
274
275%#= korap_tut_query poliqarp => 'der {1:{2:[]{1,4}} {3:Baum}} {4:[]}'
276
277<p>In KorAP classes can be defined from 1 to 128. In case a class number is dismissed, the class defaults to the class number 1: <code>{...}</code> is equal to <code>{1:...}</code>.</p>
278
279<h4>Match Modification</h4>
280
281<p>Based on classes, matches may be modified. The <code>focus()</code> operator restricts the span of a match to the boundary of a certain class.</p>
282
283%= korap_tut_query poliqarp => 'focus(der {Baum})'
284
285<p>The query above will search for the sequence <code>der Baum</code> but the match will be limited to <code>Baum</code>.
286You can think of <code>der</code> in this query as a positive look-behind zero-length assertion in regular expressions.</p>
287
288<p>But focus is way more useful if you are searching for matches without knowing the surface form. For example, to find all terms between the words &quot;der&quot; and &quot;Mann&quot; you can search:</p>
289
290%= korap_tut_query poliqarp => 'focus(der {[]} Mann)'
291
292<p>This will limit the match to all interesting terms in between &quot;der&quot; and &quot;Mann&quot;. Or you may want to search for all words following the sequence &quot;der alte und neue&quot;:</p>
293
294%= korap_tut_query poliqarp => 'focus(der alte und neue {[]})'
295
296<!--
297<p><code>focus()</code> is especially useful if you are searching for matches in certain areas, for example in quotes using positional operators.
298While not being interested in the whole quote as a match, you can focus on what's really relevant to you.</p>
299
300%= korap_tut_query poliqarp => 'focus(1:contains(er []{,10} sagte, 1{Baum}))'
301-->
302
303<p>In case a class number is dismissed, the focus operator defaults to the class number 1: <code>focus(...)</code> is equal to <code>focus(1: ...)</code>.</p>
304
305<blockquote class="warning">
306 <p>As numbers in curly brackets can be ambiguous in certain circumstances, for example <code>[]{3}</code> can be read as either &quot;any word repeated three times&quot; or &quot;any word followed by the number 3 highlighted as class number 1&quot;, numbers should always be expressed as <code>[orth=3]</code> for the latter case.</p>
307</blockquote>
308
Nils Diewald7cad8402014-07-08 17:06:56 +0000309
310</section>
311
312% end