Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 1 | % content main => begin |
| 2 | |
| 3 | <h2>KorAP-Tutorial: Poliqarp+</h2> |
| 4 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 5 | <p><%= korap_tut_link_to 'Back to Index', '/tutorial' %></p> |
| 6 | |
| 7 | <p>The following tutorial introduces all features provided by our version of the Poliqarp Query Language and some KorAP specific extensions.</p> |
| 8 | |
| 9 | <section id="tut-segments"> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 10 | <h3>Simple Segments</h3> |
| 11 | |
| 12 | <p>The atomic elements of Poliqarp queries are segments. Most of the time segments represent words and can be simply queried:</p> |
| 13 | %# footnote: In the polish national corpus, Poliqarp can join multiple segments when identifying a single word. |
| 14 | |
| 15 | %= korap_tut_query poliqarp => 'Baum' |
| 16 | |
| 17 | <p>Sequences of simple segments are expressed using a space delimiter:</p> |
| 18 | |
| 19 | %= korap_tut_query poliqarp => 'der Baum' |
| 20 | |
| 21 | <p>Simple segments always refer to the surface form of a word. To search for surface forms without case sensitivity, you can use the <code>/i</code> flag.</p> |
| 22 | |
| 23 | %= korap_tut_query poliqarp => 'laufen/i' |
| 24 | |
| 25 | <p>The query above will find all occurrences of <code>laufen</code> irrespective of the capitalization of letters, so <code>wir laufen</code> will be find as well as <code>das Laufen</code> and even <code>"GEH LAUFEN!"</code>. |
| 26 | </section> |
| 27 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 28 | <section id="tut-regexp"> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 29 | <h3>Regular Expressions</h3> |
| 30 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 31 | <p>Segments can also be queried using <%= korap_tut_link_to 'regular expressions', '/tutorial/regular-expressions' %> - by surrounding the segment with double quotes.</p> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 32 | |
| 33 | %= korap_tut_query poliqarp => '"l(au|ie)fen"' |
| 34 | |
| 35 | <p>Regular expression segments will always match the whole segment, meaning the above query will find words starting with <code>l</code> and ending with <code>n</code>. To support subqueries, you can use the <code>/x</code> flag. |
| 36 | |
| 37 | %= korap_tut_query poliqarp => '"l(au|ie)fen"/x', cutoff => 1 |
| 38 | |
| 39 | <p>The <code>/x</code> will search for all segments that contain a sequence of characters the regular expression matches. That means the above query is equivalent to:</p> |
| 40 | |
Nils Diewald | bfcf090 | 2014-07-15 13:36:47 +0000 | [diff] [blame] | 41 | %= korap_tut_query poliqarp => '".*?l(au|ie)fen.*?"', cutoff => 1 |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 42 | |
| 43 | <p>The <code>/x</code> flag can also be used in conjuntion with strict expressions to search for substrings:</p> |
| 44 | |
| 45 | %= korap_tut_query poliqarp => 'trenn/xi', cutoff => 1 |
| 46 | |
| 47 | <p>The above query will find all occurrences of segments including the string <code>trenn</code> case insensitive, like "Trennung", "unzertrennlich", or "Wettrennen".</p> |
| 48 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 49 | <blockquote class="warning"> |
| 50 | <p>Beware: These kinds of queries (with prepended <code>.*</code> expressions) are extremely slow!</p> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 51 | </blockquote> |
| 52 | |
| 53 | <p>You can again apply the <code>/i</code> flag to search case insensitive.</p> |
| 54 | |
| 55 | %= korap_tut_query poliqarp => '"l(au|ie)fen"/xi', cutoff => 1 |
| 56 | |
| 57 | </section> |
| 58 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 59 | <section id="tut-complex"> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 60 | <h3>Complex Segments</h3> |
| 61 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 62 | <p>Complex segments are expressed in square brackets and contain additional information on the resource of the term under scrutiny by providing key/value pairs, separated by a <code>=</code> symbol.</p> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 63 | |
| 64 | <p>The KorAP implementation of Poliqarp provides three special segment keys: <code>orth</code> for surface forms, <code>base</code> for lemmata, and <code>pos</code> for Part-of-Speech. The following complex query finds all surface forms of <code>Baum</code>.</p> |
| 65 | |
| 66 | %# There are more special keys in Poliqarp, but KorAP doesn't provide them. |
| 67 | |
| 68 | %= korap_tut_query poliqarp => '[orth=Baum]' |
| 69 | |
| 70 | <p>The query is thus equivalent to:</p> |
| 71 | |
| 72 | %= korap_tut_query poliqarp => 'Baum' |
| 73 | |
| 74 | <p>Complex segments expect simple expressions as a values, meaning that the following expression is valid as well:</p> |
| 75 | |
| 76 | %= korap_tut_query poliqarp => '[orth="l(au|ie)fen"/xi]', cutoff => 1 |
| 77 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 78 | <p>Another special key is <code>base</code>, refering to the lemma annotation of the <%= korap_tut_link_to 'default foundry', '/tutorial/foundries' %>. The following query finds all occurrences of segments annotated as the lemma <code>Baum</code> by the default foundry.</p> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 79 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 80 | %= korap_tut_query poliqarp => '[base=baum]' |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 81 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 82 | <p>The third special key is <code>pos</code>, refering to the part-of-speech annotation of the <%= korap_tut_link_to 'default foundry', '/tutorial/foundries' %>. The following query finds all attributive adjectives:</p> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 83 | |
| 84 | %= korap_tut_query poliqarp => '[pos=ADJA]' |
| 85 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 86 | <p>Complex segments requesting further token annotations can have keys following the <code>foundry/layer</code> notation. For example to find all occurrences of plural words in the mate foundry, you can search using the following query:</p> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 87 | |
| 88 | %= korap_tut_query poliqarp => '[mate/m=number:pl]' |
| 89 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 90 | <blockquote class="warning"> |
Nils Diewald | ca69efa | 2014-07-15 15:21:58 +0000 | [diff] [blame] | 91 | <p><strong>The following queries in the tutorial are not yet tested and may not work.</strong></p> |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 92 | </blockquote> |
| 93 | |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 94 | </section> |
| 95 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 96 | <section id="tut-spans"> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 97 | <h3>Span Segments</h3> |
| 98 | |
| 99 | %= korap_tut_query poliqarp => '<s>' |
| 100 | |
| 101 | </section> |
| 102 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 103 | <section id="tut-paradigmatic-operators"> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 104 | <h3>Paradigmatic Operators</h3> |
Nils Diewald | f736623 | 2014-07-25 15:57:08 +0000 | [diff] [blame] | 105 | |
| 106 | <p>A complex segment can have multiple properties a token has to fulfill. For example to search for all words with the surface form <code>laufe</code> (no matter if capitalized or not) that have the lemma <code>lauf</code> (and not, for example, <code>laufen</code>, which would indicate a verb or a gerund), you can search for:</p> |
| 107 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 108 | %= korap_tut_query poliqarp => '[orth=laufe/i & base=lauf]' |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 109 | |
Nils Diewald | f736623 | 2014-07-25 15:57:08 +0000 | [diff] [blame] | 110 | <p>The ampersand combines multiple properties with a logical AND.</p> |
| 111 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 112 | %= korap_tut_query poliqarp => '[orth=laufe/i & base!=lauf]' |
| 113 | |
| 114 | <blockquote class="warning"> |
| 115 | <p>There is a bug in the Lucene backend regarding the negation of matches</p> |
| 116 | </blockquote> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 117 | |
| 118 | <p>The following query is equivalent</p> |
| 119 | |
| 120 | %= korap_tut_query poliqarp => '[orth=bäume & !base=bäumen]' |
| 121 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 122 | <p>Some more ...</p> |
| 123 | |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 124 | %= korap_tut_query poliqarp => '[base=laufen | base=gehen]' |
| 125 | |
| 126 | %= korap_tut_query poliqarp => '[(base=laufen | base=gehen) & tt/pos=VVFIN]' |
| 127 | |
| 128 | %= korap_tut_query poliqarp => '[]' |
| 129 | |
| 130 | </section> |
| 131 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 132 | <section id="tut-syntagmatic-operators"> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 133 | <h3>Syntagmatic Operators</h3> |
| 134 | |
| 135 | <h4>Sequences</h4> |
| 136 | |
Nils Diewald | f736623 | 2014-07-25 15:57:08 +0000 | [diff] [blame] | 137 | <h4>Groups</h4> |
| 138 | |
| 139 | <h4>Alternation</h4> |
| 140 | |
| 141 | <p>Alternations allow for searching alternative segments or sequences of segments, similar to the paradigmatic operator. You already have seen that you can search for both sequences of <code>der alte Mann</code> and <code>der junge Mann</code> by typing in:</p> |
| 142 | |
| 143 | %= korap_tut_query poliqarp => 'der [orth=alte | orth=junge] Mann' |
| 144 | |
| 145 | <p>However, this formulation has problems in case you want to search for alternations of sequences rather than terms. If you want to search for both sequences of <code>dem jungen Mann</code> and <code>der alte Mann</code> you can use syntagmatic alternations and groups:</p> |
| 146 | |
| 147 | %= korap_tut_query poliqarp => '(dem jungen | der alte) Mann' |
| 148 | |
| 149 | <p>The pipe symbol works the same way as with the paradigmatic alternation, but supports sequences of different length as operands. The above query for <code>der alte Mann</code> and <code>der junge Mann</code> can therefor be reformulated as:</p> |
| 150 | |
| 151 | %= korap_tut_query poliqarp => 'der (junge | alte) Mann' |
| 152 | |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 153 | <h4>Repetition</h4> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 154 | |
Nils Diewald | 57a262e | 2014-07-22 15:18:38 +0000 | [diff] [blame] | 155 | <p>Repetitions in Poliqarp are realized as in <%= korap_tut_link_to 'regular expressions', '/tutorial/regular-expressions' %>, by giving quantifieres in curly brackets.</p> |
| 156 | <p>To search for a sequence of three occurrences of <code>der</code>, you can formulate your query in any of the following ways - they will have the same results:</p> |
| 157 | |
| 158 | %= korap_tut_query poliqarp => 'der der der' |
| 159 | %= korap_tut_query poliqarp => 'der{3}' |
| 160 | %= korap_tut_query poliqarp => '[orth=der]{3}' |
| 161 | |
Nils Diewald | f736623 | 2014-07-25 15:57:08 +0000 | [diff] [blame] | 162 | <p>In difference to regular expressions, the repetition operation won't refer to the match but to the pattern given. So the following query will give you a sequence of three words having the term <code>der</code> as a substring - but the words don't have to be identical. The following query for example will match a sequence of three words all starting with <code>la</code>.</p> |
Nils Diewald | 57a262e | 2014-07-22 15:18:38 +0000 | [diff] [blame] | 163 | |
| 164 | %= korap_tut_query poliqarp => '"la.*?"/i{3}' |
| 165 | |
Nils Diewald | f736623 | 2014-07-25 15:57:08 +0000 | [diff] [blame] | 166 | <p>The same is true for annotations. The following query will find a sequence of 3 to 4 adjectives as annotated by the TreeTagger foundry, that is preceded by the lemma <code>ein</code> as annotated by the default foundry and followed by a noun as annotated by the XIP foundry. The adjectives do not have to be identical though.</p> |
Nils Diewald | 57a262e | 2014-07-22 15:18:38 +0000 | [diff] [blame] | 167 | |
Nils Diewald | f736623 | 2014-07-25 15:57:08 +0000 | [diff] [blame] | 168 | %= korap_tut_query poliqarp => '[base=ein][tt/p=ADJA]{3,4}[xip/p=NOUN]' |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 169 | |
Nils Diewald | f736623 | 2014-07-25 15:57:08 +0000 | [diff] [blame] | 170 | <p>In addition to numbered quantities, it is also possible to pass repetition information as Kleene operators <code>?</code>, <code>+</code>, and <code>+</code>.</p> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 171 | |
Nils Diewald | f736623 | 2014-07-25 15:57:08 +0000 | [diff] [blame] | 172 | <p>To search for a sequence of the lemma <code>der</code> followed by the lemma <code>baum</code> as annotated by the base foundry, but allowing an optional adjective as annotated by the TreeTagger foundry in between, you can search for:</p> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 173 | |
Nils Diewald | f736623 | 2014-07-25 15:57:08 +0000 | [diff] [blame] | 174 | %= korap_tut_query poliqarp => '[base=der][tt/pos=ADJA]?[base=baum]' |
| 175 | |
| 176 | <p>This query is identical to the numbered quantification of:</p> |
| 177 | |
| 178 | %= korap_tut_query poliqarp => '[base=der][tt/pos=ADJA]{,1}[base=baum]' |
| 179 | |
| 180 | <p>To search for the same sequences but with unlimited adjectives as annotated by the TreeTagger foundry in between, you can use the Kleene Star:</p> |
| 181 | |
| 182 | %= korap_tut_query poliqarp => '[base=der][tt/pos=ADJA]*[base=baum]' |
| 183 | |
| 184 | <p>And to search for this sequence but with at least one adjective in between, you can use the Kleene Plus (all queries are identical):</p> |
| 185 | |
| 186 | %= korap_tut_query poliqarp => '[base=der][tt/pos=ADJA]+[base=baum]' |
| 187 | %= korap_tut_query poliqarp => '[base=der][tt/pos=ADJA]{1,}[base=baum]' |
| 188 | %= korap_tut_query poliqarp => '[base=der][tt/pos=ADJA][tt/pos=ADJA]*[base=baum]' |
| 189 | |
| 190 | <blockquote class="warning"> |
| 191 | <p>Repetition operators like <code>{,4}</code>, <code>?</code>, and <code>*</code> make segments or groups of segments optional. In case these queries are used separated (and there are no mandatory segments in the query), you will be warned by the system that your query won't be treated as optional. Keep in mind that optionality may be somehow <i>inherited</i>, for example when you search for <code>(junge|alte)?|tote</code>, one segment of the alternation is optional, which makes the whole query optional as well.</p> |
| 192 | </blockquote> |
| 193 | |
| 194 | %#= korap_tut_query poliqarp => '[base=der][][base=Baum]' |
| 195 | %#= korap_tut_query poliqarp => '[base=der][]{2}[base=Baum]' |
| 196 | %#= korap_tut_query poliqarp => '[base=der][]{2,}[base=Baum]' |
| 197 | %#= korap_tut_query poliqarp => '[base=der][]{,3}[base=Baum]' |
| 198 | |
| 199 | %#= korap_tut_query poliqarp => '[base=der][tt/pos=ADJA]*[base=Baum]' |
| 200 | %#= korap_tut_query poliqarp => '[base=der][tt/pos=ADJA]+[base=Baum]' |
| 201 | |
| 202 | <h4>Position</h4> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 203 | |
Nils Diewald | 13bad6a | 2014-07-18 16:44:51 +0000 | [diff] [blame] | 204 | %#= korap_tut_query poliqarp => 'matches(<s>,[])' |
| 205 | %# matches(<s>,[cnx/p=INTERJ]{2}) |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 206 | <p>contains()</p> |
| 207 | <p>startsWith()</p> |
| 208 | <p>endsWith()</p> |
| 209 | <p>overlaps()</p> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 210 | |
Nils Diewald | f736623 | 2014-07-25 15:57:08 +0000 | [diff] [blame] | 211 | <blockquote class="warning"> |
| 212 | <p>Optional operands in position operators, like in <code>within(<s>,[orth=Baum]*)</code>, have to be mandatory at the moment and will be reformulated to occur at least once.</p> |
| 213 | <p>This behaviour may change in the future.</p> |
| 214 | </blockquote> |
| 215 | |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 216 | <blockquote> |
Nils Diewald | f21aa15 | 2014-07-18 19:10:21 +0000 | [diff] [blame] | 217 | <p>The KorAP implementation of Poliqarp also supports the postfix <code>within</code> operator, that works similar to the <code>contains()</code> operator, but is not nestable.</p> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 218 | </blockquote> |
| 219 | |
| 220 | <h4>Class Operators</h4> |
| 221 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 222 | <p>{}</p> |
| 223 | <p>focus()</p> |
| 224 | <p>...</p> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 225 | |
| 226 | </section> |
| 227 | |
| 228 | % end |