Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 1 | % content main => begin |
| 2 | |
| 3 | <h2>KorAP-Tutorial: Poliqarp+</h2> |
| 4 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 5 | <p><%= korap_tut_link_to 'Back to Index', '/tutorial' %></p> |
| 6 | |
| 7 | <p>The following tutorial introduces all features provided by our version of the Poliqarp Query Language and some KorAP specific extensions.</p> |
| 8 | |
| 9 | <section id="tut-segments"> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 10 | <h3>Simple Segments</h3> |
| 11 | |
| 12 | <p>The atomic elements of Poliqarp queries are segments. Most of the time segments represent words and can be simply queried:</p> |
| 13 | %# footnote: In the polish national corpus, Poliqarp can join multiple segments when identifying a single word. |
| 14 | |
| 15 | %= korap_tut_query poliqarp => 'Baum' |
| 16 | |
| 17 | <p>Sequences of simple segments are expressed using a space delimiter:</p> |
| 18 | |
| 19 | %= korap_tut_query poliqarp => 'der Baum' |
| 20 | |
| 21 | <p>Simple segments always refer to the surface form of a word. To search for surface forms without case sensitivity, you can use the <code>/i</code> flag.</p> |
| 22 | |
| 23 | %= korap_tut_query poliqarp => 'laufen/i' |
| 24 | |
| 25 | <p>The query above will find all occurrences of <code>laufen</code> irrespective of the capitalization of letters, so <code>wir laufen</code> will be find as well as <code>das Laufen</code> and even <code>"GEH LAUFEN!"</code>. |
| 26 | </section> |
| 27 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 28 | <section id="tut-regexp"> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 29 | <h3>Regular Expressions</h3> |
| 30 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 31 | <p>Segments can also be queried using <%= korap_tut_link_to 'regular expressions', '/tutorial/regular-expressions' %> - by surrounding the segment with double quotes.</p> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 32 | |
| 33 | %= korap_tut_query poliqarp => '"l(au|ie)fen"' |
| 34 | |
| 35 | <p>Regular expression segments will always match the whole segment, meaning the above query will find words starting with <code>l</code> and ending with <code>n</code>. To support subqueries, you can use the <code>/x</code> flag. |
| 36 | |
| 37 | %= korap_tut_query poliqarp => '"l(au|ie)fen"/x', cutoff => 1 |
| 38 | |
| 39 | <p>The <code>/x</code> will search for all segments that contain a sequence of characters the regular expression matches. That means the above query is equivalent to:</p> |
| 40 | |
Nils Diewald | bfcf090 | 2014-07-15 13:36:47 +0000 | [diff] [blame] | 41 | %= korap_tut_query poliqarp => '".*?l(au|ie)fen.*?"', cutoff => 1 |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 42 | |
Nils Diewald | ca69efa | 2014-07-15 15:21:58 +0000 | [diff] [blame] | 43 | <blockquote class="exception"> |
| 44 | <p>There is a minor serialization bug currently, not accepting non-greedy quantifiers at the moment, so this query may fail.</p> |
| 45 | </blockquote> |
| 46 | |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 47 | <p>The <code>/x</code> flag can also be used in conjuntion with strict expressions to search for substrings:</p> |
| 48 | |
| 49 | %= korap_tut_query poliqarp => 'trenn/xi', cutoff => 1 |
| 50 | |
| 51 | <p>The above query will find all occurrences of segments including the string <code>trenn</code> case insensitive, like "Trennung", "unzertrennlich", or "Wettrennen".</p> |
| 52 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 53 | <blockquote class="warning"> |
| 54 | <p>Beware: These kinds of queries (with prepended <code>.*</code> expressions) are extremely slow!</p> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 55 | </blockquote> |
| 56 | |
| 57 | <p>You can again apply the <code>/i</code> flag to search case insensitive.</p> |
| 58 | |
| 59 | %= korap_tut_query poliqarp => '"l(au|ie)fen"/xi', cutoff => 1 |
| 60 | |
| 61 | </section> |
| 62 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 63 | <section id="tut-complex"> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 64 | <h3>Complex Segments</h3> |
| 65 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 66 | <p>Complex segments are expressed in square brackets and contain additional information on the resource of the term under scrutiny by providing key/value pairs, separated by a <code>=</code> symbol.</p> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 67 | |
| 68 | <p>The KorAP implementation of Poliqarp provides three special segment keys: <code>orth</code> for surface forms, <code>base</code> for lemmata, and <code>pos</code> for Part-of-Speech. The following complex query finds all surface forms of <code>Baum</code>.</p> |
| 69 | |
| 70 | %# There are more special keys in Poliqarp, but KorAP doesn't provide them. |
| 71 | |
| 72 | %= korap_tut_query poliqarp => '[orth=Baum]' |
| 73 | |
| 74 | <p>The query is thus equivalent to:</p> |
| 75 | |
| 76 | %= korap_tut_query poliqarp => 'Baum' |
| 77 | |
| 78 | <p>Complex segments expect simple expressions as a values, meaning that the following expression is valid as well:</p> |
| 79 | |
| 80 | %= korap_tut_query poliqarp => '[orth="l(au|ie)fen"/xi]', cutoff => 1 |
| 81 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 82 | <p>Another special key is <code>base</code>, refering to the lemma annotation of the <%= korap_tut_link_to 'default foundry', '/tutorial/foundries' %>. The following query finds all occurrences of segments annotated as the lemma <code>Baum</code> by the default foundry.</p> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 83 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 84 | %= korap_tut_query poliqarp => '[base=baum]' |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 85 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 86 | <p>The third special key is <code>pos</code>, refering to the part-of-speech annotation of the <%= korap_tut_link_to 'default foundry', '/tutorial/foundries' %>. The following query finds all attributive adjectives:</p> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 87 | |
| 88 | %= korap_tut_query poliqarp => '[pos=ADJA]' |
| 89 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 90 | <p>Complex segments requesting further token annotations can have keys following the <code>foundry/layer</code> notation. For example to find all occurrences of plural words in the mate foundry, you can search using the following query:</p> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 91 | |
| 92 | %= korap_tut_query poliqarp => '[mate/m=number:pl]' |
| 93 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 94 | <blockquote class="warning"> |
| 95 | <p>There is currently a bug in the serialization of this query.</p> |
Nils Diewald | ca69efa | 2014-07-15 15:21:58 +0000 | [diff] [blame] | 96 | <p><strong>The following queries in the tutorial are not yet tested and may not work.</strong></p> |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 97 | </blockquote> |
| 98 | |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 99 | </section> |
| 100 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 101 | <section id="tut-spans"> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 102 | <h3>Span Segments</h3> |
| 103 | |
| 104 | %= korap_tut_query poliqarp => '<s>' |
| 105 | |
| 106 | </section> |
| 107 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 108 | <section id="tut-paradigmatic-operators"> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 109 | <h3>Paradigmatic Operators</h3> |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 110 | %= korap_tut_query poliqarp => '[orth=laufe/i & base=lauf]' |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 111 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 112 | %= korap_tut_query poliqarp => '[orth=laufe/i & base!=lauf]' |
| 113 | |
| 114 | <blockquote class="warning"> |
| 115 | <p>There is a bug in the Lucene backend regarding the negation of matches</p> |
| 116 | </blockquote> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 117 | |
| 118 | <p>The following query is equivalent</p> |
| 119 | |
| 120 | %= korap_tut_query poliqarp => '[orth=bäume & !base=bäumen]' |
| 121 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 122 | <p>Some more ...</p> |
| 123 | |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 124 | %= korap_tut_query poliqarp => '[base=laufen | base=gehen]' |
| 125 | |
| 126 | %= korap_tut_query poliqarp => '[(base=laufen | base=gehen) & tt/pos=VVFIN]' |
| 127 | |
| 128 | %= korap_tut_query poliqarp => '[]' |
| 129 | |
| 130 | </section> |
| 131 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 132 | <section id="tut-syntagmatic-operators"> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 133 | <h3>Syntagmatic Operators</h3> |
| 134 | |
| 135 | <h4>Sequences</h4> |
| 136 | |
| 137 | <h4>Repetition</h4> |
| 138 | %= korap_tut_query poliqarp => '[base=der][][base=Baum]' |
| 139 | |
| 140 | %= korap_tut_query poliqarp => '[base=der][]{2}[base=Baum]' |
| 141 | %= korap_tut_query poliqarp => '[base=der][]{2,3}[base=Baum]' |
| 142 | %= korap_tut_query poliqarp => '[base=der][]{2,}[base=Baum]' |
| 143 | %= korap_tut_query poliqarp => '[base=der][]{,3}[base=Baum]' |
| 144 | |
| 145 | %= korap_tut_query poliqarp => '[base=der][tt/pos=ADJA]?[base=Baum]' |
| 146 | %= korap_tut_query poliqarp => '[base=der][tt/pos=ADJA]*[base=Baum]' |
| 147 | %= korap_tut_query poliqarp => '[base=der][tt/pos=ADJA]+[base=Baum]' |
| 148 | |
| 149 | <h4>Alternation</h4> |
| 150 | |
| 151 | <h4>Position Operators</h4> |
| 152 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 153 | <p>contains()</p> |
| 154 | <p>startsWith()</p> |
| 155 | <p>endsWith()</p> |
| 156 | <p>overlaps()</p> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 157 | |
| 158 | <blockquote> |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 159 | <p>The KorAP implementation of Poliqarp also support the postfix <code>within</code> operator, that works similar to the <code>contains()</code>, but is not nestable.</p> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 160 | </blockquote> |
| 161 | |
| 162 | <h4>Class Operators</h4> |
| 163 | |
Nils Diewald | 4e9fbcb | 2014-07-15 11:45:09 +0000 | [diff] [blame] | 164 | <p>{}</p> |
| 165 | <p>focus()</p> |
| 166 | <p>...</p> |
Nils Diewald | 7cad840 | 2014-07-08 17:06:56 +0000 | [diff] [blame] | 167 | |
| 168 | </section> |
| 169 | |
| 170 | % end |