Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 1 | % layout 'main', title => 'KorAP: Poliqarp+'; |
| 2 | |
| 3 | <h2>Poliqarp+</h2> |
| 4 | |
| 5 | <p>The following tutorial introduces all features provided by our version of the Poliqarp Query Language and some KorAP specific extensions.</p> |
| 6 | |
Nils Diewald | fccfbcb | 2015-04-29 20:48:19 +0000 | [diff] [blame] | 7 | <section id="segments"> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 8 | <h3>Simple Segments</h3> |
| 9 | |
| 10 | <p>The atomic elements of Poliqarp queries are segments. Most of the time segments represent words and can be simply queried:</p> |
| 11 | %# footnote: In the polish national corpus, Poliqarp can join multiple segments when identifying a single word. |
| 12 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 13 | %= doc_query poliqarp => 'Baum' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 14 | |
| 15 | <p>Sequences of simple segments are expressed using a space delimiter:</p> |
| 16 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 17 | %= doc_query poliqarp => 'der Baum' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 18 | |
| 19 | <p>Simple segments always refer to the surface form of a word. To search for surface forms without case sensitivity, you can use the <code>/i</code> flag.</p> |
| 20 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 21 | %= doc_query poliqarp => 'laufen/i' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 22 | |
Akron | 5474018 | 2017-06-17 14:17:23 +0200 | [diff] [blame] | 23 | <p>The query above will find all occurrences of <code>laufen</code> irrespective of the capitalization of letters, so <code>wir laufen</code> will be found as well as <code>das Laufen</code> and even <code>"GEH LAUFEN!"</code>.</p> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 24 | |
Nils Diewald | fccfbcb | 2015-04-29 20:48:19 +0000 | [diff] [blame] | 25 | <h4 id="regexp">Regular Expressions</h4> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 26 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 27 | <p>Segments can also be queried using <%= doc_link_to 'regular expressions', 'ql', 'regexp' %> - by surrounding the segment with double quotes.</p> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 28 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 29 | %= doc_query poliqarp => '"l(au|ie)fen"' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 30 | |
Akron | 5474018 | 2017-06-17 14:17:23 +0200 | [diff] [blame] | 31 | <p>Regular expression segments will always match the whole segment, meaning the above query will find words starting with <code>l</code> and ending with <code>n</code>. To support subqueries, you can use the <code>/x</code> flag.</p> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 32 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 33 | %= doc_query poliqarp => '"l(au|ie)fen"/x', cutoff => 1 |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 34 | |
| 35 | <p>The <code>/x</code> will search for all segments that contain a sequence of characters the regular expression matches. That means the above query is equivalent to:</p> |
| 36 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 37 | %= doc_query poliqarp => '".*?l(au|ie)fen.*?"', cutoff => 1 |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 38 | |
| 39 | <p>The <code>/x</code> flag can also be used in conjuntion with strict expressions to search for substrings:</p> |
| 40 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 41 | %= doc_query poliqarp => 'trenn/xi', cutoff => 1 |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 42 | |
| 43 | <p>The above query will find all occurrences of segments including the string <code>trenn</code> case insensitive, like "Trennung", "unzertrennlich", or "Wettrennen".</p> |
| 44 | |
| 45 | <blockquote class="warning"> |
| 46 | <p>Beware: These kinds of queries (with prepended <code>.*</code> expressions) are extremely slow!</p> |
| 47 | </blockquote> |
| 48 | |
| 49 | <p>You can again apply the <code>/i</code> flag to search case insensitive.</p> |
| 50 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 51 | %= doc_query poliqarp => '"l(au|ie)fen"/xi', cutoff => 1 |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 52 | </section> |
| 53 | |
Nils Diewald | fccfbcb | 2015-04-29 20:48:19 +0000 | [diff] [blame] | 54 | <section id="complex"> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 55 | <h3>Complex Segments</h3> |
| 56 | |
| 57 | <p>Complex segments are expressed in square brackets and contain additional information on the resource of the term under scrutiny by providing key/value pairs, separated by an equal-sign.</p> |
| 58 | |
| 59 | <p>The KorAP implementation of Poliqarp provides three special segment keys: <code>orth</code> for surface forms, <code>base</code> for lemmata, and <code>pos</code> for Part-of-Speech. The following complex query finds all surface forms of <code>Baum</code>.</p> |
| 60 | %# There are more special keys in Poliqarp, but KorAP doesn't provide them. |
| 61 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 62 | %= doc_query poliqarp => '[orth=Baum]' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 63 | |
| 64 | <p>The query is thus equivalent to:</p> |
| 65 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 66 | %= doc_query poliqarp => 'Baum' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 67 | |
| 68 | <p>Complex segments expect simple expressions as values, meaning that the following expression is valid as well:</p> |
| 69 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 70 | %= doc_query poliqarp => '[orth="l(au|ie)fen"/xi]', cutoff => 1 |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 71 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 72 | <p>Another special key is <code>base</code>, refering to the lemma annotation of the <%= doc_link_to 'default foundry', 'data', 'annotation' %>. |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 73 | The following query finds all occurrences of segments annotated as the lemma <code>Baum</code> by the default foundry.</p> |
| 74 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 75 | %= doc_query poliqarp => '[base=Baum]' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 76 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 77 | <p>The third special key is <code>pos</code>, refering to the part-of-speech annotation of the <%= doc_link_to 'default foundry', 'data', 'annotation' %>. |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 78 | The following query finds all attributive adjectives:</p> |
| 79 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 80 | %= doc_query poliqarp => '[pos=ADJA]' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 81 | |
| 82 | <p>Complex segments requesting further token annotations can have keys following the <code>foundry/layer</code> notation. |
Akron | 5474018 | 2017-06-17 14:17:23 +0200 | [diff] [blame] | 83 | For example to find all occurrences of plural words in the <code>mate</code> foundry, you can search using the following query:</p> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 84 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 85 | %= doc_query poliqarp => '[mate/m=number:pl]' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 86 | |
| 87 | <h4>Negation</h4> |
Akron | 5474018 | 2017-06-17 14:17:23 +0200 | [diff] [blame] | 88 | <p>Negation of terms in complex expressions can be expressed by prepending the equal sign or the whole expression with an exclamation mark.</p> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 89 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 90 | %= doc_query poliqarp => '[pos!=ADJA]' |
| 91 | %= doc_query poliqarp => '[!pos=ADJA]' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 92 | |
| 93 | <blockquote class="warning"> |
| 94 | <p>Beware: Negated complex segments can't be searched solely in the Lucene index. |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 95 | However, they work in case they are part of a <%= doc_link_to 'sequence', 'ql', 'poliqarp-plus#syntagmatic-operators-sequence' %>.</p> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 96 | </blockquote> |
| 97 | |
| 98 | <h4 id="empty-segments">Empty Segments</h4> |
| 99 | |
| 100 | <p>A special segment is the empty segment, that matches every word in the index.</p> |
| 101 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 102 | %= doc_query poliqarp => '[]' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 103 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 104 | <p>Empty segments are useful to express distances of words by using <%= doc_link_to 'repetitions', 'ql', 'poliqarp-plus#syntagmatic-operators-repetitions' %>.</p> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 105 | |
| 106 | <blockquote class="warning"> |
| 107 | <p>Beware: Empty segments can't be searched solely in the Lucene index. |
Nils Diewald | 9922edf | 2015-05-07 20:03:33 +0000 | [diff] [blame] | 108 | However, they work in case they are part of a <%= doc_link_to 'sequence', 'ql', 'poliqarp-plus#syntagmatic-operators-sequence' %>.</p> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 109 | </blockquote> |
| 110 | </section> |
| 111 | |
Nils Diewald | fccfbcb | 2015-04-29 20:48:19 +0000 | [diff] [blame] | 112 | <section id="spans"> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 113 | <h3>Span Segments</h3> |
| 114 | |
| 115 | <p>Not all segments are bound to words - some are bound to concepts spanning multiple words, for example noun phrases, sentences, or paragraphs. |
| 116 | Span segments can be searched for using angular brackets instead of square brackets.</p> |
| 117 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 118 | %= doc_query poliqarp => '<xip/c=INFC>' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 119 | |
| 120 | <p>Otherwise they can be treated in exactly the same way as simple or complex segments.</p> |
| 121 | </section> |
| 122 | |
Nils Diewald | fccfbcb | 2015-04-29 20:48:19 +0000 | [diff] [blame] | 123 | <section id="paradigmatic-operators"> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 124 | <h3>Paradigmatic Operators</h3> |
| 125 | |
| 126 | <p>A complex segment can have multiple properties a token has to fulfill. For example to search for all words with the surface form <code>laufe</code> (no matter if capitalized or not) that have the lemma <code>lauf</code> (and not, for example, <code>laufen</code>, which would indicate a verb or a gerund), you can search for:</p> |
| 127 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 128 | %= doc_query poliqarp => '[orth=laufe/i & base=Lauf]' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 129 | |
| 130 | <p>The ampersand combines multiple properties with a logical AND. |
| 131 | Terms of the complex segment can be negated as introduced before.</p> |
| 132 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 133 | %= doc_query poliqarp => '[orth=laufe/i & base!=Lauf]' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 134 | |
| 135 | <p>The following query is therefore equivalent:</p> |
| 136 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 137 | %= doc_query poliqarp => '[orth=laufe & !base=Lauf]' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 138 | |
| 139 | <p>Alternatives can be expressed by using the pipe symbol:</p> |
| 140 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 141 | %= doc_query poliqarp => '[base=laufen | base=gehen]' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 142 | |
| 143 | <p>All these sub expressions can be grouped using round brackets to form complex boolean expressions:</p> |
| 144 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 145 | %= doc_query poliqarp => '[(base=laufen | base=gehen) & tt/pos=VVFIN]' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 146 | </section> |
| 147 | |
Nils Diewald | fccfbcb | 2015-04-29 20:48:19 +0000 | [diff] [blame] | 148 | <section id="syntagmatic-operators"> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 149 | <h3>Syntagmatic Operators</h3> |
| 150 | |
Nils Diewald | fccfbcb | 2015-04-29 20:48:19 +0000 | [diff] [blame] | 151 | <h4 id="syntagmatic-operators-sequence">Sequences</h4> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 152 | |
| 153 | <p>Sequences can be used to search for segments in order. For example to search for the word "alte" preceded by "der" and followed by "Mann", you can simple search for the sequence of simple expressions separated by whitespaces.</p> |
| 154 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 155 | %= doc_query poliqarp => 'der alte Mann' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 156 | |
| 157 | <p>However, you can obviously search using complex segments as well:</p> |
| 158 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 159 | %= doc_query poliqarp => '[orth=der][orth=alte][orth=Mann]' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 160 | |
| 161 | <p>Now you may see the benefit of the empty segment to search for words you don't know:</p> |
| 162 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 163 | %= doc_query poliqarp => '[orth=der][][orth=Mann]' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 164 | |
| 165 | <p>You are also able to mix segments and spans in sequences, for example to search for the word "Der" at the beginning of a sentence (which can be interpreted as the first word after the end of a sentence).</p> |
| 166 | |
Akron | d05e211 | 2016-02-18 15:47:18 +0100 | [diff] [blame] | 167 | %= doc_query poliqarp => '<base/s=s>[orth=Der]' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 168 | |
| 169 | <h4>Groups</h4> |
| 170 | |
| 171 | ... |
| 172 | |
| 173 | <h4>Alternation</h4> |
| 174 | |
| 175 | <p>Alternations allow for searching alternative segments or sequences of segments, similar to the paradigmatic operator. You already have seen that you can search for both sequences of <code>der alte Mann</code> and <code>der junge Mann</code> by typing in:</p> |
| 176 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 177 | %= doc_query poliqarp => 'der [orth=alte | orth=junge] Mann' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 178 | |
| 179 | <p>However, this formulation has problems in case you want to search for alternations of sequences rather than terms. If you want to search for both sequences of <code>dem jungen Mann</code> and <code>der alte Mann</code> you can use syntagmatic alternations and groups:</p> |
| 180 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 181 | %= doc_query poliqarp => '(dem jungen | der alte) Mann' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 182 | |
| 183 | <p>The pipe symbol works the same way as with the paradigmatic alternation, but supports sequences of different length as operands. The above query for <code>der alte Mann</code> and <code>der junge Mann</code> can therefor be reformulated as:</p> |
| 184 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 185 | %= doc_query poliqarp => 'der (junge | alte) Mann' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 186 | |
Nils Diewald | fccfbcb | 2015-04-29 20:48:19 +0000 | [diff] [blame] | 187 | <h4 id="syntagmatic-operators-repetitions">Repetition</h4> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 188 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 189 | <p>Repetitions in Poliqarp are realized as in <%= doc_link_to 'regular expressions', 'ql', 'regexp' %>, by giving quantifieres in curly brackets.</p> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 190 | <p>To search for a sequence of three occurrences of <code>der</code>, you can formulate your query in any of the following ways - they will have the same results:</p> |
| 191 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 192 | %= doc_query poliqarp => 'der der der' |
| 193 | %= doc_query poliqarp => 'der{3}' |
| 194 | %= doc_query poliqarp => '[orth=der]{3}' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 195 | |
| 196 | <p>In difference to regular expressions, the repetition operation won't refer to the match but to the pattern given. So the following query will give you a sequence of three words having the term <code>der</code> as a substring - but the words don't have to be identical. The following query for example will match a sequence of three words all starting with <code>la</code>.</p> |
| 197 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 198 | %= doc_query poliqarp => '"la.*?"/i{3}' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 199 | |
| 200 | <p>The same is true for annotations. The following query will find a sequence of 3 to 4 adjectives as annotated by the TreeTagger foundry, that is preceded by the lemma <code>ein</code> as annotated by the default foundry and followed by a noun as annotated by the XIP foundry. The adjectives do not have to be identical though.</p> |
| 201 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 202 | %= doc_query poliqarp => '[base=ein][tt/p=ADJA]{3,4}[xip/p=NOUN]' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 203 | |
Akron | 92e6730 | 2017-06-18 20:27:12 +0200 | [diff] [blame^] | 204 | <p>In addition to numbered quantities, it is also possible to pass repetition information as Kleene operators <code>?</code>, <code>+</code>, and <code>*</code>.</p> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 205 | |
Akron | 92e6730 | 2017-06-18 20:27:12 +0200 | [diff] [blame^] | 206 | <p>To search for a sequence of the lemma <code>der</code> followed by the lemma <code>baum</code> as annotated by the default foundry, but allowing an optional adjective as annotated by the TreeTagger foundry in between, you can search for:</p> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 207 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 208 | %= doc_query poliqarp => '[base=die][tt/pos=ADJA]?[base=Baum]' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 209 | |
| 210 | <p>This query is identical to the numbered quantification of:</p> |
| 211 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 212 | %= doc_query poliqarp => '[base=die][tt/pos=ADJA]{,1}[base=Baum]' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 213 | |
| 214 | <p>To search for the same sequences but with unlimited adjectives as annotated by the TreeTagger foundry in between, you can use the Kleene Star:</p> |
| 215 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 216 | %= doc_query poliqarp => '[base=die][tt/pos=ADJA]*[base=Baum]' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 217 | |
| 218 | <p>And to search for this sequence but with at least one adjective in between, you can use the Kleene Plus (all queries are identical):</p> |
| 219 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 220 | %= doc_query poliqarp => '[base=die][tt/pos=ADJA]+[base=Baum]', cutoff => 1 |
| 221 | %= doc_query poliqarp => '[base=die][tt/pos=ADJA]{1,}[base=Baum]', cutoff => 1 |
| 222 | %= doc_query poliqarp => '[base=die][tt/pos=ADJA][tt/pos=ADJA]*[base=Baum]', cutoff => 1 |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 223 | |
| 224 | <blockquote class="warning"> |
Akron | 891ce83 | 2016-02-24 23:22:06 +0100 | [diff] [blame] | 225 | <p>Repetition operators like <code>{,4}</code>, <code>?</code>, and <code>*</code> make segments or groups of segments optional. In case these queries are used separated and not as part of a sequence (and there are no mandatory segments in the query), you will be warned by the system that your query won't be treated as optional.</p> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 226 | <p>Keep in mind that optionality may be somehow <i>inherited</i>, for example when you search for <code>(junge|alte)?|tote</code>, one segment of the alternation is optional, which makes the whole query optional as well.</p> |
| 227 | </blockquote> |
| 228 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 229 | <p>Repetition can also be used to express distances between segments by using <%= doc_link_to 'empty segments', 'ql', 'poliqarp-plus#empty-segments' %>.</p> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 230 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 231 | %= doc_query poliqarp => '[base=die][][base=Baum]' |
| 232 | %= doc_query poliqarp => '[base=die][]{2}[base=Baum]', cutoff => 1 |
| 233 | %= doc_query poliqarp => '[base=die][]{2,}[base=Baum]', cutoff => 1 |
| 234 | %= doc_query poliqarp => '[base=die][]{,3}[base=Baum]', cutoff => 1 |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 235 | |
| 236 | <p>Of course, Kleene operators can be used with empty segments as well.</p> |
| 237 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 238 | %= doc_query poliqarp => '[base=die][]?[base=Baum]' |
| 239 | %= doc_query poliqarp => '[base=die][]*[base=Baum]', cutoff => 1 |
| 240 | %= doc_query poliqarp => '[base=die][]+[base=Baum]', cutoff => 1 |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 241 | |
| 242 | <h4>Position</h4> |
| 243 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 244 | <p>Sequences as shown above can all be nested in further complex queries and treated as subqueries (see <%= doc_link_to 'class operators', 'ql', 'poliqarp-plus#class-operators' %> on how to later access these subqueries directly).</p> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 245 | <p>Positional operators compare two matches of subqueries and will match, in case a certain condition regarding the position of both is true.</p> |
| 246 | <p>The <code>contains()</code> operation will match, when a second subquery matches inside the span of a first subquery.</p> |
| 247 | |
Akron | d05e211 | 2016-02-18 15:47:18 +0100 | [diff] [blame] | 248 | %= doc_query poliqarp => 'contains(<base/s=s>, [tt/p=KOUS])', cutoff => 1 |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 249 | |
| 250 | <p>The <code>startsWith()</code> operation will match, when a second subquery matches at the beginning of the span of a first subquery.</p> |
| 251 | |
Akron | d05e211 | 2016-02-18 15:47:18 +0100 | [diff] [blame] | 252 | %= doc_query poliqarp => 'startsWith(<base/s=s>, [tt/p=KOUS])', cutoff => 1 |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 253 | |
| 254 | <p>The <code>endsWith()</code> operation will match, when a second subquery matches at the end of the span of a first subquery.</p> |
| 255 | |
Akron | d05e211 | 2016-02-18 15:47:18 +0100 | [diff] [blame] | 256 | %= doc_query poliqarp => 'endsWith(<base/s=s>, [opennlp/p=NN])', cutoff => 1 |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 257 | |
| 258 | <p>The <code>matches()</code> operation will match, when a second subquery has the exact same span of a first subquery.</p> |
| 259 | |
Akron | d05e211 | 2016-02-18 15:47:18 +0100 | [diff] [blame] | 260 | %= doc_query poliqarp => 'matches(<base/s=s>,[tt/p=CARD][tt/p="N.*"])', cutoff => 1 |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 261 | |
| 262 | <p>The <code>overlaps()</code> operation will match, when a second subquery has an overlapping span with the first subquery.</p> |
| 263 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 264 | %= doc_query poliqarp => 'overlaps([][tt/p=ADJA],{1:[tt/p=ADJA]}[])', cutoff => 1 |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 265 | |
| 266 | <blockquote class="warning"> |
| 267 | <p>Positional operators are still experimental and may change in certain aspects in the future (although the behaviour defined is intended to be stable). There is also known incorrect behaviour which will be corrected in future versions.</p> |
| 268 | <p>Optional operands in position operators, like in <code>contains(<s>,[orth=Baum]*)</code>, have to be mandatory at the moment and will be reformulated to occur at least once.</p> |
| 269 | <p>This behaviour may change in the future.</p> |
| 270 | </blockquote> |
| 271 | |
| 272 | <!-- |
| 273 | <blockquote> |
| 274 | <p>The KorAP implementation of Poliqarp also supports the postfix <code>within</code> operator, that works similar to the <code>contains()</code> operator, but is not nestable.</p> |
| 275 | </blockquote> |
| 276 | --> |
| 277 | |
| 278 | </section> |
| 279 | |
Nils Diewald | fccfbcb | 2015-04-29 20:48:19 +0000 | [diff] [blame] | 280 | <section id="class-operators"> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 281 | <h3>Class Operators</h3> |
| 282 | |
| 283 | <p>Classes are used to group sub matches by surrounding curly brackets and a class number <code>{1:...}</code>. Classes can be used to refer to sub matches in a query, similar to captures in regular expressions. In Poliqarp+ classes have multiple purposes, with highlighting being the most intuitive one:</p> |
| 284 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 285 | %= doc_query poliqarp => 'der {1:{2:[]} Mann}' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 286 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 287 | %#= doc_query poliqarp => 'der {1:{2:[]{1,4}} {3:Baum}} {4:[]}' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 288 | |
| 289 | <p>In KorAP classes can be defined from 1 to 128. In case a class number is dismissed, the class defaults to the class number 1: <code>{...}</code> is equal to <code>{1:...}</code>.</p> |
| 290 | |
| 291 | <h4>Match Modification</h4> |
| 292 | |
| 293 | <p>Based on classes, matches may be modified. The <code>focus()</code> operator restricts the span of a match to the boundary of a certain class.</p> |
| 294 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 295 | %= doc_query poliqarp => 'focus(der {Baum})' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 296 | |
| 297 | <p>The query above will search for the sequence <code>der Baum</code> but the match will be limited to <code>Baum</code>. You can think of <code>der</code> in this query as a positive look-behind zero-length assertion in regular expressions.</p> |
| 298 | |
| 299 | <p>But focus is way more useful if you are searching for matches without knowing the surface form. For example, to find all terms between the words "der" and "Mann" you can search:</p> |
| 300 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 301 | %= doc_query poliqarp => 'focus(der {[]} Mann)' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 302 | |
| 303 | <p>This will limit the match to all interesting terms in between "der" and "Mann". Or you may want to search for all words following the sequence "der alte und" immediately:</p> |
| 304 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 305 | %= doc_query poliqarp => 'focus(der alte und {[]})' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 306 | |
| 307 | <!-- |
| 308 | <p><code>focus()</code> is especially useful if you are searching for matches in certain areas, for example in quotes using positional operators. |
| 309 | While not being interested in the whole quote as a match, you can focus on what's really relevant to you.</p> |
| 310 | |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 311 | %= doc_query poliqarp => 'focus(1:contains(er []{,10} sagte, 1{Baum}))' |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 312 | --> |
| 313 | |
| 314 | <p>In case a class number is dismissed, the focus operator defaults to the class number 1: <code>focus(...)</code> is equal to <code>focus(1: ...)</code>.</p> |
| 315 | |
| 316 | <blockquote class="warning"> |
| 317 | <p>As numbers in curly brackets can be ambiguous in certain circumstances, for example <code>[]{3}</code> can be read as either "any word repeated three times" or "any word followed by the number 3 highlighted as class number 1", numbers should always be expressed as <code>[orth=3]</code> for the latter case.</p> |
| 318 | </blockquote> |
| 319 | </section> |