Blame - templates/de/doc/ql/poliqarp-plus.html.ep - KorAP/Kalamar

blob: 70a59f7d6d96e27430616be5bac6781cd2ac577e [file] [log] [blame]

Akron	5474018	2017-06-17 14:17:23 +0200	[diff] [blame^]	1	% layout 'main', title => 'KorAP: Poliqarp+';
				2
				3	<h2>Poliqarp+</h2>
				4
				5	<p>Das folgende Tutorial präsentiert alle Features, die unsere Version der Poliqarp Abfragesprache zur Verfügung stellt und enthält zusätzlich einige spezifische KorAP Erweiterungen.</p>
				6	%# The following tutorial introduces all features provided by our version of the Poliqarp Query Language and some KorAP specific extensions.
				7
				8	<section id="segments">
				9	<h3>Einfache Segmente</h3>
				10
				11	<p>Die einzelnen Elemente von Poliqarp sind Segmente. Meistens repräsentieren Segmente Wörter und können leicht abgefragt werden:</p>
				12	%# Fußnote: Im polnischen National-Korpus kann Poliqarp viele Segmente verbinden, wenn ein einzelnes Wort erkannt wird.
				13
				14	%= doc_query poliqarp => 'Baum'
				15
				16	<p>Abfolgen einfacher Segmenten werden durch Leerzeichen getrennt:</p>
				17
				18	%= doc_query poliqarp => 'der Baum'
				19
				20	<p>Einfache Segmente beziehen sich immer auf die Oberflächenform eines Wortes. Wenn Sie nach einer Oberflächenform ohne Beachtung der Groß- und Kleinschreibung suchen, können Sie <code>/i</code> anfügen.</p>
				21
				22	%= doc_query poliqarp => 'laufen/i'
				23
				24	<p>Die Abfrage oben findet alle Vorkommen von <code>laufen</code> unabhängig von der Großschreibung von Buchstaben, so wird <code>wir laufen</code> genauso gefunden wie <code>das Laufen</code> und sogar <code>"GEH LAUFEN!"</code>.
				25
				26	<h4 id="regexp">Reguläre Ausdrücke</h4>
				27
				28	<p>Segmente können auch durch <%= doc_link_to 'Reguläre Ausdrücke', 'ql', 'regexp' %> abgefragt werden - indem das Segment mit doppelten Anführungszeichen umschlossen wird.</p>
				29
				30	%= doc_query poliqarp => '"l(au\|ie)fen"'
				31
				32	<p>Reguläre Ausdrücke stimmen immer mit dem gesamten Segment überein, d.h. die obige Abfrage findet Wörter, die mit <code>l</code> beginnen und mit <code>n</code> enden. Um Teilausdrücke zu unterstützen, können Sie das Flag <code>/x</code> verwenden.</p>
				33
				34	%= doc_query poliqarp => '"l(au\|ie)fen"/x', cutoff => 1
				35
				36	<p>Das <code>/x</code> flag sucht nach allen Segmenten, die eine Sequenz von Zeichen enthalten, die mit dem regulären Ausdruck übereinstimmen. Das bedeutet, dass die obige Abfrage äquivalent zu der Folgenden ist:</p>
				37
				38	%= doc_query poliqarp => '".?l(au\|ie)fen.?"', cutoff => 1
				39
				40	<p>Das <Code>/x</code> Flag kann auch in Verbindung mit exakten Ausdrücken verwendet werden, um nach Teilzeichenketten zu suchen:</p>
				41
				42	%= doc_query poliqarp => 'trenn/xi', cutoff => 1
				43
				44	<p>Die obige Abfrage findet alle Vorkommen von Segmenten mit der Zeichenfolge <code>trenn</code> unabhängig von Groß-Kleinschreibung, wie "Trennung", "unzertrennlich" oder "Wettrennen".</p>
				45
				46	<blockquote class="warning">
				47	<p>Achtung: Diese Art von Abfragen (mit vorangestellten <code>.*</Code> Ausdrücken) sind extrem langsam!</p>
				48	</blockquote>
				49
				50	<p>Sie können das <code>/i</code> Flag erneut anwenden, um unabhängig von Groß-Kleinschreibung zu suchen.</p>
				51
				52	%= doc_query poliqarp => '"l(au\|ie)fen"/xi', cutoff => 1
				53
				54	</section>
				55
				56	<section id="complex">
				57	<h3>Komplexe Segmente</h3>
				58
				59	<p>Komplexe Segmente werden in eckigen Klammern ausgedrückt und enthalten zusätzliche Informationen über die Ressource des zu prüfenden Begriffs, indem sie Schlüssel/Wert-Paare enthalten, die durch ein Gleichheitszeichen getrennt sind.</p>
				60
				61	<p>Die KorAP-Implementierung von Poliqarp bietet drei spezielle Segmentschlüssel: <code>orth</code> für Oberflächenformen, <code>base</code> für Lemmata und <code>pos</code> für Part-of-Speech-Annotationen. Die folgende komplexe Abfrage findet alle Oberflächenformen von <code>Baum</code>.</p>
				62
				63	%# Es gibt mehr spezielle Schlüssel in Poliqarp, aber KorAP bietet sie nicht an.
				64
				65	%= doc_query poliqarp => '[orth=Baum]'
				66
				67	<p>Die Abfrage entspricht also:</p>
				68
				69	%= doc_query poliqarp => 'Baum'
				70
				71	<p>Komplexe Segmente erwarten einfache Ausdrücke als Werte, was bedeutet, dass auch der folgende Ausdruck gültig ist:</p>
				72
				73	%= doc_query poliqarp => '[orth="l(au\|ie)fen"/xi]', cutoff => 1
				74
				75	<p>Ein weiterer spezieller Schlüssel ist <code>base</code>, bezogen auf die Lemma-Annotation der <%= doc_link_to 'Standard-Foundry', 'data', 'annotation'%>.
				76	Die folgende Abfrage findet alle Vorkommen von Segmenten, die als Lemma <code>Baum</code> durch die Standard-Foundry annotiert wurden.</p>
				77
				78	%= doc_query poliqarp => '[base=Baum]'
				79
				80	<p>Der dritte Sonderschlüssel ist <code>pos</code> und bezieht sich auf die Wortarten-Annotation der <% = doc_link_to 'Standard-Foundry', 'data', 'annotation'%>.
				81	Die folgende Abfrage findet alle attributiven Adjektive:</p>
				82
				83	%= doc_query poliqarp => '[pos=ADJA]'
				84
				85	<p>Komplexe Segmente, die weitere Token-Annotationen anfordern, können Schlüssel der <code>foundry/layer</code> Notation folgend haben.
				86	Zum Beispiel, um alle Vorkommen von mehreren Wörtern in der <code>mate</code> Foundry zu finden, können Sie mit der folgenden Abfrage suchen:</p>
				87
				88	%= doc_query poliqarp => '[mate/m=number:pl]'
				89
				90	<h4>Negation</h4>
				91	<p>Die Negation von Termen in komplexen Ausdrücken kann durch Voranstellen eines Ausrufezeichen vor dem Gleichheitszeichen oder dem gesamten Term ausgedrückt werden.</p>
				92
				93	%= doc_query poliqarp => '[pos!=ADJA]'
				94	%= doc_query poliqarp => '[!pos=ADJA]'
				95
				96	<blockquote class="warning">
				97	<p>Vorsicht: Negierte komplexe Segmente können nicht alleinstehend im Lucene-Index gesucht werden.
				98	Allerdings funktionieren sie, wenn sie Teil einer <%= doc_link_to 'Sequenz', 'ql', 'poliqarp-plus#syntagmatic-operators-sequence'%> sind.</p>
				99	</blockquote>
				100
				101	<h4 id="empty-segments">Leere Segmente</h4>
				102
				103	<p>Ein spezielles Segment ist das leere Segment, das jedem Wort im Index entspricht.</p>
				104
				105	%= doc_query poliqarp => '[]'
				106
				107	<p>Leere Segmente sind nützlich, um Abstände von Wörtern auszudrücken, indem sie <%= doc_link_to 'Wiederholungen', 'ql', 'poliqarp-plus#syntagmatic-operators-repetitions' %> verwenden.</p>
				108
				109	<blockquote class="warning">
				110	<p>Vorsicht: Leere Segmente können nicht alleinstehend im Lucene-Index gesucht werden.
				111	Allerdings arbeiten sie, falls sie Teil eines <%= doc_link_to 'sequence', 'ql', 'poliqarp-plus#syntagmatic-operators-sequence' %> sind.</p>
				112	</blockquote>
				113	</section>
				114
				115	%# TODO:
				116
				117	<section id="spans">
				118	<h3>Span Segments</h3>
				119
				120	<p>Not all segments are bound to words - some are bound to concepts spanning multiple words, for example noun phrases, sentences, or paragraphs.
				121	Span segments can be searched for using angular brackets instead of square brackets.</p>
				122
				123	%= doc_query poliqarp => '<xip/c=INFC>'
				124
				125	<p>Otherwise they can be treated in exactly the same way as simple or complex segments.</p>
				126	</section>
				127
				128	<section id="paradigmatic-operators">
				129	<h3>Paradigmatic Operators</h3>
				130
				131	<p>A complex segment can have multiple properties a token has to fulfill. For example to search for all words with the surface form <code>laufe</code> (no matter if capitalized or not) that have the lemma <code>lauf</code> (and not, for example, <code>laufen</code>, which would indicate a verb or a gerund), you can search for:</p>
				132
				133	%= doc_query poliqarp => '[orth=laufe/i & base=Lauf]'
				134
				135	<p>The ampersand combines multiple properties with a logical AND.
				136	Terms of the complex segment can be negated as introduced before.</p>
				137
				138	%= doc_query poliqarp => '[orth=laufe/i & base!=Lauf]'
				139
				140	<p>The following query is therefore equivalent:</p>
				141
				142	%= doc_query poliqarp => '[orth=laufe & !base=Lauf]'
				143
				144	<p>Alternatives can be expressed by using the pipe symbol:</p>
				145
				146	%= doc_query poliqarp => '[base=laufen \| base=gehen]'
				147
				148	<p>All these sub expressions can be grouped using round brackets to form complex boolean expressions:</p>
				149
				150	%= doc_query poliqarp => '[(base=laufen \| base=gehen) & tt/pos=VVFIN]'
				151	</section>
				152
				153	<section id="syntagmatic-operators">
				154	<h3>Syntagmatic Operators</h3>
				155
				156	<h4 id="syntagmatic-operators-sequence">Sequences</h4>
				157
				158	<p>Sequences can be used to search for segments in order. For example to search for the word "alte" preceded by "der" and followed by "Mann", you can simple search for the sequence of simple expressions separated by whitespaces.</p>
				159
				160	%= doc_query poliqarp => 'der alte Mann'
				161
				162	<p>However, you can obviously search using complex segments as well:</p>
				163
				164	%= doc_query poliqarp => '[orth=der][orth=alte][orth=Mann]'
				165
				166	<p>Now you may see the benefit of the empty segment to search for words you don't know:</p>
				167
				168	%= doc_query poliqarp => '[orth=der][][orth=Mann]'
				169
				170	<p>You are also able to mix segments and spans in sequences, for example to search for the word "Der" at the beginning of a sentence (which can be interpreted as the first word after the end of a sentence).</p>
				171
				172	%= doc_query poliqarp => '<base/s=s>[orth=Der]'
				173
				174	<h4>Groups</h4>
				175
				176	...
				177
				178	<h4>Alternation</h4>
				179
				180	<p>Alternations allow for searching alternative segments or sequences of segments, similar to the paradigmatic operator. You already have seen that you can search for both sequences of <code>der alte Mann</code> and <code>der junge Mann</code> by typing in:</p>
				181
				182	%= doc_query poliqarp => 'der [orth=alte \| orth=junge] Mann'
				183
				184	<p>However, this formulation has problems in case you want to search for alternations of sequences rather than terms. If you want to search for both sequences of <code>dem jungen Mann</code> and <code>der alte Mann</code> you can use syntagmatic alternations and groups:</p>
				185
				186	%= doc_query poliqarp => '(dem jungen \| der alte) Mann'
				187
				188	<p>The pipe symbol works the same way as with the paradigmatic alternation, but supports sequences of different length as operands. The above query for <code>der alte Mann</code> and <code>der junge Mann</code> can therefor be reformulated as:</p>
				189
				190	%= doc_query poliqarp => 'der (junge \| alte) Mann'
				191
				192	<h4 id="syntagmatic-operators-repetitions">Repetition</h4>
				193
				194	<p>Repetitions in Poliqarp are realized as in <%= doc_link_to 'regular expressions', 'ql', 'regexp' %>, by giving quantifieres in curly brackets.</p>
				195	<p>To search for a sequence of three occurrences of <code>der</code>, you can formulate your query in any of the following ways - they will have the same results:</p>
				196
				197	%= doc_query poliqarp => 'der der der'
				198	%= doc_query poliqarp => 'der{3}'
				199	%= doc_query poliqarp => '[orth=der]{3}'
				200
				201	<p>In difference to regular expressions, the repetition operation won't refer to the match but to the pattern given. So the following query will give you a sequence of three words having the term <code>der</code> as a substring - but the words don't have to be identical. The following query for example will match a sequence of three words all starting with <code>la</code>.</p>
				202
				203	%= doc_query poliqarp => '"la.*?"/i{3}'
				204
				205	<p>The same is true for annotations. The following query will find a sequence of 3 to 4 adjectives as annotated by the TreeTagger foundry, that is preceded by the lemma <code>ein</code> as annotated by the default foundry and followed by a noun as annotated by the XIP foundry. The adjectives do not have to be identical though.</p>
				206
				207	%= doc_query poliqarp => '[base=ein][tt/p=ADJA]{3,4}[xip/p=NOUN]'
				208
				209	<p>In addition to numbered quantities, it is also possible to pass repetition information as Kleene operators <code>?</code>, <code>+</code>, and <code>+</code>.</p>
				210
				211	<p>To search for a sequence of the lemma <code>der</code> followed by the lemma <code>baum</code> as annotated by the base foundry, but allowing an optional adjective as annotated by the TreeTagger foundry in between, you can search for:</p>
				212
				213	%= doc_query poliqarp => '[base=die][tt/pos=ADJA]?[base=Baum]'
				214
				215	<p>This query is identical to the numbered quantification of:</p>
				216
				217	%= doc_query poliqarp => '[base=die][tt/pos=ADJA]{,1}[base=Baum]'
				218
				219	<p>To search for the same sequences but with unlimited adjectives as annotated by the TreeTagger foundry in between, you can use the Kleene Star:</p>
				220
				221	%= doc_query poliqarp => '[base=die][tt/pos=ADJA]*[base=Baum]'
				222
				223	<p>And to search for this sequence but with at least one adjective in between, you can use the Kleene Plus (all queries are identical):</p>
				224
				225	%= doc_query poliqarp => '[base=die][tt/pos=ADJA]+[base=Baum]', cutoff => 1
				226	%= doc_query poliqarp => '[base=die][tt/pos=ADJA]{1,}[base=Baum]', cutoff => 1
				227	%= doc_query poliqarp => '[base=die][tt/pos=ADJA][tt/pos=ADJA]*[base=Baum]', cutoff => 1
				228
				229	<blockquote class="warning">
				230	<p>Repetition operators like <code>{,4}</code>, <code>?</code>, and <code>*</code> make segments or groups of segments optional. In case these queries are used separated and not as part of a sequence (and there are no mandatory segments in the query), you will be warned by the system that your query won't be treated as optional.</p>
				231	<p>Keep in mind that optionality may be somehow <i>inherited</i>, for example when you search for <code>(junge\|alte)?\|tote</code>, one segment of the alternation is optional, which makes the whole query optional as well.</p>
				232	</blockquote>
				233
				234	<p>Repetition can also be used to express distances between segments by using <%= doc_link_to 'empty segments', 'ql', 'poliqarp-plus#empty-segments' %>.</p>
				235
				236	%= doc_query poliqarp => '[base=die][][base=Baum]'
				237	%= doc_query poliqarp => '[base=die][]{2}[base=Baum]', cutoff => 1
				238	%= doc_query poliqarp => '[base=die][]{2,}[base=Baum]', cutoff => 1
				239	%= doc_query poliqarp => '[base=die][]{,3}[base=Baum]', cutoff => 1
				240
				241	<p>Of course, Kleene operators can be used with empty segments as well.</p>
				242
				243	%= doc_query poliqarp => '[base=die][]?[base=Baum]'
				244	%= doc_query poliqarp => '[base=die][]*[base=Baum]', cutoff => 1
				245	%= doc_query poliqarp => '[base=die][]+[base=Baum]', cutoff => 1
				246
				247	<h4>Position</h4>
				248
				249	<p>Sequences as shown above can all be nested in further complex queries and treated as subqueries (see <%= doc_link_to 'class operators', 'ql', 'poliqarp-plus#class-operators' %> on how to later access these subqueries directly).</p>
				250	<p>Positional operators compare two matches of subqueries and will match, in case a certain condition regarding the position of both is true.</p>
				251	<p>The <code>contains()</code> operation will match, when a second subquery matches inside the span of a first subquery.</p>
				252
				253	%= doc_query poliqarp => 'contains(<base/s=s>, [tt/p=KOUS])', cutoff => 1
				254
				255	<p>The <code>startsWith()</code> operation will match, when a second subquery matches at the beginning of the span of a first subquery.</p>
				256
				257	%= doc_query poliqarp => 'startsWith(<base/s=s>, [tt/p=KOUS])', cutoff => 1
				258
				259	<p>The <code>endsWith()</code> operation will match, when a second subquery matches at the end of the span of a first subquery.</p>
				260
				261	%= doc_query poliqarp => 'endsWith(<base/s=s>, [opennlp/p=NN])', cutoff => 1
				262
				263	<p>The <code>matches()</code> operation will match, when a second subquery has the exact same span of a first subquery.</p>
				264
				265	%= doc_query poliqarp => 'matches(<base/s=s>,[tt/p=CARD][tt/p="N.*"])', cutoff => 1
				266
				267	<p>The <code>overlaps()</code> operation will match, when a second subquery has an overlapping span with the first subquery.</p>
				268
				269	%= doc_query poliqarp => 'overlaps([][tt/p=ADJA],{1:[tt/p=ADJA]}[])', cutoff => 1
				270
				271	<blockquote class="warning">
				272	<p>Positional operators are still experimental and may change in certain aspects in the future (although the behaviour defined is intended to be stable). There is also known incorrect behaviour which will be corrected in future versions.</p>
				273	<p>Optional operands in position operators, like in <code>contains(<s>,[orth=Baum]*)</code>, have to be mandatory at the moment and will be reformulated to occur at least once.</p>
				274	<p>This behaviour may change in the future.</p>
				275	</blockquote>
				276
				277	<!--
				278	<blockquote>
				279	<p>The KorAP implementation of Poliqarp also supports the postfix <code>within</code> operator, that works similar to the <code>contains()</code> operator, but is not nestable.</p>
				280	</blockquote>
				281	-->
				282
				283	</section>
				284
				285	<section id="class-operators">
				286	<h3>Class Operators</h3>
				287
				288	<p>Classes are used to group sub matches by surrounding curly brackets and a class number <code>{1:...}</code>. Classes can be used to refer to sub matches in a query, similar to captures in regular expressions. In Poliqarp+ classes have multiple purposes, with highlighting being the most intuitive one:</p>
				289
				290	%= doc_query poliqarp => 'der {1:{2:[]} Mann}'
				291
				292	%#= doc_query poliqarp => 'der {1:{2:[]{1,4}} {3:Baum}} {4:[]}'
				293
				294	<p>In KorAP classes can be defined from 1 to 128. In case a class number is dismissed, the class defaults to the class number 1: <code>{...}</code> is equal to <code>{1:...}</code>.</p>
				295
				296	<h4>Match Modification</h4>
				297
				298	<p>Based on classes, matches may be modified. The <code>focus()</code> operator restricts the span of a match to the boundary of a certain class.</p>
				299
				300	%= doc_query poliqarp => 'focus(der {Baum})'
				301
				302	<p>The query above will search for the sequence <code>der Baum</code> but the match will be limited to <code>Baum</code>. You can think of <code>der</code> in this query as a positive look-behind zero-length assertion in regular expressions.</p>
				303
				304	<p>But focus is way more useful if you are searching for matches without knowing the surface form. For example, to find all terms between the words "der" and "Mann" you can search:</p>
				305
				306	%= doc_query poliqarp => 'focus(der {[]} Mann)'
				307
				308	<p>This will limit the match to all interesting terms in between "der" and "Mann". Or you may want to search for all words following the sequence "der alte und" immediately:</p>
				309
				310	%= doc_query poliqarp => 'focus(der alte und {[]})'
				311
				312	<!--
				313	<p><code>focus()</code> is especially useful if you are searching for matches in certain areas, for example in quotes using positional operators.
				314	While not being interested in the whole quote as a match, you can focus on what's really relevant to you.</p>
				315
				316	%= doc_query poliqarp => 'focus(1:contains(er []{,10} sagte, 1{Baum}))'
				317	-->
				318
				319	<p>In case a class number is dismissed, the focus operator defaults to the class number 1: <code>focus(...)</code> is equal to <code>focus(1: ...)</code>.</p>
				320
				321	<blockquote class="warning">
				322	<p>As numbers in curly brackets can be ambiguous in certain circumstances, for example <code>[]{3}</code> can be read as either "any word repeated three times" or "any word followed by the number 3 highlighted as class number 1", numbers should always be expressed as <code>[orth=3]</code> for the latter case.</p>
				323	</blockquote>
				324	</section>