Blame - templates/doc/ql/cqp.html.ep - KorAP/Kalamar

blob: 9c1ff59510c5fe87092dbb56d1d20bd883d0bfe0 [file] [log] [blame]

Akron	51da002	2023-10-10 14:17:21 +0200	[diff] [blame]	1	% layout 'main', title => 'KorAP: CQP';
				2
				3	%= page_title
				4
				5	<p>The following documentation introduces all features provided by our
				6	version of the CQP Query Language and some KorAP specific extensions.
				7	This tutorial is based on the IMS Open Corpus Workbench (CWB)
				8
				9	<%= ext_link_to 'CQP Query Language Tutorial, version 3.4.24 (May 2020)',"https://cwb.sourceforge.io/files/CQP_Manual.pdf" %>
				10	and on
				11	<%= embedded_link_to 'doc', 'the Korap Poliqarp+ tutorial', 'ql', 'poliqarp-plus' %>.</p>
				12
				13	<section id="segments">
				14	<h3>Simple Segments</h3>
				15	<p>The atomic elements of CQP queries are segments. Most of the time,
				16	segments represent words and can be queried by encapsulating them in
				17	qoutes or double quotes:</p>
				18
				19
				20	%= doc_query cqp => loc('Q_cqp_simplesquote', "** 'Tree'")
				21
				22	<p>or</p>
				23
				24	%= doc_query cqp => loc('Q_cqp_simpledquote', '** "Tree"')
				25
				26	<p>A word segment is always interpreted as a <%= embedded_link_to 'doc', 'regular expressions', 'ql', 'regexp' %>, e.g., a query like</p>
				27
				28	%= doc_query cqp => loc('Q_cqp_re', '** "r(u\|a)n"'), cutoff => 1
				29
				30	%# <p>can return both "Tannenbaum" and "baum".</p>
				31
				32	<p>Sequences of simple segments are expressed using a space delimiter:</p>
				33
				34	%= doc_query cqp => loc('Q_cqp_simpleseq1', '** "the" "Tree"')
				35
				36	%= doc_query cqp => loc('Q_cqp_simpleseq2', "** 'the' 'Tree'")
				37
				38	%# ------------------- Current state (ND)
				39
				40	<p>Originally, CQP was developped as a corpus query processor tool and
				41	any CQP command had to be followed by a semicolon. <%= ext_link_to 'The CQPweb server', "https://cqpweb.lancs.ac.uk/" %> at
				42	Lancaster treats the semicolon as optional, and we implemented it in
				43	the same way.</p>
				44	<p>Simple segments always refer to the surface form of a word. To search
				45	for surface forms without case sensitivity, you can use the <code>%c</code>
				46	flag:</p>
				47
				48
				49	%= doc_query cqp => loc('Q_cqp_simplescflag', '"laufen"%c'), cutoff => 0
				50
				51
				52
				53	<p>The query above will find all occurrences of the term irrespective of
				54	the capitalization of letters.</p>
				55
				56	<p>Diacritics is not been supported yet.</p>
				57
				58	<!-- EM
				59	<p>To ignore diacritics, you can use the <code>%d</code> flag:</p>
				60
				61
				62	%= doc_query cqp => loc('Q_cqp_simplesidia2', '"Fraulein"%d'), cutoff => 0
				63
				64
				65
				66	<p>The query above will find all occurrences of the term irrespective of
				67	the use of diacritics (i.e., <code>Fräulein</code> and <code>Fraulein</code>).</p>
				68
				69	<p>Flags can be combined to ignore bose case sensitivity and diacritics:</p>
				70
				71
				72	%= doc_query cqp => loc('Q_cqp_simplesegidia2', '"Fraulein"%cd'), cutoff => 0
				73
				74	<p>The query above will find all occurrences of the term irrespective of
				75	the use of diacritics or of capitalization: <code>fraulein</code>, <code>Fraulein</code>,
				76	<code>fräulein</code>, etc.</p>
				77	-->
				78
				79	<h4 id="regexp">Regular Expressions</h4>
				80	<p>Special regular expressions characters like <code>.</code>, <code>?</code>,
				81	<code>*</code>, <code>+</code>, <code>\|</code>, <code>(</code>, <code>)</code>,
				82	<code>[</code>, <code>]</code>, <code>{</code>, <code>}</code>, <code>^</code>,
				83	<code>$</code> have to be "escaped" with backslash (<code>\</code>):</p>
				84	<ul>
				85	<li><code>"?";</code> fails while <code>"\?";</code> returns <code>?.</code></li>
				86	<li><code>"."</code> returns any character, while <code>"\$\."</code>
				87	returns <code>$.</code></li>
				88	</ul>
				89	<blockquote class="warning">
				90	<p>Beware: Queries with prepended <code>.*</code> expressions can
				91	become extremely slow!</p>
				92	<p>In Poliqarp+ only double quotes are used for regular expressions,
				93	while single quotes are used to mark verbatim strings. In CQP, you
				94	can use %l flag to match the string in a verbatim manner.</p>
				95	</blockquote>
				96	<p>To match a word form containing single or double quotes, use one of
				97	the following methods :</p>
				98	<ul>
				99	<li>if the string you need to match contain either single or
				100	double quotes, use the other quote character to encapsulate the
				101	string: </li>
				102
				103	%= doc_query cqp => loc('Q_cqp_regexqu1', '"It\'s"'), cutoff => 0
				104
				105	<!-- EM
				106	%= doc_query cqp => loc('Q_cqp_xxxx', '\'12"-screen\''), cutoff => 0
				107	-->
				108
				109	<li>escape the qoutes by doubling every occurrence of the quotes
				110	character inside the string: </li>
				111
				112	%= doc_query cqp => loc('Q_cqp_regexequ1', '\'It\'\'s\''), cutoff => 0
				113
				114	<!-- %= doc_query cqp => loc('Q_cqp_regexequ2', '"12""-screen"'), cutoff => 0 -->
				115
				116	<li>escape the qoutes by using <code>(\)</code>: </li>
				117
				118	%= doc_query cqp => loc('Q_cqp_regexequ3', "'It\\'s'"), cutoff => 0
				119
				120	<!-- %= doc_query cqp => loc('Q_cqp_regexequ4', '"12\\"-screen"'), cutoff => 0 -->
				121	</ul>
				122	</section>
				123	<section id="complex">
				124	<h3>Complex Segments</h3>
				125	<p>Complex segments are expressed in square brackets and contain
				126	additional information on the resource of the term under scrutiny by
				127	providing key/value pairs, separated by an equal-sign.</p>
				128	<p>The KorAP implementation of CQP provides three special segment keys:
				129	<code>orth</code> for surface forms, <code>base</code> for lemmata,
				130	and <code>pos</code> for Part-of-Speech. The following complex query
				131	finds all surface forms of the defined word:</p>
				132
				133	%= doc_query cqp => loc('Q_cqp_compsl1', '[orth="Baum"]'), cutoff => 0
				134
				135
				136	<p>The query is thus equivalent to:</p>
				137
				138	%= doc_query cqp => loc('Q_cqp_compsl2', '"Baum"'), cutoff => 0
				139
				140
				141	<p>Complex segments expect simple expressions as values, meaning that
				142	the following expression is valid as well:</p>
				143
				144	%= doc_query cqp => loc('Q_cqp_compsse', '[orth="l(au\|ie)fen"%c]'), cutoff => 1
				145
				146
				147	<p>Another special key is <code>base</code>, refering to the lemma
				148	annotation of the <%= embedded_link_to 'doc', 'default foundry', 'data', 'annotation'%>. The following query finds all occurrences of segments
				149	annotated as a specified lemma by the default foundry:</p>
				150
				151	%= doc_query cqp => loc('Q_cqp_compsbase', '[base="Baum"]'), cutoff => 1
				152
				153
				154	<p>The third special key is <code>pos</code>, refering to the
				155	part-of-speech annotation of the <%= embedded_link_to 'doc', 'default foundry', 'data', 'annotation'%>. The following query finds all attributive adjectives:</p>
				156
				157	%= doc_query cqp => loc('Q_cqp_compspos', '[pos="ADJA"]'), cutoff => 1
				158
				159
				160	<p>Complex segments requesting further token annotations can have keys
				161	following the <code>foundry/layer</code> notation. For example to
				162	find all occurrences of plural words in a supporting foundry, you can
				163	search using the following queries:</p>
				164
				165	%= doc_query cqp => loc('Q_cqp_compstoken1', '[marmot/m="number":"pl"]'), cutoff => 1
				166
				167
				168	%= doc_query cqp => loc('Q_cqp_compstoken2', "[marmot/m='tense':'pres']"), cutoff => 1
				169
				170
				171	<p>In case an annotation contains special non-alphabetic and non-numeric
				172	characters, the annotation part can be followed by <code>%l</code> to
				173	ensure a verbatim interpretation:</p>
				174
				175	%= doc_query cqp => loc('Q_cqp_compstokenverb', "[orth='https://de.wikipedia.org'%l]"), cutoff => 1
				176
				177
				178	<h4>Negation</h4>
				179	<p>Negation of terms in complex expressions can be expressed by
				180	prepending the equal sign or the whole expression with an exclamation
				181	mark.</p>
				182
				183	%= doc_query cqp => loc('Q_cqp_neg1', '[pos!="ADJA"] "Haare"'), cutoff => 1
				184
				185
				186
				187	%= doc_query cqp => loc('Q_cqp_neg2', '[!pos="ADJA"] "Haare"'), cutoff => 1
				188
				189
				190	<blockquote class="warning">
				191	<p>Beware: Negated complex segments can't be searched as a single
				192	statement. However, they work in case they are part of a <%= embedded_link_to 'doc', 'sequence', 'ql', 'poliqarp-plus#syntagmatic-operators-sequence'%>.</p>
				193	</blockquote>
				194	<h4 id="empty-segments">Empty Segments</h4>
				195	<p>A special segment is the empty segment, that matches every word in
				196	the index.</p>
				197
				198	%= doc_query cqp => loc('Q_cqp_empseq', '[]'), cutoff => 1
				199
				200
				201	<p>Empty segments are useful to express distances of words by using
				202	<%= embedded_link_to 'doc', 'repetitions', 'ql', 'poliqarp-plus#syntagmatic-operators-repetitions'%>.</p>
				203	<blockquote class="warning">
				204	<p>Beware: Empty segments can't be searched as a single statement.
				205	However, they work in case they are part of a <%= embedded_link_to 'doc', 'sequence', 'ql', 'poliqarp-plus#syntagmatic-operators-sequence'%>.</p>
				206	</blockquote>
				207	</section>
				208	<section id="spans">
				209	<h3>Span Segments</h3>
				210	<p>Not all segments are bound to words - some are bound to concepts
				211	spanning multiple words, for example noun phrases, sentences, or
				212	paragraphs. Span segments are structural elements and they have
				213	specific syntax in different contexts. When used in complex segments,
				214	they need to be searched by using angular brackets :
				215
				216	%= doc_query cqp => loc('Q_cqp_spansegm', '<corenlp/c=NP>'), cutoff => 1
				217
				218	Some spans such as <code>s</code> are special keywords that can be
				219	used without angular brackets, as operands of specific functional
				220	operators like <code>within</code>, <code>region</code>, <code>lbound</code>,
				221	<code>rbound</code> or <code>MU(meet)</code>.
				222
				223	<!-- EM
				224	but when used with specific functional
				225	operators like <code>within</code>, <code>region</code>, <code>lbound</code>,
				226	<code>rbound</code> or <code>MU(meet)</code>, the angular brackets
				227	are not mandatory.
				228	-->
				229	</p>
				230	</section>
				231	<section id="paradigmatic-operators">
				232	<h3>Paradigmatic Operators</h3>
				233	<p>A complex segment can have multiple properties a token requires. For
				234	example to search for all words with a certain surface form of a
				235	particular lemma (no matter if capitalized or not), you can search
				236	for:</p>
				237
				238	%= doc_query cqp => loc('Q_cqp_parseg', '[orth="laufe"%c & base="Lauf"]'), cutoff => 1
				239
				240
				241	<p>The ampersand combines multiple properties with a logical AND. Terms
				242	of the complex segment can be negated as introduced before. The
				243	following queries are equivalent:</p>
				244
				245	%= doc_query cqp => loc('Q_cqp_parsegamp1', '[orth="laufe"%c & base!="Lauf"]'), cutoff => 1
				246
				247
				248
				249	%= doc_query cqp => loc('Q_cqp_parsegamp2', '[orth="laufe"%c & !base="Lauf"]'), cutoff => 1
				250
				251
				252	<p>Alternatives can be expressed by using the pipe symbol:</p>
				253
				254	%= doc_query cqp => loc('Q_cqp_parsegalt', '[base="laufen" \| base="gehen"]'), cutoff => 1
				255
				256
				257	<p>All these sub expressions can be grouped using round brackets to form
				258	complex boolean expressions:</p>
				259
				260	%= doc_query cqp => loc('Q_cqp_parsegcb', '[(base="laufen" \| base="gehen") & tt/pos="VVFIN"]'), cutoff => 1
				261
				262
				263	Round brackets can also be used to encapsulate simple segments, to
				264	increase query readability, although they are not necessary:
				265
				266	%= doc_query cqp => loc('Q_cqp_parsegrb', '[(base="laufen" \| base="gehen") & (tt/pos="VVFIN")]'), cutoff => 1
				267
				268
				269	Negation operator can be used outside expressions grouped by round
				270	brackets. Be aware of the <%= ext_link_to "De
				271	Morgan's Laws", "https://en.wikipedia.org/wiki/De_Morgan%27s_laws" %> when you design your queries: the following query
				272
				273	%= doc_query cqp => loc('Q_cqp_parsegneg1', '[(!(base="laufen" \| base="gehen")) & (tt/pos="VVFIN")]'), cutoff => 1
				274
				275
				276	<a>is logically equivalent to:</a>
				277
				278	%= doc_query cqp => loc('Q_cqp_parsegneg2', '[!(base="laufen") & !(base="gehen") & (tt/pos="VVFIN")]'), cutoff => 1
				279
				280
				281	<a>which can be written in a more simple way like:</a>
				282
				283	%= doc_query cqp => loc('Q_cqp_parsegneg3', '[!base="laufen" & !base="gehen" & tt/pos="VVFIN"]'), cutoff => 1
				284
				285
				286	<a> or like </a>:
				287
				288	%= doc_query cqp => loc('Q_cqp_parsegneg4', '[base!="laufen" & base!="gehen" & tt/pos="VVFIN"]'), cutoff => 1
				289
				290
				291	</section>
				292	<section id="syntagmatic-operators">
				293	<h3>Syntagmatic Operators</h3>
				294	<h4 id="syntagmatic-operators-sequence">Sequences</h4>
				295	<p>Sequences can be used to search for segments in order. For this,
				296	simple expressions are separated by whitespaces.</p>
				297
				298	%= doc_query cqp => loc('Q_cqp_syntop1', '"der" "alte" "Mann"'), cutoff => 1
				299
				300
				301	<p>However, you can obviously search using complex segments as well:</p>
				302
				303	%= doc_query cqp => loc('Q_cqp_syntop2', '[orth="der"][orth="alte"][orth="Mann"]'), cutoff => 1
				304
				305
				306	<p>Now you may see the benefit of the empty segment to search for words
				307	you don't know:</p>
				308
				309	%= doc_query cqp => loc('Q_cqp_syntop3', '[orth="der"][][orth="Mann"]'), cutoff => 1
				310
				311
				312	<h4>Position</h4>
				313	<p>You are also able to mix segments and spans in sequences. In CQP,
				314	spans are marked by XML-like structural elements signalling the
				315	beginning and/or the end of a region and they can be used to look for
				316	segments in a specific position in a bigger structure like a noun
				317	phrase or a sentence.</p>
				318	<p>To search for a word at the beginning of a sentence (or a syntactic
				319	group), the following queries are equivalent.
				320	<ul>
				321	<li>
				322	The queries both match the word "Der" when positioned as a first word in a sentence:
				323	%= doc_query cqp => loc('Q_cqp_posfirst1', '<base/s=s>[orth="Der"]'), cutoff => 1
				324	%= doc_query cqp => loc('Q_cqp_posfirst2','<s>[orth="Der"]'), cutoff => 1
				325	</li>
				326	<li>The queries both match the word "Der" when positioned after the end of a sentence:
				327	%= doc_query cqp => loc('Q_cqp_posaend1','</base/s=s>[orth="Der"]'), cutoff => 1
				328	%= doc_query cqp => loc('Q_cqp_posaend2','</s>[orth="Der"]'), cutoff => 1
				329	</li>
				330	</ul>
				331	To search for a word at the end of a sentence (or a syntactic group),
				332	you can use:<br>
				333	<ul>
				334	<li>Match the word "Mann"
				335	when positioned as a last word in a sentence: </li>
				336
				337	%= doc_query cqp => loc('Q_cqp_posend1','[orth="Mann"]</base/s=s>'), cutoff => 1
				338	%= doc_query cqp => loc('Q_cqp_posend2','[orth="Mann"]</s>'), cutoff => 1
				339
				340	<li>Match the
				341	word "Mann" when positioned before the beginning of a sentence, as a
				342	last word of the previous sentence: </li>
				343
				344	%= doc_query cqp => loc('Q_cqp_posbbeg1','[orth="Mann"]<base/s=s>'), cutoff => 1
				345	%= doc_query cqp => loc('Q_cqp_posbbeg2','[orth="Mann"]<s>'), cutoff => 1
				346
				347	</ul>
				348	<blockquote class="warning">
				349	<p>Beware that when searching for longer sequences, sentence boundaries may be crossed. </p>
				350	</blockquote>
				351	<p> In the following example, sequences where "für" occurs in a previous
				352	sentence may also be matched, because of the long sequence of empty
				353	tokens in the query (minimum 20, maximum 25).
				354	</p>
				355
				356	%= doc_query cqp => loc('Q_cqp_posbbeg3', '"für" []{20,25} "uns"</s>'), cutoff => 1
				357
				358	</section>