Blame - templates/doc/data/annotation.html.ep - KorAP/Kalamar

blob: 032375ce2fb232bae02e1c9c33af120ec4d95bee [file] [log] [blame]

Nils Diewald	a31a515	2015-04-17 21:05:23 +0000	[diff] [blame]	1	% layout 'main', title => 'KorAP: Annotations';
				2
				3	<h2>Annotations</h2>
				4
				5	<p>KorAP provides access to multiple levels of annotations originating from multiple resources, so called <em>foundries</em>.</p>
				6
				7	<section id="base">
				8	<h3>Base Foundry</h3>
Akron	4856781	2017-09-01 16:49:04 +0200	[diff] [blame]	9	<p>The base foundry is available for all corpora and acts as a common ground for document structure annotation in the layer <code>s</code>. It supports three types of spans: <code><base/s=s></code> for sentences, <code><base/s=p></code> for paragraphs, and <code><base/s=t></code> for the text span</p>
				10	%= doc_query poliqarp => '<base/s=s>', cutoff => 1
Nils Diewald	a31a515	2015-04-17 21:05:23 +0000	[diff] [blame]	11	</section>
				12
				13
				14	<section id="cnx">
				15	<h3>Connexor (<code>cnx</code>)</h3>
				16	<p>Connexor annotations provide the following layer for the <code>cnx</code> prefix:</p>
				17	<dl>
				18	<dt><abbr data-type="token" title="Lemma">l</abbr></dt>
				19	<dd>All lemmas are written in lower case. Composita are split, e.g. the token "Leitfähigkeit" is matched by the lemmas "leit" and "fähigkeit" - not by the lemma "leitfähigkeit"</dd>
				20	<dt><abbr data-type="token" title="Part-of-Speech">p</abbr></dt>
				21	<dd>Part-of-speech information is written in capital letters and is based on STTS</dd>
				22	<dt><abbr data-type="token" title="Syntactical information">syn</abbr></dt>
				23	<dd>Includes token based information like <code>@PREMOD</code>, <code>@NH</code>, <code>@MAIN</code> ...</dd>
				24	<dt><abbr data-type="token" title="Morphosyntactical information">m</abbr></dt>
				25	<dd>Includes information about tense (<code>PRES</code> ...), mode (<code>IND</code>), number (<code>PL</code> ...) etc.</dd>
				26	<dt><abbr data-type="span" title="Phrases">c</abbr></dt>
				27	<dd>Only nominal phrases are available and all nominal phrases are written in lower case (<code>np</code>)</dd>
				28	</dl>
Nils Diewald	61e6ff5	2015-05-07 17:26:50 +0000	[diff] [blame]	29	%= doc_query poliqarp => '[cnx/p=CC]', cutoff => 1
Nils Diewald	a31a515	2015-04-17 21:05:23 +0000	[diff] [blame]	30	</section>
				31
				32
				33	<section id="corenlp">
				34	<h3>CoreNLP (<code>corenlp</code>)</h3>
				35	<dl>
Akron	4856781	2017-09-01 16:49:04 +0200	[diff] [blame]	36	<dt><abbr data-type="token" title="Part-of-Speech">p</abbr></dt>
				37	<dd>Part-of-speech information is written in capital letters and is based on STTS</dd>
				38	<dt><abbr data-type="token" title="Constituency">c</abbr></dt>
				39	<dd>Constituency information follows the annotations of the <a href="http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/negra-corpus.html">negr@ corpus</a>.</dd>
				40	<dt><abbr data-type="token" title="Named Entity">ne</abbr></dt>
Nils Diewald	a31a515	2015-04-17 21:05:23 +0000	[diff] [blame]	41	<dd>Contains named entities like <code>I-PER</code>, <code>I-ORG</code> etc.</dd>
Akron	4856781	2017-09-01 16:49:04 +0200	[diff] [blame]	42	<dt><abbr data-type="token" title="Named Entity">ne_hgc_175m_600</abbr></dt>
				43	<dd>See above</dd>
Nils Diewald	a31a515	2015-04-17 21:05:23 +0000	[diff] [blame]	44	<dt><abbr data-type="token" title="Named Entity">ne_dewac_175_175m_600</abbr></dt>
				45	<dd>See above</dd>
				46	</dl>
Nils Diewald	61e6ff5	2015-05-07 17:26:50 +0000	[diff] [blame]	47	%= doc_query poliqarp => '[corenlp/ne_dewac_175m_600=I-ORG]', cutoff => 1
Nils Diewald	a31a515	2015-04-17 21:05:23 +0000	[diff] [blame]	48	</section>
				49
				50
				51	<section id="tt">
				52	<h3>TreeTagger (<code>tt</code>)</h3>
				53	<dl>
				54	<dt><abbr data-type="token" title="Lemma">l</abbr></dt>
				55	<dd>All non-noun lemmas are written in lower case, nouns are written upper case. Composita stay intact (e.g. <code>Normalbedingung</code>)</dd>
				56	<dt><abbr data-type="token" title="Part-of-Speech">p</abbr></dt>
				57	<dd>All part-of-speech information is written in capital letters and is based on STTS</dd>
				58	</dl>
Nils Diewald	61e6ff5	2015-05-07 17:26:50 +0000	[diff] [blame]	59	%= doc_query poliqarp => '[tt/p=ADV]', cutoff => 1
Nils Diewald	a31a515	2015-04-17 21:05:23 +0000	[diff] [blame]	60	</section>
				61
				62
Akron	ebc8d93	2015-05-28 18:19:35 +0200	[diff] [blame]	63	<section id="mate">
Nils Diewald	a31a515	2015-04-17 21:05:23 +0000	[diff] [blame]	64	<h3>Mate (<code>mate</code>)</h3>
				65	<dl>
				66	<dt><abbr data-type="token" title="Lemma">l</abbr></dt>
				67	<dd>All lemmas are written in lower case. Composita stay intact (e.g. <code>buchstabenbezeichnung</code>)</dd>
				68	<dt><abbr data-type="token" title="Part-of-Speech">p</abbr></dt>
				69	<dd>All part-of-speech information is written in capital letters and is based on STTS</dd>
				70	<dt><abbr data-type="token" title="Morphosyntactical information">m</abbr></dt>
				71	<dd>Includes information about tense (<code>tense:pres</code> ...), mode (<code>mood:ind</code>), number (<code>number:pl</code> ...), gender (<code>gender:masc</code> ...) etc.</dd>
				72	</dl>
Nils Diewald	61e6ff5	2015-05-07 17:26:50 +0000	[diff] [blame]	73	%= doc_query poliqarp => '[mate/m=gender:fem]', cutoff => 1
Nils Diewald	a31a515	2015-04-17 21:05:23 +0000	[diff] [blame]	74	</section>
				75
				76
				77	<section id="opennlp">
				78	<h3>OpenNLP (<code>opennlp</code>)</h3>
				79	<dl>
				80	<dt><abbr data-type="token" title="Part-of-Speech">p</abbr></dt>
				81	<dd>All part-of-speech information is written in capital letters and is based on STTS</dd>
				82	</dl>
Nils Diewald	61e6ff5	2015-05-07 17:26:50 +0000	[diff] [blame]	83	%= doc_query poliqarp => '[opennlp/p=PDAT]', cutoff => 1
Nils Diewald	a31a515	2015-04-17 21:05:23 +0000	[diff] [blame]	84	</section>
				85
Akron	4856781	2017-09-01 16:49:04 +0200	[diff] [blame]	86	<!--
Nils Diewald	a31a515	2015-04-17 21:05:23 +0000	[diff] [blame]	87	<section id="xip">
				88	<h3>Xerox Incremental Parser (<code>xip</code>)</h3>
				89	<dl>
				90	<dt><abbr data-type="token" title="Lemma">l</abbr></dt>
				91	<dd>All non-noun lemmas are written in lower case, nouns are written upper case. Composita are split, e.g. the token <code>Leitfähigkeit</code> is matched by the lemmas <code>leiten</code> and <code>Fähigkeit</code> - and by a merged and pretty useless <code>leitenfähigkeit</code> (This is going to change)</dd>
				92	<dt><abbr data-type="token" title="Part-of-Speech">p</abbr></dt>
				93	<dd>All part-of-spech information is written in capital letters and is based on STTS</dd>
				94	<dt><abbr data-type="span" title="Phrases">c</abbr></dt>
				95	<dd>Some phrases to create sentences, all upper case (<code>NP</code>, <code>NPA</code>, <code>NOUN</code>, <code>VERB</code>, <code>PREP</code>, <code>AP</code> ...)</dd>
				96	</dl>
Nils Diewald	61e6ff5	2015-05-07 17:26:50 +0000	[diff] [blame]	97	%= doc_query poliqarp => '[xip/p=ADJ]', cutoff => 1
				98	%= doc_query poliqarp => '<xip/c=VERB>', cutoff => 1
Nils Diewald	a31a515	2015-04-17 21:05:23 +0000	[diff] [blame]	99	</section>
Akron	4856781	2017-09-01 16:49:04 +0200	[diff] [blame]	100	-->
Nils Diewald	a31a515	2015-04-17 21:05:23 +0000	[diff] [blame]	101
				102	<section id="default-foundries">
				103	<h3>Default Foundries</h3>
Akron	4856781	2017-09-01 16:49:04 +0200	[diff] [blame]	104	<p>For queries on specific layers without given foundries, KorAP provides default foundries<!--, that can be overwritten by user configurations-->. The default foundries apply to the following layers:</p>
Nils Diewald	a31a515	2015-04-17 21:05:23 +0000	[diff] [blame]	105
				106	<ul>
				107	<li><strong>orth</strong>: <code>opennlp</code></li>
				108	<li><strong>lemma</strong>: <code>tt</code></li>
				109	<li><strong>pos</strong>: <code>tt</code></li>
				110	</ul>
				111
				112	<blockquote>
				113	<p>In the Lucene backend, the <code>orth</code> layer can only be bound to a specific foundry, as only one tokenization is supported.</p>
				114	</blockquote>
				115	</section>