blob: dbbd534445bfd0cde07dd22c1521f0819dbe3e1a [file] [log] [blame]
Nils Diewalda31a5152015-04-17 21:05:23 +00001% layout 'main', title => 'KorAP: Annotations';
2
Akron1120a582017-10-17 12:29:16 +02003<h2 id="tutorial-top">Annotations</h2>
Nils Diewalda31a5152015-04-17 21:05:23 +00004
5<p>KorAP provides access to multiple levels of annotations originating from multiple resources, so called <em>foundries</em>.</p>
6
7<section id="base">
8 <h3>Base Foundry</h3>
Akron48567812017-09-01 16:49:04 +02009 <p>The base foundry is available for all corpora and acts as a common ground for document structure annotation in the layer <code>s</code>. It supports three types of spans: <code>&lt;base/s=s&gt;</code> for sentences, <code>&lt;base/s=p&gt;</code> for paragraphs, and <code>&lt;base/s=t&gt;</code> for the text span</p>
10 %= doc_query poliqarp => '<base/s=s>', cutoff => 1
Nils Diewalda31a5152015-04-17 21:05:23 +000011</section>
12
13
14<section id="cnx">
15 <h3>Connexor (<code>cnx</code>)</h3>
16 <p>Connexor annotations provide the following layer for the <code>cnx</code> prefix:</p>
17 <dl>
18 <dt><abbr data-type="token" title="Lemma">l</abbr></dt>
19 <dd>All lemmas are written in lower case. Composita are split, e.g. the token &quot;Leitfähigkeit&quot; is matched by the lemmas &quot;leit&quot; and &quot;fähigkeit&quot; - not by the lemma &quot;leitfähigkeit&quot;</dd>
20 <dt><abbr data-type="token" title="Part-of-Speech">p</abbr></dt>
21 <dd>Part-of-speech information is written in capital letters and is based on STTS</dd>
22 <dt><abbr data-type="token" title="Syntactical information">syn</abbr></dt>
23 <dd>Includes token based information like <code>@PREMOD</code>, <code>@NH</code>, <code>@MAIN</code> ...</dd>
24 <dt><abbr data-type="token" title="Morphosyntactical information">m</abbr></dt>
25 <dd>Includes information about tense (<code>PRES</code> ...), mode (<code>IND</code>), number (<code>PL</code> ...) etc.</dd>
26 <dt><abbr data-type="span" title="Phrases">c</abbr></dt>
27 <dd>Only nominal phrases are available and all nominal phrases are written in lower case (<code>np</code>)</dd>
28 </dl>
Nils Diewald61e6ff52015-05-07 17:26:50 +000029 %= doc_query poliqarp => '[cnx/p=CC]', cutoff => 1
Nils Diewalda31a5152015-04-17 21:05:23 +000030</section>
31
32
33<section id="corenlp">
34 <h3>CoreNLP (<code>corenlp</code>)</h3>
35 <dl>
Akron48567812017-09-01 16:49:04 +020036 <dt><abbr data-type="token" title="Part-of-Speech">p</abbr></dt>
37 <dd>Part-of-speech information is written in capital letters and is based on STTS</dd>
38 <dt><abbr data-type="token" title="Constituency">c</abbr></dt>
39 <dd>Constituency information follows the annotations of the <a href="http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/negra-corpus.html">negr@ corpus</a>.</dd>
40 <dt><abbr data-type="token" title="Named Entity">ne</abbr></dt>
Nils Diewalda31a5152015-04-17 21:05:23 +000041 <dd>Contains named entities like <code>I-PER</code>, <code>I-ORG</code> etc.</dd>
Akron48567812017-09-01 16:49:04 +020042 <dt><abbr data-type="token" title="Named Entity">ne_hgc_175m_600</abbr></dt>
43 <dd>See above</dd>
Nils Diewalda31a5152015-04-17 21:05:23 +000044 <dt><abbr data-type="token" title="Named Entity">ne_dewac_175_175m_600</abbr></dt>
45 <dd>See above</dd>
46 </dl>
Nils Diewald61e6ff52015-05-07 17:26:50 +000047 %= doc_query poliqarp => '[corenlp/ne_dewac_175m_600=I-ORG]', cutoff => 1
Nils Diewalda31a5152015-04-17 21:05:23 +000048</section>
49
50
51<section id="tt">
52 <h3>TreeTagger (<code>tt</code>)</h3>
53 <dl>
54 <dt><abbr data-type="token" title="Lemma">l</abbr></dt>
55 <dd>All non-noun lemmas are written in lower case, nouns are written upper case. Composita stay intact (e.g. <code>Normalbedingung</code>)</dd>
56 <dt><abbr data-type="token" title="Part-of-Speech">p</abbr></dt>
57 <dd>All part-of-speech information is written in capital letters and is based on STTS</dd>
58 </dl>
Nils Diewald61e6ff52015-05-07 17:26:50 +000059 %= doc_query poliqarp => '[tt/p=ADV]', cutoff => 1
Nils Diewalda31a5152015-04-17 21:05:23 +000060</section>
61
62
Akronebc8d932015-05-28 18:19:35 +020063<section id="mate">
Nils Diewalda31a5152015-04-17 21:05:23 +000064 <h3>Mate (<code>mate</code>)</h3>
65 <dl>
66 <dt><abbr data-type="token" title="Lemma">l</abbr></dt>
67 <dd>All lemmas are written in lower case. Composita stay intact (e.g. <code>buchstabenbezeichnung</code>)</dd>
68 <dt><abbr data-type="token" title="Part-of-Speech">p</abbr></dt>
69 <dd>All part-of-speech information is written in capital letters and is based on STTS</dd>
70 <dt><abbr data-type="token" title="Morphosyntactical information">m</abbr></dt>
71 <dd>Includes information about tense (<code>tense:pres</code> ...), mode (<code>mood:ind</code>), number (<code>number:pl</code> ...), gender (<code>gender:masc</code> ...) etc.</dd>
72 </dl>
Nils Diewald61e6ff52015-05-07 17:26:50 +000073 %= doc_query poliqarp => '[mate/m=gender:fem]', cutoff => 1
Nils Diewalda31a5152015-04-17 21:05:23 +000074</section>
75
76
77<section id="opennlp">
78 <h3>OpenNLP (<code>opennlp</code>)</h3>
79 <dl>
80 <dt><abbr data-type="token" title="Part-of-Speech">p</abbr></dt>
81 <dd>All part-of-speech information is written in capital letters and is based on STTS</dd>
82 </dl>
Nils Diewald61e6ff52015-05-07 17:26:50 +000083 %= doc_query poliqarp => '[opennlp/p=PDAT]', cutoff => 1
Nils Diewalda31a5152015-04-17 21:05:23 +000084</section>
85
Akron48567812017-09-01 16:49:04 +020086<!--
Nils Diewalda31a5152015-04-17 21:05:23 +000087<section id="xip">
88 <h3>Xerox Incremental Parser (<code>xip</code>)</h3>
89 <dl>
90 <dt><abbr data-type="token" title="Lemma">l</abbr></dt>
91 <dd>All non-noun lemmas are written in lower case, nouns are written upper case. Composita are split, e.g. the token <code>Leitfähigkeit</code> is matched by the lemmas <code>leiten</code> and <code>Fähigkeit</code> - and by a merged and pretty useless <code>leitenfähigkeit</code> (This is going to change)</dd>
92 <dt><abbr data-type="token" title="Part-of-Speech">p</abbr></dt>
93 <dd>All part-of-spech information is written in capital letters and is based on STTS</dd>
94 <dt><abbr data-type="span" title="Phrases">c</abbr></dt>
95 <dd>Some phrases to create sentences, all upper case (<code>NP</code>, <code>NPA</code>, <code>NOUN</code>, <code>VERB</code>, <code>PREP</code>, <code>AP</code> ...)</dd>
96 </dl>
Nils Diewald61e6ff52015-05-07 17:26:50 +000097 %= doc_query poliqarp => '[xip/p=ADJ]', cutoff => 1
98 %= doc_query poliqarp => '<xip/c=VERB>', cutoff => 1
Nils Diewalda31a5152015-04-17 21:05:23 +000099</section>
Akron48567812017-09-01 16:49:04 +0200100-->
Nils Diewalda31a5152015-04-17 21:05:23 +0000101
102<section id="default-foundries">
103 <h3>Default Foundries</h3>
Akron48567812017-09-01 16:49:04 +0200104 <p>For queries on specific layers without given foundries, KorAP provides default foundries<!--, that can be overwritten by user configurations-->. The default foundries apply to the following layers:</p>
Nils Diewalda31a5152015-04-17 21:05:23 +0000105
106 <ul>
107 <li><strong>orth</strong>: <code>opennlp</code></li>
108 <li><strong>lemma</strong>: <code>tt</code></li>
109 <li><strong>pos</strong>: <code>tt</code></li>
110 </ul>
111
112 <blockquote>
113 <p>In the Lucene backend, the <code>orth</code> layer can only be bound to a specific foundry, as only one tokenization is supported.</p>
114 </blockquote>
115</section>