blob: 4c6427d0de2279b9e1e9125dac3c56552449970a [file] [log] [blame]
Nils Diewald4e9fbcb2014-07-15 11:45:09 +00001% content main => begin
2
3<h2>KorAP-Tutorial: Foundries and Layers</h2>
4
5<p><%= korap_tut_link_to 'Back to Index', '/tutorial' %></p>
6
7<p>KorAP provides access to multiple levels of annotations originating from multiple resources, so called <i>foundries</i>.</p>
8
9<section name="cheatsheet">
10 <ul>
11 <li><strong>base</strong>
12 <ul>
13 <li>Supports two types of spans: <strong>&lt;s&gt;</strong> for sentences and <strong>&lt;p&gt;</strong> for paragraphs - this will likely change in the next index version. These spans lack prefix information!</li>
14 </ul>
15 </li>
16 <li><strong>cnx</strong>
17 <ul>
18 <li><strong>l</strong> (Token:Lemma): All lemmas are written in lower case. Composita are split, e.g. the token &quot;Leitfähigkeit&quot; is matched by the lemmas &quot;leit&quot; and &quot;fähigkeit&quot; - not by the lemma &quot;leitfähigkeit&quot;</li>
19 <li><strong>p</strong> (Token:Part of Speech): All pos infos are written in capital letters and are based on STTS</li>
20 <li><strong>syn</strong> (Token:Syntactical information): Includes token based information like @PREMOD, @NH, @MAIN ...</li>
21 <li><strong>m</strong> (Token:Morphosyntactical information): Includes information about tense (&quot;PRES&quot; ...), mode (&quot;IND&qut;), number (&quot;PL&quot; ...) etc.</li>
22 <li><strong>c</strong> (Span:Phrases): Only nominal phrases are available and all nominal phrases are written in lower case (&quot;np&quot;)</li>
23 </ul>
24 </li>
25 <li><strong>corenlp</strong>
26 <ul>
27 <li><strong>ne_hgc_175m_600</strong> (Token:Named Entity): Contains named entities like &quot;I-PER&quot;, &quot;I-ORG&quot; etc. </li>
28 <li><strong>ne_dewac_175_175m_600</strong> (Token:Named Entity): see above</li>
29 </ul>
30 </li>
31 <li><strong>tt</strong>
32 <ul>
33 <li><strong>l</strong> (Token:Lemma): All non-noun lemmas are written in lower case, nouns are written upper case. Composita stay intact (e.g. &quot;Normalbedingung&quot;)</li>
34 <li><strong>p</strong> (Token:Part of Speech): All pos infos are written in capital letters and are based on STTS</li>
35 </ul>
36 </li>
37 <li><strong>mate</strong>
38 <ul>
39 <li><strong>l</strong> (Token:Lemma): All lemmas are written in lower case. Composita stay intact (e.g. &quot;buchstabenbezeichnung&quot;)</li>
40 <li><strong>p</strong> (Token:Part of Speech): All pos infos are written in capital letters and are based on STTS</li>
41 <li><strong>m</strong> (Token:Morphosyntactical information): Includes information about tense (&quot;tense:pres&quot; ...), mode (&quot;mood:ind&qut;), number (&quot;number:pl&quot; ...), gender (&quot;gender:masc&quot; etc.</li>
42 </ul>
43 </li>
44 <li><strong>opennlp</strong>
45 <ul>
46 <li><strong>p</strong> (Token:Part of Speech): All pos infos are written in capital letters and are based on STTS</li>
47 </ul>
48 </li>
49 <li><strong>xip</strong>
50 <ul>
51 <li><strong>l</strong> (Token:Lemma): All non-noun lemmas are written in lower case, nouns are written upper case. Composita are split, e.g. the token &quot;Leitfähigkeit&quot; is matched by the lemmas &quot;leiten&quot; and &quot;Fähigkeit&quot; - and by a merged and pretty useless &quot;leitenfähigkeit&quot; (This is going to change)</li>
52 <li><strong>p</strong> (Token:Part of Speech): All pos infos are written in capital letters and are based on STTS</li>
53 <li><strong>c</strong> (Span:Phrases): Some phrases to create sentences, all upper case (&quot;NP&quot;, &quot;NPA&quot;, &quot;NOUN&quot;, &quot;VERB&quot;, &quot;PREP&quot;, &quot;AP&quot; ...)</li>
54 </ul>
55 </li>
56 </ul>
57</section>
58
59<h3>Default Foundries</h3>
60
61<p>For queries on specific layers without given foundries, KorAP provides default foundries, that can be overwritten by user configurations. The default foundries apply to the following layers:</p>
62
63<ul>
64 <li><strong>orth</strong>: opennlp </li>
65 <li><strong>lemma</strong>: opennlp </li>
66 <li><strong>pos</strong>: mate</li>
67</ul>
68
69<blockquote>
70 <p>In the Lucene backend, the <strong>orth</strong> layer can be bound to a specific foundry, as only one tokenization is supported.</p>
71</blockquote>
72
73% end