Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 1 | % layout 'main', title => 'KorAP: Annotations'; |
| 2 | |
Akron | 1120a58 | 2017-10-17 12:29:16 +0200 | [diff] [blame] | 3 | <h2 id="tutorial-top">Annotations</h2> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 4 | |
| 5 | <p>KorAP provides access to multiple levels of annotations originating from multiple resources, so called <em>foundries</em>.</p> |
| 6 | |
| 7 | <section id="base"> |
| 8 | <h3>Base Foundry</h3> |
Akron | 1bd65d9 | 2019-07-17 18:26:36 +0200 | [diff] [blame] | 9 | <p>The base foundry is available for all corpora and acts as a common ground for document structure annotation in the layer <code>s</code>.</p> |
| 10 | <dl> |
| 11 | <dt><abbr data-type="token" title="Structure">s</abbr></dt> |
| 12 | <dd>Document structure supporting the spans: <code><base/s=s></code> for sentences, <code><base/s=p></code> for paragraphs, and <code><base/s=t></code> for the text span.</dd> |
| 13 | </dl> |
| 14 | |
Akron | 4856781 | 2017-09-01 16:49:04 +0200 | [diff] [blame] | 15 | %= doc_query poliqarp => '<base/s=s>', cutoff => 1 |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 16 | </section> |
| 17 | |
| 18 | |
Akron | 1bd65d9 | 2019-07-17 18:26:36 +0200 | [diff] [blame] | 19 | <!-- |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 20 | <section id="cnx"> |
| 21 | <h3>Connexor (<code>cnx</code>)</h3> |
| 22 | <p>Connexor annotations provide the following layer for the <code>cnx</code> prefix:</p> |
| 23 | <dl> |
| 24 | <dt><abbr data-type="token" title="Lemma">l</abbr></dt> |
| 25 | <dd>All lemmas are written in lower case. Composita are split, e.g. the token "Leitfähigkeit" is matched by the lemmas "leit" and "fähigkeit" - not by the lemma "leitfähigkeit"</dd> |
| 26 | <dt><abbr data-type="token" title="Part-of-Speech">p</abbr></dt> |
| 27 | <dd>Part-of-speech information is written in capital letters and is based on STTS</dd> |
| 28 | <dt><abbr data-type="token" title="Syntactical information">syn</abbr></dt> |
| 29 | <dd>Includes token based information like <code>@PREMOD</code>, <code>@NH</code>, <code>@MAIN</code> ...</dd> |
| 30 | <dt><abbr data-type="token" title="Morphosyntactical information">m</abbr></dt> |
| 31 | <dd>Includes information about tense (<code>PRES</code> ...), mode (<code>IND</code>), number (<code>PL</code> ...) etc.</dd> |
| 32 | <dt><abbr data-type="span" title="Phrases">c</abbr></dt> |
| 33 | <dd>Only nominal phrases are available and all nominal phrases are written in lower case (<code>np</code>)</dd> |
| 34 | </dl> |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 35 | %= doc_query poliqarp => '[cnx/p=CC]', cutoff => 1 |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 36 | </section> |
Akron | 1bd65d9 | 2019-07-17 18:26:36 +0200 | [diff] [blame] | 37 | --> |
| 38 | |
| 39 | <section id="dereko"> |
| 40 | <h3>DeReKo (<code>dereko</code>)</h3> |
| 41 | <p>DeReKo annotations provide the following layer for the <code>dereko</code> prefix:</p> |
| 42 | <dl> |
| 43 | <dt><abbr data-type="token" title="Structure">s</abbr></dt> |
| 44 | <dd>Document structure as encoded in the <%= doc_ext_link_to 'I5 text document', 'http://www1.ids-mannheim.de/kl/projekte/korpora/textmodell.html' %>.</dd> |
| 45 | </dl> |
| 46 | %= doc_query poliqarp => 'startsWith(<dereko/s=s>, Fragestunde)', cutoff => 1 |
| 47 | </section> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 48 | |
| 49 | |
| 50 | <section id="corenlp"> |
| 51 | <h3>CoreNLP (<code>corenlp</code>)</h3> |
Akron | 1bd65d9 | 2019-07-17 18:26:36 +0200 | [diff] [blame] | 52 | <p>CoreNLP annotations provide the following layer for the <code>corenlp</code> prefix:</p> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 53 | <dl> |
Akron | 4856781 | 2017-09-01 16:49:04 +0200 | [diff] [blame] | 54 | <dt><abbr data-type="token" title="Part-of-Speech">p</abbr></dt> |
| 55 | <dd>Part-of-speech information is written in capital letters and is based on STTS</dd> |
| 56 | <dt><abbr data-type="token" title="Constituency">c</abbr></dt> |
Akron | 1bd65d9 | 2019-07-17 18:26:36 +0200 | [diff] [blame] | 57 | <dd>Constituency information follows the annotations of the <%= doc_ext_link_to 'negr@ corpus', 'http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/negra-corpus.html' %>.</dd> |
Akron | 4856781 | 2017-09-01 16:49:04 +0200 | [diff] [blame] | 58 | <dt><abbr data-type="token" title="Named Entity">ne</abbr></dt> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 59 | <dd>Contains named entities like <code>I-PER</code>, <code>I-ORG</code> etc.</dd> |
Akron | 4856781 | 2017-09-01 16:49:04 +0200 | [diff] [blame] | 60 | <dt><abbr data-type="token" title="Named Entity">ne_hgc_175m_600</abbr></dt> |
| 61 | <dd>See above</dd> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 62 | <dt><abbr data-type="token" title="Named Entity">ne_dewac_175_175m_600</abbr></dt> |
| 63 | <dd>See above</dd> |
| 64 | </dl> |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 65 | %= doc_query poliqarp => '[corenlp/ne_dewac_175m_600=I-ORG]', cutoff => 1 |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 66 | </section> |
| 67 | |
| 68 | |
| 69 | <section id="tt"> |
| 70 | <h3>TreeTagger (<code>tt</code>)</h3> |
Akron | 1bd65d9 | 2019-07-17 18:26:36 +0200 | [diff] [blame] | 71 | <p>TreeTagger annotations provide the following layer for the <code>tt</code> prefix:</p> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 72 | <dl> |
| 73 | <dt><abbr data-type="token" title="Lemma">l</abbr></dt> |
| 74 | <dd>All non-noun lemmas are written in lower case, nouns are written upper case. Composita stay intact (e.g. <code>Normalbedingung</code>)</dd> |
| 75 | <dt><abbr data-type="token" title="Part-of-Speech">p</abbr></dt> |
| 76 | <dd>All part-of-speech information is written in capital letters and is based on STTS</dd> |
| 77 | </dl> |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 78 | %= doc_query poliqarp => '[tt/p=ADV]', cutoff => 1 |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 79 | </section> |
| 80 | |
| 81 | |
Akron | 1bd65d9 | 2019-07-17 18:26:36 +0200 | [diff] [blame] | 82 | <!-- |
Akron | ebc8d93 | 2015-05-28 18:19:35 +0200 | [diff] [blame] | 83 | <section id="mate"> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 84 | <h3>Mate (<code>mate</code>)</h3> |
| 85 | <dl> |
| 86 | <dt><abbr data-type="token" title="Lemma">l</abbr></dt> |
| 87 | <dd>All lemmas are written in lower case. Composita stay intact (e.g. <code>buchstabenbezeichnung</code>)</dd> |
| 88 | <dt><abbr data-type="token" title="Part-of-Speech">p</abbr></dt> |
| 89 | <dd>All part-of-speech information is written in capital letters and is based on STTS</dd> |
| 90 | <dt><abbr data-type="token" title="Morphosyntactical information">m</abbr></dt> |
| 91 | <dd>Includes information about tense (<code>tense:pres</code> ...), mode (<code>mood:ind</code>), number (<code>number:pl</code> ...), gender (<code>gender:masc</code> ...) etc.</dd> |
| 92 | </dl> |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 93 | %= doc_query poliqarp => '[mate/m=gender:fem]', cutoff => 1 |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 94 | </section> |
Akron | 1bd65d9 | 2019-07-17 18:26:36 +0200 | [diff] [blame] | 95 | --> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 96 | |
Akron | 1bd65d9 | 2019-07-17 18:26:36 +0200 | [diff] [blame] | 97 | <section id="malt"> |
| 98 | <h3>Malt (<code>malt</code>)</h3> |
| 99 | <p>Malt annotations provide the following layer for the <code>malt</code> prefix:</p> |
| 100 | <dl> |
| 101 | <dt><abbr data-type="token" title="Lemma">d</abbr></dt> |
| 102 | <dd>Dependency information</dd> |
| 103 | </dl> |
| 104 | %= doc_query annis => 'tt/p="PPOSAT" ->malt/d[func="DET"] node', cutoff => 1 |
| 105 | </section> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 106 | |
| 107 | <section id="opennlp"> |
| 108 | <h3>OpenNLP (<code>opennlp</code>)</h3> |
Akron | 1bd65d9 | 2019-07-17 18:26:36 +0200 | [diff] [blame] | 109 | <p>OpenNLP annotations provide the following layer for the <code>opennlp</code> prefix:</p> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 110 | <dl> |
| 111 | <dt><abbr data-type="token" title="Part-of-Speech">p</abbr></dt> |
| 112 | <dd>All part-of-speech information is written in capital letters and is based on STTS</dd> |
| 113 | </dl> |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 114 | %= doc_query poliqarp => '[opennlp/p=PDAT]', cutoff => 1 |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 115 | </section> |
| 116 | |
Akron | 1bd65d9 | 2019-07-17 18:26:36 +0200 | [diff] [blame] | 117 | |
| 118 | <section id="marmot"> |
| 119 | <h3>Marmot (<code>marmot</code>)</h3> |
| 120 | <p>Marmot annotations provide the following layer for the <code>marmot</code> prefix:</p> |
| 121 | <dl> |
| 122 | <dt><abbr data-type="token" title="Part-of-Speech">p</abbr></dt> |
| 123 | <dd>Part-of-speech information is written in capital letters and is based on STTS</dd> |
| 124 | <dt><abbr data-type="token" title="Morphosyntactical information">m</abbr></dt> |
| 125 | <dd>Includes information about case (<code>acc</code> ...), degree (<code>pos</code>), gender (<code>fem</code> ...) etc.</dd> |
| 126 | </dl> |
| 127 | %= doc_query poliqarp => '[marmot/m=degree:sup & marmot/p=ADJA]', cutoff => 1 |
| 128 | </section> |
| 129 | |
| 130 | |
Akron | 4856781 | 2017-09-01 16:49:04 +0200 | [diff] [blame] | 131 | <!-- |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 132 | <section id="xip"> |
| 133 | <h3>Xerox Incremental Parser (<code>xip</code>)</h3> |
| 134 | <dl> |
| 135 | <dt><abbr data-type="token" title="Lemma">l</abbr></dt> |
| 136 | <dd>All non-noun lemmas are written in lower case, nouns are written upper case. Composita are split, e.g. the token <code>Leitfähigkeit</code> is matched by the lemmas <code>leiten</code> and <code>Fähigkeit</code> - and by a merged and pretty useless <code>leitenfähigkeit</code> (This is going to change)</dd> |
| 137 | <dt><abbr data-type="token" title="Part-of-Speech">p</abbr></dt> |
| 138 | <dd>All part-of-spech information is written in capital letters and is based on STTS</dd> |
| 139 | <dt><abbr data-type="span" title="Phrases">c</abbr></dt> |
| 140 | <dd>Some phrases to create sentences, all upper case (<code>NP</code>, <code>NPA</code>, <code>NOUN</code>, <code>VERB</code>, <code>PREP</code>, <code>AP</code> ...)</dd> |
| 141 | </dl> |
Nils Diewald | 61e6ff5 | 2015-05-07 17:26:50 +0000 | [diff] [blame] | 142 | %= doc_query poliqarp => '[xip/p=ADJ]', cutoff => 1 |
| 143 | %= doc_query poliqarp => '<xip/c=VERB>', cutoff => 1 |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 144 | </section> |
Akron | 4856781 | 2017-09-01 16:49:04 +0200 | [diff] [blame] | 145 | --> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 146 | |
| 147 | <section id="default-foundries"> |
| 148 | <h3>Default Foundries</h3> |
Akron | 4856781 | 2017-09-01 16:49:04 +0200 | [diff] [blame] | 149 | <p>For queries on specific layers without given foundries, KorAP provides default foundries<!--, that can be overwritten by user configurations-->. The default foundries apply to the following layers:</p> |
Nils Diewald | a31a515 | 2015-04-17 21:05:23 +0000 | [diff] [blame] | 150 | |
| 151 | <ul> |
| 152 | <li><strong>orth</strong>: <code>opennlp</code></li> |
| 153 | <li><strong>lemma</strong>: <code>tt</code></li> |
| 154 | <li><strong>pos</strong>: <code>tt</code></li> |
| 155 | </ul> |
| 156 | |
| 157 | <blockquote> |
| 158 | <p>In the Lucene backend, the <code>orth</code> layer can only be bound to a specific foundry, as only one tokenization is supported.</p> |
| 159 | </blockquote> |
| 160 | </section> |