Added documentation for supported I5 metadata fields
Change-Id: I9af7e848533216386c8de9e5873db6b28ad2159d
diff --git a/Changes b/Changes
index 906a0d0..5f5b710 100644
--- a/Changes
+++ b/Changes
@@ -1,4 +1,4 @@
-0.39 2020-01-16
+0.39 2020-02-11
- Added Talismane support.
- Added "distributor" field to I5 metadata.
- Added DGD link field to I5 metadata.
@@ -6,6 +6,9 @@
- Added support for DGD pseudo-sentences
based on anchor milestones.
- Added brief explanation of the format.
+ - Fixed parsing of editionStmt.
+ - Added documentation for supported I5 metadata
+ fields.
0.38 2019-05-22
- Stop file processing when base tokenization
diff --git a/Readme.pod b/Readme.pod
index eac3a7e..6627baa 100644
--- a/Readme.pod
+++ b/Readme.pod
@@ -463,8 +463,11 @@
(so-called "foundries"; Bański et al. 2013).
Metadata is available in the TEI-P5 variant I5
-(Lüngen and Sperberg-McQueen 2012), while annotations correspond to
-a variant of the TEI-P5 feature structures (TEI Consortium; Lee et al. 2004).
+(Lüngen and Sperberg-McQueen 2012). See the documentation in
+L<KorAP::XML::Meta::I5> for translatable fields.
+
+Annotations correspond to a variant of the TEI-P5 feature structures
+(TEI Consortium; Lee et al. 2004).
Multiple KorAP-XML documents are organized on three levels following
the "IDS Textmodell" (Lüngen and Sperberg-McQueen 2012):
diff --git a/lib/KorAP/XML/Krill.pm b/lib/KorAP/XML/Krill.pm
index d35396e..ec30176 100644
--- a/lib/KorAP/XML/Krill.pm
+++ b/lib/KorAP/XML/Krill.pm
@@ -414,15 +414,15 @@
=head1 COPYRIGHT AND LICENSE
-Copyright (C) 2015-2018, L<IDS Mannheim|http://www.ids-mannheim.de/>
-Author: L<Nils Diewald|http://nils-diewald.de/>
+Copyright (C) 2015-2020, L<IDS Mannheim|https://www.ids-mannheim.de/>
+Author: L<Nils Diewald|https://nils-diewald.de/>
KorAP::XML::Krill is developed as part of the
L<KorAP|http://korap.ids-mannheim.de/>
Corpus Analysis Platform at the
-L<Institute for the German Language (IDS)|http://ids-mannheim.de/>,
+L<Institute for the German Language (IDS)|https://www.ids-mannheim.de/>,
member of the
-L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/en/about-us/leibniz-competition/projekte-2011/2011-funding-line-2/>
+L<Leibniz-Gemeinschaft|https://www.leibniz-gemeinschaft.de/en/>
and supported by the L<KobRA|http://www.kobra.tu-dortmund.de> project,
funded by the
L<Federal Ministry of Education and Research (BMBF)|http://www.bmbf.de/en/>.
diff --git a/lib/KorAP/XML/Meta/I5.pm b/lib/KorAP/XML/Meta/I5.pm
index 48565ae..2df85ad 100644
--- a/lib/KorAP/XML/Meta/I5.pm
+++ b/lib/KorAP/XML/Meta/I5.pm
@@ -408,3 +408,152 @@
1;
+
+__END__
+
+=pod
+
+=encoding utf8
+
+=head1 NAME
+
+KorAP::XML::Meta::I5 - Parses I5 meta data of a KorAP-XML document
+
+=head1 DESCRIPTION
+
+Parses I5 meta data of a KorAP-XML document.
+
+Following the data model, all 3 levels of metadata are parsed, while not all
+metadata levels contain the same information. The precedence is that metadata
+defined on the text level will override metadata on the document level. And
+metadata on the document level will override metadata on the corpus level.
+
+=head2 Metadata categories
+
+Krill currently supports the following types of metadata to be indexed.
+They differ especially in the way they can be used to construct a virtual corpus.
+
+=over 2
+
+=item B<String>
+
+A simple string representation of a meta data field. Useful for fixed values,
+such as I<corpusSigle> or I<language>.
+
+=item B<Text>
+
+A string representation that will be indexed as a text, so fulltext search
+(like phrase search) is supported. Useful for values where partial matches are
+useful, like I<title> or I<author>.
+
+=item B<Keywords>
+
+Multiple string representations. Identical to string, but supports multiple
+values in the same field. Useful for multiple given values such as I<textClass>.
+
+=item B<Attachement>
+
+Values that can't be used for the construction of virtual corpora, but are stored
+per document and can be retrieved. Useful for static data to be retrieved such as
+I<reference> or I<externalLink>.
+
+=item B<Date>
+
+A representation of a date, that can later be used for date range queries to construct
+virtual corpora. Useful for all date related information, such as I<pubDate> or I<createDate>.
+
+=back
+
+=head2 Metadata fields
+
+Currently L<KorAP::XML::Meta::I5> recognizes and transfers the following fields, given as
+a SCSS selector rule (plus C<@> for attribute values) followed by the field name and
+the metadata category.
+The order may indicate a field to be overwritten.
+
+=over 2
+
+=item B<On all levels>
+
+ (analytic, monogr) editor[role=translator] translator ATTACHEMENT
+ pubPlace@key pubPlaceKey STRING
+ pubPlace pubPlace STRING
+ imprint publisher publisher ATTACHEMENT
+ textDesc textType textType STRING
+ textDesc textDomain textDomain STRING
+ textDesc textTypeArt textTypeArt STRING
+ textDesc textTypeRef textTypeRef STRING
+ pubDate[type=year]
+ & pubDate[type=month]
+ & pubDate[type=day] pubDate DATE
+ creatDate creationDate DATE
+ textClass catRef@target textClass KEYWORDS
+ textClass h.keywords > keyTerm keywords KEYWORDS
+ biblFull editionStmt biblEditionStatement ATTACHEMENT
+ fileDesc editionStmt fileEditionStatement ATTACHEMENT
+ fileDesc publicationStmt > availability availability STRING
+ fileDesc publicationStmt > distributor distributor ATTACHEMENT
+ profileDesc > langUsage > language[id]@id language STRING
+
+=item B<On text level>
+
+ textSigle textSigle STRING
+ fileDesc > titleStmt > t.title title TEXT
+ (analytic, monogr) h.title[type=main] title TEXT
+ (analytic, monogr) h.title[type=sub] subTitle TEXT
+ (analytic, monogr) h.author author TEXT
+ (analytic, monogr) editor[role!=translator] editor ATTACHEMENT
+ sourceDesc reference[type=complete] reference ATTACHEMENT
+ textDesc > column textColumn STRING
+ biblStruct biblScope[type=pp] srcPages ATTACHEMENT
+
+=item B<On document level>
+
+ dokumentSigle docSigle STRING
+ fileDesc > titleStmt > d.title docTitle TEXT
+ (analytic, monogr) h.title[type=main] docTitle TEXT
+ (analytic, monogr) h.title[type=sub] docSubTitle TEXT
+ (analytic, monogr) h.author docAuthor TEXT
+ (analytic, monogr) editor[role!=translator] docEditor ATTACHEMENT
+
+=item B<On corpus level>
+
+ korpusSigle corpusSigle STRING
+ fileDesc > titleStmt > c.title corpusTitle TEXT
+ (analytic, monogr) h.title[type=main] corpusTitle TEXT
+ (analytic, monogr) h.title[type=sub] corpusSubTitle TEXT
+ (analytic, monogr) h.author corpusAuthor TEXT
+ (analytic, monogr) editor[role!=translator] corpusEditor ATTACHEMENT
+
+=back
+
+Some fields are specially formated, like C<srcPages> or dates.
+In case of Wikipedia texts, C<sourceDesc reference[type=complete]> will be
+turned into an C<externalLink>. In case of DGD/AGD documents, an external link
+to the DGD will be created as C<externalLink>.
+
+
+=head1 AVAILABILITY
+
+ https://github.com/KorAP/KorAP-XML-Krill
+
+
+=head1 COPYRIGHT AND LICENSE
+
+Copyright (C) 2015-2020, L<IDS Mannheim|https://www.ids-mannheim.de/>
+Author: L<Nils Diewald|https://nils-diewald.de/>
+
+KorAP::XML::Krill is developed as part of the
+L<KorAP|https://korap.ids-mannheim.de/>
+Corpus Analysis Platform at the
+L<Institute for the German Language (IDS)|https://www.ids-mannheim.de/>,
+member of the
+L<Leibniz-Gemeinschaft|https://www.leibniz-gemeinschaft.de/en/>
+and supported by the L<KobRA|http://www.kobra.tu-dortmund.de> project,
+funded by the
+L<Federal Ministry of Education and Research (BMBF)|http://www.bmbf.de/en/>.
+
+KorAP::XML::Krill is free software published under the
+L<BSD-2 License|https://raw.githubusercontent.com/KorAP/KorAP-XML-Krill/master/LICENSE>.
+
+=cut
diff --git a/script/korapxml2krill b/script/korapxml2krill
index 99d0300..fd95337 100644
--- a/script/korapxml2krill
+++ b/script/korapxml2krill
@@ -1573,8 +1573,11 @@
(so-called "foundries"; Bański et al. 2013).
Metadata is available in the TEI-P5 variant I5
-(Lüngen and Sperberg-McQueen 2012), while annotations correspond to
-a variant of the TEI-P5 feature structures (TEI Consortium; Lee et al. 2004).
+(Lüngen and Sperberg-McQueen 2012). See the documentation in
+L<KorAP::XML::Meta::I5> for translatable fields.
+
+Annotations correspond to a variant of the TEI-P5 feature structures
+(TEI Consortium; Lee et al. 2004).
Multiple KorAP-XML documents are organized on three levels following
the "IDS Textmodell" (Lüngen and Sperberg-McQueen 2012):