Add more information to Readme including License
Change-Id: I62395f3490d39a87105943fcc82e96d988205dd5
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000..b462bd9
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,24 @@
+Copyright (c) 2022, IDS Mannheim
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright notice,
+ this list of conditions and the following disclaimer.
+
+2. Redistributions in binary form must reproduce the above copyright notice,
+ this list of conditions and the following disclaimer in the documentation
+ and/or other materials provided with the distribution.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
+GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
+DAMAGE.
\ No newline at end of file
diff --git a/Readme.md b/Readme.md
index dd220ce..21683e4 100644
--- a/Readme.md
+++ b/Readme.md
@@ -1,6 +1,7 @@
# KorAP-Docker
-KorAP consists of several components,
+The [KorAP Corpus Analysis Platform](http://korap.ids-mannheim.de/)
+consists of several independent components,
but they can easily be installed together using
[Docker](https://www.docker.com/).
This repository contains a recipe to install all
@@ -9,11 +10,12 @@
In addition, all relevant tools are installed and
made available that are necessary for data conversion
-and indexing of corpora in the widely used TEI-P5 (I5)
-format for KorAP.
+and indexing of corpora in the widely used TEI-P5
+([I5](https://www.ids-mannheim.de/en/digspra/corpus-linguistics/projects/corpus-development/ids-text-model/)) format for KorAP.
For different options of the tools we refer to the
respective repositories.
+
## Requirements
Install [docker](https://www.docker.com/) and
@@ -22,7 +24,7 @@
## Starting
-To download, intialize and run KorAP pointing to a certain directory index
+To download, intialize and run KorAP pointing to an existing index
(in this example `index` in the local directory), run
```shell
@@ -35,14 +37,17 @@
## Corpus Conversion
-Depending on the corpus data to be indexed, it must first be converted.
-In the case of a conversion from TEI P5 (I5) format,
+In order to create an index based on existing
+corpus data, some conversion steps are usually
+necessary.
+In the case of a conversion from TEI P5
+([I5](https://www.ids-mannheim.de/en/digspra/corpus-linguistics/projects/corpus-development/ids-text-model/)) format,
the tools required for this have already been installed
with the command above.
-In the following we take the
+In the following we take the open part of the
[Dortmunder Chatkorpus 2.2](https://www.uni-due.de/germanistik/chatkorpus/)
-as an example to build an index.
+(Beißwenger & Storrer 2008) as an example to build an index.
The file is located at `example/dck-part1.i5.xml`.
@@ -56,11 +61,19 @@
--input /data/dck-part1.i5.xml > dck.zip
```
-... will convert the i5 file into a KorAP-XML file using
+... will convert the i5 file into a
+[KorAP-XML](https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml)
+file using
[tei2korapxml](https://github.com/KorAP/KorAP-XML-TEI).
+This format is designed to add further arbitrary annotations
+to the primary data. In this example, however, we will stick
+with the inline annotations that the example corpus already
+contains and will make available later under the label `cmc`.
+
To convert the KorAP-XML archive in a second step
-into individual Krill JSON files, the following command ...
+into individual [Krill](https://github.com/KorAP/Krill) compatible
+JSON files, the following command ...
```shell
$ mkdir json
@@ -81,11 +94,15 @@
Depending on how the source data is designed,
different parameters must be specified for the conversion.
+Here, the inline token annotation is used as the basis for
+word tokenization, and the included document structure is
+used for default annotation of sentence and paragraph boundaries.
+
## Index Creation
[Krill](https://github.com/KorAP/Krill)'s indexer tool can now
-be used to index the json files:
+be used to index the JSON files:
```shell
$ mkdir index
@@ -96,3 +113,27 @@
After that, the index can be loaded with the aforementioned
call and is searchable via the browser.
+
+## Development and License
+
+**Authors**: [Nils Diewald](https://www.nils-diewald.de/), Harald Lüngen, Marc Kupietz
+
+Copyright (c) 2022, [IDS Mannheim](https://www.ids-mannheim.de/), Germany
+
+KorAP-Docker is published under the BSD-2 License.
+
+The example corpus corresponds to the *release part* of the
+[Dortmunder Chatkorpus 2.2](https://www.uni-due.de/germanistik/chatkorpus/)
+as prepared by
+[DeReKo](https://www.ids-mannheim.de/digspra/kl/projekte/korpora/).
+The corpus is released under the [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) License.
+Legal restrictions may arise from data protection legislation.
+
+
+## Bibliography
+
+Beißwenger, Michael / Storrer, Angelika (2008):
+Corpora of Computer-Mediated Communication.
+In: Anke Lüdeling & Merja Kytö (Eds): *Corpus Linguistics. An International Handbook.*
+Volume 1. Berlin. New York (Handbooks of Linguistics and Communication Science 29.1),
+pp. 292--308.