About Distributed Query Processing

Our setup is a collection of index structures1 that are physically distributed across multiple machines (worker nodes). Their purpose is to allow fast querying of different segmentations (tokenization, sentence boundaries etc.) and annotations (e.g. part-of-speech tags, dependencies, syntactic constituents) on arbitrary document collections (corpora). Conversely, this implies that the union of all the distributed indexes sums up to the complete corpus collection2. Continue reading

Surveying the Hadoop Framework on its Practical Applicability in a Corpus Analysis Platform

1 Introduction

This document gives an introductory description of Hadoop from the perspective of the KorAP project1 (Korpusanalyseplattform der nächsten Generation, “Next Generation Corpus Analysis Platform”). It aims to process corpora on a petabyte-scale. The current DeReKo2 corpus comprises around 5 billion tokens, but is growing and additional corpora will be included so that the KorAP platform is expected to process up to 50 billion tokens during its life-cycle. Continue reading