Surveying the Hadoop Framework on its Practical Applicability in a Corpus Analysis Platform

1 Introduction

This document gives an introductory description of Hadoop from the perspective of the KorAP project1 (Korpusanalyseplattform der nächsten Generation, “Next Generation Corpus Analysis Platform”). It aims to process corpora on a petabyte-scale. The current DeReKo2 corpus comprises around 5 billion tokens, but is growing and additional corpora will be included so that the KorAP platform is expected to process up to 50 billion tokens during its life-cycle. Continue reading