Every day, thousands of digital documents are generated with useful
information for companies, public organizations, and citizens. Given the
impossibility of processing them manually, the automatic processing of these
documents is becoming increasingly necessary in certain sectors. However, this
task remains challenging, since in most cases a text-only based parsing is not
enough to fully understand the information presented through different
components of varying significance. In this regard, Document Layout Analysis
(DLA) has been an interesting research field for many years, which aims to
detect and classify the basic components of a document. In this work, we used a
procedure to semi-automatically annotate digital documents with different
layout labels, including 4 basic layout blocks and 4 text categories. We apply
this procedure to collect a novel database for DLA in the public affairs
domain, using a set of 24 data sources from the Spanish Administration. The
database comprises 37.9K documents with more than 441K document pages, and more
than 8M labels associated to 8 layout block units. The results of our
experiments validate the proposed text labeling procedure with accuracy up to
99%.