A New Challenge in the Data Processing of Non-Standard Texts Containing Accents / Diacritics: A Case Study

Dobre, Ciprian

doi:10.1109/ICCP48234.2019.8959636

A New Challenge in the Data Processing of Non-Standard Texts Containing Accents / Diacritics: A Case Study

Date Issued

2019

Author(s)

Gavrilă, Veronica

Băjenaru, Lidia

Dobre, Ciprian

DOI

10.1109/ICCP48234.2019.8959636

Abstract

The INTELLIT project develops a virtual online museum of the Romanian literature. The sources of data made available and provided by the Romanian Academy, such as: General Dictionary of Romanian Literature, Timeline of the Romanian Literary Life and the canonical works of Romanian writers are digitized and indexed using smart text analytics. One of the challenges with this process is dealing with diacritics and textual accents. Here, we present an in-depth analysis of possible solutions and describe our implementation for detecting various Unicode text processing. We present the solution identified as an accessible way to remove specific Unicode text code points in order to greatly improve our search and filtering capabilities while still preserving the original source (at the database level).

Subjects

data processing

unicode

text encoding

canonical normalizati...