Repository logo
  • Collections
  • Browse
  • English
  • العربية
  • বাংলা
  • Català
  • Čeština
  • Deutsch
  • Ελληνικά
  • Español
  • Suomi
  • Français
  • Gàidhlig
  • हिंदी
  • Magyar
  • Italiano
  • Қазақ
  • Latviešu
  • Nederlands
  • Polski
  • Português
  • Português do Brasil
  • Srpski (lat)
  • Српски
  • Svenska
  • Türkçe
  • Yкраї́нська
  • Tiếng Việt
Log In
New user? Click here to register.Have you forgotten your password?
  1. Home
  2. ICI
  3. Publications
  4. A New Challenge in the Data Processing of Non-Standard Texts Containing Accents / Diacritics: A Case Study
 
  • Details

A New Challenge in the Data Processing of Non-Standard Texts Containing Accents / Diacritics: A Case Study

Date Issued
2019
Author(s)
Gavrilă, Veronica
Băjenaru, Lidia
Dobre, Ciprian
DOI
10.1109/ICCP48234.2019.8959636
Abstract
The INTELLIT project develops a virtual online museum of the Romanian literature. The sources of data made available and provided by the Romanian Academy, such as: General Dictionary of Romanian Literature, Timeline of the Romanian Literary Life and the canonical works of Romanian writers are digitized and indexed using smart text analytics. One of the challenges with this process is dealing with diacritics and textual accents. Here, we present an in-depth analysis of possible solutions and describe our implementation for detecting various Unicode text processing. We present the solution identified as an accessible way to remove specific Unicode text code points in order to greatly improve our search and filtering capabilities while still preserving the original source (at the database level).
Subjects

data processing

unicode

text encoding

canonical normalizati...

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Privacy policy
  • End User Agreement
  • Send Feedback