Methodology for creating datasets of parallel sentences in low-resource languages by using AI

Balzhan Abduali, Marek Milosz, Ualsher Tukeyev, Aidana Karibayeva

Abstract

This study addresses the crucial problem of data scarcity for low-resource languages, with a particular focus on a methodology for creating parallel corpora in two low-resource languages. The lack of large-scale, high-quality bilingual datasets significantly hinders  the development of neural machine translation systems for such languages. This study proposes and validates a methodology for creating such datasets. The methodology involves selecting an AI system to generate a parallel corpus based on criteria of accessibility (free access), translation quality, and efficiency, based on a test dataset of 1000 sentences. Subsequently, a substantial parallel corpus of Kyrgyz-Kazakh was created using the selected AI system. However, manual error analysis revealed that approximately 0.5% of the translations contained inaccuracies, indicating the need for further post-editing and model refinement. This study contributes to the development of resources for low-resource language pairs and provides practical guidance on the effective creation of parallel corpora using modern AI systems.

Authors

Balzhan Abduali
balzhanabdualy@gmail.com (Primary Contact)
Marek Milosz
Ualsher Tukeyev
Aidana Karibayeva
Abduali, B. ., Milosz, M. ., Tukeyev, U. ., & Karibayeva, A. . (2025). Methodology for creating datasets of parallel sentences in low-resource languages by using AI. International Journal of Innovative Research and Scientific Studies, 8(9), 13–23. https://doi.org/10.53894/ijirss.v8i9.10605

Article Details