Methodology for creating datasets of parallel sentences in low-resource languages by using AI

Balzhan  Abduali; Marek  Milosz; Ualsher  Tukeyev; Aidana  Karibayeva

doi:10.53894/ijirss.v8i9.10605

Engineering

Balzhan Abduali, Marek Milosz, Ualsher Tukeyev, Aidana Karibayeva

https://doi.org/10.53894/ijirss.v8i9.10605

Issue
Vol. 8 No. 9 (2025)

Keywords:

AI systems, Kazakh-Kyrgyz language pair, Low-resources languages, Methodology for creating datasets, Parallel sentences.

PDF

Abstract

This study addresses the crucial problem of data scarcity for low-resource languages, with a particular focus on a methodology for creating parallel corpora in two low-resource languages. The lack of large-scale, high-quality bilingual datasets significantly hinders the development of neural machine translation systems for such languages. This study proposes and validates a methodology for creating such datasets. The methodology involves selecting an AI system to generate a parallel corpus based on criteria of accessibility (free access), translation quality, and efficiency, based on a test dataset of 1000 sentences. Subsequently, a substantial parallel corpus of Kyrgyz-Kazakh was created using the selected AI system. However, manual error analysis revealed that approximately 0.5% of the translations contained inaccuracies, indicating the need for further post-editing and model refinement. This study contributes to the development of resources for low-resource language pairs and provides practical guidance on the effective creation of parallel corpora using modern AI systems.

Authors

Balzhan Abduali

Faculty of Information Technology, Al-Farabi Kazakh National University, Almaty 050040, Kazakhstan.

balzhanabdualy@gmail.com (Primary Contact)

Marek Milosz

Faculty of Electrical Engineering and Computer Science, Lublin University of Technology, Lublin, Poland.

https://orcid.org/0000-0002-5898-815X

Ualsher Tukeyev

Faculty of Information Technology, Al-Farabi Kazakh National University, Almaty 050040, Kazakhstan.

https://orcid.org/0000-0001-9878-981X

Aidana Karibayeva

Faculty of Information Technology, Al-Farabi Kazakh National University, Almaty 050040, Kazakhstan.

https://orcid.org/0000-0002-2023-1573

Abduali, B. ., Milosz, M. ., Tukeyev, U. ., & Karibayeva, A. . (2025). Methodology for creating datasets of parallel sentences in low-resource languages by using AI. International Journal of Innovative Research and Scientific Studies, 8(9), 13–23. https://doi.org/10.53894/ijirss.v8i9.10605

Download Citation

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

	All	Since 2021
Citations	5702	5476
h-index	28	28
i10-index	130	129

Article Sidebar

Abstract

Authors

Article Details

Cited byView all

Cited by