Dataset Development and Fine-Tuning of Language Models for Low-Resource Bengali Programming Assistance

Avijit Roy, CUNY John Jay College

0009-0007-8036-0952

ACCESS Allocation Request CIS260616

Abstract: This project aims to develop and evaluate a low-resource artificial intelligence (AI) debugging assistant designed to support Bengali-speaking learners in introductory programming environments. Many existing AI-based coding assistants are optimized for English and require significant computational resources, limiting their usefulness in regions where computing infrastructure is constrained. This project addresses these challenges by fine-tuning open-source language models to generate beginner-friendly debugging explanations in Bengali and evaluating their reliability on modest hardware. ACCESS GPU resources will be used to train and refine staged versions of the model using parameter-efficient fine-tuning methods such as LoRA or QLoRA. Training workflows involve iterative dataset refinement, model evaluation, and comparison across multiple training stages to improve explanation clarity and correctness. GPU-enabled cloud environments such as Jetstream will allow reproducible experimentation across different configurations, including testing dataset scaling effects and training parameter variations. Training workflows will include repeated experimental runs across multiple dataset scales and hyperparameter configurations to evaluate performance, robustness, and resource efficiency. Software packages expected to be used include PyTorch, Transformers, PEFT, TRL, and Unsloth for training and inference workflows, along with Python-based evaluation tools and lightweight web-based interfaces (e.g., Gradio) for testing model usability. Model outputs will be evaluated using structured scoring rubrics to measure correctness, clarity, and pedagogical usefulness. This project builds on ongoing research into equitable AI access and support for low-resource languages. A related case study has been accepted for poster presentation at the 69th Annual Conference of the International Linguistic Association (ILA), scheduled for April 30 – May 2, 2026, in New York City at John Jay College, City University of New York. In addition, the primary research model associated with this effort has been submitted for peer review to ACM COMPASS, examining infrastructure and accessibility challenges affecting underrepresented languages. The long-term objective of this work is to develop deployable models capable of running reliably on commodity hardware, supporting practical use in educational and self-learning environments.

Allocations:

2026 Indiana Jetstream2 GPU 120,000.0 SUs
2026 Indiana Jetstream2 Large Memory 50,000.0 SUs
The estimated value of these awarded resources is $31,305.00. The allocation of these resources represents a considerable investment by the NSF in advanced computing infrastructure for the U.S. The dollar value of the allocation is estimated from the NSF awards supporting the allocated resources.
There are no other allocations for this project.

Other Titles:

There are no prior titles for this project.