Full Fine-Tuning Qwen3.5-0.8B for Hindi → Gujarati Translation

Large Language Models show strong cross-lingual ability, but performance on low-resource pairs stays suboptimal without targeted fine-tuning. Indian-language translation is especially hard: morphological diversity, script differences, and limited high-quality parallel corpora.

This project instruction-tunes the Qwen3.5-0.8B causal language model for Hindi → Gujarati translation on the AI4Bharat IN22-Conv conversational dataset, using PyTorch, Hugging Face Transformers and Accelerate for efficient single-GPU fine-tuning.

Repository: github.com/Vatsa10/Hindi2Guj-Qwen3.5

Model architecture

The base is Qwen3.5-0.8B, a compact decoder-only transformer:

Transformer decoder stack, ~0.8B parameters
Causal self-attention
Rotary positional embeddings
Instruction-tuned tokenizer vocabulary

Loaded via AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-0.8B"). Because Qwen is a causal LM rather than encoder-decoder, translation is framed as an instruction-following generation task instead of a traditional seq2seq pipeline — the emerging paradigm where LLMs translate through prompt conditioning rather than explicit encoder-decoder models like T5 or NLLB.

Dataset

Training uses AI4Bharat IN22-Conv, a multilingual conversational corpus with aligned utterances across 22 Indian languages. Two fields are extracted:

hin_Deva → Hindi source
guj_Gujr → Gujarati target

Pipeline: load via HuggingFace Datasets, filter rows containing both Hindi and Gujarati, shuffle with a fixed seed, then create a train/validation split.

~1,500 aligned sentence pairs
Conversational-style sentences
Devanagari → Gujarati script mapping

Because the dataset is small, multiple epochs are used to allow enough gradient updates.

Instruction formatting

Rather than training on raw sentence pairs, each row is converted into a chat-style instruction prompt:

<|im_start|>user
Translate the following Hindi sentence to Gujarati.
Hindi: {src}
Gujarati:<|im_end|>
<|im_start|>assistant
{tgt}<|im_end|>

This mirrors instruction tuning, so the model learns translation as a task instruction. It stays compatible with chat-tuned LLMs, improves inference consistency, and aligns with Qwen's conversational token structure.

Loss masking strategy

Only the target Gujarati tokens contribute to the loss. Prompt tokens are masked with the standard cross-entropy ignore index:

labels[:prompt_len] = -100

This stops the model from learning to reproduce the prompt and focuses it entirely on generating the correct Gujarati. Padding tokens are excluded from the loss too.

Training configuration

This is full-parameter fine-tuning — every weight is updated.

Model: Qwen3.5-0.8B · Max sequence length: 256
Batch size: 2 · Gradient accumulation: 4 → effective batch 8 per step
Learning rate: 2e-5 · Scheduler: cosine decay with warmup (ratio 0.1)
Epochs: 3+ · Optimizer: AdamW
Mixed precision: bfloat16 — cuts VRAM while staying numerically stable on modern GPUs

Distributed training infrastructure

Hugging Face Accelerate is the orchestration layer, handling device placement, gradient accumulation, mixed precision, multi-GPU compatibility and optional DeepSpeed integration:

accelerator = Accelerator(gradient_accumulation_steps=4)
model, dataloaders, optimizer, scheduler = accelerator.prepare(...)

The same script then runs unchanged on a single GPU, multi-GPU, or a distributed cluster — no manual CUDA handling.

Training loop & validation

Each step does the standard transformer pattern: forward pass, cross-entropy loss, backprop, gradient clipping, optimizer step, scheduler update. Gradient accumulation is wrapped via accelerator.accumulate(model), enabling larger effective batches without more GPU memory. After each epoch the model is evaluated on the validation split, with loss averaged across processes via accelerator.reduce(...) for correct aggregation in distributed runs.

Inference

Inference reuses the same prompt template, with beam search for quality (max_new_tokens=200, num_beams=4):

Input:  आप कहाँ जा रहे हैं?
Output: તમે ક્યાં જઈ રહ્યા છો?

Key engineering observations

Instruction formatting significantly stabilizes translation output.
Gradient accumulation enables training on limited VRAM.
Cosine LR scheduling improves convergence.
Loss masking is critical to prevent prompt memorization.

Even with a small dataset, the model learns meaningful Hindi → Gujarati mappings.

Future work

Parameter-efficient fine-tuning (LoRA / QLoRA)
Larger Hindi-Gujarati parallel datasets
BLEU / chrF automatic evaluation
RL for translation refinement
Deployment as a translation microservice
Multilingual fine-tuning across several Indic pairs for cross-lingual transfer

Conclusion

Combining Qwen3.5-0.8B, AI4Bharat's multilingual data, and Hugging Face training infrastructure, it's possible to build targeted Indian-language translation systems on modest hardware — a real step toward regional-language accessibility in AI. Code: github.com/Vatsa10/Hindi2Guj-Qwen3.5.