Full Fine-Tuning Qwen3.5-0.8B for Hindi → Gujarati Translation
A full-parameter fine-tune of Qwen3.5-0.8B for Hindi→Gujarati translation on AI4Bharat IN22-Conv — instruction formatting, loss masking, bf16, gradient accumulation, and Hugging Face Accelerate.
Large Language Models show strong cross-lingual ability, but performance on low-resource pairs stays suboptimal without targeted fine-tuning. Indian-language translation is especially hard: morphological diversity, script differences, and limited high-quality parallel corpora.
This project instruction-tunes the Qwen3.5-0.8B causal language model for Hindi → Gujarati translation on the AI4Bharat IN22-Conv conversational dataset, using PyTorch, Hugging Face Transformers and Accelerate for efficient single-GPU fine-tuning.
Repository: github.com/Vatsa10/Hindi2Guj-Qwen3.5
Model architecture
The base is Qwen3.5-0.8B, a compact decoder-only transformer:
- Transformer decoder stack, ~0.8B parameters
- Causal self-attention
- Rotary positional embeddings
- Instruction-tuned tokenizer vocabulary
Loaded via AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-0.8B"). Because Qwen is a causal LM rather than encoder-decoder, translation is framed as an instruction-following generation task instead of a traditional seq2seq pipeline — the emerging paradigm where LLMs translate through prompt conditioning rather than explicit encoder-decoder models like T5 or NLLB.
Dataset
Training uses AI4Bharat IN22-Conv, a multilingual conversational corpus with aligned utterances across 22 Indian languages. Two fields are extracted:
hin_Deva→ Hindi sourceguj_Gujr→ Gujarati target
Pipeline: load via HuggingFace Datasets, filter rows containing both Hindi and Gujarati, shuffle with a fixed seed, then create a train/validation split.
- ~1,500 aligned sentence pairs
- Conversational-style sentences
- Devanagari → Gujarati script mapping
Because the dataset is small, multiple epochs are used to allow enough gradient updates.
Instruction formatting
Rather than training on raw sentence pairs, each row is converted into a chat-style instruction prompt:
<|im_start|>user
Translate the following Hindi sentence to Gujarati.
Hindi: {src}
Gujarati:<|im_end|>
<|im_start|>assistant
{tgt}<|im_end|>
This mirrors instruction tuning, so the model learns translation as a task instruction. It stays compatible with chat-tuned LLMs, improves inference consistency, and aligns with Qwen's conversational token structure.
Loss masking strategy
Only the target Gujarati tokens contribute to the loss. Prompt tokens are masked with the standard cross-entropy ignore index:
labels[:prompt_len] = -100
This stops the model from learning to reproduce the prompt and focuses it entirely on generating the correct Gujarati. Padding tokens are excluded from the loss too.
Training configuration
This is full-parameter fine-tuning — every weight is updated.
- Model: Qwen3.5-0.8B · Max sequence length: 256
- Batch size: 2 · Gradient accumulation: 4 → effective batch 8 per step
- Learning rate: 2e-5 · Scheduler: cosine decay with warmup (ratio 0.1)
- Epochs: 3+ · Optimizer: AdamW
- Mixed precision: bfloat16 — cuts VRAM while staying numerically stable on modern GPUs
Distributed training infrastructure
Hugging Face Accelerate is the orchestration layer, handling device placement, gradient accumulation, mixed precision, multi-GPU compatibility and optional DeepSpeed integration:
accelerator = Accelerator(gradient_accumulation_steps=4)
model, dataloaders, optimizer, scheduler = accelerator.prepare(...)
The same script then runs unchanged on a single GPU, multi-GPU, or a distributed cluster — no manual CUDA handling.
Training loop & validation
Each step does the standard transformer pattern: forward pass, cross-entropy loss, backprop, gradient clipping, optimizer step, scheduler update. Gradient accumulation is wrapped via accelerator.accumulate(model), enabling larger effective batches without more GPU memory. After each epoch the model is evaluated on the validation split, with loss averaged across processes via accelerator.reduce(...) for correct aggregation in distributed runs.
Inference
Inference reuses the same prompt template, with beam search for quality (max_new_tokens=200, num_beams=4):
Input: आप कहाँ जा रहे हैं?
Output: તમે ક્યાં જઈ રહ્યા છો?
Key engineering observations
- Instruction formatting significantly stabilizes translation output.
- Gradient accumulation enables training on limited VRAM.
- Cosine LR scheduling improves convergence.
- Loss masking is critical to prevent prompt memorization.
Even with a small dataset, the model learns meaningful Hindi → Gujarati mappings.
Future work
- Parameter-efficient fine-tuning (LoRA / QLoRA)
- Larger Hindi-Gujarati parallel datasets
- BLEU / chrF automatic evaluation
- RL for translation refinement
- Deployment as a translation microservice
- Multilingual fine-tuning across several Indic pairs for cross-lingual transfer
Conclusion
Combining Qwen3.5-0.8B, AI4Bharat's multilingual data, and Hugging Face training infrastructure, it's possible to build targeted Indian-language translation systems on modest hardware — a real step toward regional-language accessibility in AI. Code: github.com/Vatsa10/Hindi2Guj-Qwen3.5.