WhisTLE Demystified: Text-Only Domain Adaptation for Speech Recognition
This article explores WhisTLE, a method for adapting pretrained speech recognition transformers using only text data. This approach eliminates the need for heavy text-to-speech (TTS) pipelines, offering a more practical solution for domain adaptation.
Key Advantages and Improvements
- Significant improvement over TTS adaptation (11.61% relative improvement).
- Reproducible training protocol and code scaffold provided.
- Clearer metric definitions and reporting standards.
WhisTLE leverages phoneme representation for text-only domain adaptation, achieving a substantial improvement over traditional methods. The detailed training protocol, including epoch and batch size, ensures reproducibility and allows for verification of results.
Training Protocol
| Phase | Mode | Epochs | Batch Size |
|---|---|---|---|
| Main trunk training | trunk | 100 | 240 |
| Adaptation branch | adapt | 50 | 40 |
Reproducibility is paramount. A fixed random seed (e.g., 42) is used throughout the process. Environment details (Python, PyTorch, CUDA versions etc.) are meticulously documented for consistent and verifiable results. The hardware setup typically uses 4 GPUs (e.g., NVIDIA V100 or A100) with synchronized gradient updates. A modular code scaffold with clear folder structure (data, models, scripts, logs, reproducibility checklist) is also provided to facilitate implementation and extendability.
Dataset and Preprocessing
The research utilizes LibriSpeech dataset in combination with domain-specific data (e.g. TED-LIUM 3, Switchboard). The preprocessing steps include text normalization, tokenization using SentencePiece, grapheme-to-phoneme (g2p) conversion, phoneme embedding, and feature extraction (80-dim log-Mel spectrograms).
| Dataset | Split | Approx. Hours | Notes |
|---|---|---|---|
| LibriSpeech | train-clean-100 | 5.7 h | Small training subset |
| LibriSpeech | train-clean-360 | 360 h | Mid-size training subset |
| LibriSpeech | train-other-500 | 500 h | Largest LibriSpeech training subset |
| LibriSpeech | dev-clean | 5.7 h | Validation subset |
| LibriSpeech | dev-other | 5.0 h | Validation subset |
| LibriSpeech | test-clean | 5.7 h | Test subset |
| LibriSpeech | test-other | 5.0 h | Test subset |
| Domain data | train-domain | 150 h | Domain-adaptation sources |
| Domain data | dev-domain | 10 h | Hold-out domain validation |
| Domain data | test-domain | 15 h | Domain-specific evaluation |
Data augmentation techniques like SpecAugment are employed to enhance model robustness. A rigorous data cleaning process ensures high-quality data for training.
Model architecture and Integration
WhisTLE integrates a phoneme-based text encoder in parallel with the acoustic encoder. The text encoder’s output is fused into the decoder using cross-attention, leveraging both acoustic and linguistic information to improve accuracy. The embedding dimension is recommended to be between 512-768. The loss function combines standard sequence losses (cross-entropy or CTC) with a domain-adaptation regularization term (λ_domain) to encourage domain alignment.
Results and Metrics
The primary evaluation metric is Word Error Rate (WER), with Character Error Rate (CER) as a secondary metric. Relative Improvement (RI) is calculated as: RI = ((WER_baseline – WER_text_only) / WER_baseline) × 100%. The results show a significant relative improvement (RI) of 11.61% when using phoneme representation with the extra text encoder compared to TTS adaptation. Confidence intervals should be reported for WER and RI.
| Variant | WER (abs, %) | WER 95% CI | RI (%) | RI 95% CI |
|---|---|---|---|---|
| Baseline | 12.0 | 11.1 – 12.9 | — | — |
| Text-only | 9.5 | 9.0 – 10.0 | 20.8 | 15.0 – 26.0 |
Reproducibility and Deployment
The article emphasizes reproducibility by providing detailed instructions, including Docker Compose and Conda environment files for easy setup and replication of experiments. Version control and logging are stressed to ensure transparency and repeatability.
Conclusion
WhisTLE offers a promising approach to text-only domain adaptation for speech recognition. The reproducible research methodology and significant performance improvement over traditional methods make it a valuable contribution to the field.

Leave a Reply