WhisTLE Demystified: Deeply Supervised Text-Only Domain…

Graduation ceremony in a gymnasium with speakers addressing graduates seated in gowns.

WhisTLE Demystified: Text-Only Domain Adaptation for Speech Recognition

This article explores WhisTLE, a method for adapting pretrained speech recognition transformers using only text data. This approach eliminates the need for heavy text-to-speech (TTS) pipelines, offering a more practical solution for domain adaptation.

Key Advantages and Improvements

  • Significant improvement over TTS adaptation (11.61% relative improvement).
  • Reproducible training protocol and code scaffold provided.
  • Clearer metric definitions and reporting standards.

WhisTLE leverages phoneme representation for text-only domain adaptation, achieving a substantial improvement over traditional methods. The detailed training protocol, including epoch and batch size, ensures reproducibility and allows for verification of results.

Training Protocol

Phase Mode Epochs Batch Size
Main trunk training trunk 100 240
Adaptation branch adapt 50 40

Reproducibility is paramount. A fixed random seed (e.g., 42) is used throughout the process. Environment details (Python, PyTorch, CUDA versions etc.) are meticulously documented for consistent and verifiable results. The hardware setup typically uses 4 GPUs (e.g., NVIDIA V100 or A100) with synchronized gradient updates. A modular code scaffold with clear folder structure (data, models, scripts, logs, reproducibility checklist) is also provided to facilitate implementation and extendability.

Dataset and Preprocessing

The research utilizes LibriSpeech dataset in combination with domain-specific data (e.g. TED-LIUM 3, Switchboard). The preprocessing steps include text normalization, tokenization using SentencePiece, grapheme-to-phoneme (g2p) conversion, phoneme embedding, and feature extraction (80-dim log-Mel spectrograms).

Dataset Split Approx. Hours Notes
LibriSpeech train-clean-100 5.7 h Small training subset
LibriSpeech train-clean-360 360 h Mid-size training subset
LibriSpeech train-other-500 500 h Largest LibriSpeech training subset
LibriSpeech dev-clean 5.7 h Validation subset
LibriSpeech dev-other 5.0 h Validation subset
LibriSpeech test-clean 5.7 h Test subset
LibriSpeech test-other 5.0 h Test subset
Domain data train-domain 150 h Domain-adaptation sources
Domain data dev-domain 10 h Hold-out domain validation
Domain data test-domain 15 h Domain-specific evaluation

Data augmentation techniques like SpecAugment are employed to enhance model robustness. A rigorous data cleaning process ensures high-quality data for training.

Model architecture and Integration

WhisTLE integrates a phoneme-based text encoder in parallel with the acoustic encoder. The text encoder’s output is fused into the decoder using cross-attention, leveraging both acoustic and linguistic information to improve accuracy. The embedding dimension is recommended to be between 512-768. The loss function combines standard sequence losses (cross-entropy or CTC) with a domain-adaptation regularization term (λ_domain) to encourage domain alignment.

Results and Metrics

The primary evaluation metric is Word Error Rate (WER), with Character Error Rate (CER) as a secondary metric. Relative Improvement (RI) is calculated as: RI = ((WER_baseline – WER_text_only) / WER_baseline) × 100%. The results show a significant relative improvement (RI) of 11.61% when using phoneme representation with the extra text encoder compared to TTS adaptation. Confidence intervals should be reported for WER and RI.

Variant WER (abs, %) WER 95% CI RI (%) RI 95% CI
Baseline 12.0 11.1 – 12.9
Text-only 9.5 9.0 – 10.0 20.8 15.0 – 26.0

Reproducibility and Deployment

The article emphasizes reproducibility by providing detailed instructions, including Docker Compose and Conda environment files for easy setup and replication of experiments. Version control and logging are stressed to ensure transparency and repeatability.

Conclusion

WhisTLE offers a promising approach to text-only domain adaptation for speech recognition. The reproducible research methodology and significant performance improvement over traditional methods make it a valuable contribution to the field.

Watch the Official Trailer

Comments

Leave a Reply

Discover more from Everyday Answers

Subscribe now to keep reading and get access to the full archive.

Continue reading