Real-Time Open-Vocabulary Detection in Medical Imaging:...

Real-Time Open-Vocabulary Detection in Medical Imaging: The MedROV Approach Across Diverse Modalities

The field of medical imaging analysis is rapidly advancing, driven by the need for more accurate, efficient, and versatile detection systems. Traditional methods often rely on closed-set vocabularies, limiting their ability to identify novel or rare pathologies. This article introduces MedROV, a novel approach to real-time open-vocabulary-alignment-in-source-free-domain-adaptation-for-open-vocabulary-semantic-segmentation-key-findings-and-practical-implications/”>vocabulary detection that aims to overcome these limitations by enabling models to detect a wide range of clinically relevant findings across diverse imaging modalities without pre-defined labels. We delve into the intricate details of its methodology, benchmarking strategies, and empirical validation.

MedROV Methodology in Detail: Architecture, Training Schedule, and Hyperparameters

The MedROV methodology is built upon a robust architecture and a carefully designed training pipeline. Key components include:

Architecture: Utilizes a ViT-B/16 backbone with 12 transformer blocks and 16×16 patches, augmented by a four-level Feature Pyramid Network (FPN) for multi-scale feature extraction.
Open-Vocabulary Detector Head: Features 1,024 learnable class tokens. A CLIP-style text encoder provides dynamic embeddings, which are then aligned with region features via cross-attention.
Pseudo-labeling Framework: Employs a teacher-student setup where unlabeled images are assigned pseudo-labels. A confidence ramp from 0.60 to 0.90 over 20,000 steps and consensus across augmentations are required for label validity.
Noise Handling and Robust Optimization: Incorporates a noise-aware loss function that discounts low-confidence labels. It also uses an EMA teacher momentum of 0.99 and region-wise score re-calibration with modality-specific patterns.

The training schedule involves pretraining on 1.2 million image-text pairs, followed by three stages: Stage 1 (warmup) for 3 epochs, Stage 2 (detector training) for 42 epochs, and Stage 3 (fine-tuning) for 12 epochs, totaling approximately 57 epochs. Training utilizes cosine learning rate decay with a 1,000-step warmup. Hyperparameters include AdamW optimizer, a base learning rate of 1e-4 with linear warmup, weight decay of 0.01, batch sizes of 8–16 GPUs (4–8 per GPU), gradient clipping at 1.0, and a balanced objective function combining IoU loss and focal classification loss.

The open-vocabulary strategy leverages 1,024 dynamic tokens, with prompts generated from modular anatomy/pathology vocabularies. Token embeddings are updated jointly with the detector. MedROV supports diverse modalities including X-ray, CT, MRI, and Ultrasound, utilizing modality adapters with lightweight normalization and small projection heads. Modality-specific intensity normalization (e.g., HU for CT) is also applied.

In terms of inference efficiency, MedROV achieves near real-time performance, operating at 12–18 ms per image on an A100 GPU at 640×640 resolution, with memory requirements of approximately 6–8 GB for the detector and text store. p-norm smoothing is used for stable batch outputs.

Regarding data accessibility, the code and models are available at https://github.com/MedROV/medrov. Datasets can be found at https://medrov.org/datasets. Model cards and licenses are provided, with weights available on HuggingFace at https://huggingface.co/MedROV/medrov-base. Open-vocabulary approaches are data-driven, significantly reducing labeling costs and annotation bottlenecks compared to closed-vocabulary methods.

Planned ablations will investigate pseudo-label thresholds, vocabulary size, text encoder type, and modality adapters. Error analysis will focus on small lesions, cross-modality confusion, and anatomical variants. Ethical and safety guardrails are in place, with outputs designed to assist radiologists, not replace their expertise. Bias and fairness checks are conducted across modalities and populations, and results include statistical significance and confidence intervals.

Benchmarking Protocols Across Modalities: Datasets, Splits, Metrics, and Reproducibility

Dataset Composition and Modality Distribution

Understanding the data is crucial for evaluating any model. The MedROV dataset comprises:

Modality	Images
X-ray	60,000
CT	40,000
MRI	30,000
Ultrasound	20,000

The data is split as follows:

Split	Percent	Images
Train	70%	105,000
Validation	15%	22,500
Test	15%	22,500

Annotations include approximately 2.1 million bounding boxes (average ~14 per image), targeting 1,800–2,000 clinically relevant findings across modalities. Due to the high cost of manual labeling, which often restricts datasets to a small predefined set of categories, MedROV’s open-vocabulary approach aims for broader concept coverage.

Evaluation Metrics and Protocols for Open vs Closed Vocabulary

Meaningful evaluation requires a clear and honest representation of model performance across various conditions. For open-vocabulary and closed-vocabulary detection, we recommend reporting the following:

Primary Metrics:

mAP at IoU thresholds 0.5 and 0.75: Measures average precision at different overlap levels.
mAP across IoU 0.5:0.95: A single score reflecting detection and localization accuracy over a spectrum of strictness.
Recall@K (K = 1, 5, 10) for top detections: Reports the fraction of true findings captured by the top-K predictions.
FROC-style sensitivity vs. false positives per image (FPPI): Plots sensitivity against FPPI for a clinical perspective on practicality.

Per-Modality Reporting: Quantify cross-domain generalization by reporting core metrics separately for X-ray, CT, MRI, and Ultrasound.
Open-Vocabulary Evaluation:

Zero-shot and few-shot performance: Evaluate on unseen or barely seen token sets.
Token-level recall and precision: Measure accuracy at the token level.
Dynamic vocabulary of 1,024 tokens: Use a fixed-size, re-weightable vocabulary for comparability.
Modality-adjusted prompts: Tailor prompts to imaging modalities.

Statistical Significance and Ablations:

95% confidence intervals via bootstrap across images: Convey robustness of metrics.
Ablations with p-values: Report p-values for observed gains in variant comparisons.

Transparent reporting of these metrics across modalities, with open-vocabulary tests and statistical significance, enables robust community judgment of model utility.

Open-Vocabulary vs Closed-Vocabulary Benchmarking: Experimental Design

Fair comparison necessitates identical data splits and a shared evaluation protocol. This ensures that differences in performance are attributable to the vocabulary flexibility, not procedural variations.

What we test

a) Open vocabulary increases detection of unseen findings.
b) Cross-modality alignment improves consistency across modalities.
c) Open-vocabulary models maintain comparable latency to closed baselines.

Experimental design highlights

Data splits: Use identical training, validation, and test splits for both open-vocabulary and closed-vocabulary runs.
Vocabulary definitions: Open vocabulary considers a broad set of words, while the baseline uses a predefined, limited label set.
Evaluation protocol: Employ the same metrics, decision thresholds, and evaluation procedures.
Unseen findings: Specifically measure performance on categories not part of the closed vocabulary.
Cross-modality alignment: Assess signal alignment across modalities.
Latency: Report inference time and resource usage on the same hardware.
Statistical reporting: Include confidence intervals or significance tests.

Design contrast at a glance

Aspect	Open-vocabulary (MedROV)	Closed-vocabulary baseline
Vocabulary	No fixed label list; learns from a broad lexicon	Predefined, limited label set
Data splits	Identical to baseline	Identical to open-vocabulary
Evaluation protocol	Same metrics and thresholds as baseline	Same metrics and thresholds as open-vocabulary
Unseen findings	Measured explicitly	Not included in base vocabulary
Cross-modality alignment	Measured and compared	Measured with same procedure
Latency	Comparable to baseline on equal hardware	Comparable to open-vocabulary on equal hardware
Labeling costs	Less dependent on predefined labels; potentially higher lexical coverage	Lower lexical coverage; easier labeling

Rationale anchors for open-vocabulary approaches are their data-driven nature and ability to explore many words beyond predefined vocabularies, which is critical given the small-scale nature of annotated categories due to labeling costs.

Open-Vocabulary Performance and Reproducibility: Empirical Evidence, Code, and Access

Aspect	MedROV Open-Vocabulary	Baseline Closed-Vocabulary
Modality Coverage	Uses all four modalities (X-ray, CT, MRI, Ultrasound).	Limited to X-ray and CT modalities.
Vocabulary Size	1,024 dynamic tokens	200 fixed classes
Data Requirements	Uses unlabeled data plus a small labeled set.	Relies primarily on labeled data.
Inference Latency	12–18 ms per image	10–15 ms per image (at similar hardware)
Evaluation Metrics	Open-vocabulary recall@K and mAP@IoU 0.5/0.75	Closed-vocabulary mAP and Recall@K on the same test set
Accessibility	Code, datasets, and pretrained models linked; documentation includes experiment expositions, hyperparameter tables, and reproducibility scripts.	Code, datasets, and pretrained models linked; documentation includes experiment expositions, hyperparameter tables, and reproducibility scripts.

Ablation Studies, Error Analysis, and Practical Considerations

Ablation Study Plan: Pseudo-Labeling Thresholds, Noise Handling, and Vocabulary Size

Investigating the impact of design choices is key. This plan isolates four factors:

Ablation	Focus	Key Choices / Parameters	Expected Trends	Notes
A	Pseudo-label threshold	Thresholds: 0.55, 0.65, 0.75, 0.85, 0.90; evaluated on mAP@IoU = 0.5 across modalities	Higher thresholds tend to boost precision but reduce recall. A sweet spot around 0.65–0.75 is expected to offer balanced precision and recall with stable cross-modal performance.	Calibrate for cross-modal consistency; monitor the impact on false positives and label noise propagation.
B	Vocabulary size	Tokens: 256, 1,024, 4,096	1,024 tokens likely provide a good balance between coverage and stability; 4,096 may yield diminishing returns and higher compute costs.	Track model size, inference time, and open-vocabulary recall as tokens grow.
C	Text encoder variant	CLIP-style ViT-based encoder vs BERT-like encoder vs no text encoder	CLIP-style encoders are expected to enable stronger cross-modal alignment and higher open-vocabulary recall; BERT-like may help language grounding but with different trade-offs; no text encoder tests baseline.	Assess how textual grounding interacts with visual/multi-modal signals across domains.
D	Modality adapters	With vs without modality-specific adapters	Adapters should improve cross-domain consistency, especially for modality pairs like MRI and ultrasound, by tailoring feature spaces per modality.	Evaluate whether adapters help stabilize performance under domain shift and improve rare-modality recall.

Ablation A — Pseudo-label threshold: Controls pseudo-label quality and quantity. A threshold near 0.65–0.75 is expected to offer the best trade-off. Plotting precision-recall vs. threshold is advised.

Ablation B — Vocabulary size: Affects model expressiveness and grounding. 1,024 tokens is considered a sweet spot, balancing coverage and stability. Monitor convergence speed and memory usage.

Ablation C — Text encoder variant: Shapes semantic signal input. CLIP-style encoders are predicted to yield stronger cross-modal alignment. Assess overall mAP and cross-modal recall.

Ablation D — Modality adapters: Aims to align heterogeneous data. Adapters should improve performance stability, especially for underrepresented domains. Analyze per-modality performance curves.

Error Analysis and Generalization Across Modalities

Detecting lesions across modalities presents challenges. Common failure modes include:

Very small lesions (< 5 mm), occluded or hidden by structures.
Lesions with atypical shapes or rare presentations.
Cross-modality confusions (e.g., calcifications misread as lesions).

The analysis plan involves:

Analysis component	What it measures	Why it matters
Per-modality error breakdown	Lesion size, location, appearance by modality	reveals modality-specific blind spots and data gaps.
Top-10 confusion matrix	Most frequent misclassifications across modalities	Targets actionable improvements in annotations and model design.
Qualitative examples with visualizations	Representative failure cases with overlays and heatmaps	Builds intuition and guides targeted fixes.

Expected insights: While open-vocabulary approaches should help detect rare findings, careful calibration is needed to balance generalization with precise, modality-aware thresholds, especially in noisy modalities like ultrasound. Mitigation strategies include targeted data augmentation, modality-specific adapters, and uncertainty-aware scoring.

Reproducibility Checklist: Code, Data, and Training Logs

Ensuring reproducibility is vital for scientific progress. The checklist covers four areas:

Code availability: Public repository with versioned releases, environment locking, and organized scripts for training, evaluation, and ablations.
Data governance: Clearly stated dataset access terms, licensing, de-identification notes, and curated data splits with metadata.
Experiment tracking: Deterministic seeds logged, consistent hyperparameters, and results reported as means with 95% confidence intervals across multiple runs.
Hardware and runtime: Documented hardware specs (GPU, memory, CPU, OS), typical wall-clock time per epoch, and instructions for reproducing latency measurements.

Thorough documentation in these areas makes work auditable, reusable, and accelerates scientific advancement.

Conclusion

MedROV represents a significant advancement in real-time open-vocabulary detection for medical imaging. By employing a sophisticated architecture, a robust pseudo-labeling framework, and carefully designed benchmarking protocols, it demonstrates strong performance across diverse modalities. The detailed methodology, ablation studies, and commitment to reproducibility pave the way for broader adoption and further research. This approach holds the potential to significantly reduce annotation costs and accelerate the development of more versatile and powerful AI tools for medical image analysis, ultimately aiding clinicians in diagnosis and patient care.

Real-Time Open-Vocabulary Detection in Medical Imaging:…