Medic AI
Multimodal Medical Intelligence System with Explainable Diagnosis
Status
Active Development
Timeline
4 months
Role
Full Stack ML Engineer
Problem Statement
Medical diagnosis traditionally relies on single modalities (images or text), but modern diagnosis requires integration across modalities:
- Image-only diagnosis misses contextual patient information and symptom descriptions
- Text-only assessment lacks visual confirmation from medical imaging
- Black-box predictions are unacceptable in clinical settings where explainability is critical
- Confidence calibration is essential to flag uncertain predictions for clinician review
Target Users: Clinical decision support systems, telemedicine platforms, and medical AI research teams.
Research Motivation
Why Multimodal Fusion?
Medical diagnosis is inherently multimodal. Combining visual information (pathology, anatomy) with textual information (patient history, symptoms) mirrors clinical workflow and improves diagnostic accuracy.
Why Explainability?
Clinicians must understand model reasoning to make informed decisions. SHAP interpretability provides feature-level explanations showing which image regions and text snippets influenced predictions.
System Architecture
Model Architecture
Vision Encoder: ResNet-50 backbone pretrained on ImageNet, outputs (2048,) feature vector
Text Encoder: BERT medical-finetuned, processes symptom descriptions, outputs (768,) embeddings
Fusion Layer: Multi-head attention across modalities with cross-modal interaction
Classification Head: 2-layer MLP outputting diagnosis + confidence score
End-to-End Pipeline
Input Processing: Image normalization + text tokenization
Encoder Stage: Parallel encoding of image and text modalities
Fusion Stage: Multi-modal fusion with attention weights
Prediction Stage: Disease classification with confidence calibration
Explanation Stage: SHAP values for feature-level interpretability
Technical Implementation
Backend Stack
- • Framework: PyTorch
- • Vision Backbone: ResNet-50
- • Text Encoder: BERT (medical domain)
- • Fusion: Multi-head cross-modal attention
- • Interpretability: SHAP TreeExplainer
- • Serving: FastAPI with TorchServe
Frontend Stack
- • Framework: React TypeScript
- • Image Upload: Drag-and-drop interface
- • Visualization: Plotly for SHAP explanations
- • UI Library: TailwindCSS
- • Real-time Updates: WebSocket streaming
- • Deployment: Docker + AWS
Technical Challenges & Solutions
Challenge 1: Modality Imbalance
Problem: Image data often dominates multimodal fusion, reducing text influence.
Solution: Implemented modality-specific normalization with learnable fusion weights that automatically balance contributions during training.
Challenge 2: Confidence Calibration
Problem: Neural networks often produce overconfident predictions unsuitable for clinical deployment.
Solution: Applied temperature scaling and Platt scaling on validation set; medical predictions now reflect true uncertainty.
Challenge 3: Explainability Overhead
Problem: SHAP computation is expensive; real-time explanations for clinical use were slow.
Solution: Cached SHAP values for common diseases; implemented approximate SHAP for real-time inference with response times under 100ms latency.
Methodology
Evaluation Framework
Evaluated across medical AI benchmarks:
- Accuracy Metrics: Sensitivity, specificity, F1-score per disease class
- Calibration: ECE (Expected Calibration Error), Brier score
- Explainability: Explanation fidelity via perturbation analysis
- Multimodal Impact: Ablation study (image-only, text-only, multimodal)
Dataset & Benchmarking
Trained on medical imaging dataset with paired clinical notes (5,000+ samples). Validated on held-out test set; compared against unimodal baselines.
Results & Impact
Accuracy Improvement
+12%
Multimodal vs. image-only
Calibration (ECE)
3.2%
Well-calibrated predictions
Inference Speed
85ms
With explanations
Key Findings
- • Multimodal fusion consistently outperforms unimodal baselines across all disease classes
- • Text modality provides critical context, especially for diagnostic edge cases (8% error reduction)
- • Explainability via SHAP improves clinician confidence and model adoption
- • Confidence calibration reduces false-positive predictions in high-stakes scenarios
Future Work
Federated learning for privacy-preserving multi-hospital model training
Real-time uncertainty quantification via Bayesian deep learning
Integration with clinical workflows through FHIR-compliant EHR systems
Extended multimodality support (time-series vital signs, genomic data)
Domain adaptation to new diseases without retraining
Fairness and bias auditing across patient demographics