Back to Projects
🏥

Medic AI

Multimodal Medical Intelligence System with Explainable Diagnosis

Status

Active Development

Timeline

4 months

Role

Full Stack ML Engineer

Problem Statement

Medical diagnosis traditionally relies on single modalities (images or text), but modern diagnosis requires integration across modalities:

  • Image-only diagnosis misses contextual patient information and symptom descriptions
  • Text-only assessment lacks visual confirmation from medical imaging
  • Black-box predictions are unacceptable in clinical settings where explainability is critical
  • Confidence calibration is essential to flag uncertain predictions for clinician review

Target Users: Clinical decision support systems, telemedicine platforms, and medical AI research teams.

Research Motivation

Why Multimodal Fusion?

Medical diagnosis is inherently multimodal. Combining visual information (pathology, anatomy) with textual information (patient history, symptoms) mirrors clinical workflow and improves diagnostic accuracy.

Why Explainability?

Clinicians must understand model reasoning to make informed decisions. SHAP interpretability provides feature-level explanations showing which image regions and text snippets influenced predictions.

System Architecture

Model Architecture

Vision Encoder: ResNet-50 backbone pretrained on ImageNet, outputs (2048,) feature vector

Text Encoder: BERT medical-finetuned, processes symptom descriptions, outputs (768,) embeddings

Fusion Layer: Multi-head attention across modalities with cross-modal interaction

Classification Head: 2-layer MLP outputting diagnosis + confidence score

End-to-End Pipeline

Input Processing: Image normalization + text tokenization

Encoder Stage: Parallel encoding of image and text modalities

Fusion Stage: Multi-modal fusion with attention weights

Prediction Stage: Disease classification with confidence calibration

Explanation Stage: SHAP values for feature-level interpretability

Technical Implementation

Backend Stack

  • Framework: PyTorch
  • Vision Backbone: ResNet-50
  • Text Encoder: BERT (medical domain)
  • Fusion: Multi-head cross-modal attention
  • Interpretability: SHAP TreeExplainer
  • Serving: FastAPI with TorchServe

Frontend Stack

  • Framework: React TypeScript
  • Image Upload: Drag-and-drop interface
  • Visualization: Plotly for SHAP explanations
  • UI Library: TailwindCSS
  • Real-time Updates: WebSocket streaming
  • Deployment: Docker + AWS

Technical Challenges & Solutions

Challenge 1: Modality Imbalance

Problem: Image data often dominates multimodal fusion, reducing text influence.

Solution: Implemented modality-specific normalization with learnable fusion weights that automatically balance contributions during training.

Challenge 2: Confidence Calibration

Problem: Neural networks often produce overconfident predictions unsuitable for clinical deployment.

Solution: Applied temperature scaling and Platt scaling on validation set; medical predictions now reflect true uncertainty.

Challenge 3: Explainability Overhead

Problem: SHAP computation is expensive; real-time explanations for clinical use were slow.

Solution: Cached SHAP values for common diseases; implemented approximate SHAP for real-time inference with response times under 100ms latency.

Methodology

Evaluation Framework

Evaluated across medical AI benchmarks:

  • Accuracy Metrics: Sensitivity, specificity, F1-score per disease class
  • Calibration: ECE (Expected Calibration Error), Brier score
  • Explainability: Explanation fidelity via perturbation analysis
  • Multimodal Impact: Ablation study (image-only, text-only, multimodal)

Dataset & Benchmarking

Trained on medical imaging dataset with paired clinical notes (5,000+ samples). Validated on held-out test set; compared against unimodal baselines.

Results & Impact

Accuracy Improvement

+12%

Multimodal vs. image-only

Calibration (ECE)

3.2%

Well-calibrated predictions

Inference Speed

85ms

With explanations

Key Findings

  • • Multimodal fusion consistently outperforms unimodal baselines across all disease classes
  • • Text modality provides critical context, especially for diagnostic edge cases (8% error reduction)
  • • Explainability via SHAP improves clinician confidence and model adoption
  • • Confidence calibration reduces false-positive predictions in high-stakes scenarios

Future Work

Federated learning for privacy-preserving multi-hospital model training

Real-time uncertainty quantification via Bayesian deep learning

Integration with clinical workflows through FHIR-compliant EHR systems

Extended multimodality support (time-series vital signs, genomic data)

Domain adaptation to new diseases without retraining

Fairness and bias auditing across patient demographics

Links & Resources

Built with Next.js, TypeScript, and TailwindCSS.

© 2026 Shridipa Dhar