Nature Scientific Data · CC-BY 4.0

Cataract-LMM A Large-Scale Multi-Source Benchmark for Surgical AI Foundation Models

Mohammad Javad Ahmadi1, Iman Gandomi1, Parisa Abdi2, Seyed-Farzad Mohammadi2, Amirhossein Taslimi1, Mehdi Khodaparast2, Hassan Hashemi3, Mahdi Tavakoli4, Hamid D. Taghirad1

1 ARAS, K.N. Toosi University of Technology 2 Farabi Eye Hospital, TUMS 3 Noor Eye Hospital, TUMS 4 University of Alberta
0
Procedures
0
Hours of Video
0
Segmented Frames
0
Annotation Tasks
0
Clinical Centers
Domain Shift Baseline

Bridging the Reality Gap

While computer-assisted surgery systems require massive annotated datasets to generalize, prior benchmarks are heavily constrained by single-center sourcing and limited procedural variance.

Cataract-LMM establishes the largest, most structurally diverse foundation. 3,000 phacoemulsification videos acquired prospectively from expert and novice surgeons, deeply annotated to link spatial kinematics with objective clinical skill.

By integrating multi-layer annotations—temporal workflow, pixel-wise semantic parsing, continuous tracking geometries, and expert-adjudicated performance scores—we furnish the data necessary to train universal, multi-task models.

Hardware Heterogeneity

Acquisitions from two distinct centers forcefully introduce domain shifts for ML models:

  • S1
    Farabi Eye Hospital (n=2,930)
    Haag-Streit HS Hi-R NEO 900 · 720×480 @ 30fps
  • S2
    Noor Eye Hospital (n=70)
    ZEISS ARTEVO 800 · 1920×1080 @ 60fps

4 Annotation Tasks & Raw Data

Independently accessible, intricately linked annotation modules.

Task 1
150 Videos

Phase Recognition

Frame-wise temporal boundaries tracking 13 surgical phases. Captured stochastic workflow variations natively handled by Long-term dependency models like ASFormer.

Task 2
6,094 Frames

Instance Segmentation

Pixel-level polygons for 12 classes (10 instruments & 2 anatomies). Provided in both YOLO and COCO formats.

Task 3
170 Clips

Object Tracking

Spatiotemporal IDs and functional tip keypoints for 469,118 densely annotated frames. Maps kinematic dynamics during Capsulorhexis.

Task 4
170 Ratings

Skill Assessment

Synchronized with Task 3. A 6-indicator rubric (GRASIS/ICO-OSCAR) rigorously adjudicated by expert surgeons.

Data Pool
SSL & VLP

Raw Unannotated

The full 3,000-procedure corpus acting as a knowledge base for Self-Supervised and Zero-Shot feature extraction.

Baseline Diagnostics

Technical validations computed across diverse deep learning architectures.

Phase Recognition (Video-Level)

Evaluating long-term temporal context processing across in-domain and out-of-domain architectures. DINO features prove superior for clinical domain generalization.

Temporal Model Backbone In-Domain (F1) Out-of-Domain (F1)
ASFormer DINO 79.98% 67.87%
ASFormer ResNet50 74.61% 60.93%
MS-TCN (TeCNO) DINO 78.53% 61.60%

Instance Segmentation

Benchmarking supervised architectures against foundational Zero-Shot models. Fine-tuned domain knowledge significantly outperforms general foundation variants.

Model Architecture Type mAP (Overall) Tissue mAP
YOLOv11-L Supervised 73.9 83.4
Mask R-CNN Supervised 53.7 92.9
SAM-ViT-H Zero-Shot 56.0 63.1

Spatiotemporal Tracking

Capsulorhexis instrument tracking evaluation. Deep learning Siamese networks eliminate the massive failure rates seen in traditional OpenCV approaches.

Tracker Protocol Failure Rate Precision IoU
SiamBAN 0.00% 61.35% 74.65%
GradNet 0.00% 58.20% 31.60%
KCF (Traditional) 81.30% 6.40% 7.10%

Objective Skill Assessment

Binary classification (Expert vs. Novice) utilizing the continuous GRASIS/ICO-OSCAR numeric data. Transformers capture the spatio-temporal indicators of motion economy effectively.

Model Precision Recall Macro F1
TimeSformer 86.00% 82.00% 83.90%
R3D-18 82.35% 84.85% 83.58%
CNN-LSTM 70.97% 66.67% 68.75%
Stat Fact: Instrument travel distance drops ~5,800px per rating point.