Nature Scientific Data · CC-BY 4.0

Cataract-LMM A Large-Scale Multi-Source Benchmark
for Surgical AI Foundation Models

Mohammad Javad Ahmadi¹, Iman Gandomi¹, Parisa Abdi², Seyed-Farzad Mohammadi²,
Amirhossein Taslimi¹, Mehdi Khodaparast², Hassan Hashemi³, Mahdi Tavakoli⁴, Hamid D. Taghirad¹

¹ ARAS, K.N. Toosi University of Technology ² Farabi Eye Hospital, TUMS ³ Noor Eye Hospital, TUMS ⁴ University of Alberta

Code Repository Dataset (Hugging Face) Read Paper

Procedures

Hours of Video

Segmented Frames

Annotation Tasks

Clinical Centers

Domain Shift Baseline

Bridging the Reality Gap

While computer-assisted surgery systems require massive annotated datasets to generalize, prior benchmarks are heavily constrained by single-center sourcing and limited procedural variance.

Cataract-LMM establishes the largest, most structurally diverse foundation. 3,000 phacoemulsification videos acquired prospectively from expert and novice surgeons, deeply annotated to link spatial kinematics with objective clinical skill.

By integrating multi-layer annotations—temporal workflow, pixel-wise semantic parsing, continuous tracking geometries, and expert-adjudicated performance scores—we furnish the data necessary to train universal, multi-task models.

Hardware Heterogeneity

Acquisitions from two distinct centers forcefully introduce domain shifts for ML models:

S1
Farabi Eye Hospital (n=2,930)
Haag-Streit HS Hi-R NEO 900 · 720×480 @ 30fps
S2
Noor Eye Hospital (n=70)
ZEISS ARTEVO 800 · 1920×1080 @ 60fps

4 Annotation Tasks & Raw Data

Independently accessible, intricately linked annotation modules.

Task 1

150 Videos

Phase Recognition

Frame-wise temporal boundaries tracking 13 surgical phases. Captured stochastic workflow variations natively handled by Long-term dependency models like ASFormer.

Task 2

6,094 Frames

Instance Segmentation

Pixel-level polygons for 12 classes (10 instruments & 2 anatomies). Provided in both YOLO and COCO formats.

Task 3

170 Clips

Object Tracking

Spatiotemporal IDs and functional tip keypoints for 469,118 densely annotated frames. Maps kinematic dynamics during Capsulorhexis.

Task 4

170 Ratings

Skill Assessment

Synchronized with Task 3. A 6-indicator rubric (GRASIS/ICO-OSCAR) rigorously adjudicated by expert surgeons.

Data Pool

SSL & VLP

Raw Unannotated

The full 3,000-procedure corpus acting as a knowledge base for Self-Supervised and Zero-Shot feature extraction.

Baseline Diagnostics

Technical validations computed across diverse deep learning architectures.

Phase Recognition (Video-Level)

Evaluating long-term temporal context processing across in-domain and out-of-domain architectures. DINO features prove superior for clinical domain generalization.

Temporal Model	Backbone	In-Domain (F1)	Out-of-Domain (F1)
ASFormer	DINO	79.98%	67.87%
ASFormer	ResNet50	74.61%	60.93%
MS-TCN (TeCNO)	DINO	78.53%	61.60%

Instance Segmentation

Benchmarking supervised architectures against foundational Zero-Shot models. Fine-tuned domain knowledge significantly outperforms general foundation variants.

Model Architecture	Type	mAP (Overall)	Tissue mAP
YOLOv11-L	Supervised	73.9	83.4
Mask R-CNN	Supervised	53.7	92.9
SAM-ViT-H	Zero-Shot	56.0	63.1

Spatiotemporal Tracking

Capsulorhexis instrument tracking evaluation. Deep learning Siamese networks eliminate the massive failure rates seen in traditional OpenCV approaches.

Tracker Protocol	Failure Rate	Precision	IoU
SiamBAN	0.00%	61.35%	74.65%
GradNet	0.00%	58.20%	31.60%
KCF (Traditional)	81.30%	6.40%	7.10%

Objective Skill Assessment

Binary classification (Expert vs. Novice) utilizing the continuous GRASIS/ICO-OSCAR numeric data. Transformers capture the spatio-temporal indicators of motion economy effectively.

Model	Precision	Recall	Macro F1
TimeSformer	86.00%	82.00%	83.90%
R3D-18	82.35%	84.85%	83.58%
CNN-LSTM	70.97%	66.67%	68.75%

Stat Fact: Instrument travel distance drops ~5,800px per rating point.

Cataract-LMM A Large-Scale Multi-Source Benchmark for Surgical AI Foundation Models

Bridging the Reality Gap

Hardware Heterogeneity

4 Annotation Tasks & Raw Data

Phase Recognition

Instance Segmentation

Object Tracking

Skill Assessment

Raw Unannotated

Baseline Diagnostics

Phase Recognition (Video-Level)

Instance Segmentation

Spatiotemporal Tracking

Objective Skill Assessment

Cataract-LMM A Large-Scale Multi-Source Benchmark
for Surgical AI Foundation Models