Mohammad Javad Ahmadi1, Iman Gandomi1, Parisa Abdi2, Seyed-Farzad Mohammadi2,
Amirhossein Taslimi1, Mehdi Khodaparast2, Hassan Hashemi3, Mahdi Tavakoli4, Hamid D. Taghirad1
While computer-assisted surgery systems require massive annotated datasets to generalize, prior benchmarks are heavily constrained by single-center sourcing and limited procedural variance.
Cataract-LMM establishes the largest, most structurally diverse foundation. 3,000 phacoemulsification videos acquired prospectively from expert and novice surgeons, deeply annotated to link spatial kinematics with objective clinical skill.
By integrating multi-layer annotations—temporal workflow, pixel-wise semantic parsing, continuous tracking geometries, and expert-adjudicated performance scores—we furnish the data necessary to train universal, multi-task models.
Acquisitions from two distinct centers forcefully introduce domain shifts for ML models:
Independently accessible, intricately linked annotation modules.
Frame-wise temporal boundaries tracking 13 surgical phases. Captured stochastic workflow variations natively handled by Long-term dependency models like ASFormer.
Pixel-level polygons for 12 classes (10 instruments & 2 anatomies). Provided in both YOLO and COCO formats.
Spatiotemporal IDs and functional tip keypoints for 469,118 densely annotated frames. Maps kinematic dynamics during Capsulorhexis.
Synchronized with Task 3. A 6-indicator rubric (GRASIS/ICO-OSCAR) rigorously adjudicated by expert surgeons.
The full 3,000-procedure corpus acting as a knowledge base for Self-Supervised and Zero-Shot feature extraction.
Technical validations computed across diverse deep learning architectures.
Evaluating long-term temporal context processing across in-domain and out-of-domain architectures. DINO features prove superior for clinical domain generalization.
| Temporal Model | Backbone | In-Domain (F1) | Out-of-Domain (F1) |
|---|---|---|---|
| ASFormer | DINO | 79.98% | 67.87% |
| ASFormer | ResNet50 | 74.61% | 60.93% |
| MS-TCN (TeCNO) | DINO | 78.53% | 61.60% |
Benchmarking supervised architectures against foundational Zero-Shot models. Fine-tuned domain knowledge significantly outperforms general foundation variants.
| Model Architecture | Type | mAP (Overall) | Tissue mAP |
|---|---|---|---|
| YOLOv11-L | Supervised | 73.9 | 83.4 |
| Mask R-CNN | Supervised | 53.7 | 92.9 |
| SAM-ViT-H | Zero-Shot | 56.0 | 63.1 |
Capsulorhexis instrument tracking evaluation. Deep learning Siamese networks eliminate the massive failure rates seen in traditional OpenCV approaches.
| Tracker Protocol | Failure Rate | Precision | IoU |
|---|---|---|---|
| SiamBAN | 0.00% | 61.35% | 74.65% |
| GradNet | 0.00% | 58.20% | 31.60% |
| KCF (Traditional) | 81.30% | 6.40% | 7.10% |
Binary classification (Expert vs. Novice) utilizing the continuous GRASIS/ICO-OSCAR numeric data. Transformers capture the spatio-temporal indicators of motion economy effectively.
| Model | Precision | Recall | Macro F1 |
|---|---|---|---|
| TimeSformer | 86.00% | 82.00% | 83.90% |
| R3D-18 | 82.35% | 84.85% | 83.58% |
| CNN-LSTM | 70.97% | 66.67% | 68.75% |