LLM Safety and Benchmarking

By releasing detailed benchmarking results, we encourage the broader healthcare and AI communities to adopt best practices, refine existing techniques, and ensure consistent safety standards in LLM deployment.

Overview

Multimodal Large Language Models and Agentic AI are emerging as powerful computational tools capable of transforming clinical workflows and patient care. Our laboratory’s primary goal is to explore and validate the safest, most effective methodologies for integrating these AI-driven solutions into modern medical practice. A key element of our work involves Retrieval-Augmented Generation, a technique designed to ground LLM outputs in reliable and up-to-date medical literature. By ensuring that each response is rooted in authoritative sources, we minimize the risk of misinformation, bias, and hallucinations, thereby fostering greater confidence among clinicians and patients alike.

Focus

To realize the full potential of Agentic AI, RAG, Model Fine Tuning, Embeddings and Knowledge Graphs in diverse clinical settings, our team actively develops and tests a range of advanced techniques. These efforts revolve around optimizing model performance while upholding the highest standards of transparency and ethical responsibility. We subject our systems to extensive validation protocols in various medical specialties, including surgery, oncology, and internal medicine, ensuring that our frameworks are versatile and generalizable to distinct domains of patient care. This comprehensive approach helps us pinpoint system vulnerabilities—technical, ethical, or otherwise—and refine our protocols to safeguard the welfare of both practitioners and the public.

Testing

Security is at the forefront of our research agenda. Through structured red-teaming exercises, we simulate adversarial conditions that reveal potential threats, such as data breaches or malicious inputs intended to exploit model weaknesses. These scenarios allow us to anticipate real-world challenges, harden our systems against potential attacks, and reinforce protective measures before they are deployed in clinical environments. Our work also incorporates strong principles of algorithmic fairness and responsible data stewardship, reducing the likelihood of biases that could undermine patient trust or compromise clinical decision-making.

Looking beyond immediate performance metrics, we uphold a shared commitment to collaboration and transparency. By releasing detailed benchmarking results, we encourage the broader healthcare and AI communities to adopt best practices, refine existing techniques, and ensure consistent safety standards in LLM deployment. In doing so, we help create an ecosystem where physicians can rely on these tools as reliable complements to their expertise, ultimately enhancing diagnostic accuracy, streamlining workflows, and elevating patient outcomes. Our vision is a future in which AI technologies integrate seamlessly into clinical practice, setting new standards for quality, safety, and innovation in global healthcare research.

Representative Studies:

Pressman SM, Borna S, Gomez-Cabello CA, Haider SA, Haider C, Forte AJ. AI and Ethics: A Systematic Review of the Ethical Considerations of Large Language Model Use in Surgery Research. Healthcare. 2024;12(8):825. doi:10.3390/healthcare12080825

Yu Y, Gomez-Cabello CA, Makarova S, Parte Y, Borna S, Haider SA, Genovese A, Prabha S, Forte AJ. Using Large Language Models to Retrieve Critical Data from Clinical Processes and Business Rules. Bioengineering (Basel). 2024 Dec 28;12(1):17. doi: 10.3390/bioengineering12010017.

Trabilsy M, Prabha S, Gomez-Cabello CA, Haider SA, Genovese A, Borna S, Wood N, Gopala N, Tao C, Forte AJ. The PIEE Cycle: A Structured Framework for Red Teaming Large Language Models in Clinical Decision- Making. Bioengineering. 2025; 12(7):706. https://doi.org/10.3390/ bioengineering12070706

Haider SA, Prabha S, Gomez-Cabello CA, Borna S, Genovese A, Trabilsy M, Collaco BG, Wood NG, Bagaria S, Tao C, Forte AJ Synthetic Patient– Physician Conversations Simulated by Large Language Models: A Multi-Dimensional Evaluation. Sensors. 2025; 25(14):4305. https://doi.org/10.3390/s25144305

Yu Y, Gomez-Cabello CA, Haider SA, Genovese A, Prabha S, Trabilsy M, Collaco BG, Wood NG, Bagaria S, Tao C, Forte AJ. Enhancing Clinician Trust in AI Diagnostics: A Dynamic Framework for Confidence Calibration and Transparency. Diagnostics (Basel). 2025 Aug 30;15(17):2204. doi: 10.3390/diagnostics15172204.