Abstract
The study presents a local large language model pipeline for automated feedback in pediatric simulation-based medical education. The system generates structured reports using the Liverpool Undergraduate Communication Assessment Scale. Four communication simulations were processed and evaluated against human feedback. The model generated structured reports with total scores (mean = 13.9 ± 1.2) comparable to human ratings (mean = 12.8 ± 2.9). Scenario-level analyses indicated variation in some cases but overall reliable performance across simulation runs. The pipeline demonstrated lower inter-rater variability (SD = 0.7-1.7) than human examiners (SD = 1.3-2.9), indicating greater internal consistency. The findings inform a forthcoming prospective evaluation.