基于机器学习算法建立子宫内膜癌诊断模型及效能验证研究
To establish a clinical diagnosis model of endometrial cancer based on machine learning algorithm and its performance verification study
摘要目的:利用机器学习算法构建子宫内膜癌(EC)诊断模型并评估验证模型效能。方法:回顾性收集2022—2023年期间河南省人民医院肿瘤科和郑州大学第一附属医院妇科的EC患者465例与同期非EC患者182例,采用随机数字表法分为训练集(363例,包括EC患者272例、非EC患者91例)和内部验证集(284例,包括EC患者193例、非EC患者91例)。收集2024年1月至2025年2月期间同科室的EC患者188例、非EC患者93例作为外部验证集,收集患者临床特征(年龄、体质指数等)、影像学特征(子宫内膜异常回声)及实验室指标(肿瘤标志物等)作为预测变量。采用最小绝对收缩和选择算法(LASSO)进行EC诊断模型特征筛选,采用逻辑回归(LR)、随机森林(RF)及支持向量机(SVM)3种机器学习算法建立EC诊断模型,结合特异度或灵敏度选择最优算法选择并分别进行内部、外部验证,绘制受试者工作特征(ROC)曲线及校准曲线评价3种EC诊断模型的诊断效能。结果:训练集的年龄(41.4±13.2)岁;内部验证集的年龄(41.1±13.4)岁;外部验证集的年龄(41.0±13.1)岁。训练集、内部验证集、外部验证集EC组的转录因子样蛋白5(TCFL5)、癌胚抗原(CEA)、糖类抗原153(CA153)、糖类抗原72-4(CA72-4)、中性粒细胞与淋巴细胞比值(NLR)、脂肪因子(Apelin)、人附睾蛋白4(HE4)水平、糖尿病及子宫内膜异常回声比例均高于非EC组( P均<0.001)。沙普利加和解释法分析显示LR模型特征重要性排序前三位的指标依次为年龄、子宫内膜异常回声、TCFL5。内部验证中LR、RF、SVM模型曲线下面积(AUC)(95% CI)分别为0.890(0.849~0.923)、0.885(0.843~0.919)、0.845(0.799~0.884),LR模型Brier评分低于RF与SVM模型( P均<0.05)。外部验证显示LR模型AUC(95% CI)为0.932(0.894~0.961),高于RF模型[0.918(0.877~0.949)]和SVM模型[0.887(0.841~0.924)](均 P<0.05)。 结论:基于年龄、糖尿病、子宫内膜异常回声、TCFL5、CEA、CA153、CA72-4、NLR、Apelin、HE4等多维度变量构建的EC诊断模型具有良好效能,其中LR算法表现最优,可为EC早期临床识别与诊断提供科学参考依据。
更多相关知识
abstractsObjective:To construct and validate a diagnostic model for endometrial cancer (EC) using machine learning algorithms.Methods:Case-control study. A total of 465 patients with endometrial agenesis (EC) and 182 non-EC patients from the Department of Oncology, Henan Provincial People′s Hospital and the Department of Gynecology, First Affiliated Hospital of Zhengzhou University, were collected from 2022 to 2023. They were randomly assigned to a training set (363 cases, including 272 EC patients and 91 non-EC patients) and an internal validation set (284 cases, including 193 EC patients and 91 non-EC patients) using a random number table. From January 2024 to February 2025, 188 EC patients and 93 non-EC patients from the same department were collected as an external validation set. Patient clinical characteristics (age, body mass index, etc.), imaging characteristics (abnormal endometrial echo), and laboratory indicators (tumor markers, etc.) were collected as predictive variables. The Least Absolute Shrinkage and Selection (LASSO) algorithm was used to screen features for EC diagnostic models. Three machine learning algorithms, namely logistic regression (LR), random forest (RF) and support vector machine (SVM), were used to build EC diagnostic models. The optimal algorithm was selected by combining specificity or sensitivity and internal and external validations were performed respectively. ROC curves and calibration curves were plotted to evaluate the diagnostic efficacy of the three EC diagnostic models.Results:The average age of the training set was (41.4±13.2) years; the average age of the internal validation set was (41.1±13.4) years; and the average age of the external validation set was (41.0±13.1) years. The levels of transcription factor-like protein 5 (TCFL5), carcinoembryonic antigen (CEA), carbohydrate antigen 153 (CA153), carbohydrate antigen 72-4 (CA72-4), neutrophil-to-lymphocyte ratio (NLR), adipokines (Apelin), human epididymal protein 4 (HE4), diabetes, and abnormal endometrial echogenicity were all significantly higher in the EC group than in the non-EC group (all P<0.001). SHAP analysis showed that the top three indicators of importance in the LR model were age, abnormal endometrial echogenicity, and TCFL5. In internal validation, the area uncler curve(AUC) (95% CI) of the LR, RF, and SVM models were 0.890 (0.849-0.923), 0.885 (0.843-0.919), and 0.845 (0.799-0.884), respectively. The Brier score of the LR model was lower than that of the RF and SVM models (both P<0.05). External validation showed that the AUC (95% CI) of the RF and SVM models were 0.918 (0.877-0.949) and 0.887 (0.841-0.924), respectively, while the AUC (95% CI) of the LR model was 0.932 (0.894-0.961), which was significantly higher than the former two (both P<0.05). Conclusions:The EC diagnostic model constructed based on multiple dimensions of variables including age, diabetes, abnormal endometrial echo, TCFL5, CEA, CA153, CA72-4, NLR, Apelin, and HE4 has good efficacy, with the LR algorithm showing the best performance. It can provide a scientific reference for the early clinical identification and diagnosis of EC.
More相关知识
- 浏览3
- 被引0
- 下载0

相似文献
- 中文期刊
- 外文期刊
- 学位论文
- 会议论文


换一批



