首页 > 中华流行病学杂志 > 分布式学习用于流行病学效应值估计的方法开发与应用进展

分布式学习用于流行病学效应值估计的方法开发与应用进展

Progress in method development and application of distributed learning for estimation of epidemiological effect

摘要目的:系统总结分布式学习在流行病学效应值估计中的方法学进展及应用现状，为多中心研究提供方法参考。方法:以“健康/医疗大数据”“分布式学习/联邦学习”2组关键词系统检索截至2023年12月31日的英文文献。经多轮专家咨询后，制定本研究的纳入排除标准与数据提取框架。数据提取内容主要包括基本信息、研究设计与方法学、应用及评价情况。采用平行、独立方式进行文献纳入排除与信息提取。分别采用EndNote 20和EpiData软件进行文献与数据管理。结果:通过系统检索共获取3 444篇文献，经纳入排除后，最终纳入29篇文献进行分析。其中，25篇（86.2%）文献发表于2019年及以后，主要来自美国（72.4%，21/29）。对于流行病学效应值估计，现已开发22种分布式学习方法，涉及logistic回归（8种）、Cox回归（8种），Poisson回归（2种）和广义线性混合模型（GLMM）（4种），以及3个分布式分析平台（VLP、Vantage6、AusCAT）。29篇文献共涉及45项应用案例，其中20项（44.4%）以建立预测模型为主要目的，25项（55.6%）侧重于关联分析。值得注意的是，除GLMM外，现有分布式学习方法可通过1~3轮通讯获得较小偏倚的效应值估计，尤其在数据异质性、罕见结局的处理上，偏倚程度小于Meta分析。然而，既往研究鲜少考虑数据结构异质性、稀疏数据偏倚等的影响，仍有待进一步探索。结论:分布式学习在流行病学效应值估计中表现良好，但仍处于初期发展阶段，未来需加强对数据异质性处理、低通讯效率等关键问题的研究。

abstractsObjective:To systematically review the progress in the method development and application of distributed learning in the estimation of epidemiological effect and provide methodological reference for multi-center studies.Methods:We conducted a literature retrieval for English papers published up to December 31, 2023 by using keywords of "health/medical big data" and "distributed/federated learning". After consulting experts, we set criteria of paper inclusion and exclusion and created a framework for data extraction. We collected information about basic study details, including method, application, and evaluation. Two researchers independently screened the papers and extracted information. We used EndNote 20 for the management of literatures and EpiData for the management of data.Results:A total of 3 444 papers were collected, and 29 papers were included in the final analysis. Most of the papers (25, 86.2%) were published in or after 2019, and the papers were mainly from the United States (21/29, 72.4%). For the estimation of epidemiological effects, 22 distributed learning methods had been developed, including methods for logistic regression (8), Cox regression (8), Poisson regression (2), and generalized linear mixed model (GLMM) (4), as well as three platforms for distributed analysis (VLP, Vantage6, AusCAT). The 29 papers described 45 applications, with 20 (44.4%) focusing on the establishment of prediction model and 25 (55.6%) on association analysis. Importantly, except for GLMM, current distributed learning methods can estimate effects with little bias in 1-3 rounds of communication. These methods show less bias compared with meta-analysis, especially in the address of data heterogeneity and rare outcomes. However, less studies examined how differences in data structure and sparse data affect results, an area that requires further research.Conclusion:While distributed learning shows promise in epidemiological effect estimation, it is still in early development, requiring further research on data heterogeneity handling and communication efficiency improvement.