基于机器学习的肺腺癌特征基因筛选初步研究
Preliminary study on screening lung adenocarcinoma characteristic genes based on machine learning
摘要目的:通过采用生物信息学与机器学习算法分析和筛选公共数据库中肺腺癌的关键基因,探究肺腺癌诊治的生物标志物。方法:从基因表达数据库(GEO)数据库中选择肺腺癌基因表达谱(GSE116959、GSE118370、GSE19188),利用R语言中的Combat包对GSE116959与GSE118370去批次化合并作为训练组;并获得差异表达基因(DEGs)。采用最小绝对收缩和选择算子(LASSO)、支持向量机递归特征消除(SVM-RFE)算法分别对训练组进行预测,LASSO与SVM-RFE算法交集基因被认为是肺腺癌特征基因。利用箱型图和受试者工作曲线(ROC)对筛选出的特征基因在验证集中进行验证并对筛选出的基因进行免疫细胞浸润分析。结果:LASSO算法共筛选出8个特征基因,SVM-RFE算法共筛选出6个特征基因,两个算法有4个交集基因(BTNL9、LOC400568、RTNK2、SGCG),4个基因对疾病的预测能力分别为0.945[95%可信区间( CI):0.885~0.994],0.892(95% CI:0.819~0.952),0.919(95% CI:0.854~0.971),0.939(95% CI:0.888~0.981),对疾病均具有良好的预测能力。免疫细胞浸润结果表明,免疫细胞与特征基因存在相关性,进一步论证了机器学习算法筛选疾病特征基因的准确性。 结论:机器学习算法筛选的特征基因可为肺腺癌的发展机制和肿瘤标志物的鉴定提供基础,并可能促进肺腺癌的靶向治疗。
更多相关知识
abstractsObjective:This study employs bioinformatics and machine learning algorithms to scrutinize crucial genes associated with lung adenocarcinoma within public databases. The investigation seeks to elucidate biomarkers for the diagnosis and treatment of this cancer subtype, while simultaneously pinpointing novel targets for anti-cancer therapy.Methods:In this study, we collected gene expression profiles related to lung adenocarcinoma from the gene expression omnibus (GEO) database (GSE116959, GSE118370, GSE19188). Utilizing R’s Combat package, we merged GSE116959 and GSE118370 datasets, effectively removing batch effects and forming a comprehensive training cohort. Subsequently, we identified differentially expressed genes (DEGs) and used the Least Absolute Shrinkage and Selection Operator (LASSO) and Support Vector Machine Recursive Feature Elimination (SVM-RFE) algorithms for predictive analyses on the training cohort. The overlapping genes identified by both LASSO and SVM-RFE were considered as lung adenocarcinoma feature genes. To validate these selected genes, we rigorously examined them in an independent validation cohort, employing box plots and receiver operating characteristic (ROC) curves. Moreover, we conducted an extensive analysis of immune cell infiltration to explore the potential implications of these genes in the immune response related to lung adenocarcinoma.Results:The LASSO algorithm identified a total of 8 feature genes, while the SVM-RFE algorithm identified 6 feature genes. There were 4 genes that overlapped between the two algorithms, namely BTNL9, LOC400568, RTNK2 and SGCG. The predictive abilities of these 4 genes for the disease were found to be 0.945 [95% confidence interval ( CI): 0.885-0.994], 0.892 (95% CI: 0.819-0.952), 0.919 (95% CI: 0.854-0.971) and 0.939 (95% CI: 0.888-0.981), respectively, indicating excellent predictive performance for the disease. Furthermore, the results of immune cell infiltration analysis revealed a significant correlation between immune cells and the identified feature genes. Conclusion:Machine learning algorithms identified feature genes with potential implications in lung adenocarcinoma’s development, tumor biomarker identification, and targeted therapy advancement.
More相关知识
- 浏览82
- 被引0
- 下载0

相似文献
- 中文期刊
- 外文期刊
- 学位论文
- 会议论文


换一批



