| Purpose: Construct and validate a risk prediction model for precancerous diseases of the upper gastrointestinal tract, and evaluate its effectiveness in pre-risk identification and auxiliary screening.
Methods: The population of this study was derived from 12,216 residents in the Taihang Mountain region who participated in the 2024 Upper gastrointestinal cancer Screening program. Taking whether one has precancerous diseases of the digestive tract as the dependent variable, multi-dimensional variables such as demographic characteristics, dietary habits and disease history were integrated. Different machine learning algorithms were used to screen the variables and construct a predictive model. The area under the receiver operating characteristic curve was used to evaluate the model's discrimination, the calibration curve was used to evaluate the calibration degree, and the decision curve was used to evaluate the clinical practicability. The SHapley Additive exPlanation (SHAP) method was adopted to rank the importance of features and explain the final model.
Results: Multi-model comparisons show that the XGBoost model has the best overall performance, with an AUC of 0.89 in its test set, which is significantly better than other traditional statistical methods and tree models. The key predictors of the model include gastrointestinal abnormal symptoms, history of ulcers or perforations, family history, weight changes, comorbidities, and dietary behaviors (overnight eating, rapid eating, hot food). SHAP analysis showed that abnormal symptoms and a history of ulcers contributed the most to the model. All variables are easily accessible, which is conducive to their use in the context of grassroots screening.
Conclusion: XGBoost performs the best among various machine learning algorithms and can effectively identify individuals with precancerous diseases of the upper gastrointestinal tract. It is suitable to be used as a tool for pre-screening and risk stratification in high-risk areas. Its excellent performance and interpretability provide a feasible solution for optimizing the allocation of endoscopic resources and improving the efficiency of early intervention. In the future, further external validation is needed in multi-center and prospective cohort studies. |