您好,欢迎访问中国热带农业科学院 机构知识库!

Recursive Random Forests Enable Better Predictive Performance and Model Interpretation than Variable Selection by LASSO

文献类型: 外文期刊

作者: Zhu, Xiang-Wei 1 ; Xin, Yan-Jun 1 ; Ge, Hui-Lin 3 ;

作者机构: 1.Qingdao Agr Univ, Coll Resource & Environm, Qingdao 266109, Peoples R China

2.Qingdao Agr Univ, Qingdao Engn Res Ctr Rural Environm, Qingdao 266109, Peoples R China

3.Chinese Acad Trop Agr Sci, Hainan Prov Key Lab Qual & Safety Trop Fruits & V, Anal & Testing Ctr, Haikou 571101, Hainan, Peoples R China

期刊名称:JOURNAL OF CHEMICAL INFORMATION AND MODELING ( 影响因子:4.956; 五年影响因子:5.39 )

ISSN:

年卷期:

页码:

收录情况: SCI

摘要: Variable selection is of crucial significance in QSAR raideling since it increases the model predictive ability and reduces noise. The selection of the right variables is far more complicated than the development of predictive models. In this study, eight continuous and Categorical data sets were employed to explore the applicability Of two distinct variable selection methods random forests (RP) and least abSolute shrinkage and selection operator (LASSO). Variable selection Was performed: (1) by using recursive random forests to rule Out a quartet,of the least important descriptors at each iteration and (2) by using LASSO modeling with 10-fold innet crossvalidation to tune its penalty A fat each data set. Along with regular statistical parameters of model performance, we proposed the pairwise correlation rate, average pairwise Pearson's correlation coefficient, and Tanimoto coefficient to evaluate the optimal by RF and LASSO in an extensive way. Results showed that variable selection could allow a tremendous reduction of noisy descriptors (at most 96% with RF method in this Study) and apparently enhance model's predictive performance as well Furthermore, random forests showed property of gathering important predictor's without restricting their pairwise correlation, which is contrary to LASSO. The mutual exclusion of highly correlated variables in LASSO modeling tends to skip important variables that are highly related to response endpoints and thus undermine the model's predictive performance. The optimal variables selected by RF share tow similaritY with those by LASSO (e:g:, the Tanirnoto coefficients, were smaller than 0.20 in seven out of,eight data sets): We found that the differences between RF and LASSO predictive performances mainly resulted from the variables selected by different strategies rather than the learning algorithms. Our study showed that the right selection of variables is More important than the learning algorithm for Modeling We hope that a standard procedure could be developed based on these proposed statistical metrics to select the truly important variables for model interpretation, as well as for further use to facilitate drug discovery and environmental toxicity assessment

  • 相关文献
作者其他论文 更多>>