Recursive Random Forests Enable Better Predictive Performance and Model Interpretation than Variable Selection by LASSO
文献类型: 外文期刊
作者: Zhu, Xiang-Wei 1 ; Xin, Yan-Jun 1 ; Ge, Hui-Lin 3 ;
作者机构: 1.Qingdao Agr Univ, Coll Resource & Environm, Qingdao 266109, Peoples R China
2.Qingdao Agr Univ, Qingdao Engn Res Ctr Rural Environm, Qingdao 266109, Peoples R China
3.Chinese Acad Trop Agr Sci, Hainan Prov Key Lab Qual & Safety Trop Fruits & V, Anal & Testing Ctr, Haikou 571101, Hainan, Peoples R China
期刊名称:JOURNAL OF CHEMICAL INFORMATION AND MODELING ( 影响因子:4.956; 五年影响因子:5.39 )
ISSN:
年卷期:
页码:
收录情况: SCI
摘要: Variable selection is of crucial significance in QSAR raideling since it increases the model predictive ability and reduces noise. The selection of the right variables is far more complicated than the development of predictive models. In this study, eight continuous and Categorical data sets were employed to explore the applicability Of two distinct variable selection methods random forests (RP) and least abSolute shrinkage and selection operator (LASSO). Variable selection Was performed: (1) by using recursive random forests to rule Out a quartet,of the least important descriptors at each iteration and (2) by using LASSO modeling with 10-fold innet crossvalidation to tune its penalty A fat each data set. Along with regular statistical parameters of model performance, we proposed the pairwise correlation rate, average pairwise Pearson's correlation coefficient, and Tanimoto coefficient to evaluate the optimal by RF and LASSO in an extensive way. Results showed that variable selection could allow a tremendous reduction of noisy descriptors (at most 96% with RF method in this Study) and apparently enhance model's predictive performance as well Furthermore, random forests showed property of gathering important predictor's without restricting their pairwise correlation, which is contrary to LASSO. The mutual exclusion of highly correlated variables in LASSO modeling tends to skip important variables that are highly related to response endpoints and thus undermine the model's predictive performance. The optimal variables selected by RF share tow similaritY with those by LASSO (e:g:, the Tanirnoto coefficients, were smaller than 0.20 in seven out of,eight data sets): We found that the differences between RF and LASSO predictive performances mainly resulted from the variables selected by different strategies rather than the learning algorithms. Our study showed that the right selection of variables is More important than the learning algorithm for Modeling We hope that a standard procedure could be developed based on these proposed statistical metrics to select the truly important variables for model interpretation, as well as for further use to facilitate drug discovery and environmental toxicity assessment
- 相关文献
作者其他论文 更多>>
-
Predicting synergistic toxicity of heavy metals and ionic liquids on photobacterium Q67
作者:Ge, Hui-Lin;Liu, Shu-Shen;Ge, Hui-Lin;Su, Bing-Xia;Qin, Li-Tang
关键词:Synergism;Concentration addition;Independent action;ICIM model;Uniform design
-
Mixture cytotoxicity assessment of ionic liquids and heavy metals in MCF-7 cells using mixtox
作者:Zhu, Xiang-Wei;Cao, Yu-Bin;Ge, Hui-Lin
关键词:Mixture toxicity;Concentration addition;Independent action;Uniform design;MCF-7
-
Two-Stage Prediction of the Effects of Imidazolium and Pyridinium Ionic Liquid Mixtures on Luciferase
作者:Ge, Hui-Lin;Su, Bing-Xia;Ge, Hui-Lin;Liu, Shu-Shen;Zhu, Xiang-Wei
关键词:ionic liquids;luciferase;molecular level;joint toxicity;concentration addition;independent action;two-stage prediction;NOEC;mixture risk assessment