您好,欢迎访问广东省农业科学院 机构知识库!

PlantLncBoost: key features for plant lncRNA identification and significant improvement in accuracy and generalization

文献类型: 外文期刊

作者: Tian, Xue-Chan 1 ; Nie, Shuai 2 ; Domingues, Douglas 4 ; Rossi Paschoal, Alexandre 5 ; Jiang, Li-Bo 1 ; Mao, Jian-Feng 2 ;

作者机构: 1.Shandong Univ Technol, Sch Life Sci & Med, Zibo 255000, Shandong, Peoples R China

2.Beijing Forestry Univ, State Key Lab Tree Genet & Breeding, Natl Engn Res Ctr Tree Breeding & Ecol Restorat, Natl Engn Lab Tree Breeding,Key Lab Genet & Breedi, Beijing 100083, Peoples R China

3.Guangdong Acad Agr Sci, Guangdong Key Lab Rice Sci & Technol, Key Lab Genet & Breeding High Qual Rice Southern C, Guangdong Rice Engn Lab,Rice Res Inst,Minist Agr &, Guangzhou 510640, Peoples R China

4.Univ Sao Paulo, Luiz de Queiroz Coll Agr, Dept Genet, BR-13418900 Piracicaba, SP, Brazil

5.Univ Tecnol Fed Parana, Dept Comp Sci, Bioinformat & Pattern Recognit Grp BIOINFO CP, UTFPR, Campus Cornelio Procopio, BR-86300000 Cornelio Procopio, Brazil

6.Rosalind Franklin Inst, Didcot OX110QX, England

7.Umea Univ, Umea Plant Sci Ctr UPSC, Dept Plant Physiol, S-90187 Umea, Sweden

关键词: feature selection; Fourier transform; gradient boosting algorithms; long noncoding RNAs (lncRNAs); model selection; ORF coverage

期刊名称:NEW PHYTOLOGIST ( 影响因子:8.1; 五年影响因子:10.3 )

ISSN: 0028-646X

年卷期: 2025 年 247 卷 3 期

页码:

收录情况: SCI

摘要: Long noncoding RNAs (lncRNAs) are critical regulators of numerous biological processes in plants. Nevertheless, their identification is challenging due to the low sequence conservation across various species. Existing computational methods for lncRNA identification often face difficulties in generalizing across diverse plant species, highlighting the need for more robust and versatile identification models. Here, we present PlantLncBoost, a novel computational tool designed to improve the generalization in plant lncRNA identification. By integrating advanced gradient boosting algorithms with comprehensive feature selection, our approach achieves both high accuracy and generalizability. We conducted an extensive analysis of 1662 features and identified three key features - ORF coverage, complex Fourier average, and atomic Fourier amplitude - that effectively distinguish lncRNAs from mRNAs. We assessed the performance of PlantLncBoost using comprehensive datasets from 20 plant species. The model exhibited exceptional performance, with an accuracy of 96.63%, a sensitivity of 98.42%, and a specificity of 94.93%, significantly outperforming existing tools. Further analysis revealed that the features we selected effectively capture the differences between lncRNAs and mRNAs across a variety of plant species. PlantLncBoost represents a significant advancement in plant lncRNA identification. It is freely accessible on GitHub () and has been integrated into a comprehensive analysis pipeline, Plant-LncRNA-pipeline v.2 ().

  • 相关文献
作者其他论文 更多>>