A Feature Engineering Method for Whole-Genome DNA Sequence with Nucleotide Resolution

文献类型: 外文期刊

第一作者: Wang, Ting

作者: Wang, Ting;Cui, Yunpeng;Sun, Tan;Li, Huan;Hou, Ying;Wang, Mo;Chen, Li;Wu, Jinming;Wang, Ting;Cui, Yunpeng;Sun, Tan;Li, Huan;Hou, Ying;Wang, Mo;Chen, Li;Wu, Jinming;Wang, Chao

作者机构:

关键词: feature construction; genetic selection; Omics analysis; large language model; agronomic trait prediction

期刊名称:INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES ( 影响因子:4.9; 五年影响因子:5.7 )

ISSN: 1661-6596

年卷期: 2025 年 26 卷 5 期

页码:

收录情况: SCI

摘要: Feature engineering for whole-genome DNA sequences plays a critical role in predicting plant phenotypic traits. However, due to limitations in the models' analytical capabilities and computational resources, the existing methods are predominantly confined to SNP-based approaches, which typically extract genetic variation sites for dimensionality reduction before feature extraction. These methods not only suffer from incomplete locus coverage and insufficient genetic information but also overlook the relationships between nucleotides, thereby restricting the accuracy of phenotypic trait prediction. Inspired by the parallels between gene sequences and natural language, the emergence of large language models (LLMs) offers novel approaches for addressing the challenge of constructing genome-wide feature representations with nucleotide granularity. This study proposes FE-WDNA, a whole-genome DNA sequence feature engineering method, using HyenaDNA to fine-tune it on whole-genome data from 1000 soybean samples. We thus provide deep insights into the contextual and long-range dependencies among nucleotide sites to derive comprehensive genome-wide feature vectors. We further evaluated the application of FE-WDNA in agronomic trait prediction, examining factors such as the context window length of the DNA input, feature vector dimensions, and trait prediction methods, achieving significant improvements compared to the existing SNP-based approaches. FE-WDNA provides a mode of high-quality DNA sequence feature engineering at nucleotide resolution, which can be transformed to other plants and directly applied to various computational breeding tasks.

分类号:

  • 相关文献
作者其他论文 更多>>