Improving In-Silico Bacterial Toxin Prediction via Semi-Supervised Dataset Curation

Oral Session B-3 : Biomedical Applications

간행물

한국차세대컴퓨팅학회 학술대회 바로가기
권호(발행년)

ICNGC 2025 The 11th International Conference on Next Generation Computing 2025 (2025.12) 바로가기
페이지

pp.334-335
저자

Sung-Yoon Ahn, Sewon Kim, Hye Won Jeong, Sang-Woong Lee, Iel Soo Bang
언어

영어(ENG)
URL

https://www.earticle.net/Article/A478528

영어: In this study, we propose a semi-supervised dataset curation framework that leverages both high-confidence labeled protein sequence data and automated weakly labeled protein sequence data to refine dataset quality prior to model training. The approach centers on using a pre-trained ProtBERT model to iteratively assign pseudo-labels to uncertain samples, followed by subsequent model retraining, with the goal of enhancing robustness and generalization. We anticipate that a curated dataset constructed in this way will significantly enhance toxin-classification performance— measured in accuracy, F1-score, and MCC—compared to models trained solely on manually labeled or automatically annotated data.

Sung-Yoon Ahn [ School of Computing, Gachon University Seongnam-Si, Republic of Korea ]
Sewon Kim [ School of Computing,, Gachon University Seongnam-Si, Republic of Korea ]
Hye Won Jeong [ Department of Microbiology and Immunology Chosun University School of Dentistry Gwangju. Korea ]
Sang-Woong Lee [ School of Computing,, Gachon University Seongnam-Si, Republic of Korea ]
Iel Soo Bang [ Department of Microbiology and Immunology Chosun University School of Dentistry Gwangju. Korea ]

자료제공 : 네이버학술정보

Earticle