In this study, we propose a semi-supervised dataset curation framework that leverages both high-confidence labeled protein sequence data and automated weakly labeled protein sequence data to refine dataset quality prior to model training. The approach centers on using a pre-trained ProtBERT model to iteratively assign pseudo-labels to uncertain samples, followed by subsequent model retraining, with the goal of enhancing robustness and generalization. We anticipate that a curated dataset constructed in this way will significantly enhance toxin-classification performance— measured in accuracy, F1-score, and MCC—compared to models trained solely on manually labeled or automatically annotated data.
목차
Abstract I. INTRODUCTION II. RELATED WORKS A. Semi-supervised learning B. Protein Sequence Models III. METHODOLOGY ACKNOWLEDGMENT REFERENCES
저자
Sung-Yoon Ahn [ School of Computing, Gachon University Seongnam-Si, Republic of Korea ]
Sewon Kim [ School of Computing,, Gachon University Seongnam-Si, Republic of Korea ]
Hye Won Jeong [ Department of Microbiology and Immunology Chosun University School of Dentistry Gwangju. Korea ]
Sang-Woong Lee [ School of Computing,, Gachon University Seongnam-Si, Republic of Korea ]
Iel Soo Bang [ Department of Microbiology and Immunology Chosun University School of Dentistry Gwangju. Korea ]