Small Language Models are competitive without large-scale infrastructure, their performance is highly contingent on prompt design. This study analyzes the sensitivity of BitNet b1.58-2B-4T to label exposure and fewshot exemplar composition on a 36-class medical query classification task. We generated 504 items consisting of 6 direct and 8 indirect questions for each disease and after removing cross-exemplar leakage the final evaluation set contained 494 items. With no parameter updates, 0/1/2/5/10- shot prompting was evaluated using Accuracy. Under the nolabel- exposure setting accuracy increased as more exemplars were provided. However, these gains were accompanied by growing prediction concentration on exemplar labels. In contrast with label-exposure, zero-shot achieved the highest accuracy, while the inclusion of exemplars reduced accuracy and amplified label bias. These results show that the structure of the prompt tends to shift few-shot effects from beneficial to detrimental. This highlights the importance of controlled prompt design and domain-adaptive training to ensure trustworthy performance.
목차
Abstract I. INTRODUCTION II. METHODOLOGY III. EXPERIMENTS AND RESULTS A. Experiment Settings B. Experiment Results IV. DISCUSSION AND CONCLUSION ACKNOWLEDGMENT REFERENCES
저자
Sihyung Kim [ Department of Computer Engineering The Catholic University of Korea Bucheon, South Korea ]
Jaehyun Cha [ Department of Computer Engineering The Catholic University of Korea Bucheon, South Korea ]
Siyoung Kim [ Department of Computer Engineering The Catholic University of Korea Bucheon, South Korea ]
Yoojoong Kim [ School of Computer Science and Information Engineering The Catholic University of Korea Bucheon, South Korea ]
Corresponding Author