대형 언어 모델에서의 공격 방지를 위한 필터링 기법에 관한 연구

대형 언어 모델에서의 공격 방지를 위한 필터링 기법에 관한 연구
A Study on Filtering Techniques for Preventing Attacks in Large Language Models

한국어: 최근, 대형 언어 모델(Large Language Model)을 대상으로 혐오적, 폭력적, 또는 공격적인 표현을 프롬프트로 입력하여 모 델의 응답을 왜곡하거나 악용하는 사례가 발생하고 있다. 이러한 공격적 사용에 대비하기 위해, 본 논문에서는 이중 필터링 기법을 제안한다. 제안된 기법은 1단계에서 혐오적 문장 여부를 분류할 수 있는 분류기를 적용하고, 2단계에서 기존 지식 내 에서 벡터 검색을 활용하여 문장의 의미를 추가적으로 평가하여 잘못 분류된 문장을 걸러내는 과정을 포함한다. 이를 통해 모 델의 응답이 공격적 콘텐츠를 포함하지 않도록, 보다 정교한 필터링을 구현하여 언어 모델의 안전성을 강화할 수 있도록 한다. 이를 위해, 1단계에서는 텍스트 임베딩 모델과 텍스트 분류기를 결합하여 혐오적 문장 분류 실험을 수행하였다. 그 결과, 0.98 이상의 f-score를 보였다. 2단계에서는 텍스트 임베딩 모델을 활용하여 질문 및 답변을 벡터 값으로 전환하여 질문에 대한 답 을 찾는 실험을 수행하였다. 그 결과, 0.87의 Precision@1을 보였다.

영어: Recently, there have been instances where prompts containing hateful, violent, or offensive expressions are input into large language models (LLMs) to distort or misuse their responses. To counteract such abusive usage, this paper proposes a dual filtering technique. The proposed method applies a classifier in the first stage to detect hateful sentences, followed by a second stage where vector search is employed to further assess the semantics of the sentence within the model's existing knowledge base, filtering out any misclassified sentences. This approach aims to strengthen the safety of language models by implementing more sophisticated filtering that prevents the inclusion of offensive content in model responses. In the first stage, we combined a text embedding model with a text classifier to conduct an experiment on classifying hateful sentences, achieving an f-score of over 0.98. In the second stage, we conducted experiments using a text embedding model to convert questions and answers into vector values, with a focus on finding accurate responses to queries. The results showed a Precision@1 of 0.87.

자료제공 : 네이버학술정보