AI는 한영 번역을 어떻게 평가하는가? 챗GPT-인간 평가의 상관관계와 챗GPT 평가의 특징에 관하여
The Potential of ChatGPT as a Translation Evaluator: Characteristics and Comparisons to Human Evaluation.
In this study, we carried out a series of experiments to explore how ChatGPT (version 4o) evaluated Korean-English translations. Using two datasets of human translations (n=57) and two datasets of post-edited translations (n=56), all drawn from Lee and Lee (2021), we adopted two evaluation approaches with strict prompt control. In Experiment A, ChatGPT rated the four datasets freely on a five-point scale without specific criteria. In Experiment B, which was conducted concurrently with Experiment A, ChatGPT rated the same datasets using a prescribed, criterion-referenced five-point scale. To assess intra-rater reliability, we repeated both experiments one month later. This study yielded both quantitative and qualitative findings, including the following: (1) ChatGPT’s average scores differed significantly from those of human raters; (2) correlations between human and ChatGPT scores ranged from ‘moderate’ to ‘strong’; (3) the use of the prescribed rating scale improved ChatGPT’s reliability as a rater; (4) ChatGPT exhibited very low intra-rater reliability; and (5) ChatGPT’s self-justifications for its ratings varied in quality, often failing to identify obvious errors.
목차
1. 서론 2. 선행연구 검토 2.1. 챗GPT와 외국어 작문/번역 2.2. 챗GPT를 평가 도구로 활용한 연구 3. 연구 방법 4. 분석 결과 4.1. 정량분석 4.1.1. 평균 비교 4.1.2. 평가자 간의 상관관계 4.1.3. 평가자 내 신뢰도 4.2. 정성분석 4.2.1. 평가 근거와 관련된 특징 4.2.2. 척도에 따른 평가 차이 4.2.3. 챗GPT 평가 사례 5. 결론 참고문헌
키워드
번역평가포스트에디팅번역품질번역교육번역 평가자로서의 챗GPTtranslation evaluationMTPEtranslation qualitytranslation educationChatGPT as a rater