Earticle

현재 위치 Home

Convergence of Internet, Broadcasting and Communication

Proposal of a Korean 3D Lip-Sync Model Structure through Extension of an Existing Korean Speech Synthesis Model

첫 페이지 보기
  • 발행기관
    국제인공지능학회(구 한국인터넷방송통신학회) 바로가기
  • 간행물
    International Journal of Internet, Broadcasting and Communication 바로가기
  • 통권
    Vol.17 No.3 (2025.08)바로가기
  • 페이지
    pp.136-143
  • 저자
    Ki-Hong Kim
  • 언어
    영어(ENG)
  • URL
    https://www.earticle.net/Article/A472238

※ 원문제공기관과의 협약기간이 종료되어 열람이 제한될 수 있습니다.

원문정보

초록

영어
This study proposes a novel model structure for implementing 3D lip-sync by extending the Korean speech synthesis model, Korean-FastSpeech2. Existing Korean lip-sync technologies have struggled to accurately render the three-dimensional expressions of the Korean pronunciations "아" (/a/), "오" (/o/), and "우" (/u/), particularly the lip rounding (mouthPucker). To address this, we introduce a Lip Predictor to the Encoder-Variance Adaptor-Decoder architecture, enabling the model to learn ARKit data. The Lip Predictor, built on a Transformer decoder with four layers and eight multi-head attentions, processes phoneme features and temporal information. By sharing the Variance Adaptor’s output with the speech output Decoder, it naturally resolves synchronization issues between speech and lip movements, which is the core contribution of this study. The proposed model facilitates specialized learning for "아", "오", and "우" pronunciations and is expected to offer superior precision, synchronization accuracy, and scalability compared to existing 3D lip-sync algorithms such as Audio2Face, VOCA, and FaceFormer. This work highlights the potential for advancing lip-sync technology for minority languages.

목차

Abstract
1. Introduction
2. Related research
2.1 Korean-FastSpeech2
2.2 ARKit Facial Animation
2.3 VOCA: Voice Operated Character Animation
2.4 Existing 3D Lip-Sync Algorithms
3. Research Methods
3.1 Data Preparation
3.2 Data Preprocessing
3.3 Model Architecture
3.4 Training Procedure
3.5 Proposed Evaluation Framework
4. Expected Model Superiority
4.1 Precise Implementation of &quat;아&quat;, &quat;오&quat;, &quat;우&quat; Pronunciations
4.2 Speech-Lip Synchronization Precision
4.3 Korean Phoneme-Specific Learning Capability
4.4 Scalability and Flexibility
4.5 Real-Time Application Potential
4.6 Theoretical Model Validation Framework
5. Discussion and Future Work
5.1 Speaker-Dependent Data Considerations
5.2 Dataset Scalability Strategy
5.3 Emotional Expression Integration
6. Conclusion
Acknowledgement
Reference

키워드

Korean lip-sync 3D lip movement Korean-FastSpeech2 Lip Predictor ARKit

저자

  • Ki-Hong Kim [ Professor, Department of Visual Animation, Dongseo University, Korea ] Corresponding Author

참고문헌

자료제공 : 네이버학술정보

간행물 정보

발행기관

  • 발행기관명
    국제인공지능학회(구 한국인터넷방송통신학회) [The International Association for Artificial Intelligence]
  • 설립연도
    2000
  • 분야
    공학>전자/정보통신공학
  • 소개
    인터넷방송, 인터넷 TV , 방송 통신 네트워크 및 관련 분야에 대한 국내는 물론 국제적인 학술, 기술의 진흥발전에 공헌하고 지식 정보화 사회에 기여하고자 한다.

간행물

  • 간행물명
    International Journal of Internet, Broadcasting and Communication
  • 간기
    계간
  • pISSN
    2288-4920
  • eISSN
    2288-4939
  • 수록기간
    2009~2025
  • 십진분류
    KDC 326 DDC 380

이 권호 내 다른 논문 / International Journal of Internet, Broadcasting and Communication Vol.17 No.3

    피인용수 : 0(자료제공 : 네이버학술정보)

    함께 이용한 논문 이 논문을 다운로드한 분들이 이용한 다른 논문입니다.

      페이지 저장