딥러닝 알고리즘을 이용한 문서의 인코딩 및 언어 판별

논문

딥러닝 알고리즘을 이용한 문서의 인코딩 및 언어 판별
Encoding and language detection of text document using Deep learning algorithm

간행물

한국차세대컴퓨팅학회 논문지 KCI 등재 바로가기
권호(발행년)

Vol.13 No.5 (2017.10) 바로가기
페이지

pp.124-130
저자

김선범, 배준우, 박희진
언어

한국어(KOR)
URL

https://www.earticle.net/Article/A313205

원문정보

초록

한국어: 문자 인코딩은 문자나 기호를 컴퓨터로 표현하기 위해 사용되는 방법이며 문자 인코딩 판별 소프트웨어들이 존재한다. 기존의 널리 쓰이는 인코딩 판별 소프트웨어인“uchardet”의 경우 변조되지 않은 일반 문서의 인코딩 판별 정확도는 91.39% 이지만 언어 판별 정확도는 32.09%에 불과하다. 또한 문서가 치환 암호에 의해 암호화 된 경우 인코딩 판별 정확도는 3.55%, 언어 판별 정확도는 0.06%로 매우 낮은 정확도를 보였다. 따라서 본 논문에서는Deep learning 알고리즘인 LSTM(Long Short-Term Memory)을 이용한 문서의 인코딩 및 언어 판별 방법을제안하며, 기존의 인코딩 판별 소프트웨어“uchardet”보다 뛰어난 결과를 보였다. 제안하는 방법을 이용한 일반 문서의 인코딩 판별 정확도는 99.89%이며, 언어 판별 정확도는 99.92%이다. 또한 문서가 치환 암호에 의해 암호화된 경우에는 제안하는 방법의 인코딩 판별 정확도는 99.26%이며, 언어 판별 정확도는 99.77%로 매우 뛰어나다.

영어: Character encoding is the method used to represent characters or symbols on a computer, and there are many encoding detection software tools. For the widely used encoding detection software“uchardet”, the accuracy of encoding detection of unmodified normal text document is 91.39%, but the accuracy of language detection is only 32.09%. Also, if a text document is encrypted by substitution, the accuracy of encoding detection is 3.55% and the accuracy of language detection is 0.06%. Therefore, in this paper, we propose encoding and language detection of text document using the deep learning algorithm called LSTM(Long Short-Term Memory). The results of LSTM are better than encoding detection software“uchardet”. The accuracy of encoding detection of normal text document using the LSTM is 99.89% and the accuracy of language detection is 99.92%. Also, if a text document is encrypted by substitution, the accuracy of encoding detection is 99.26%, the accuracy of language detection is 99.77%.

요약
Abstract
1. 서론
2. 실험 데이터 수집
3. 일반 문서의 인코딩 및 언어 판별
  3.1 LSTM 입력 데이터 전처리
  3.2 실험 결과
4. 치환에 의해 암호화된 문서의 인코딩 및 언어 판별
  4.1 치환에 의한 문서 변조
  4.2 LSTM 입력 데이터 전처리
  4.3 실험 결과
5. 결론
참고문헌

저자

김선범 [ Seonbeom Kim | 한양대학교 컴퓨터소프트웨어학과 ]
배준우 [ Junwoo Bae | 한양대학교 전자통신컴퓨터공학과 ]
박희진 [ Heejin Park | 한양대학교 컴퓨터소프트웨어학과 ] 교신저자

참고문헌

자료제공 : 네이버학술정보

간행물 정보

간행물

한국차세대컴퓨팅학회 논문지 [THE JOURNAL OF KOREAN INSTITUTE OF NEXT GENERATION COMPUTING]
간기
격월간
pISSN
1975-681X
수록기간
2005~2026
등재여부
KCI 등재
십진분류
KDC 566 DDC 004

Earticle

딥러닝 알고리즘을 이용한 문서의 인코딩 및 언어 판별
Encoding and language detection of text document using Deep learning algorithm

원문정보

초록

목차

저자

참고문헌

간행물 정보

Earticle

딥러닝 알고리즘을 이용한 문서의 인코딩 및 언어 판별 Encoding and language detection of text document using Deep learning algorithm

원문정보

초록

목차

저자

참고문헌

간행물 정보

딥러닝 알고리즘을 이용한 문서의 인코딩 및 언어 판별
Encoding and language detection of text document using Deep learning algorithm