Mijit Ablimit, Tatsuya Kawahara, Akbar Pattar, Askar Hamdulla
언어
영어(ENG)
URL
https://www.earticle.net/Article/A271466
※ 원문제공기관과의 협약기간이 종료되어 열람이 제한될 수 있습니다.
원문정보
초록
영어
Uyghur language is an agglutinative language in which words are derived from stems (or roots) by concatenating suffixes. This property makes a large number of combinations of morphemes, and greatly increases the word-vocabulary size, causing out-of-vocabulary (OOV) and data sparseness problems for statistical models. So words are split into certain sub-word units and applied to text and speech processing applications. Proper sub-word units not only provide high coverage and smaller lexicon size, but also provide semantic and syntactic information which is necessary for downstream applications. This paper discusses a general purpose morphological analyzer tool which can split a text of words into sequence of morphemes or syllables. Uyghur morpheme segmentation is a basic part of the comprehensive effort of the Uyghur language corpus compilation. As there are no delimiters for sub-word units, a supervised method, combined with certain rules and a statistical learning algorithm, is applied for morpheme segmentation. For phonetic units like syllable and phonemes, pure rule-based methods can extract with high accuracy. Most common and proper sub-words for various applications can be the linguistic morphemes for they provide linguistic information, high coverage, low lexicon size, and easily be restored to words. As the Uyghur language is written as pronounced, phonetic alterations of speech are openly expressed in text. This property makes many surface forms for a particular morpheme. A general purpose morphological analyzer must be able to analyze and export in both standard and surface forms. So the morpho-phonetic alterations like phonetic harmony, weakening, and morphological changes are summarized and learnt from training corpus. And a statistical model based morpheme segmentation tool is trained on the corpus of aligned word-morpheme sequences, and applied to predict possible morpheme sequences. For an open test set, with word coverage of 86.8% and morpheme coverage of 98.4%, the morpheme segmentation accuracy is 97.6%. This morpheme segmentation tool can output both on the standard forms and on the surface forms without costing segmentation accuracy. Furthermore, for various basic lexical units of word, morpheme, and syllable, the statistical properties are compared as a comprehensive effort of the Uyghur language corpus compilation.
목차
Abstract 1. Uyghur Language and Morphological Structure 2. Inducing Morphological Units 2.1. Phonetic Rules in Uyghur Language 2.2. Rule Based Segmentation 2.3 Morpheme Segmentation Based on a Statistical Model 3. Statistical Properties of Various Units 4. Conclusions Acknowledgements References
키워드
Uyghurmorphememorphologyphoneticsvowel weakening
저자
Mijit Ablimit [ Institute of Information Science and Engineering, Xinjiang University, China ]
Tatsuya Kawahara [ School of Informatics, Kyoto University, Kyoto, Japan ]
Akbar Pattar [ Institute of Information Science and Engineering, Xinjiang University, China ]
Askar Hamdulla [ School of Software, Xinjiang University, Urumqi, China ]
보안공학연구지원센터(IJFGCN) [Science & Engineering Research Support Center, Republic of Korea(IJFGCN)]
설립연도
2006
분야
공학>컴퓨터학
소개
1. 보안공학에 대한 각종 조사 및 연구
2. 보안공학에 대한 응용기술 연구 및 발표
3. 보안공학에 관한 각종 학술 발표회 및 전시회 개최
4. 보안공학 기술의 상호 협조 및 정보교환
5. 보안공학에 관한 표준화 사업 및 규격의 제정
6. 보안공학에 관한 산학연 협동의 증진
7. 국제적 학술 교류 및 기술 협력
8. 보안공학에 관한 논문지 발간
9. 기타 본 회 목적 달성에 필요한 사업
간행물
간행물명
International Journal of Future Generation Communication and Networking
간기
격월간
pISSN
2233-7857
수록기간
2008~2016
십진분류
KDC 505DDC 605
이 권호 내 다른 논문 / International Journal of Future Generation Communication and Networking Vol.9 No.2