Earticle

현재 위치 Home

Stem-Affix based Uyghur Morphological Analyzer

첫 페이지 보기
  • 발행기관
    보안공학연구지원센터(IJFGCN) 바로가기
  • 간행물
    International Journal of Future Generation Communication and Networking 바로가기
  • 통권
    Vol.9 No.2 (2016.02)바로가기
  • 페이지
    pp.59-72
  • 저자
    Mijit Ablimit, Tatsuya Kawahara, Akbar Pattar, Askar Hamdulla
  • 언어
    영어(ENG)
  • URL
    https://www.earticle.net/Article/A271466

※ 원문제공기관과의 협약기간이 종료되어 열람이 제한될 수 있습니다.

원문정보

초록

영어
Uyghur language is an agglutinative language in which words are derived from stems (or roots) by concatenating suffixes. This property makes a large number of combinations of morphemes, and greatly increases the word-vocabulary size, causing out-of-vocabulary (OOV) and data sparseness problems for statistical models. So words are split into certain sub-word units and applied to text and speech processing applications. Proper sub-word units not only provide high coverage and smaller lexicon size, but also provide semantic and syntactic information which is necessary for downstream applications. This paper discusses a general purpose morphological analyzer tool which can split a text of words into sequence of morphemes or syllables. Uyghur morpheme segmentation is a basic part of the comprehensive effort of the Uyghur language corpus compilation. As there are no delimiters for sub-word units, a supervised method, combined with certain rules and a statistical learning algorithm, is applied for morpheme segmentation. For phonetic units like syllable and phonemes, pure rule-based methods can extract with high accuracy. Most common and proper sub-words for various applications can be the linguistic morphemes for they provide linguistic information, high coverage, low lexicon size, and easily be restored to words. As the Uyghur language is written as pronounced, phonetic alterations of speech are openly expressed in text. This property makes many surface forms for a particular morpheme. A general purpose morphological analyzer must be able to analyze and export in both standard and surface forms. So the morpho-phonetic alterations like phonetic harmony, weakening, and morphological changes are summarized and learnt from training corpus. And a statistical model based morpheme segmentation tool is trained on the corpus of aligned word-morpheme sequences, and applied to predict possible morpheme sequences. For an open test set, with word coverage of 86.8% and morpheme coverage of 98.4%, the morpheme segmentation accuracy is 97.6%. This morpheme segmentation tool can output both on the standard forms and on the surface forms without costing segmentation accuracy. Furthermore, for various basic lexical units of word, morpheme, and syllable, the statistical properties are compared as a comprehensive effort of the Uyghur language corpus compilation.

목차

Abstract
 1. Uyghur Language and Morphological Structure
 2. Inducing Morphological Units
  2.1. Phonetic Rules in Uyghur Language
  2.2. Rule Based Segmentation
  2.3 Morpheme Segmentation Based on a Statistical Model
 3. Statistical Properties of Various Units
 4. Conclusions
 Acknowledgements
 References

키워드

Uyghur morpheme morphology phonetics vowel weakening

저자

  • Mijit Ablimit [ Institute of Information Science and Engineering, Xinjiang University, China ]
  • Tatsuya Kawahara [ School of Informatics, Kyoto University, Kyoto, Japan ]
  • Akbar Pattar [ Institute of Information Science and Engineering, Xinjiang University, China ]
  • Askar Hamdulla [ School of Software, Xinjiang University, Urumqi, China ]

참고문헌

자료제공 : 네이버학술정보

간행물 정보

발행기관

  • 발행기관명
    보안공학연구지원센터(IJFGCN) [Science & Engineering Research Support Center, Republic of Korea(IJFGCN)]
  • 설립연도
    2006
  • 분야
    공학>컴퓨터학
  • 소개
    1. 보안공학에 대한 각종 조사 및 연구 2. 보안공학에 대한 응용기술 연구 및 발표 3. 보안공학에 관한 각종 학술 발표회 및 전시회 개최 4. 보안공학 기술의 상호 협조 및 정보교환 5. 보안공학에 관한 표준화 사업 및 규격의 제정 6. 보안공학에 관한 산학연 협동의 증진 7. 국제적 학술 교류 및 기술 협력 8. 보안공학에 관한 논문지 발간 9. 기타 본 회 목적 달성에 필요한 사업

간행물

  • 간행물명
    International Journal of Future Generation Communication and Networking
  • 간기
    격월간
  • pISSN
    2233-7857
  • 수록기간
    2008~2016
  • 십진분류
    KDC 505 DDC 605

이 권호 내 다른 논문 / International Journal of Future Generation Communication and Networking Vol.9 No.2

    피인용수 : 0(자료제공 : 네이버학술정보)

    함께 이용한 논문 이 논문을 다운로드한 분들이 이용한 다른 논문입니다.

      페이지 저장