Measuring the similarity between nominal variables is an important problem in data mining. It's the base to measure the similarity of data objects which contain nominal variables. There are two kinds of traditional methods for this task, the first one simply distinguish variables by same or not same while the second one measures the similarity based on co-occurrence with variables of other attributes. Though they perform well in some conditions, but are still not enough in accuracy. This paper proposes an algorithm to measure the similarity between nominal variables of the same attribute based on the fact that the similarity between nominal variables depends on the relationship between subsets which hold them in the same dataset. This algorithm use the difference of the distribution which is quantified by f-divergence to form feature vector of nominal variables. The theoretical analysis helps to choose the best metric from four most common used forms of f-divergence. Time complexity of the method is linear with the size of dataset and it makes this method suitable for processing the large-scale data. The experiments which use the derived similarity metrics with K-modes on extensive UCI datasets demonstrate the effectiveness of our proposed method.
목차
Abstract 1. Introduction 2. Proposed Algorithm 2.1 Definition of Similarity 2.2 Hellinger Distance 2.3 Distance in Unsupervised Learning 3. Theoretical Analysis 3.1 Why Hellinger Distance 3.2 Complexity of the Algorithm 4. Experiments 4.1 Intrinsic Method 4.2. The Extrinsic Method 5. Conclusion References
키워드
SimilarityNominal variablesf-divergenceK-modes
저자
Zhao Liang [ Institute of Graduate, Liaoning Technical University, Fuxin, Liaoning, 123000, P.R. China ]
Liu Jianhui [ School of Electronic and Information Engineering, Liaoning Technical University, Huludao, Liaoning, 125000, P.R. China ]
보안공학연구지원센터(IJDTA) [Science & Engineering Research Support Center, Republic of Korea(IJDTA)]
설립연도
2006
분야
공학>컴퓨터학
소개
1. 보안공학에 대한 각종 조사 및 연구
2. 보안공학에 대한 응용기술 연구 및 발표
3. 보안공학에 관한 각종 학술 발표회 및 전시회 개최
4. 보안공학 기술의 상호 협조 및 정보교환
5. 보안공학에 관한 표준화 사업 및 규격의 제정
6. 보안공학에 관한 산학연 협동의 증진
7. 국제적 학술 교류 및 기술 협력
8. 보안공학에 관한 논문지 발간
9. 기타 본 회 목적 달성에 필요한 사업
간행물
간행물명
International Journal of Database Theory and Application
간기
격월간
pISSN
2005-4270
수록기간
2008~2016
십진분류
KDC 505DDC 605
이 권호 내 다른 논문 / International Journal of Database Theory and Application Vol.9 No.3