Generally, a suffix tree is an efficient data structure since it reveals the detailed internal structures of given sequences within linear time. However, it is difficult to implement a suffix tree for a large number of sequences because of memory size constraints. Therefore, in order to compare multimega base genomic sequence sets using suffix trees, there is a need to re-construct the suffix tree algorithms. We introduce a new method for constructing a suffix tree on secondary storage of a large number of sequences. Our algorithm divides three files, in a designated sequence, into parts, storing references to the locations of edges in hash tables. To execute experiments, we used 1,300,000 sequences around 300Mbyte in EST to generate a suffix tree on disk.
목차
Abstract 1. Introduction 2. Proposed Method 2.1 Data structure 2.2 Storing Edges 2.3 Node Numbering Process 2.4 Storing a Hash Table 3. Experimentation and Analysis 4. Discussion and Conclusion References
키워드
suffix treelarge data setssequence analysisgenomic sequences
저자
Hae-won Choi [ Department of Computer Engineering, Kyungwoon University, Korea ]
Myung-Chun Ryoo [ Department of Computer Engineering, Kyungwoon University, Korea ]
Joon-Ho Park [ Department of Computer Engineering, Kyungwoon University, Korea ]
한국EA학회는 전사적 관점의 아키텍처 개념 및 원칙을 국내 민간기업 및 정부기관에 적용 확산시키고, EA 및 관련 분야의 연구, 전문인력의 양성 및 정책적 건의 등을 통해 기업 및 정부기관의 경쟁력 및 생산성을 향상시키고, 우리나라 지식 기반 산업 등의 고도화를 도모하는 것을 목적으로 합니다.