Page Header Logo
TEI of Athens eJournals

Efficient indexing methods in the data mining context

Nikolaos Kouiroukidis, Georgios Evangelidis

Abstract


The proliferation of applications manipulating enormous sizes of multidimensional vector data, that require indexing in order to support operations like kNN searching in an efficient manner, has revived the interest in multidimensional indexing. It is well established that as the dimensionality of data increases the performance of queries such as range queries and nearest neighbor (1NN or kNN) queries decreases leading to a problem described as “the curse of dimensionality”. In this paper we point out problems that arise when indexing multidimensional data in fixed dimensions, such as 8 or 16, because, usually, when dealing with data of higher dimensionality it is common to first apply dimensionality reduction techniques such as principal component analysis. Although there is a plethora of research papers proposing multidimensional indexing structures, most of them report empirical results with relatively small and low dimensionality datasets. We attempt a fair comparison of many state of the art indexing structures designed exclusively to index multi-dimensional points like the Hybrid tree, the iDistance and the P+tree. We include in our comparison the R*tree, a state of the art index designed both for multidimensional points and regions. It is an improvement of the well-known R-tree, and also has been revised and improved further recently. We compare the behavior of the indexes on kNN queries (for k=1 and k=10) with varying dataset sizes and dimensionality. Our goal is to determine the structure(s) that could be used efficiently in the area of data mining. We obtained the source code of the implementations of the related structures from the authors and the R-tree portal.

Keywords


Multidimensional Indexing, Data Mining

References


Beckmann, N., Kriegel, H.-P., Schneider, R., and Seeger, B. (1990). The R*-tree: An efficient and robust access method for points and rectangles. In Proc. SIGMOD..

Berchtold, S., Bohm, C., and Kriegel, H.-P. (1998). The pyramid- technique: Towards breaking the curse of dimensionality. In Proc. SIGMOD.

Chakrabarti, K. and Mehrotra, S. (1999). The hybrid tree: An index structure for high dimensional feature spaces. In Proc. International Conference on Data Engineering, pp. 322–331.

Lomet, D.B. and Salzberg, B. (1990). The hB-tree: a multiattribute indexing method with good guaranteed performance. ACM TODS, 15(4), pp. 625-658.

Ooi, B.C., Tan, K.L., Yu, C., and Bressan, S. (2000). Indexing the edge: a simple and yet efficient approach to high-dimensional indexing. In Proc. ACM PODS, pp. 166–174.

Zhang, R., Ooi, B.C., and Tan, K-L. (2004). Making the Pyramid Technique Robust to Query Types and Workloads. In Proc. ICDE, pp. 313-324.

Yu, C., Ooi, B.C., Tan, K.L., and Jagadish, H. (2001). Indexing the distance: an efficient method to knn processing. In Proc. International Conference on Very Large Data Bases, pp. 421–430.

Comer, D. (1979). The Ubiquitous B-Tree. ACM Computing Surveys, 11(2), pp. 121–137.

Sakurai, Y., Yoshikawa, M., Uemura, S., and Kojima, H. (2000). The A-tree: An Index Structure for High-Dimensional Spaces Using Relative Approximation. In Proc. International Conference on Very Large Data Bases, pp. 516-526.

Katayama, N. and Satoh, S. (1998). SR-tree: An index structure for nearest neighbor searching of high-dimensional point data. Systems and Computers in Japan, 29(6), pp. 59-73.

Nievergelt, J., Hinterberger, and H., Sevcik, K.C. (1984). The Grid File: An Adaptable, Symmetric Multikey File Structure. ACM Trans. Database Syst., 9(1), pp. 38-71.


Full Text: PDF

DOI: 10.18780/jiim.v1i1.3049

Refbacks

  • There are currently no refbacks.

The application for presenting electronic journals TEI developed within subproject 2 "electronic publishing service" the Act "Development Services Digital Library of TEI" and financed by the operational program "Digital Convergence", NSRF 2007-2013.