STUDI KOMPARASI ALGORITMA HIERARCHICAL DAN PARTITIONAL UNTUK CLUSTERING DOKUMEN TEKS BERBAHASA INDONESIA

Authors

  • Amir Hamzah
  • Adhi Susanto
  • F Soesianto
  • Jazi Eko Istiyanto

DOI:

https://doi.org/10.34151/technoscientia.v0i0.1972

Keywords:

document clustering, hierarchical, partitional, clustering performance

Abstract

Text document clustering is a technique which has been intensively studied be-cause of its important role in  the text-mining and information retrieval. In the  vector spa-ce model it is typically known two main clustering approaches,i.e. hierachical algorithm and partitional algorithm. The hierarchical algorithm produces deterministic result known as a dendogram, but its weakness is high complexity in time and memory.  On the other hand, partitiaonal algorithm has linear time and memory complexity although its clustering result is not independent from its initial cluster.

The aim of this research was to study experimentally to compare the perfor-mances of several techniques of hierarchical algorithms and partitional algorithms applied to text documents written in Bahasa Indonesia. The five similarity techniques i.e.  UPGM-A, CSI, IST,SL and CL were chosen from hierarchical, whereas K-Means,  Bisecting K-Mean and Buckshot are chosen for partitonal ones. The documents were collected from 200 to 800 Indonesian news text that have been categorized manually and used to test these algorithms using F-measure for clustering performance. This value was derived from Recall and Precision and can be used to measure the performance of the algorithms to correctly classify the collections. Results showed that Bisecting K-Mean as a variant of partitional algorithm  performed comparably with the two best hierarchical techniques,i.e. UPGMA and CL but with much lower time complexity.

References

Asian, J., H. E. Williams, and S. M. M. Tahaghoghi, 2004, Tesbed for Indonesian Text Retrieval, 9th Australian Document Computing Symposiom, Melbourne, Decem-ber,
Asian, J., H. E. Williams, and S. M. M. Tahaghoghi, 2005, Stemming In-donesian, 28th Australian Com-puter Science Conference (ACS-2005).
Bifet, A. , C. Castillo, P. A. Chirita and I. Weber, 2005, An Analysis of Factors Used in Search Engine Ranking.airweb.cse.lehigh.edu/2005/bifet.pdf
Chisholm, E. and T. G. Kolda, 1999, New Term Weighting Formula for the Vector Space Method in Informa-tion Retrieval, Research Report, Computer Science and Mathe-matics Division, Oak Ridge Nati-onal Library, Oak Ridge, TN 37-81-6367, March 1999.
Cutting, D. R., D. R. Karger, J. O. Pe-derson, and J. W. Tukey,1992, Scatter/Gather:A Cluster-based Approach to Browsing Large Do-cument Collection, Procedding 15th Annual Int 7ACM SIGIR Conference on R&D in IR, June 1992.
Dhillon, I., J. Kogan, and C. Nicholas, 20-02, Feature Selection and Docu-ment Clustering, www.csee.um bc.edu/cadip/2002Symposim/koghan.pdf
Halkidi, M., Y. Batistakis, and M. Va-zirgiannis, 2001, On Clustering Validation Techniques, Journal of Intelligent Information System 17 :2/3, 107-145
Hamzah, A, 2006, Pengaruh Stemming Kata Dalam Peningkatan Unjuk Kerja Document Clustering Un-tuk Dokumen Berbahasa Indo-nesia , Proseding Seminar Nasi-onal Riset Teknologi Informasi, AKAKOM, yogyakarta.
Hinneburg, A. and D.K. Keim, 1999, Op-timal Grid-Clustering: Towards Breaking the Curse of Dimensio-nality in High-Dimensional Clus-tering, Proceeding of 25th VLDB Conference, Edinburg, Scotland.
Jain, A.K. and R. C. Dubes, 1988, Algo-rithms for Clustering Data, Pren-tice-Hall.
Kural, Y., S. Robertston, and S. Jones, 1988, Clustering Information Se-arch Outputs, 21st BCS IRSG Colloqium on IR, University of Glasgow.
Luhn, H.P. ,1958, The Automatic Crea-tion of Literature Abstracts. IBM Journal of Research and Deve-lopment, 2:159-165
Nazief, B., 2000, Development of Com-putational Linguistic Research: a Challenge for Indonesia”, Com-puter Science Center, University of Indonesia
Porter, M. , 1980, An Algorithm for Suffix Stripping, Program 13(3), 130-137.
Rijsbergen, C. J.,1979, Information Re-trieval, Information Retrieval Gro-up, University of Glasgow , UK
Steinbach, M., G. Karypis, and V. Kumar 2000, A Comparison of Docu-ment Clustering Techniques, K-DD Workshop on Text Mining, www.citeseer.ist.psu.edu/steincah00comparison.html
Strehl, A., J. Ghosh, and R. Mooney, 20-00, Impact of Similarity Measures on Web-Page Clustering, Pro-ceeding of the Workshop of Ar-tificial Intelligent for Web Search, 17th National Conference on Ar-tificial Intelligence, July 2000.
Tala, F. Z., 2004, A Study of Stemming Effect on Information Retrieval in Bahasa Indonesia, Master The-sis, Universiteit van Amsterdam, The Netherlands
Tombros, A., 2002, The Effectiveness of Query-Based Hierarchical Cluster-ing of Document for Information Retrieval, PhD Thesis, University of Glasgo.
www.google.com
Zamir, O.E., 1999, Clustering Web Docu-ment : A Phrase-Based Method for Grouping Search Engine Result, PhD. Dissertation, Uniiversity of Washington

Downloads

Published

22-02-2019

How to Cite

Hamzah, A., Susanto, A., Soesianto, F., & Istiyanto, J. E. (2019). STUDI KOMPARASI ALGORITMA HIERARCHICAL DAN PARTITIONAL UNTUK CLUSTERING DOKUMEN TEKS BERBAHASA INDONESIA. JURNAL TEKNOLOGI TECHNOSCIENTIA, 11–21. https://doi.org/10.34151/technoscientia.v0i0.1972