EFEK PENAMBAHAN FRASA DALAM FEATURE KATA UNTUK CLUSTERING DOKUMEN TEKS
Keywords:Word-base clustering, Phrase-based clustering, Clustering performance
Text document clustering has been intensively studied because of its important role in text-mining and information retrieval. High dimensionality problem caused by high number of words is always happened in word-based clustering technique using vector space model. Although extracting words in the preprocessing phase is simple, the collection itself can not only be viewed as a set of words but also a set of partly more than one word phrase. Separating a phrase into its parts can eliminate the actual meaning of phrase. Therefore in order to maintain the context of words a phrase must be maintained as a phrase. It is assumed that by adding phrases to words as features in clustering will improve the performance. This paper will study the comparison of word-based and phrase-based clustering. Two clustering models were chosen i.e. hierarchical and partition. Four similarity techniques i.e.: Group Average, Complete Link, Single Link, and Cluster Center were tried for hierarchical, K-Means and Bisecting K-Mean and Buckshot for partition. A document collection from 200-800 news text that has been categorized ma-nually was used to test these algorithms by using F-measure as criteria of clustering performance. This value was derived from Recall and Precision and can be used to measure the performance of the algorithms to correctly classify the collections. Results show that by adding phrases or simply word pair, although it’s still not statistically significant, it slightly improves the performance of clustering.
Asian, J., Williams, H. E., and Tahaghoghi, S. M. M., 2004, Tesbed for Indonesian Text Retrieval, 9th Australian Document Computing Symposiom, Melbourne December, 13, 2004.
Chisholm, E. and Kolda, T. G. , 1999, New Term Weighting Formula for the Vector Space Method in Information Retrieval, Research Report, Computer Science and Mathematics Division, Oak Ridge National Library, Oak Ridge, TN 3781-6367, March 1999.
Dhillon, S. I., J. Fan, and Guan, Y., 2001, Efficient Clustering of Very Large Document Collection, www. citeseer.ist.psu.edu/dhillon01.html.
Dhillon, I., Kogan, J. and Nicholas, C., 2002, Feature Selection and Do-cument Clustering, www.csee. umbc.edu/cadip/2002Symposim/koghan.pdf.
Jain, A. K. and Dubes, R. C., 1998, Algorithms for Clustering Data, Prentice-Hall.
Frantzi K. T. and Annaniadou, S., 2003, Automatic Term Recognition Using Contextual Cues, DELOS’03,www.ercim.org DELOS03/ frantzi.pdf.
Gao, J. and Zhang, J., 2003, Clustered SVD Strategies in Latent Semantic Indexing, Technical Report No. 382–03, Department of Computer Science, University of Ken-tucky, Lexington, KY.
Hamzah, A, 2006, Pengaruh Stemming Kata Dalam Peningkatan Unjuk Kerja Document Clustering Untuk Dokumen Berbahasa Indone-sia , Prosiding Seminar Nasional Riset Teknologi Informasi, AKAKOM, Juli, 2006.
Hamzah, A., Soesianto, F., Susanto, A., Istiyanto, J. E., 2006, Seleksi Feature Kata Berdasarkan Variansi Kemunculan Kata Dalam Peningkatan Unjuk Kerja Document Clustering Untuk Dokumen Berbahasa Indonesia, Pakar, Jurnal Teknologi Informasi dan Bisnis , Vol.7, No.3. , pp. 181-190.
Hinneburg, A. and Keim, D. K., 1999, Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering, Proceeding of 25th VLDB Conference, Edinburg, Scotland.
Luhn, H. P.,1958, The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, 2:159-165.
Maynard, D. and Ananiadou, S., 1999, Incorporating Linguistic Information for Multi-Word Term Extraction, Dept.of Computing & Mathematics, Manchester, MI 5GD, UK.
Nazief, B., 2000, Development of Computational Linguistic Research: a Challenge for Indonesia, Computer Science Center, University of Indonesia.
Rijsbergen, C. J., 1979, Information Retrieval, Information Retrieval Group, University of Glasgow, UK.
Steinbach, M., Karypis, G., Kumar, V., 2000, A Comparison of Document Clustering Techniques, University of Minnesota, Technical Report #00-034, at http://www. cs.umn. edu/tech_reports.
Tan, A. H, 1999, Text Mining: The State of The Art and The Challenges, Kent Ridge Digital Labs 21 Heng Mui Keng Terrace Singapore 119613.
Zhang, Y., E. Milios and Heywood, N. Z., 2004, A Comparison of Key-word and Keyterm-based Methods for Automatic Web Site Summarization, Tecnical Report, Faculty of Computer Science, University Ave. Halivax, Nova Scotia, 2004.