Research Article
BibTex RIS Cite

A New Method to Measure Clustering Performance and its Evaluation for Text Clustering

Year 2021, Issue: 27, 53 - 65, 30.11.2021
https://doi.org/10.31590/ejosat.932938

Abstract

In this study, an alternative method that can be used to measure clustering performance is proposed. In order to test the consistency of the proposed method, two different data sets consisting of Wikipedia abstracts were clustered with k-Means, k-Medoids and CLARANS methods and performance measurements were calculated with both the proposed method and the existing methods. The first data set containing only English summaries was tested by dividing it into different numbers of clusters. Since there was no prior knowledge of the content of the abstracts, the internal methods Silhouette, Calinski-Harabasz, and Davies-Bouldin were used to evaluate how accurately they were clustered. The second data set, which includes Wikipedia abstracts of 6 different languages, is divided into 6 clusters with clustering methods to classify the abstracts according to their language. Since the language of the summaries in the data set is known beforehand, the success of clustering could be measured by both internal and external methods. Since it is known that data compression algorithms compress a file with similar texts better than a file with different texts, it has been suggested that compression ratio can be used as an alternative evaluation metric. The proposed Compression Ratio Index (CRI), which can be calculated much faster than internal methods such as Silhouette, Calinski-Harabasz and Davies-Bouldin indexes, was tested with 4 different compression algorithms and yielded the same results with 9 external methods used in the second data set.

References

  • Abdalgader, K. (2017). Clustering Short Text using a Centroid-Based Lexical Clustering Algorithm. IAENG International Journal of Computer Science, 44(4).
  • Alakuijala, J., Szabadka, Z. (2016). Brotli Compressed Data Format. Internet Engineering Task Force (IETF), RFC 7932, ISSN: 2070-1721
  • Bolshakova, N., & Azuaje, F. (2003). Cluster validation techniques for genome expression data. Signal processing, 83(4), 825-833.
  • Burrows, M., Wheeler, D. J. (1994). A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation.
  • Caliński, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, 3(1), 1-27.
  • Cleary, J., & Witten, I. (1984). Data compression using adaptive coding and partial string matching. IEEE transactions on Communications, 32(4), 396-402.
  • Davies, D. L., & Bouldin, D. W. (1979). A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence, (2), 224-227.
  • Deutsch, P. (1996). DEFLATE Compressed Data Format Specification. version 1.3, RFC 1951 doi:10.17487/RFC1951.
  • Dinçer, Ş. E. (2006). Veri madenciliğinde K-means algoritması ve tıp alanında uygulanması (Master's thesis, Kocaeli Universitesi, Fen Bilimleri Enstitusu)
  • Erdinç, U., Erdoğan, C., & Saygılı, A. (2016). Hiyerarşik Kümeleme Modeli Kullanan Web Tabanlı Bir Ödev Değerlendirme Sistemi. Ejovoc (Electronic Journal of Vocational Colleges), 6(3), 87-98.
  • Ghufron, G., Surarso, B., & Gernowo, R. (2020). The Implementations of K-medoids Clustering for Higher Education Accreditation by Evaluation of Davies Bouldin Index Clustering. Jurnal Ilmiah KURSOR, 10(3).
  • Hacıoğlu H., K. (2016). Kümeleme Analizinde Kullanılan Bazı Benzerlik İndekslerinin Karşılaştırılması. Yüksek Lisans Tezi Gazi Üniversitesi, Fen Bilimleri Enstitüsü.,98.
  • Brümmer, M. The DBpedia abstract corpus (2015), http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/ Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of classification, 2(1), 193-218.
  • Jaccard, P. (1901). Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines. Bulletin de la Société Vaudoise des Sciences Naturelles. 37: p. 241-272.
  • Ketchen, D. J., & Shook, C. L. (1996). The application of cluster analysis in strategic management research: an analysis and critique. Strategic management journal, 17(6), 441-458.
  • Kresse, W. and Danko, D.M. (2012). Springer Handbook of Geographic Information. Springer-Verlag, Berlin.
  • Leavline, E. J., & Singh, D. A. A. G. (2013). Hardware implementation of LZMA data compression algorithm. International Journal of Applied Information Systems (IJAIS), 5(4), 51-56.
  • Mesut, A. (2006). Veri Sıkıştırmada Yeni Yöntemler. Trakya Üniversitesi, Fen Bilimleri Enstitüsü, Doktora Tezi.
  • Ni, X., Quan, X., Lu, Z., Wenyin, L., & Hua, B. (2011). Short text clustering by finding core terms. Knowledge and information systems, 27(3), 345-365.
  • Petrovic, S. (2006, October). A Comparison Between The Silhouette İndex And The Davies-Bouldin İndex İn Labelling İds Clusters. In Proceedings of the 11th Nordic Workshop of Secure IT Systems (Vol. 2006, pp. 53-64). sn.
  • Psalmerosi, F. H. (2019). Applying Text Mining and Machine Learning to Build Methods for Automated Grading (Master's thesis, University of Twente). Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336), 846-850.
  • Rangrej, A., Kulkarni, S., & Tendulkar, A. V. (2011, March). Comparative study of clustering techniques for short text documents. In Proceedings of the 20th international conference companion on World wide web (pp. 111-112). ACM.
  • Rosenberg, A., & Hirschberg, J. (2007, June). V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL) (pp. 410-420).
  • Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20, 53-65.
  • Santos, J. M., & Embrechts, M. (2009, September). On the use of the adjusted rand index as a metric for evaluating supervised classification. In International conference on artificial neural networks (pp. 175-184). Springer, Berlin, Heidelberg.
  • Selvi, C. K., Kogilavani, S. V., & Jayaprakash, S. M. D. (2018). Short Text Segmentation for Improved Query Processing. IJRASET. ISSN: 2321-9653; IC Value: 45.98;2719-2724.
  • Shkarin, D. (2002, April). PPM: One step to practicality. In Proceedings DCC 2002. Data Compression Conference (pp. 202-211). IEEE.
  • Shrestha, P., Jacquin, C., & Daille, B. (2012, March). Clustering short text and its evaluation. In International Conference on Intelligent Text Processing and Computational Linguistics (pp. 169-180). Springer, Berlin, Heidelberg.
  • Silahtaroğlu, G. (2016). Veri madenciliği: Kavram ve algoritmaları. Papatya.
  • Starczewski, A., & Krzyżak, A. (2015, June). Performance evaluation of the Silhouette index. In International Conference on Artificial Intelligence and Soft Computing (pp. 49-58). Springer, Cham.
  • Şenol, A., & Karacan, H. (2018). Akan Veri Kümeleme Teknikleri Üzerine Bir Derleme. Avrupa Bilim ve Teknoloji Dergisi, (13), 17-30.
  • Tengilimoğlu E., Öztürk, Y., (2019). Metin madenciliği yöntemleri ile online yorumların kümelenmesi: Bakü otelleri örneği. 5. International Congress of Social Science, Skopje/Macedonia, 595-608.
  • Thinsungnoena, T., Kaoungkub, N., Durongdumronchaib, P., Kerdprasopb, K., & Kerdprasopb, N. (2015). The clustering validity with Silhouette and sum of squared errors. learning, 3(7).

Kümeleme Performansını Ölçmek için Yeni Bir Yöntem ve Metin Kümeleme için Değerlendirmesi

Year 2021, Issue: 27, 53 - 65, 30.11.2021
https://doi.org/10.31590/ejosat.932938

Abstract

Bu çalışmada kümeleme performansını ölçmek için kullanılabilecek alternatif bir yöntem önerilmiştir. Önerilen yöntemin tutarlılığını test etmek için, Wikipedia makale özetlerinden oluşan iki farklı veri kümesinde k-Means, k-Medoids ve CLARANS yöntemleri ile kümelemeler yapılmış ve hem önerdiğimiz yöntem hem de mevcut yöntemler ile performans ölçümleri hesaplanmıştır. Sadece İngilizce özetlerin olduğu ilk veri kümesi farklı sayıda kümelere ayrılarak test edilmiştir. Özetlerin içeriği hakkında önceden bilgi sahibi olunmadığı için ne kadar doğru kümelendiğini değerlendirmek için dahili yöntemler olan Silhouette, Calinski-Harabasz ve Davies-Bouldin indeksleri kullanılmıştır. 6 farklı dile ait Wikipedia özetlerini içeren ikinci veri kümesi ise özetlerin dillerine göre sınıflanmış olması için kümeleme yöntemleri ile 6 kümeye ayrılmıştır. Veri kümesindeki metinlerin hangi dile ait olduğu önceden bilindiği için kümelemenin başarısı hem dahili hem de harici yöntemler ile ölçülebilmiştir. Veri sıkıştırma algoritmalarının birbirine benzer metinlerin olduğu bir dosyayı, birbirinden farklı metinlerin olduğu dosyaya göre daha iyi sıkıştırdığı bilindiğinden, sıkışma oranının alternatif bir değerlendirme ölçütü olarak kullanılabileceği önerilmiştir. Silhouette, Calinski-Harabasz ve Davies-Bouldin indeksleri gibi dahili yöntemlere göre çok daha hızlı hesaplanabilen önerilen Sıkıştırma Oranı İndeksi (SOİ), 4 farklı sıkıştırma algoritması ile test edilmiş ve ikinci veri kümesinde kullanılan 9 harici yöntemle de aynı sonuçları vermiştir.

References

  • Abdalgader, K. (2017). Clustering Short Text using a Centroid-Based Lexical Clustering Algorithm. IAENG International Journal of Computer Science, 44(4).
  • Alakuijala, J., Szabadka, Z. (2016). Brotli Compressed Data Format. Internet Engineering Task Force (IETF), RFC 7932, ISSN: 2070-1721
  • Bolshakova, N., & Azuaje, F. (2003). Cluster validation techniques for genome expression data. Signal processing, 83(4), 825-833.
  • Burrows, M., Wheeler, D. J. (1994). A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation.
  • Caliński, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, 3(1), 1-27.
  • Cleary, J., & Witten, I. (1984). Data compression using adaptive coding and partial string matching. IEEE transactions on Communications, 32(4), 396-402.
  • Davies, D. L., & Bouldin, D. W. (1979). A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence, (2), 224-227.
  • Deutsch, P. (1996). DEFLATE Compressed Data Format Specification. version 1.3, RFC 1951 doi:10.17487/RFC1951.
  • Dinçer, Ş. E. (2006). Veri madenciliğinde K-means algoritması ve tıp alanında uygulanması (Master's thesis, Kocaeli Universitesi, Fen Bilimleri Enstitusu)
  • Erdinç, U., Erdoğan, C., & Saygılı, A. (2016). Hiyerarşik Kümeleme Modeli Kullanan Web Tabanlı Bir Ödev Değerlendirme Sistemi. Ejovoc (Electronic Journal of Vocational Colleges), 6(3), 87-98.
  • Ghufron, G., Surarso, B., & Gernowo, R. (2020). The Implementations of K-medoids Clustering for Higher Education Accreditation by Evaluation of Davies Bouldin Index Clustering. Jurnal Ilmiah KURSOR, 10(3).
  • Hacıoğlu H., K. (2016). Kümeleme Analizinde Kullanılan Bazı Benzerlik İndekslerinin Karşılaştırılması. Yüksek Lisans Tezi Gazi Üniversitesi, Fen Bilimleri Enstitüsü.,98.
  • Brümmer, M. The DBpedia abstract corpus (2015), http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/ Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of classification, 2(1), 193-218.
  • Jaccard, P. (1901). Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines. Bulletin de la Société Vaudoise des Sciences Naturelles. 37: p. 241-272.
  • Ketchen, D. J., & Shook, C. L. (1996). The application of cluster analysis in strategic management research: an analysis and critique. Strategic management journal, 17(6), 441-458.
  • Kresse, W. and Danko, D.M. (2012). Springer Handbook of Geographic Information. Springer-Verlag, Berlin.
  • Leavline, E. J., & Singh, D. A. A. G. (2013). Hardware implementation of LZMA data compression algorithm. International Journal of Applied Information Systems (IJAIS), 5(4), 51-56.
  • Mesut, A. (2006). Veri Sıkıştırmada Yeni Yöntemler. Trakya Üniversitesi, Fen Bilimleri Enstitüsü, Doktora Tezi.
  • Ni, X., Quan, X., Lu, Z., Wenyin, L., & Hua, B. (2011). Short text clustering by finding core terms. Knowledge and information systems, 27(3), 345-365.
  • Petrovic, S. (2006, October). A Comparison Between The Silhouette İndex And The Davies-Bouldin İndex İn Labelling İds Clusters. In Proceedings of the 11th Nordic Workshop of Secure IT Systems (Vol. 2006, pp. 53-64). sn.
  • Psalmerosi, F. H. (2019). Applying Text Mining and Machine Learning to Build Methods for Automated Grading (Master's thesis, University of Twente). Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336), 846-850.
  • Rangrej, A., Kulkarni, S., & Tendulkar, A. V. (2011, March). Comparative study of clustering techniques for short text documents. In Proceedings of the 20th international conference companion on World wide web (pp. 111-112). ACM.
  • Rosenberg, A., & Hirschberg, J. (2007, June). V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL) (pp. 410-420).
  • Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20, 53-65.
  • Santos, J. M., & Embrechts, M. (2009, September). On the use of the adjusted rand index as a metric for evaluating supervised classification. In International conference on artificial neural networks (pp. 175-184). Springer, Berlin, Heidelberg.
  • Selvi, C. K., Kogilavani, S. V., & Jayaprakash, S. M. D. (2018). Short Text Segmentation for Improved Query Processing. IJRASET. ISSN: 2321-9653; IC Value: 45.98;2719-2724.
  • Shkarin, D. (2002, April). PPM: One step to practicality. In Proceedings DCC 2002. Data Compression Conference (pp. 202-211). IEEE.
  • Shrestha, P., Jacquin, C., & Daille, B. (2012, March). Clustering short text and its evaluation. In International Conference on Intelligent Text Processing and Computational Linguistics (pp. 169-180). Springer, Berlin, Heidelberg.
  • Silahtaroğlu, G. (2016). Veri madenciliği: Kavram ve algoritmaları. Papatya.
  • Starczewski, A., & Krzyżak, A. (2015, June). Performance evaluation of the Silhouette index. In International Conference on Artificial Intelligence and Soft Computing (pp. 49-58). Springer, Cham.
  • Şenol, A., & Karacan, H. (2018). Akan Veri Kümeleme Teknikleri Üzerine Bir Derleme. Avrupa Bilim ve Teknoloji Dergisi, (13), 17-30.
  • Tengilimoğlu E., Öztürk, Y., (2019). Metin madenciliği yöntemleri ile online yorumların kümelenmesi: Bakü otelleri örneği. 5. International Congress of Social Science, Skopje/Macedonia, 595-608.
  • Thinsungnoena, T., Kaoungkub, N., Durongdumronchaib, P., Kerdprasopb, K., & Kerdprasopb, N. (2015). The clustering validity with Silhouette and sum of squared errors. learning, 3(7).
There are 33 citations in total.

Details

Primary Language Turkish
Subjects Engineering
Journal Section Articles
Authors

Murat Aslanyürek 0000-0002-3296-4395

Altan Mesut 0000-0002-1477-3093

Early Pub Date July 29, 2021
Publication Date November 30, 2021
Published in Issue Year 2021 Issue: 27

Cite

APA Aslanyürek, M., & Mesut, A. (2021). Kümeleme Performansını Ölçmek için Yeni Bir Yöntem ve Metin Kümeleme için Değerlendirmesi. Avrupa Bilim Ve Teknoloji Dergisi(27), 53-65. https://doi.org/10.31590/ejosat.932938