Abstract:
This paper introduces three new text classi?cation
methods: Clustering-Based Generalised Instances Set (CB-GIS),
Multilevel Clustering-Based Generalised Instances Set
(MLC GIS) and Multilevel Clustering-Based k Nearest Neighbours
(MLC-kNN). These new methods aim to unify the
strengths and overcome the drawbacks of the three similaritybased
text classi?cation methods, namely, kNN, centroid-based
and GIS. The new methods utilise a clustering technique called
spherical K-means to represent each class by a representative set
of generalised instances to be used later in the classi?cation. The
CB-GIS method applies a ?at clustering method while MLC-GIS
and MLC-kNN apply multilevel clustering. Extensive experiments
have been conducted to evaluate the new methods and
compare them with kNN, centroid-based and GIS classi?ers on
the Reuters-21578(10) benchmark dataset. The evaluation has
been performed in terms of the classi?cation performance and the
classi?cation e?ciency. The experimental results show that the
top-performing classi?cation method is the MLC-kNN classi?er,
followed by the MLC-GIS and CB-GIS classi?ers. According to
the best micro-averaged F1 scores, the new methods (CB-GIS,
MLC-CIS, MLC-kNN) have improvements of 4.48%, 4.65% and
4.76% over kNN, 1.84%, 1.92% and 2.12% over the centroidbased
and 5.26%, 5.34% and 5.45% over GIS respectively. With
respect to the best macro-averaged F1 scores, the new methods
(CB-GIS, MLC-CIS, MLC-kNN) have improvements of 10.29%,
10.19% and 10.45% over kNN, respectively, 0.1%, 0.03% and
0.29% over the centroid-based and 3.75%, 3.68% and 3.94% over
GIS respectively.