Once Upon a Time in the Lab
Our ability to process textual data has improved rapidly with the introduction of language models such as word2vec and GloVe, which enable symbolic text to be transformed into a more flexible numeric representation. Yet, the nature of some problems is that there will never be enough domain or context-specific textual data to train such models. One of our projects at DSTIL involves analytics on the task descriptions created during software projects. Our datasets are small and contain very specialised language – exactly the opposite of what is required to apply contemporary textual analysis approaches. However, there is an algorithm for which these constraints are ideal: k-medoids. By clustering elements based on only their quantified distance to each other, this algorithm can show us the main concepts in a project and how much project effort has been directed to each one. This post describes all aspects of the algorithm – how it works, how to analyse the results, and its characteristics – to prepare you for encounters with similar problems and datasets.
A Fistful of Medoids
k-medoids is a partitioning algorithm that divides the dataset into non-overlapping clusters . Each cluster is represented by a medoid, an exemplar of the cluster’s elements. Elements are assigned to the cluster with the medoid to which they are most similar, where similarity is quantified by a metric function. The metric enables the algorithm to optimise clusters after the assignment step, ensuring that the most representative element of each cluster becomes its medoid. One interpretation of the algorithm  is as follows:
- Select k number of elements to serve as medoids.
- Assign all other elements to the cluster with the most similar medoid.
- Reassign each cluster’s medoid the element that has the lowest sum difference (calculated using the metric function) to its cluster neighbours.
- Repeat Steps 2 and 3 until the sum difference of medoids to their cluster elements, summed across all clusters, does not decrease.
For the code-inclined, a Haskell implementation and a guide to implementing the algorithm yourself can be found here. To demonstrate the power of this approach, let’s apply it to the first lines of our commit messages of one of our larger DSTIL projects. This dataset has all of the problems that we have discussed: it is small (851 messages containing a total of 4035 non-stopword tokens); contains technical, domain and team-specific language, and is inherently noisy as it is natural language. The k-medoids algorithm gives the breakdown of clusters by size shown in Figure 1, and tells us that the following are the five main concepts in that project:
- Medoid: “Fix docker instructions to improve clarity” (137 elements)
- Keywords: improve, models, fix, worker
- Medoid: “Add joined-w2v-adapter” (129 elements)
- Keywords: add, script, class, models
- Medoid: “Update README” (75 elements)
- Keywords: update, readme, add, comments
- Medoid: “Correct exception name” (60 elements)
- Keywords: correct, name, function, error
- Medoid: “Remove docker-compose-staging.yml” (58 elements)
- Keywords: remove, unnecessary, function, functionality
The Wicketkeeper project involved worker machines executing deep learning models in a Docker-based setup, and development involved much prototyping and experimentation with various implementations. This is reflected in the clusters identified by the k-medoids algorithm.
Fig. 1: Size of clusters generated by the k-medoids clustering algorithm applied to the commit messages of DSTIL’s Wicketkeeper project.
For A Few Medoids More
Effective use of the k-medoids algorithm means selecting the appropriate knumber of clusters. k must be high enough to ensure that all of the valuable ‘groups’ of elements hidden in the dataset can be identified, but not so high that there are few elements per cluster thus losing the power of aggregation and abstraction that clustering provides. Ideally, clusters should be cohesive, containing elements that are similar to each other, and distinct, meaning different from every other cluster. One way of measuring this is the silhouette method, which indicates how appropriately an element was clustered. The silhouette value s of element i is calculated as follows:
s(i) = (b(i) – a(i)) / max(a(i), b(i))
where a(i) is the mean distance of i to its cluster neighbours and b(i) is the mean distance of i to the elements of the next most similar cluster, also calculated by mean distance. If s(i) is positive, i is in the most appropriate cluster. If s(i) is negative, i should be in the next most similar cluster. If s(i) is zero then i is equally appropriate to its current and next most similar clusters. By summing the silhouette of all elements we can gain insight into the overall quality of a cluster, and use this to search for the optimal value of k. Figure 2 presents the mean quality of configurations for various k on our dataset: the peak at k=15indicates that that is the optimal number of clusters, not k=25 as assumed in the initial application.
Fig. 2: Configuration silhouette values for various k when clustering the commit messages of DSTIL’s Wicketkeeper project using the k-medoids algorithm.
The Good, The Bad and the Medoids
For our example dataset the k-medoids algorithm is ideal, but whether it is appropriate for your use case may depend on a number of factors. The algorithm has the following advantages, disadvantages and quirks:
- k-medoids is less sensitive to noise and outliers because it minimises the absolute distance between elements, rather than the squared distance (used in e.g. k-means). This approach limits the extent to which outliers can influence the ‘centre’ of clusters.
- The algorithm is relatively simple and intuitive, and the only source of randomness is the selection of initial medoids. This makes it relatively simple to implement, maintain and control compared to many other machine learning algorithms.
- k-medoids is a greedy algorithm that assumes that the best medoids will be found by optimising the medoids of each cluster. It does not follow that the best medoids overall will be found. Multiple clustering attempts may be necessary to find adequate cluster configurations.
- The algorithm, with complexity O(k(n-k)2), is much more computationally expensive than many other clustering algorithms. If the number of clusters (k) and elements (n) are high, using the algorithm may not be viable (although memoising the metric function can also be extremely effective).
- Medoids themselves may not be very informative, as in our example application. Some ‘summarisation’ technique may be necessary to gain insight into what the elements of a cluster collectively represent.
If the disadvantages of k-medoids are too severe for your use case, consider the similar k-medians  and medoidshift  algorithms.
The Colossus of Clustering
Like our project management example, in which a dataset of specialised language grows slowly over time, some datasets will forever remain small and continue to be difficult to transform into more flexible representations. When faced with such problems, consider the k-medoids algorithm. As this post has demonstrated by the commit message clustering example, the algorithm is able to provide a surprising degree of insight into challenging datasets. k-medoids has been our first step to taking the wild out of the wild west of project management data.
Further reading: Data Management for Data Scientists
This is a featured blog article from the Deakin Software and Technology Innovation Lab DSTIL blog.
- Kaufman, L., Rousseeuw, P.J., 1987. Clustering by means of medoids, in Statistical Data Analysis Based on the L1–Norm and Related Methods, ed. Y. Dodge, North-Holland, pp. 405-416.
- Park, H.S., Jun, C.H., 2009. A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2), pp. 3336–3341.
- Bradley, P.S, Mangasarian, O.L., Street, W.N., 1997. Clustering via concave minimization, Advances in Neural Information Processing Systems, 10, pp. 368-374.
- Sheikh, Y.A., Khan, E.A., Kanade, T., 2007. Mode-seeking by Medoidshifts, Computer Vision, 11, pp. 1-8.