Introduction
Vegetation classification is the process of delimiting types of vegetation on the basis of their relative homogeneity and distinctness from other types (van der Maarel & Franklin 2013). Classification facilitates not only the description of vegetation, but also the study of its relationships with the environment and attendant interacting, co-dependant organisms. Vegetation classification is thus a first step to the classification of ecosystems (sensu Tansley 1935), and vegetation typologies have come to underpin a wide variety of conservation and natural resource management applications including the selection of protected areas, ecosystem risk assessment and market-based mechanisms such as biodiversity offsets (Bland et al. 2019). Despite a relatively short history, the science has spawned a wide range of schools (Whittaker 1978, van der Maarel & Franklin 2013). Increasingly however, vegetation classification centres on the clustering of quantitative plot samples (De Cáceres et al ., 2015; 2018). When recorded with systematic procedures, plot samples have the advantage of allowing observations from different sources to be consolidated over time, while computer-generated clustering solutions confer a degree of objectivity in the elucidation of patterns.
The utility of clustering in the development of vegetation classifications is beyond question, although it is complicated by three inter-related problems. First, excepting simulated datasets, there is no agreed external point of reference with which clustering solutions can be compared. Instead, solutions based on field data must be evaluated on internal criteria (Aho et al . 2008), either geometric (eg cluster homogeneity) or non-geometric (eg species/cluster fidelity). Since these vary in the way they weight particular characteristics of the solution, the best clustering solution may depend on its application. Second, the hyper-spatial structure of vegetation data is generally unknown. The choice of both clustering algorithm and evaluation metrics therefore requires a user-supplied model which usually (but not invariably (Ahoet al . 2008)) means clusters are assumed to be spheroidal, if only because the majority of operators default to a few well-tested algorithms (Kent 2011). This is problematic, because algorithms which seek to optimise central tendency can generate sub-optimal solutions when applied to data with irregular structure, and internal metrics which assume a spheroidal model may not be appropriate measures of cluster quality. Third, biases in the both the geographic and environmental distribution of samples means that cluster metrics are often optimised for data which sample the range of floristic variation either unevenly or incompletely. That is, biases may induce irregularities in data structure even if assemblages in the field form a continuum. It is not surprising then, that clustering solutions are notoriously idiosyncratic and highly sensitive to data structure, transformations, choices of algorithm and resemblance measures (Tichyet al ., 2014).
The potential limitations of assuming a spheroidal model to data of irregular structure are illustrated in Figure 1. The data are points on a cartesian plain, normally and randomly distributed around each of six pre-defined centroids. The k-means algorithm fails to retrieve the underlying data structure; in i) incorrectly splitting cluster C while merging clusters D and F; and ii) incorrectly splitting clusters C and F to partially merge with clusters A and D, respectively. The resulting solutions appear what Barton et al . (2019) termed ‘unnatural’, although they conceded the vagueness (sensu Regan et al, 2002) of the assignation, relying as it does on an appeal to the human eye. Less subjectively, the solution is ‘incorrect’, for example in Figure 1(ii) in assigning samples that are co-located in space in the region of centroid C to different groups, while drawing in remotely-located samples from the region of centroid A. The implication is there is a high likelihood of alternative solutions arising as further data are added, or if the clustering algorithm is changed or supplied different parameters.
The problem illustrated in Figure 1 arises primarily to the insensitivity of the algorithm to variations in the density of points, however a failure to recover ‘natural’ or ‘correct’ clusters of irregular shape has similarly been documented in a wide range of algorithms operating on assumptions of central tendency (Karypis 1999, Zhao & Karypis 2005, Han et al . 2012, Barton et al . 2019). The core principle underpinning algorithms which seek to retrieve clusters of irregular shape and/or density is sample inter-connectivity. That is, cluster membership depends on interconnections among sample (based on pairwise similarity) rather than shared proximity to an artificial centroid or medoid. Schmidtlein et al . (2010), for example, noted two vegetation samples with no species in common could nevertheless share cluster membership provided they were connected in a chain of close neighbours. This implies clusters generated by an algorithm sensitive to irregular data structure are likely to be more heterogeneous than those derived with reference to a spheroidal model, particularly at thematic scale where discontinuities and variation in sample density exist.
Potential irregularities in data structure are rarely accounted for in vegetation classification. Schmidtlein et al . (2010) documented a promising approach, however our investigations of their ISOMAP algorithm suggested its “brute force” approach is too computationally demanding for a dataset many thousands of samples (Schmidtlein et al . (2010) investigated datasets ranging in size up to 305 samples and warned users of ISOMAP that the algorithm is slow, and not to complain!). Chameleon (Karypis et al., 1999, see methods for a detailed description) is one of several alternative algorithms designed to recover clusters of variable shape which may, therefore, reproduce landscape scale relationships more faithfully than traditional clustering techniques (Han et al ., 2012). Chameleon assesses both interconnectivity and closeness of objects as a basis for determining merging decisions, an approach which results in fewer “wrong” decisions than algorithms that consider only one or the other (Karypiset al ., 1999). Focussing on interconnectivity allows the algorithm to adapt automatically to the characteristics of the clusters (density and hyperspatial distribution), rather than relying on a static model (eg discrete spherical clusters). Therefore, provided they are strongly interconnected, samples spanning a compositional continuum can be retrieved as a single cluster even if the distribution of samples along the continuum is uneven, because Chameleon is relatively insensitive to variations in hyperspatial density (Han et al ., 2012).
We suggest that a failure to take account of the underlying structure of vegetation data is likely to be one factor contributing to idiosyncrasies among clustering solutions, however the affect is likely to be dependent on the expression and nature of discontinuities in the data structure. We postulate that accounting for data structure is more likely to be important at broad thematic scales (as represented by the points in Figure 1 collectively) because discontinuities are likely to arise both naturally (eg between regions which share few species), due to variable data coverage (De Cáceres et al ., 2018, Gellieet al . 2018) or because environmental gradients are discontinuous in geographic space (Austin 2013). Conversely, there may be no disadvantage in assuming a spheroidal model where clustering essentially amounts to partitioning a continuum (ie partitioning the individual clusters in Figure 1). In this paper, we investigate two hypotheses: i) that an algorithm sensitive to hyperspatial irregularities in the density and arrangement of samples will produce clusters which are likely to be more ‘correct’ (in the sense that samples are co-located with their close neighbours), but at the cost of poorer internal metrics relative to algorithms that seek to optimise around central tendency; and ii) Differences between the respective algorithms will decline with decreasing thematic scale of cluster solutions. To test these hypotheses, we used a large regional data set of 7541 plot samples to compare the performance of traditional clustering algorithms (k-means, hierarchical agglomerative and divisive) with the Chameleon algorithm using both internal metrics (homogeneity, indicator species) and the concept of ‘correctness’ which we apply as the misclassification rate: the proportion of samples which do not cluster with their nearest neighbour.