Introduction
Vegetation classification is the process of delimiting types of
vegetation on the basis of their relative homogeneity and distinctness
from other types (van der Maarel & Franklin 2013). Classification
facilitates not only the description of vegetation, but also the study
of its relationships with the environment and attendant interacting,
co-dependant organisms. Vegetation classification is thus a first step
to the classification of ecosystems (sensu Tansley 1935), and
vegetation typologies have come to underpin a wide variety of
conservation and natural resource management applications including the
selection of protected areas, ecosystem risk assessment and market-based
mechanisms such as biodiversity offsets (Bland et al. 2019).
Despite a relatively short history, the science has spawned a wide range
of schools (Whittaker 1978, van der Maarel & Franklin 2013).
Increasingly however, vegetation classification centres on the
clustering of quantitative plot samples (De Cáceres et al ., 2015;
2018). When recorded with systematic procedures, plot samples have the
advantage of allowing observations from different sources to be
consolidated over time, while computer-generated clustering solutions
confer a degree of objectivity in the elucidation of patterns.
The utility of clustering in the development of vegetation
classifications is beyond question, although it is complicated by three
inter-related problems. First, excepting simulated datasets, there is no
agreed external point of reference with which clustering solutions can
be compared. Instead, solutions based on field data must be evaluated on
internal criteria (Aho et al . 2008), either geometric (eg cluster
homogeneity) or non-geometric (eg species/cluster fidelity). Since these
vary in the way they weight particular characteristics of the solution,
the best clustering solution may depend on its application. Second, the
hyper-spatial structure of vegetation data is generally unknown. The
choice of both clustering algorithm and evaluation metrics therefore
requires a user-supplied model which usually (but not invariably (Ahoet al . 2008)) means clusters are assumed to be spheroidal, if
only because the majority of operators default to a few well-tested
algorithms (Kent 2011). This is problematic, because algorithms which
seek to optimise central tendency can generate sub-optimal solutions
when applied to data with irregular structure, and internal metrics
which assume a spheroidal model may not be appropriate measures of
cluster quality. Third, biases in the both the geographic and
environmental distribution of samples means that cluster metrics are
often optimised for data which sample the range of floristic variation
either unevenly or incompletely. That is, biases may induce
irregularities in data structure even if assemblages in the field form a
continuum. It is not surprising then, that clustering solutions are
notoriously idiosyncratic and highly sensitive to data structure,
transformations, choices of algorithm and resemblance measures (Tichyet al ., 2014).
The potential limitations of assuming a spheroidal model to data of
irregular structure are illustrated in Figure 1. The data are points on
a cartesian plain, normally and randomly distributed around each of six
pre-defined centroids. The k-means algorithm fails to retrieve the
underlying data structure; in i) incorrectly splitting cluster C while
merging clusters D and F; and ii) incorrectly splitting clusters C and F
to partially merge with clusters A and D, respectively. The resulting
solutions appear what Barton et al . (2019) termed ‘unnatural’,
although they conceded the vagueness (sensu Regan et al, 2002) of
the assignation, relying as it does on an appeal to the human eye. Less
subjectively, the solution is ‘incorrect’, for example in Figure 1(ii)
in assigning samples that are co-located in space in the region of
centroid C to different groups, while drawing in remotely-located
samples from the region of centroid A. The implication is there is a
high likelihood of alternative solutions arising as further data are
added, or if the clustering algorithm is changed or supplied different
parameters.
The problem illustrated in Figure 1 arises primarily to the
insensitivity of the algorithm to variations in the density of points,
however a failure to recover ‘natural’ or ‘correct’ clusters of
irregular shape has similarly been documented in a wide range of
algorithms operating on assumptions of central tendency (Karypis 1999,
Zhao & Karypis 2005, Han et al . 2012, Barton et al .
2019). The core principle underpinning algorithms which seek to retrieve
clusters of irregular shape and/or density is sample inter-connectivity.
That is, cluster membership depends on interconnections among sample
(based on pairwise similarity) rather than shared proximity to an
artificial centroid or medoid. Schmidtlein et al . (2010), for
example, noted two vegetation samples with no species in common could
nevertheless share cluster membership provided they were connected in a
chain of close neighbours. This implies clusters generated by an
algorithm sensitive to irregular data structure are likely to be more
heterogeneous than those derived with reference to a spheroidal model,
particularly at thematic scale where discontinuities and variation in
sample density exist.
Potential irregularities in data structure are rarely accounted for in
vegetation classification. Schmidtlein et al . (2010) documented a
promising approach, however our investigations of their ISOMAP algorithm
suggested its “brute force” approach is too computationally demanding
for a dataset many thousands of samples (Schmidtlein et al .
(2010) investigated datasets ranging in size up to 305 samples and
warned users of ISOMAP that the algorithm is slow, and not to
complain!). Chameleon (Karypis et al., 1999, see methods for a detailed
description) is one of several alternative algorithms designed to
recover clusters of variable shape which may, therefore, reproduce
landscape scale relationships more faithfully than traditional
clustering techniques (Han et al ., 2012). Chameleon assesses both
interconnectivity and closeness of objects as a basis for determining
merging decisions, an approach which results in fewer “wrong”
decisions than algorithms that consider only one or the other (Karypiset al ., 1999). Focussing on interconnectivity allows the
algorithm to adapt automatically to the characteristics of the clusters
(density and hyperspatial distribution), rather than relying on a static
model (eg discrete spherical clusters). Therefore, provided they are
strongly interconnected, samples spanning a compositional continuum can
be retrieved as a single cluster even if the distribution of samples
along the continuum is uneven, because Chameleon is relatively
insensitive to variations in hyperspatial density (Han et al .,
2012).
We suggest that a failure to take account of the underlying structure of
vegetation data is likely to be one factor contributing to
idiosyncrasies among clustering solutions, however the affect is likely
to be dependent on the expression and nature of discontinuities in the
data structure. We postulate that accounting for data structure is more
likely to be important at broad thematic scales (as represented by the
points in Figure 1 collectively) because discontinuities are likely to
arise both naturally (eg between regions which share few species), due
to variable data coverage (De Cáceres et al ., 2018, Gellieet al . 2018) or because environmental gradients are discontinuous
in geographic space (Austin 2013). Conversely, there may be no
disadvantage in assuming a spheroidal model where clustering essentially
amounts to partitioning a continuum (ie partitioning the individual
clusters in Figure 1). In this paper, we investigate two hypotheses: i)
that an algorithm sensitive to hyperspatial irregularities in the
density and arrangement of samples will produce clusters which are
likely to be more ‘correct’ (in the sense that samples are co-located
with their close neighbours), but at the cost of poorer internal metrics
relative to algorithms that seek to optimise around central tendency;
and ii) Differences between the respective algorithms will decline with
decreasing thematic scale of cluster solutions. To test these
hypotheses, we used a large regional data set of 7541 plot samples to
compare the performance of traditional clustering algorithms (k-means,
hierarchical agglomerative and divisive) with the Chameleon algorithm
using both internal metrics (homogeneity, indicator species) and the
concept of ‘correctness’ which we apply as the misclassification rate:
the proportion of samples which do not cluster with their nearest
neighbour.