Comparing clustering solutions with a reference classification
We assessed the extent to which clustering solutions (15 classes)
produced by each algorithm retrieved species-sets characterising the
units of an established subcontinental-scale vegetation classification
that covers 800,000 km2 in southeastern Australia
(Keith 2004), including the study area (c. 11% of total area). The
reference classification was developed from the top-down based on an
extensive review of vegetation studies, field reconnaissance and
qualitative synthesis of vegetation maps available at the time (Keith
2004). Its highest level of classification (vegetation formation) is
based on structural/physiognomic features. Formations are subdivided
into vegetation classes based on geographically distinct expressions of
structural and compositional features. Fifteen of 99 vegetation classes
recognised in the reference classification are mapped within the study
area and are described with lists of indicative species (Keith, 2004).
For each clustering solution, we identified the species diagnostic of
each cluster as those with a frequency of occurrence statistically
higher within the cluster samples than across the dataset as a whole
(cumulative hypergeometric probability > 0.999). We
compared these with the species identified as diagnostic of the
reference classes, compiling a confusion matrix with the units of the
respective classifications as rows and columns and cell values
calculated as the proportion of reference class species that were
identified as diagnostic of each cluster class.