Distribution fitting
Methods of fitting data to abundance models are contentious, with many
protocols having been advocated (Matthews & Whittaker, 2014). As
mentioned, all of the models considered here seek to explain SADs sensu
stricto, which are vectors that record the number of species each
sharing a given count of individuals (Fisher et al., 1943).
It is very important to stress that SADs are not equivalent to rank
abundance distributions (RADs). These are useful for depicting counts
(e.g., Motomura, 1932, MacArthur, 1957) and are commonly used even by
some contemporary workers to fit distribution models (e.g., Nekola et
al., 2008; Ulrich et al., 2010, 2015). There are at least four major
reasons not to fit data to RADs: (1) key theoretical models directly
predict SAD shapes, not RADs; (2) most models that do directly predict
RADs, such as the geometric series (Motomura, 1932), are no longer
considered to be viable descriptors of real-world ecological communities
(Alroy, 2015; Baldridge et al., 2016); (3) maximum likelihood methods
have been developed to fit models to SADs (e.g., Connolly et al., 2005;
Connolly & Thibaut, 2012) and are generally advocated over the many
alternatives (Gray et al. 2006; Whittaker & Matthews, 2014; Antão et
al., 2021), but RADs are generally fit using frequentist methods; and
(4) it is difficult to model the error in species ranks because any
variation in the count of a species could also change its rank, so the
x- and y-axes in an RAD are not statistically independent.
A third approach is also worth mentioning: to bin the counts into
classes on a log scale, equivalent to a histogram where the boxes show
the number of species in classes 1, 2, 3 – 4, 5 – 8, 9 – 16, etc.
(Preston, 1948). This strategy is still used (e.g., Matthews et al.,
2014), but it has rightly been rejected because it loses too much
information and can introduce artefacts (Gray et al., 2006; Nekola et
al., 2008).
Here I use a fast and reliable maximum likelihood computation for
fitting. It is the most obvious approach: define the likelihood by
multiplying the probabilities of the individual counts based on the SAD.
Specifically, if there are s 1 singletons ands 2 doubletons out of S species and if the
hypothesised PMF is p 1,p 2, p 3…, then the joint
likelihood is p 1s 1 xp 2s 2….
This calculation works as well in practice as any other I have
investigated, surpassing rivals in a suite of tests that I do not have
space to detail. It has a very interesting property: exactly the same
solution is always found by fitting a given set of counts to a
multinomial model. The reason is that the combinatorial terms
distinguishing multinomial distributions from simple products of
probabilities are constant across all possible parameter values, so they
wash out of any optimisation.
All models considered here use just one free parameter. However, many of
the remaining models in the literature assume two parameters (such as
the PLN). For comparison across models in general, the corrected Akaike
information criterion (Hurvich and Tsai, 1993) is recommended. It has
been used previously in this context (Antão et al., 2021).
In practice, analysing a large data set requires examining a limited set
of classes. Here, the computational limit is treated as
214 = 16,384. Imposing this cutoff makes hardly any
difference because just 45 out of 82,870 species counts in the database
(0.05%) exceed it.