INTRODUCTION
Species inventories are the universal currency of community ecology: counting individuals that belong to each of the species in a place is a routine and fundamental practice. These counts are the basis of key community assembly theories (Fisher et al., 1943; MacArthur, 1957; Bulmer, 1974; Caswell, 1976; Hubbell, 2001). It is thought that count distributions usually tail off with an array of rare species (McGill et al., 2007). When this is true, estimating species richness is difficult and dangerous because inventories are likely to be quite incomplete (Colwell & Coddington, 1994). Biodiversity is of deep concern throughout science and society, making it imperative to solve this problem. I focus here on the richness estimation strategy of fitting inventories to mathematical distributions that imply fixed numbers of missing species. In the course of doing so, I show not only that this idea is feasible but that fundamental processes of community assembly can be distinguished using basic inventory data.
Population dynamical models going back to Kendall (1948) have been used before to predict shapes of species abundance distributions (SADs), but the SADs have generally involved multi-parameter equations (Volkov et al., 2005; Jabot & Chave, 2011). The three models considered here all require a single parameter. I put aside other single-parameter models such as the broken stick (MacArthur, 1957), the geometric series as applied to rank-abundance distributions (Motomura, 1932), the logistic-J (Dewdney, 2000), and the Zipf (see Newman, 2005) because they have received little support in comprehensive assessments of distributions (Alroy, 2015; Baldridge et al., 2016) and have not been considered in many studies that have treated two or three distributions at a time (Hughes, 1986; Dewdney, 2000; Connolly et al., 2005; Ulrich et al., 2010; Antão et al., 2021). I do not consider the gambin model (Ugland et al., 2007; Matthews et al., 2014) because it appears only to describe distributions of counts binned into octaves on a log scale (Preston, 1948). Gray et al. (2006) are among several to have pointed out problems with this approach. So like others including Antão et al. (2021), this study is concerned with models such as the log series (Fisher et al., 1943) that predict counts of singletons, doubletons, and so on – i.e., SADs in a restricted sense.
I also do not consider two-parameter distributions such as the classic Poisson log normal (PLN: Bulmer, 1974) and the truncated negative binomial (Connolly et al., 2009; Connolly & Thibaut, 2012). These models have much traction: for example, the PLN has been argued to fit extensive datasets of trees, birds, fishes, and benthic organisms (Antão et al., 2021), not to mention all GBIF occurrence records in the world combined (Callaghan et al., 2023). Meanwhile, the negative binomial has been fit to a vast data set for Amazonian trees (ter Steege et al., 2020). There are two major reasons not to consider these models for the moment. First, they overfit the data, reducing chances of predicting related patterns. Second, one-parameter distributions are often so good that they cannot be rejected by a saturated model. The latter is a highly resolved function that closely mirrors the raw counts instead of following a proper parameterised distribution. This paper shows how to construct a saturated model and how to assess its fit to the data. The upshot is that the three substantive models under consideration perform so well there is little left to explain.