How do you estimate the biodiversity of a whole community, or the distribution of abundances and ranges of its species, from presence/absence data in scattered samples?
It all starts with the collector's dilemma: if you double the number of samples, you will not get double the number of species, since you will find many of the same common species, and only a few new rare ones.
This non-additivity has prompted many ecologists to study the Species-Area Relationship. A common theoretical approach has been to connect this spatial pattern to the overall distribution of how common or rare a species can be. At least since Fisher's celebrated log-series , ecologists have been trying to, first, infer the shape of the Species Abundance Distribution, and then, use it to predict how many species should be found in a given area or a given number of samples. This has found many applications, from microbial communities to tropical forests, from estimating the number of yet-unknown species to predicting how much biodiversity may be lost if a fraction of the habitat is removed.
In this elegant work, Tovo et al.  propose a method that starts only from presence/absence data over a number of samples, and provides the community's diversity, as well as its abundance and range size distributions. This method is simple, analytically explicit, and accurate: the authors test it on the classic Pasoh and Barro Colorado Island tropical forest datasets, and on simulated data. They make a very laudable effort in both explaining its theoretical underpinnings, and proposing a straightforward step-by-step guide to applying it to data.
The core of Tovo et al's method is a simple property: the scale invariance of the Negative Binomial (NB) distribution. Subsampling from a NB gives another NB, where a single parameter has changed. Therefore, if the Species Abundance Distribution is close enough to some NB (which is flexible enough to accommodate all the data here), we can estimate how this parameter changes when going from (1) a single sample to (2) all the available samples, and from there, extrapolate to (3) the entire community.
This principle was first applied by the authors in a previous study  that required abundance data in the samples, rather than just presence/absence. Given that binary occurrence data is far more available in a variety of empirical settings, this extension is worthwhile (including its new predictions on range size distributions), and it deserves to be widely known and tested.
1) To explain the novelty of the authors' contribution, it is useful to look at competing techniques.
Some ""parametric"" approaches try to infer the whole-community Species Abundance Distribution (SAD) by guessing its functional form (Gaussian, power-law, log-series...) and fitting its parameters from sampled data. The issue is that this distribution shape may not remain in the same family as we increase the sampling effort or area, so the regression problem may not be well-defined. This is where the Negative Binomial's scale invariance is useful.
Other ""non-parametric"" approaches have renounced guessing the whole SAD: they simply try to approximate of its tail of rare species, by looking at how many species are found in only one (or a few) samples. From this, they derive an estimate of biodiversity that is agnostic to the rest of the SAD. Tovo et al.  show the issue with these approaches: they extrapolate from the properties of individual samples to the whole community, but do not properly account for the bias introduced by the amount of sampling (the intermediate scale (2) in the summary above).
2) The main condition for all such approaches to work is well-mixedness: each sample should be sufficiently like a lot drawn from the same skewed lottery. As long as that condition applies, finding the best approach is a theoretical matter of probabilities and combinatorics that may, in time, be given a definite answer.
The authors also show that ""well-mixed"" is not as restrictive as it sounds: the method works both on real data (which is never perfectly mixed) and on simulations where species are even more spatially clustered than the empirical data. In addition, the Negative Binomial's scale invariance entails that, if it works well enough at some spatial scale, it will also work at all higher scales (until one reaches the edges of the sufficiently-well-mixed community)
3) One may ask: why the Negative Binomial as a Species Abundance Distribution?
If one wishes for some dynamical explanation, the Negative Binomial can be derived from neutral birth and death process with immigration, as shown by the authors in . But to be applied to data, it should only be able to approximate the empirical distribution well enough (at all relevant scales). Depending on one's taste, this type of probabilistic approaches can be interpreted as:
- purely phenomenological, describing only the observational process of sampling from an existing state of affairs, not the ecological processes that gave rise to that state.
- a null model, from which everything in practice is expected to deviate to some extent.
- or a way to capture the statistical forces that tend to induce stable relationships between different patterns (as long as no ecological process opposes them strongly enough).
 Fisher, R. A., Corbet, A. S., & Williams, C. B. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. The Journal of Animal Ecology, 42-58. doi: 10.2307/1411
 Tovo, A., Formentin, M., Suweis, S., Stivanello, S., Azaele, S., & Maritan, A. (2019). Inferring macro-ecological patterns from local species' occurrences. bioRxiv, 387456, ver. 2 peer-reviewed and recommended by PCI Ecol. doi: 10.1101/387456
 Tovo, A., Suweis, S., Formentin, M., Favretti, M., Volkov, I., Banavar, J. R., Azaele, S., & Maritan, A. (2017). Upscaling species richness and abundances in tropical forests. Science Advances, 3(10), e1701438. doi: 10.1126/sciadv.1701438
DOI or URL of the preprint: 10.1101/387456
Version of the preprint: 2
Thank you for your letter and resubmission, with my sincere apologies for the delays.
Serious concerns have indeed been addressed very satisfactorily, and I would now like to suggest very minor revisions following the comments of one of the reviewers and my own attached. As soon as those are done, I will post my recommendation for the updated preprint.
Besides my annotations of the pdf for typos (attached below), I have a comment on the reviewer's concern about eq 10: it took me a bit to convince myself that this expression is correct and indeed hypergeometric, even if it only takes the canonical form when you replace (M v) by (M M-v) (clearly equal by symmetry) at the top left.
As the interpretation in terms of a hypergeometric distribution is not very intuitive (why would you have n+M-1 balls of which M are successful, and try to get v successes by drawing M-1 balls?), I would recommend instead a small term-by-term explanation: you have (M v) possibilities to choose the v filled bins, (n-1 v-1) possibilities to distribute n balls among v bins so that no bin is empty, and (n+M-1 M-1) ways to distribute n balls in M bins allowing empty bins, then referencing a classic book on combinatorics for these results (e.g. W. Feller's book and its "bars and stars")
Sincerely, Matthieu BarbierDownload recommender's annotations
DOI or URL of the preprint: 10.1101/387456
Version of the preprint: 1
I agree with both reviewers that the work in manuscript is truly worthy of recommendation, but would deserve a clearer presentation to really get the attention it deserves.
In addition to the two reviews, a third colleague (anonymously) noted that this work is very relevant and interesting, but that there should be no shortcuts taken in the writing. Some basic assumptions are not explained (e.g. why take the negative binomial? we can come up with a justification, but it should be explicit already). Most importantly, the model description is very sloppy with incomplete definitions and inconsistent use of variables and parameters. The introduction could be more explicit about the concepts it refers to.
Maybe part of this is tied to the possible redundancy with a prior work, as noted in the review of K Cazelles, in which case I concur that the authors should make a clear-cut choice, making this manuscript either an explicit follow-up (so that the reader knows where to go for details), or something much more pedagogical that can encourage other scientists to use this method.
I think it is good practice for authors to try to read the paper through the eyes of someone who never encountered any of these methods or questions, and try to judge whether, when a concept appears, there is sufficient information to either understand it on the spot or know how to learn more about it (if the concept is relevant to the results, the answer to that should be "later in the text", rather than "somewhere in another reference", unless it is very clear from the start that this manuscript is a follow-up to one specific other work).
-"these latter methods make no assumption on the RSA distribution and they thus perform no fit of empirical patterns, rather they only take into account rare species, which are intuitively assumed to carry all the needed information on the undetected species in a sample" After reading this sentence, I have honestly no idea how nonparametric models work, and what it means for them to "perform no fit of empirical patterns". Can you try being a bit more descriptive? (without necessarily being much longer, just less oblique) (NB: After reading further down, I now understand what is happening here, but I don't think I could have understood it from this sentence; maybe give a short account of how parametric and nonparametric methods work in general, whether the former traditionally use part of the test data to do the fitting or whether they use different patterns for fit and prediction as you do, etc.)
Is Table 2 simply missing its caption, or was it planned not to have one? If you add one with the definition of all the terms that are not defined in there (C, etc.), it would help quite a bit. The caption may also be where you put some explanations of the nonparametric methods, if you think they would clutter the introduction.
"derived as the steady-state RSA of a birth and death stochastic dynamics, accounting for effective immigration and/or intraspecific interactions" It would be worthwhile having a short (even 1 sentence) explanation of how this works. In particular, it seems important for the reader to understand how much of what you say is based on purely statistical effects versus how much relies on specific biological assumptions.
"We denote as P (n|1) the relative species abundance, – i.e. the probability that a species has exactly n individuals" I'm confused as to how this is a relative species abundance
"with parameters (r, ξ) (r is known as the clustering coefficient)" Please give any intuition of what parameters r and xi mean (where they come from qualitatively in the derivation in your previous work, and how they shape the distribution)
"In order to reduce the number of parameters to fit from three to two (see (7))" I don't see what the third parameter is in (8), it seems you have only (r, ξp∗ )
"we rescale the setting, by assuming that the global scale p = 1 is actually the one where we have data, i.e. the sampling scale p∗ = na/A."
I think this sentence is very confusing since it says that << "p=1" == "p∗ = na/A" >> which makes little mathematical sense. After some parsing I understood something like: "we now want to make predictions of what happens when subsampling the data, instead of using the data as a subsample of the unknown true distribution. For this specific use, we define the total area A' as A'=na given we have n measurements, each of area a". However, this is not the same "a" as used in page 4 (where "a" was, in a sense, the total sampled area A') so this is confusing. Also, when has n become a number of measurements? Are you saying that each area a must now contain exactly one individual? If so, you should really note that beforehand.
Please add at least this level of detail and clarity if that was indeed your meaning. The notion that we are using the same formulas in different ways to subsample for fitting and to extrapolate from the fit is very interesting, but it is tricky and should be made crystal clear.
"we compute the empirical average of the species observed in all subsets of k cells." of the number of species
"and compute the C × S presence/absence matrix," C switches between being a set and being the cardinal of that set; maybe redefine C here (and potentially change notation for the set)
"In the case of the NB forest, the two methods performed very well for both the random and the clumped distribution with an average prediction error below 1% in absolute value (see Table1). In the Thomas distributed forests, the error increased,"
Do you mean the LN forests in that last sentence? The Thomas distributed forests = the clumped distribution mentioned in the previous sentence
p2: at a that spatial scale
p3: however our method can applied
p5: helps us linking => link
p7: ωsi = 1 if species i => species s