August 29, 2021

The Pareto distribution and Price's law

As detailed in the previous post, the ratio \(f\) of the top authors that publish a fraction \(v\) of all publications is independent from the total number of authors \(N_0\). Of course, this result is incompatible with Price's law (that for \(v=0.5\), \(f = 1/\sqrt{N_0}\)). This issue has been discussed by Price and co-workers [1], but I will take here a slightly different approach.

I had assumed in my derivation that he domain of the distribution was unbound above (\(H = \infty\)), and that the exponent \(\alpha\) was higher than 1. One can relax these assumptions and check their effect on \(f\) by:

  1. imposing a finite upper bound \(H\) and
  2. by setting \(\alpha = 1\). Note that 2. also requires 1. 

Role of the upper bound

In the finite \(H\) case one must use the full expressions (containing \(H\) and \(L\)) for the various quantities. In this section, we will continue to assume that \(\alpha > 1\). Since \(L\) acts everywhere as a scale factor for \(x\) (and \(H\)) I will set it to 1 in the following. It is also reasonable to assume that the least productive authors have one publication (why truncate at a higher value?!) Consequently, all results will also depend on \(H\), but presumably not explicitly on \(N_0\), which is a prefactor for the PDF and should cancel out of all expectation calculations. It is, however, quite likely that \(H\) itself will depend on \(N_0\), since more authors will lead to a higher maximum publication number!

In my opinion, the most reasonable assumption is that there is only one author with \(H\) publications, so that \(N_0 p(H) = 1 \Rightarrow H \simeq (N_0 \alpha)^{\frac{1}{\alpha + 1}}\), neglecting the normalization prefactor of \(p(x)\).

The threshold number \(x_f\) is easy to obtain directly from \(S(x)\):

\[x_f = \left [ f + (1-f) H^{-\alpha}\right ]^{-1/\alpha}\]

From its definition, the fraction \(v\) is given by: \(v = \dfrac{\alpha}{\mu} \dfrac{1}{1-H^{-\alpha}} \dfrac{1}{\alpha - 1} \left ( x_f^{1-\alpha} - H^{1-\alpha} \right )\). Note that we need here the complete expression for the mean [2]:

\[\mu = \dfrac{\alpha}{\alpha - 1} L \dfrac{1-H^{1-\alpha}}{1-H^{-\alpha}}\]

Plugging \(x_f\) and \(\mu\) in the definition of \(v\) and setting \(v = 1/2\) yields:

\begin{equation} f = f_{\infty} \dfrac{\left ( 1 + H^{1-\alpha}\right )^{\frac{\alpha}{\alpha - 1}} - 2^{\frac{\alpha}{\alpha - 1}}H^{-\alpha}}{1-H^{-\alpha}}, \quad \text{with } f_{\infty} = \left( \dfrac{1}{2} \right )^{\frac{\alpha}{\alpha - 1}},\end{equation}

and we assume that the upper bound is given by:

\begin{equation} H = (N_0 \alpha)^{\frac{1}{\alpha + 1}}. \end{equation}

Exponent \(\alpha = 1\)

Let us rewrite the PDF, CDF and survival function in this particular case:

\[p(x) = \dfrac{1}{1 - H^{-1}} \dfrac{1}{x^2}; \, F(x) = \dfrac{1- x^{-1}}{1 - H^{-1}} ; \, S(x) = 1 - F(x) = \dfrac{x^{-1}- H^{-1}}{1 - H^{-1}}\]

\[x_f = S^{-1}(f) = \dfrac{1}{f + (1-f) H^{-1}}\]

\[v = \dfrac{1}{2} = 1 - \dfrac{\ln(x_f)}{\ln(H)} \Rightarrow x_f = \sqrt{H} \quad \text{and, since } H = \sqrt{N_0}, \, x_f = N_0^{1/4}\]

Putting it all together yields \(f = \dfrac{N_0^{1/4} - 1}{N_0^{1/2} - 1}\) and, in the high \(N_0\) limit, \(f \sim N_0^{-1/4}\), so the number of "prolific" authors \(N_p = f N_0 = N_0^{3/4}\), a result also obtained by Price et al. [1] using the discrete distribution. They also showed that other power laws (from \(N_0^{1/2}\) to \(N_0^{1}\)) can be obtained, depending on the exact dependence of \(H\) on \(N_0\).

Fraction \(f\) of the most prolific authors that contribute \(v = 1/2\) of the total output, as a function of the total number of authors, \(N_0\), for various exponents \(\alpha\). The unbound limit \(f (H \rightarrow \infty)\), calculated in the previous post is also shown for \(\alpha > 1\). With my choice for the relation between \(N_0\) and \(H\), this also corresponds to \(N_0 \rightarrow \infty\). The particular value \(\alpha = 1.16\) yields the 20/80 rule, but also the 0.6/50 rule shown as solid black line. Note that the curve for \(\alpha = 1\) is computed using a different formula than the others and does not reach a plateau: its asymptotic regime \(f \sim N_0^{-1/4}\) is shown as dotted line.
The graph above summarizes all these results: for \(\alpha = 1\), \(f\) reaches the asymptotic regime \(f \sim N_0^{-1/4}\) very quickly (\(N_0 \simeq 100\)). For \(\alpha > 1\), \(f\) leaves this asymptote and saturates at its unbound limit \(f (H \rightarrow \infty)\), calculated in the previous post. This regime change is very slow for \(\alpha  < 2\): the plateau is reached for \(N_0 > 10^6\).
In conclusion, an attenuated version of Price's law is indeed obtained for \(\alpha = 1\)(where it holds for any \(N_0\)) but also for reasonably low \(\alpha > 1\), in particular for \(\alpha = 1.16\) (of 20/80 fame) where it applies for any practical number of authors! As soon as \(\alpha\) exceeds about 1.5, the decay is shallow and saturates quickly so \(f\) is relatively flat.


1 Allison, P. D. et al., Lotka's Law: A Problem in Its Interpretation and Application Social Studies of Science 6, 269-276, (1976).

August 28, 2021

The Pareto distribution and the 20/80 rule

I mentioned in the previous post Pareto's 20/80 rule. Here, I will discuss Pareto's distribution, insisting on how (and in what conditions) it gives rise to this result. I had some trouble understanding the derivation as presented in various sources, so I will go through it in detail.

The functional form of the Pareto distribution is a power law, over an interval \((L,H)\) such that \(0<L<H\leq \infty\). I will use the notations of the Wikipedia page unless stated otherwise. Its probability density function (PDF) \(p(x)\) and cumulative distribution function (CDF) \(F(x)\) are (\(\alpha\) is real and strictly positive):

\[p(x) = \dfrac{\alpha}{1-(L/H)^{\alpha}} \dfrac{1}{x} \left ( \dfrac{L}{x} \right ) ^{\alpha}\quad ; \quad F(x) =  \dfrac{1-(L/x)^{\alpha}}{1-(L/H)^{\alpha}}\]

One often uses the complementary CDF (or survival function) defined as:

\[S(x) = 1 - F(x) = \dfrac{1}{1-(L/H)^{\alpha}}\left [ \left ( \dfrac{L}{x} \right )^{\alpha} - \left ( \dfrac{L}{H} \right )^{\alpha}\right ]\]

Note that the survival function is very similar to the PDF multiplied by \(x\): \(S(x) \simeq \dfrac{x}{\alpha} p(x)\), the difference being due only to the final truncation term. However, this is only true for power laws, as one can easily check by writing \(p(x) = F'(x)\) and solving the resulting ODE. We should therefore carefully distinguish \(x p(x)\) (which is, for instance, the integrand to use for computing the mean of the distribution) and \(S(x)\) which "has already been integrated", so to speak.

Let us use this continuous model to describe the distribution of publications (neglecting for now its intrinsically discrete character). \(x\) stands for the number of publications by one author, bounded by \(L\) and \(H\). The number of authors that published \(x\) books is given by \(N_0 \, p(x)\). \(N_0\) is the total number of authors. 

  • The first question is: who are the first \(f\) more prolific authors (in Pareto's case, \(f = 0.2 = 20\)%)? More precisely, what is the threshold number of publications \(x_f\) separating them from the less prolific ones?
This is quite easy: if we go through the list of authors (ordered by increasing \(x\)) when we reach \(x_f\) we will have counted the lower fraction, so \(\int_{L}^{x_f} p(x) \text{d}x = F(x_f) = 1-f\). Thus, the survival function is \(S(x_f) = \int_{x_f}^{H} p(x) \text{d}x = f\) and we can simply invert this dependency to get \(x_f =S^{-1}(f)\).
  •  The second question is: how many publications did these top \(f\) authors contribute?
We need to count the authors again, but with an additional factor of \(x\), since there are \(N_0 \, p(x)\) authors with exactly \(x\) publications, for a total contribution of \(x \, N_0 \, p(x)\). The fraction of publications contributed by the top \(f\) authors \(v\) is then:
\[v = \dfrac{\int_{x_f}^{H} x \, N_0 \, p(x) \text{d}x}{\int_{L}^{H} x \, N_0 \, p(x) \text{d}x} = \dfrac{\int_{x_f}^{H} x \, p(x) \text{d}x}{ \mu}\]
where \(\mu\) is the mean of the distribution and \(N_0 \mu\) is the total number of publications.

In the simple case \(H = \infty\) (which requires \(\alpha > 1\)), one has:
\[p(x) = \dfrac{\alpha}{x} \left ( \dfrac{L}{x} \right )^{\alpha}, \quad \text{with} \quad \mu = \dfrac {\alpha}{\alpha-1} L\]
\[f  = S(x_f) =\left ( \dfrac{L}{x_f} \right )^{\alpha} \Rightarrow x_f = L f^{-1/\alpha}\]
 
Plugging the above into the equation for \(v\) yields:
\[v = \dfrac{1}{\mu} \int_{x_f}^{\infty} x \, p(x) \text{d}x = \dfrac{\alpha}{\mu} \int_{x_f}^{\infty}  \left ( \dfrac{L}{x} \right ) ^{\alpha} \text{d}x = \left ( \dfrac{L}{x_f} \right ) ^{\alpha-1} =f^{\frac{\alpha-1}{\alpha}} \Rightarrow f = v^{\frac{\alpha}{\alpha-1}}\] 
Pareto's rule \(f=0.2\) and \(v=0.8\) requires \(\alpha \simeq 1.161\): a power law with this exponent will obey the rule, irrespective of the values of \(L\) and \(N_0\). Despite the neat coincidence in the established statement of the principle, there is absolutely no need that \(f+v=1\)! For instance, the same \(\alpha\) implies that, for \(v=0.5\), \(f \simeq 0.065\), a result I have already used in the previous post.

August 27, 2021

Price's law is not intensive

Price's law was proposed in the context of scientific publishing: 

The square root of the total number of authors contribute half of the total number of publications.

It is a more extreme version of Pareto's 20/80 rule, which would state that 20% of authors contribute 80% of the total number of publications (see next post for the relation between the two). Unlike Pareto's rule, however, Price's law is not stable under extension. This is a trivial observation, but I have not yet seen it in the literature, just like I have not seen much empirical evidence for Price's law.

Let us denote by \(N\) the total number of authors and by \(N_p\) the number of "productive" authors (the top authors that provide half of all publications). As the ratio of two extensive quantities, \(p\) should be independent of the system size \(N\): consider ten systems (e.g. the research communities in different countries, different subjects, etc.), each of size \(N\), with the same publication distribution and hence the same \(N_p\). Half of the total number of publications is published by \(10 \, N_p\) contributors, so the overall productivity is \(p' = \dfrac{10 N_p}{10 N} = p\). According to Price's law, it should however be \(p' = p/\sqrt{10}\) ! The situation is similar to having ten identical vessels, all under the same pressure \(p\). If we connect them all together the pressure does not change, although both the volume and energy increase by a factor of ten.

Price's law does have a "convenient" feature: simply by selecting the representative size \(N\) one can obtain any productivity, since \(p = 1/\sqrt{N}\). For instance, the same Pareto distribution that yields the 20/80 rule predicts that 0.7% of causes yield half of the effects. This result is reproduced by Price's law with \(N \simeq 23000\).

Outside of bibliometry, Price's law has been invoked in economics, for instance by Jordan Peterson in (at least) one of his videos. What I find amusing is that it seems to contradict the principle of economies of scale: if there is a connection between the productivity \(p\) and the economic efficiency (and this is the more likely the higher the personnel costs are) then an increase in the size of a company decreases its efficiency. For instance, a chain of ten supermarkets would be less effective than ten independent units, which would be less effective than many small shops. Since the market is supposed to select for efficiency, we should witness fragmentation, rather than consolidation. 

References:

https://subversion.american.edu/aisaac/notes/pareto-distribution.pdf Clear derivation of the 20/80 principle from the general Pareto distribution.