Processing math: 100%

August 29, 2021

The Pareto distribution and Price's law

As detailed in the previous post, the ratio f of the top authors that publish a fraction v of all publications is independent from the total number of authors N0. Of course, this result is incompatible with Price's law (that for v=0.5, f=1/N0). This issue has been discussed by Price and co-workers [1], but I will take here a slightly different approach.

I had assumed in my derivation that he domain of the distribution was unbound above (H=), and that the exponent α was higher than 1. One can relax these assumptions and check their effect on f by:

  1. imposing a finite upper bound H and
  2. by setting α=1. Note that 2. also requires 1. 

Role of the upper bound

In the finite H case one must use the full expressions (containing H and L) for the various quantities. In this section, we will continue to assume that α>1. Since L acts everywhere as a scale factor for x (and H) I will set it to 1 in the following. It is also reasonable to assume that the least productive authors have one publication (why truncate at a higher value?!) Consequently, all results will also depend on H, but presumably not explicitly on N0, which is a prefactor for the PDF and should cancel out of all expectation calculations. It is, however, quite likely that H itself will depend on N0, since more authors will lead to a higher maximum publication number!

In my opinion, the most reasonable assumption is that there is only one author with H publications, so that N0p(H)=1H(N0α)1α+1, neglecting the normalization prefactor of p(x).

The threshold number xf is easy to obtain directly from S(x):

xf=[f+(1f)Hα]1/α

From its definition, the fraction v is given by: v=αμ11Hα1α1(x1αfH1α). Note that we need here the complete expression for the mean [2]:

μ=αα1L1H1α1Hα

Plugging xf and μ in the definition of v and setting v=1/2 yields:

f=f(1+H1α)αα12αα1Hα1Hα,with f=(12)αα1,

and we assume that the upper bound is given by:

H=(N0α)1α+1.

Exponent α=1

Let us rewrite the PDF, CDF and survival function in this particular case:

p(x)=11H11x2;F(x)=1x11H1;S(x)=1F(x)=x1H11H1

xf=S1(f)=1f+(1f)H1

v=12=1ln(xf)ln(H)xf=Hand, since H=N0,xf=N1/40

Putting it all together yields f=N1/401N1/201 and, in the high N0 limit, fN1/40, so the number of "prolific" authors Np=fN0=N3/40, a result also obtained by Price et al. [1] using the discrete distribution. They also showed that other power laws (from N1/20 to N10) can be obtained, depending on the exact dependence of H on N0.

Fraction f of the most prolific authors that contribute v=1/2 of the total output, as a function of the total number of authors, N0, for various exponents α. The unbound limit f(H), calculated in the previous post is also shown for α>1. With my choice for the relation between N0 and H, this also corresponds to N0. The particular value α=1.16 yields the 20/80 rule, but also the 0.6/50 rule shown as solid black line. Note that the curve for α=1 is computed using a different formula than the others and does not reach a plateau: its asymptotic regime fN1/40 is shown as dotted line.
The graph above summarizes all these results: for α=1, f reaches the asymptotic regime fN1/40 very quickly (N0100). For α>1, f leaves this asymptote and saturates at its unbound limit f(H), calculated in the previous post. This regime change is very slow for α<2: the plateau is reached for N0>106.
In conclusion, an attenuated version of Price's law is indeed obtained for α=1(where it holds for any N0) but also for reasonably low α>1, in particular for α=1.16 (of 20/80 fame) where it applies for any practical number of authors! As soon as α exceeds about 1.5, the decay is shallow and saturates quickly so f is relatively flat.


1 Allison, P. D. et al., Lotka's Law: A Problem in Its Interpretation and Application Social Studies of Science 6, 269-276, (1976).

August 28, 2021

The Pareto distribution and the 20/80 rule

I mentioned in the previous post Pareto's 20/80 rule. Here, I will discuss Pareto's distribution, insisting on how (and in what conditions) it gives rise to this result. I had some trouble understanding the derivation as presented in various sources, so I will go through it in detail.

The functional form of the Pareto distribution is a power law, over an interval (L,H) such that 0<L<H. I will use the notations of the Wikipedia page unless stated otherwise. Its probability density function (PDF) p(x) and cumulative distribution function (CDF) F(x) are (α is real and strictly positive):

p(x)=α1(L/H)α1x(Lx)α;F(x)=1(L/x)α1(L/H)α

One often uses the complementary CDF (or survival function) defined as:

S(x)=1F(x)=11(L/H)α[(Lx)α(LH)α]

Note that the survival function is very similar to the PDF multiplied by x: S(x)xαp(x), the difference being due only to the final truncation term. However, this is only true for power laws, as one can easily check by writing p(x)=F(x) and solving the resulting ODE. We should therefore carefully distinguish xp(x) (which is, for instance, the integrand to use for computing the mean of the distribution) and S(x) which "has already been integrated", so to speak.

Let us use this continuous model to describe the distribution of publications (neglecting for now its intrinsically discrete character). x stands for the number of publications by one author, bounded by L and H. The number of authors that published x books is given by N0p(x). N0 is the total number of authors. 

  • The first question is: who are the first f more prolific authors (in Pareto's case, f=0.2=20%)? More precisely, what is the threshold number of publications xf separating them from the less prolific ones?
This is quite easy: if we go through the list of authors (ordered by increasing x) when we reach xf we will have counted the lower fraction, so xfLp(x)dx=F(xf)=1f. Thus, the survival function is S(xf)=Hxfp(x)dx=f and we can simply invert this dependency to get xf=S1(f).
  •  The second question is: how many publications did these top f authors contribute?
We need to count the authors again, but with an additional factor of x, since there are N0p(x) authors with exactly x publications, for a total contribution of xN0p(x). The fraction of publications contributed by the top f authors v is then:
v=HxfxN0p(x)dxHLxN0p(x)dx=Hxfxp(x)dxμ
where μ is the mean of the distribution and N0μ is the total number of publications.

In the simple case H= (which requires α>1), one has:
p(x)=αx(Lx)α,withμ=αα1L
f=S(xf)=(Lxf)αxf=Lf1/α
 
Plugging the above into the equation for v yields:
v=1μxfxp(x)dx=αμxf(Lx)αdx=(Lxf)α1=fα1αf=vαα1 
Pareto's rule f=0.2 and v=0.8 requires α1.161: a power law with this exponent will obey the rule, irrespective of the values of L and N0. Despite the neat coincidence in the established statement of the principle, there is absolutely no need that f+v=1! For instance, the same α implies that, for v=0.5, f0.065, a result I have already used in the previous post.

August 27, 2021

Price's law is not intensive

Price's law was proposed in the context of scientific publishing: 

The square root of the total number of authors contribute half of the total number of publications.

It is a more extreme version of Pareto's 20/80 rule, which would state that 20% of authors contribute 80% of the total number of publications (see next post for the relation between the two). Unlike Pareto's rule, however, Price's law is not stable under extension. This is a trivial observation, but I have not yet seen it in the literature, just like I have not seen much empirical evidence for Price's law.

Let us denote by N the total number of authors and by Np the number of "productive" authors (the top authors that provide half of all publications). As the ratio of two extensive quantities, p should be independent of the system size N: consider ten systems (e.g. the research communities in different countries, different subjects, etc.), each of size N, with the same publication distribution and hence the same Np. Half of the total number of publications is published by 10Np contributors, so the overall productivity is p=10Np10N=p. According to Price's law, it should however be p=p/10 ! The situation is similar to having ten identical vessels, all under the same pressure p. If we connect them all together the pressure does not change, although both the volume and energy increase by a factor of ten.

Price's law does have a "convenient" feature: simply by selecting the representative size N one can obtain any productivity, since p=1/N. For instance, the same Pareto distribution that yields the 20/80 rule predicts that 0.7% of causes yield half of the effects. This result is reproduced by Price's law with N23000.

Outside of bibliometry, Price's law has been invoked in economics, for instance by Jordan Peterson in (at least) one of his videos. What I find amusing is that it seems to contradict the principle of economies of scale: if there is a connection between the productivity p and the economic efficiency (and this is the more likely the higher the personnel costs are) then an increase in the size of a company decreases its efficiency. For instance, a chain of ten supermarkets would be less effective than ten independent units, which would be less effective than many small shops. Since the market is supposed to select for efficiency, we should witness fragmentation, rather than consolidation. 

References:

https://subversion.american.edu/aisaac/notes/pareto-distribution.pdf Clear derivation of the 20/80 principle from the general Pareto distribution.