Rank-size distribution

Rank-size distribution is the distribution of size by rank, in decreasing order of size. For example, if a data set consists of items of sizes 5, 100, 5, and 8, the rank-size distribution is 100, 8, 5, 5 (ranks 1 through 4). This is also known as the rank-frequency distribution, when the source data are from a frequency distribution. These are particularly of interest when the data vary significantly in scale, such as city size or word frequency. These distributions frequently follow a power law distribution, or less well-known ones such as a stretched exponential function or parabolic fractal distribution, at least approximately for certain ranges of ranks; see below. Rank-size distribution is the distribution of size by rank, in decreasing order of size. For example, if a data set consists of items of sizes 5, 100, 5, and 8, the rank-size distribution is 100, 8, 5, 5 (ranks 1 through 4). This is also known as the rank-frequency distribution, when the source data are from a frequency distribution. These are particularly of interest when the data vary significantly in scale, such as city size or word frequency. These distributions frequently follow a power law distribution, or less well-known ones such as a stretched exponential function or parabolic fractal distribution, at least approximately for certain ranges of ranks; see below. A rank-size distribution is not a probability distribution or cumulative distribution function. Rather, it is a discrete form of a quantile function (inverse cumulative distribution) in reverse order, giving the size of the element at a given rank. In the case of city populations, the resulting distribution in a country, a region, or the world will be characterized by its largest city, with other cities decreasing in size respective to it, initially at a rapid rate and then more slowly. This results in a few large cities and a much larger number of cities orders of magnitude smaller. For example, a rank 3 city would have one-third the population of a country's largest city, a rank 4 city would have one-fourth the population of the largest city, and so on. When any log-linear factor is ranked, the ranks follow the Lucas numbers, which consist of the sequentially additive numbers 1, 3, 4, 7, 11, 18, 29, 47, 76, 123, 199, etc. Like the more famous Fibonacci sequence, each number is approximately 1.618 (the Golden ratio) times the preceding number. For example, the third term in the sequence above, 4, is approximately 1.6183, or 4.236; the fourth term, 7, is approximately 1.6184, or 6.854; the eighth term, 47, is approximately 1.6188, or 46.979. With higher values, the figures converge. An equiangular spiral is sometimes used to visualize such sequences. A rank-size (or rank-frequency) distribution is often segmented into ranges. This is frequently done somewhat arbitrarily or due to external factors, particularly for market segmentation, but can also be due to distinct behavior as rank varies. Most simply and commonly, a distribution may be split in two, termed the head and tail. If a distribution is broken into three pieces, the third (middle) piece has several terms, generically middle, also belly, torso, and body. These frequently have some adjectives added, most significantly long tail, also fat belly, chunky middle, etc. In more traditional terms, these may be called top-tier, mid-tier, and bottom-tier. The relative sizes and weights of these segments (how many ranks in each segment, and what proportion of the total population is in a given segment) qualitatively characterizes a distribution, analogously to the skewness or kurtosis of a probability distribution. Namely: is it dominated by a few top members (head-heavy, like profits in the recorded music industry), or is it dominated by many small members (tail-heavy, like internet search queries), or distributed in some other way? Practically, this determines strategy: where should attention be focused? These distinctions may be made for various reasons. For example, they may arise from differing properties of the population, as in the 90–9–1 principle, which posits that in an internet community, 90% of the participants of a community only view content, 9% of the participants edit content, and 1% of the participants actively create new content. As another example, in marketing one may pragmatically consider the head as all members that receive personalized attention, such as personal phone calls; while the tail is everything else, which does not receive personalized attention, for example receiving form letters; and the line is simply as far as resources allow, or where it makes business sense to stop. Purely quantitatively, a conventional way of splitting a distribution into head and tail is to consider the head to be the first p portion of ranks, which account for 1 − p {displaystyle 1-p} of the overall population, as in the 80:20 Pareto principle, where the top 20% (head) comprises 80% of the overall population. The exact cutoff depends on the distribution – each distribution has a single such cutoff point—and for power laws can be computed from the Pareto index.

Parent Topic

Child Topic

No Parent Topic