The qsort optimization that wasn’t

In 1993, Jon Bentley and Doug McIlroy published a paper Engineering a Sort Function (also here) (Software–Practice and Experience, vol. 23 no. 11, Nov. 1993, pp 1249-1265). Since then, most C stdlib qsort functions have been based on the code and/or the ideas in that paper.

The qsort in the BSD variants and descendants have followed the code in that paper. But someone in 1994 attempted to “improve” on Bentley and McIlroy, with potentially disastrous results.

The NetBSD code prior to 1994, in NetBSD qsort 1.1 (copied from 386BSD), contained a qsort that attempts to follow the quicksort discussion in Donald Knuth’s The Art of Computer Programming, volume 3, Sorting and Searching (1973, first edition). I believe the author would have been better off following the advice in the later paper by Robert Sedgewick, Implementing Quicksort Programs (Communications of the ACM, vol 21 no 10, Oct. 1978 pp 847–857). Sedgewick was Knuth’s PhD student and had done his thesis on quicksort.

The comments in the 386BSD code imply to me that some of Knuth was misunderstood. For example: “Knuth, Vol. 3, page 116, Algorithm Q, step b, argues that a single pass of straight insertion sort after partitioning is complete is better than sorting each small partition as it is created.”

Oddly, Knuth never suggests a single pass of insertion sort instead of sorting each small partition; that was suggested later by Sedgewick. Knuth step b says “Subfiles of M or fewer elements are sorted by straight insertion…” and details doing that in step Q8, page 117.

A later comment says “Insertion sort has the same worst case as most simple sorts (O N^2). It gets used here because it is (O N) in the case of sorted data.” It is true that it is O(N2) worst case and O(N) if the data is sorted, but that is not why it is used with quicksort. It is used because it is faster for small subsections than quicksort due to its lower “bookkeeping” overhead, per Knuth on page 116.

The 386BSD routine departed from Knuth in an important respect: its partitioning code “jumped over” equal elements instead of stopping and swapping them. Bentley and McIlroy noted: “Shopping around for a better qsort, we found that a qsort written at Berkeley in 1983 would consume quadratic time on arrays that contain a few elements repeated many times–in particular arrays of random zeros and ones.” And later: “The Berkeley qsort takes quadratic time on random zeros and ones precisely because it scans over equal elements…”

The author also did something else not found in Knuth, however. There is this comment:

	 * Quicksort behaves badly in the presence of data which is already
	 * sorted (see Knuth, Vol. 3, page 119) going from O N lg N to O N^2.
	 * To avoid this worst case behavior, if a re-partitioning occurs
	 * without swapping any elements, it is not further partitioned and
	 * is insert sorted.  This wins big with almost sorted data sets and
	 * only loses if the data set is very strangely partitioned.  A fix
	 * for those data sets would be to return prematurely if the insertion
	 * sort routine is forced to make an excessive number of swaps, and
	 * continue the partitioning.

(Knuth does not say this on page 119; he does say on page 122 “If the original file is already in order … this situation … makes quicksort anything but quick; its running time becomes proportional to N2 instead of N log N.” This refers to a version that chooses the first element as the pivot. But Knuth describes ways of fixing the situation, including choosing the pivot at random or choosing the median of first, middle, and last elements. And the 386BSD routine does use the median-of-3 method.)

So the code includes a flag didswap which is set if the partitioning pass does any swapping, and then uses an insertion sort on the partition if didswap is false. This “optimization” loses very big if the data is “very strangely partitioned”. This is probably the cause of this issue called out in Bentley and McIlroy: “…several other problems with the Berkeley code. (For instance, a mod-4 ‘sawtooth’ with its front half reversed ran a factor of 8 slower than expected.)” Also, the comment blithely suggests “A fix for those data sets would be to return prematurely if the insertion sort routine is forced to make an excessive number of swaps” without indicating how that might be done without too much code or computation.

The Bentley-McIlroy paper reported that “In any event, the new code represents enough of an improvement to have been adopted in our own lab and by Berkeley.” But someone at Berkeley failed to read and understand the paper. The first FreeBSD qsort (1994-05-27) based on the Bentley-McIlroy paper, (also in NetBSD revision 1.1.1.2 (1994-06-16) and OpenBSD revision 1.1 (1995-10-18)) was copied from 4.4BSD-Lite. It has a variable swap_cnt that serves the same purpose as didswap in the earlier qsort. It’s misnamed, because it’s a 0/1 flag, not a counter. Then there is:

	if (swap_cnt == 0) {  /* Switch to insertion sort */
		for (pm = a + es; pm < (char *) a + n * es; pm += es)
			for (pl = pm; pl > (char *) a && cmp(pl - es, pl) > 0;
			     pl -= es)
				swap(pl, pl - es);
		return;
	}

This causes exactly the problem with the sawtooth pattern mentioned above.

This problem was noticed a long time ago. See, e.g., this thread by postgresql developers. Apparently the “very strangely partitioned” data happened in the real world:

qsort, once again
From:Tom Lane
Date:16 March 2006, 18:37:58

I was just looking at the behavior of src/port/qsort.c on the test case
that Jerry Sievers was complaining about in pgsql-admin this morning.
I found out what the real weak spot is: it's got nothing directly to do
with good or bad pivots, it's this code right here:
   if (swap_cnt == 0)   {                            /* Switch to insertion sort */
   [... code shown above -- rdg ...]
   }
In other words, if qsort hits a subfile for which the chosen pivot is a
perfect pivot (no swaps are necessary), it switches to insertion sort.
Which is O(N^2).  In Jerry's test case this happens on a subfile of
736357 elements, and you can say goodnight to that process ....

This swap_cnt code was in place in NetBSD until qsort revision 1.20, 2009-06-01, about 15 years later. (Commit message: “qsort: remove the “switch to insertion sort” optimization because it causes catastrophic performance for certain inputs.”)

It remained in OpenBSD until qsort revision 1.12, 2014-06-12, about 20 years later. (Commit message: “Disable the “switch to insertion sort” optimization to avoid quadratic behavior for certain inputs. From NetBSD.”

As of this writing, it remains in FreeBSD and in several other libc distributions, e.g. Newlib, picolibc, bionic, reactos.

The postgresql team apparently decided to use their own qsort exclusively. It includes a different optimization: at the beginning of each partitioning operation, it checks to see if the data is already in order, and skips sorting that subfile if it is already sorted. (Some members of the group are skeptical of this check.)

It also occurred to me that perhaps a Shell sort in place of the if (swap_cnt == 0) insertion sort would improve performance.

I took several versions that include the swap_cnt code and removed that code, then tried adding a pre-sorted test, a Shell sort in place of the if (swap_cnt == 0) insertion sort, and both the pre-sorted test and the Shell sort. These versions are named nnn_nopt, nnn_pre, nnn_shell, and nnn_pre_shell for the no-swap-count-optimization, pre-sort-check, Shell sort, and pre-sort-plus-Shellsort versions, where nnn is the name of the original version (freebsd, reactos, bionic). And I added bentley_mcilroy_pre, bentley_mcilroy_shell, and bentley_mcilroy_pre_shell versions as well.

I added a pre-check for sorted data to qs22i to get qs22j, and used that in the following tests.

Here are some results (10000 elements, 5 reps):

   Tot.rank     Swaps     Compares      Time   Ratio Implementation
 1.  377     94128580    459721400   6.806 s   1.000 qs22j
 2.  705    107019100    541185040   8.011 s   1.177 qs22i
 3.  817    111294680    508438960   7.933 s   1.165 reactos_pre_shell
 4.  918    111244760    508450060   7.935 s   1.166 bentley_mcilroy_pre_shell
 5.  955            0    508438960   8.064 s   1.185 freebsd_pre_shell
 6. 1003            0    535753740   8.258 s   1.213 freebsd_pre
 7. 1010    109557610    515009880   7.992 s   1.174 bentley_mcilroy_pre
 8. 1053            0    508438960   8.206 s   1.206 bionic_pre_shell
 9. 1124    108492120    535753740   8.295 s   1.219 reactos_pre
10. 1305    159348860    539955580  10.402 s   1.528 bentley_mcilroy_shell
11. 1326            0    535753740   8.377 s   1.231 bionic_pre
12. 1380    159381080    539941060  10.103 s   1.484 reactos_shell
13. 1487    118093240    585420940   8.908 s   1.309 bentley_mcilroy2
14. 1536            0    539941060  10.134 s   1.489 freebsd_shell
15. 1545            0    585420940  11.149 s   1.638 bentmcil
16. 1630            0    539941060  10.196 s   1.498 bionic_shell
17. 1680    155190940    590927860  10.575 s   1.554 reactos_nopt
18. 1727    158006340    585420940  10.597 s   1.557 bentley_mcilroy
19. 1754            0    590927860  10.648 s   1.564 freebsd_nopt
20. 1867            0    590927860  11.162 s   1.640 bionic_nopt

And here are results from the “izabera” (Isabella Bosia) tests:

   Tot.rank     Swaps     Compares      Time   Ratio Implementation
 1.  299     56211700    402119340   5.971 s   1.000 qs22j
 2.  441     56212300    467930740   6.952 s   1.164 qs22i
 3.  591            0    492160640   7.356 s   1.232 freebsd_pre
 4.  594     72794660    480684760   7.396 s   1.239 reactos_pre_shell
 5.  654            0    480684760   7.455 s   1.249 freebsd_pre_shell
 6.  685     72798140    480673320   7.437 s   1.246 bentley_mcilroy_pre_shell
 7.  698     66451700    492160640   7.412 s   1.241 reactos_pre
 8.  706     67278940    465405160   6.992 s   1.171 bentley_mcilroy_pre
 9.  729            0    480684760   7.541 s   1.263 bionic_pre_shell
10.  771     84726960    461110580   7.793 s   1.305 bentley_mcilroy_shell
11.  783            0    492160640   7.502 s   1.256 bionic_pre
12.  786     84720640    461115400   7.700 s   1.290 reactos_shell
13.  840            0    461115400   7.791 s   1.305 freebsd_shell
14.  870     76082840    506126400   8.012 s   1.342 reactos_nopt
15.  900            0    503202250   8.304 s   1.391 bentmcil
16.  906            0    506126400   8.145 s   1.364 bionic_nopt
17.  925            0    506126400   8.078 s   1.353 freebsd_nopt
18.  934            0    461115400   7.878 s   1.319 bionic_shell
19.  942     66704230    503202250   7.788 s   1.304 bentley_mcilroy2
20. 1065     78515800    503202250   8.271 s   1.385 bentley_mcilroy

Does the swap_cnt code improve the performance if we avoid the cases where it goes quadratic? Here are results including the original versions of bionic, freebsd, newlib, nlopt, picolibc, and reactos:

   Tot.rank     Swaps     Compares      Time   Ratio Implementation
 1.  345    812425500   3995362140  62.996 s   1.000 qs22j
 2.  579    934827860   4255667160  70.232 s   1.115 reactos_pre_shell
 3.  675    934744100   4255449220  70.785 s   1.124 bentley_mcilroy_pre_shell
 4.  707            0   4255667160  72.269 s   1.147 bionic_pre_shell
 5.  750    942426500   5280619020  80.768 s   1.282 qs22i
 6.  836            0   4462256520  85.932 s   1.364 picolibc
 7.  874            0   4255667160  73.264 s   1.163 freebsd_pre_shell
 8.  875    921501620   4550322480  73.574 s   1.168 reactos_pre
 9.  895            0   4462256520  86.235 s   1.369 newlib
10.  923            0   4550322480  75.254 s   1.195 freebsd_pre
11.  952            0   4550322480  75.365 s   1.196 bionic_pre
12.  957    925527200   4324550240  71.875 s   1.141 bentley_mcilroy_pre
13. 1018   1671482800   4462256520  86.726 s   1.377 reactos
14. 1097            0   4462256520  86.923 s   1.380 netbsd_1_4
15. 1268   1024111730   5490627380  85.876 s   1.363 bentley_mcilroy2
16. 1281            0   4462256520  90.780 s   1.441 bionic
17. 1328   1354974780   5048315340  95.209 s   1.511 reactos_shell
18. 1380   1354903560   5047593740  94.987 s   1.508 bentley_mcilroy_shell
19. 1413            0   4462256520  91.197 s   1.448 freebsd
20. 1431            0   4462256520  95.693 s   1.519 nlopt
21. 1446            0   5490627380 104.098 s   1.652 bentmcil
22. 1477   1334927640   5490627380  98.820 s   1.569 bentley_mcilroy
23. 1507            0   5532021360  99.464 s   1.579 freebsd_nopt
24. 1511   1318873580   5532021360  99.074 s   1.573 reactos_nopt
25. 1530            0   5048315340  97.048 s   1.541 freebsd_shell
26. 1573            0   5532021360 100.056 s   1.588 bionic_nopt
27. 1612            0   5048315340  97.321 s   1.545 bionic_shell

In most cases, adding only the Shell sort resulted in somewhat slower overall performance, but adding the pre-sort check and the Shell sort give better performance than the swap_cnt versions.

But including the cases that cause quadratic performance give clearly bad results for the swap_cnt sorts (ordered here by time rather than finishing rank):

   Tot.rank     Swaps     Compares      Time   Ratio Implementation
 1.  578     41547940    200606648   3.256 s   1.000 qs22j
 2.  929     49860980    219465656   3.699 s   1.136 reactos_pre_shell
 3.  973     49843156    219434364   3.717 s   1.141 bentley_mcilroy_pre_shell
 5. 1192            0    219465656   3.801 s   1.167 bionic_pre_shell
 8. 1346     47973632    222390056   3.874 s   1.190 bentley_mcilroy_pre
 6. 1308     47641164    232805696   3.885 s   1.193 reactos_pre
 4. 1098     47050340    246166168   3.914 s   1.202 qs22i
 9. 1370            0    232805696   4.024 s   1.236 freebsd_pre
 7. 1331            0    219465656   4.035 s   1.239 freebsd_pre_shell
10. 1440            0    232805696   4.092 s   1.257 bionic_pre
14. 1738     51969562    263591438   4.345 s   1.334 bentley_mcilroy2
15. 1766     70217484    240903484   4.722 s   1.450 reactos_shell
17. 1847     70204772    240834664   4.805 s   1.475 bentley_mcilroy_shell
22. 2153            0    240903484   4.877 s   1.498 bionic_shell
21. 2128            0    240903484   4.940 s   1.517 freebsd_shell
20. 2079     67699272    266621708   5.046 s   1.550 reactos_nopt
19. 2021     68827308    263591438   5.065 s   1.555 bentley_mcilroy
25. 2257            0    266621708   5.099 s   1.566 bionic_nopt
23. 2181            0    266621708   5.099 s   1.566 freebsd_nopt
18. 2002            0    263591438   5.383 s   1.653 bentmcil
16. 1835            0   1624564520  31.208 s   9.583 netbsd_1_4
12. 1650            0   1624564520  31.601 s   9.704 picolibc
11. 1639            0   1624564520  32.435 s   9.960 newlib
13. 1724   1495494748   1624564520  32.919 s  10.108 reactos
24. 2194            0   1624564520  37.001 s  11.362 bionic
27. 2311            0   1624564520  37.474 s  11.507 freebsd
26. 2270            0   1624564520  38.115 s  11.703 nlopt

Not shown here are the Bosia tests, but they show similar results.

I conclude that if you want to use a qsort derived from Bentley and McIlroy, you should at least replace the swap_cnt == 0 insertion sort with a Shell sort, and probably add the check for already-sorted data.

Another change made by BSD from the Bentley-McIlroy paper was to replace the second of the two recursive calls with goto, with the comment “Iterate rather than recurse to save stack space”. This will save little or no space unless the partitioning is significantly uneven. But it also does not prevent a possible O(N2) stack depth usage. (Also, the goto goes to a point just before SWAPINIT(a, es);, so the SWAPINIT() macro gets called redundantly many times.) Only in 2017 did the major BSD-derived versions change to always recursing on the smaller partition, which eliminates the quadratic stack space problem. (Bentley and McIlroy explicitly decided not to deal with the possibility.)

One other change made by FreeBSD seems odd to me: Just recently (January 2022) they made a change “prevent undefined behavior,” with the following comments:

Mark Milliard has detected a case of undefined behavior with the LLVM
UBSAN. The mandoc program called qsort with a==NULL and n==0, which is
allowed by the POSIX standard. The qsort() in FreeBSD did not attempt
to perform any accesses using the passed pointer for n==0, but it did
add an offset to the pointer value, which is undefined behavior in
case of a NULL pointer. This operation has no adverse effects on any
achitecture supported by FreeBSD, but could be caught in more strict
environments.

After some discussion in the freebsd-current mail list, it was
concluded that the case of a==NULL and n!=0 should still be caught by
UBSAN (or cause a program abort due to an illegal access) in order to
not hide errors in programs incorrectly invoking qsort().

Only the the case of a==NULL and n==0 should be fixed to not perform
the undefined operation on a NULL pointer.

I cannot see how this “is allowed by the POSIX standard,” which says “The qsort() function shall sort an array of nel objects, the initial element of which is pointed to by base.” Nowhere does it say the base pointer can be NULL; it must point to “an array of nel objects.” This is clearly a bug in mandoc. There are many ways to call qsort with invalid arguments that can cause various sorts of trouble.

On the matter of avoiding undefined behavior, many of the BSD variants still use the Bentley-McIlroy “trick” of ((char *)a - (char *)0) to get a usable “address” from the base pointer. This is undefined behavior; both pointers in a pointer subtraction must point within the same array (or one past the end). I think these should be updated to use ((uintptr_t)(void *)a) if uintptr_t is available (it’s an optional <stdint.h> type in C99 and C11).

(See also: part 1, part 2, part 4, part 5)