Skip to content

math.gamma result slightly different on aarch64-apple-darwin #132763

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
youknowone opened this issue Apr 21, 2025 · 33 comments
Closed

math.gamma result slightly different on aarch64-apple-darwin #132763

youknowone opened this issue Apr 21, 2025 · 33 comments
Labels
type-bug An unexpected behavior, bug, or error

Comments

@youknowone
Copy link
Contributor

youknowone commented Apr 21, 2025

Bug report

Bug description:

The difference is watched from distributions.
Because the result is changed by fp-contract. It may not be strictly a bug in source code, but can be a bug of build & distrubition.

On x86_64 linux:

Python 3.13.3 (main, Apr 21 2025, 11:31:07) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import math; math.gamma(-3.8510064710745118)
0.36222232384328096

On aarch64 macOS:

Python 3.13.2 (main, Feb  4 2025, 14:51:09) [Clang 16.0.0 (clang-1600.0.26.6)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import math; math.gamma(-3.8510064710745118)
0.36222232384328107

tgamma.c file is separated from mathmodule.c to test easy.
new tgamma.c

The result below is the result of tgamma.c, but they are the same in CPython distribution.
Test results:

tgamma.c in x86_64 linux:

$ clang -lm tgamma.c -ffp-contract=off && ./a.out The result of tgamma(-3.8510064710745118) is: 0.362222
Bit representation of result: 4600196837208649421
$ clang -lm tgamma.c -ffp-contract=on && ./a.out 
The result of tgamma(-3.8510064710745118) is: 0.362222
Bit representation of result: 4600196837208649421

tgamma.c in aarch64 macos:

$  clang -lm tgamma.c -ffp-contract=off && ./a.out
The result of tgamma(-3.8510064710745118) is: 0.362222
Bit representation of result: 4600196837208649421
$  clang -lm tgamma.c -ffp-contract=on && ./a.out
The result of tgamma(-3.8510064710745118) is: 0.362222
Bit representation of result: 4600196837208649423

The original test I found this mismatch:
On aarch64-apple-darwin,

x is input.
Left is CPython, right is pymath

thread 'gamma::tests::test_tgamma' panicked at src/gamma.rs:303:17:
assertion `left == right` failed: x = -3.8510064710745118, py_gamma = 0.36222232384328107, rs_gamma = 0.36222232384328096
  left: 4600196837208649423
 right: 4600196837208649421

thread 'gamma::tests::test_tgamma' panicked at src/gamma.rs:303:17:
assertion `left == right` failed: x = -3.8510064710745118, py_gamma = 0.36222232384328107, rs_gamma = 0.36222232384328096
  left: 4600196837208649423
 right: 4600196837208649421

thread 'gamma::tests::test_tgamma' panicked at src/gamma.rs:286:5:
Test failed: assertion `left == right` failed: x = -3.8510064710745118, py_gamma = 0.36222232384328107, rs_gamma = 0.36222232384328096
  left: 4600196837[208](https://github.com/RustPython/pymath/actions/runs/14565988158/job/40855284918?pr=2#step:6:209)649423
 right: 4600196837208649421.
minimal failing input: x = -3.8510064710745118
	successes: 0
	local rejects: 0
	global rejects: 0

On x86_64-unknown-linux-gnu,

x = -3.8510064710745118
py_gamma = 0.36222232384328096
rs_gamma = 0.36222232384328096
py_gamma_repr = 4600196837208649421
rs_gamma_repr = 4600196837208649421

Turning off fp-contract will fix aarch64 macos build to match x86 linux/windows build.

CPython versions tested on:

3.13

Operating systems tested on:

macOS

@youknowone youknowone added the type-bug An unexpected behavior, bug, or error label Apr 21, 2025
@corona10
Copy link
Member

cc @rhettinger @ned-deily

@tim-one
Copy link
Member

tim-one commented Apr 21, 2025

It's a 2 ULP difference:

>>> 0.36222232384328096 .hex()
'0x1.72ea68ab26ecdp-2'
>>> 0.36222232384328107 .hex()
'0x1.72ea68ab26ecfp-2'

I don't consider this to be "a bug". CPython makes no promises about the accuracy of "advanced" math functions,, and is typically at the mercy of the platform's C compiler and libm. They can't be expected to deliver identical results across platforms, compilers, or even across releases (of the compiler or of Python) on a single platform.

In this case, looks like the MacOS C compiler is exploiting the HW's fused multiply-add (FMA) support. There's nothing in our source code that asks for that, or that forbids it. It's up to the compiler, and that's fine. FMA is generally a good thing, running faster and with smaller error. Of course it can have tiny effects on the low-order bits of the final result (or, for algorithms that rely on it, which Python doesn't use, huge effects).

@skirpichev
Copy link
Member

FMA is generally a good thing, running faster and with smaller error.

BTW, in the OP case - aarch64 result seems slightly better (rather 1ULP off).

The difference is watched from distributions.

It depends on how you do this. If you compare computed value with some float by == - that's rather bug in your tests.

@skirpichev skirpichev added the pending The issue will be closed if no feedback is provided label Apr 21, 2025
@youknowone
Copy link
Contributor Author

youknowone commented Apr 21, 2025

I don't consider this to be "a bug".

I agree this is not necessarily a bug. Checking whether this is a policy of CPython will be good enough. But it is not because it is 2ULP difference, but because no promise about advanced math funcitions, right?

Finding bigger difference is possible.

$ clang -lm tgamma.c && ./a.out
The result of tgamma(-3.8510064710745118) is: 3.809507
Bit representation of result: 4615760666384231000
$ clang -lm tgamma.c && ./a.out
The result of tgamma(3.6215752811868267) is: 3.809507
Bit representation of result: 4615760666384231005

If you compare computed value with some float by ==

I am not comparing the float values but comparing the bit representation of float values. Probably I didn't write the issue detailed enough. I am not insisting the result of math.gamma is incorrect, but only saying the bit representation of gamma function is different only in aarch64-apple-darwin unlike x86 linux or windows on major platforms I have.

For more context, I was going to write the port of those functions in Rust. I know float operation result can be varied by based platform, libm or a few other reasons.
The rust implementation, using same libc functions and carefully ported code produced the same bit representation in all of the platforms. The bit representation was also same on CPython on x86-64 linux and windows. Surprisingly, only CPython on aarch64-apple-darwin had different bit representation to the other 3 rust + 2 cpython results. I still think this is not strictly a bug. So I'd like to check if this is intended or not intended. I thought there could be a chance a build flag was missed for aarch64.

@skirpichev
Copy link
Member

Finding bigger difference is possible

Sorry, it seems you use different arguments in your latest examples. -3.8510064710745118 vs 3.6215752811868267. It's hard to say in which example the difference is bigger.

but comparing the bit representation of float values.

It's same story. Rather you should use something like math.isclose().

@youknowone
Copy link
Contributor Author

youknowone commented Apr 21, 2025

Sorry, the log is hardcoded one. (The source code is linked above)
3.6215752811868267 is the correct input in the comment. You can see bit representations of them to check difference.

@skirpichev I am not getting why you are emphasizing math.isclose() in same story.
I know comparing bits is not the best practice for float operations. I described what was the purpose of the issue and I am not insisting the difference of small values are real bugs. If it was not clear enough, please tell me which part was misleading. I will try to fix it.

@corona10
Copy link
Member

corona10 commented Apr 21, 2025

Or leave note at docs.python.org that the it could show different bit results according to the compiler options?
Because we can not control all distros that are using custom compiler options.

@youknowone
Copy link
Contributor Author

@corona10 Different bit representations for floating-point operations across platforms are expected behavior in computer science. So I feel like that's a redundant commentary for general computer science knowledge.

If CPython wants to try best to keep consistent cross-platform floating-point behavior, modifying the build configuration with appropriate compiler flags would be the solution.

However, if the current default compiler options align with the project's policy of minimal compiler customization, then I believe we can simply close this issue as working as designed.

@skirpichev
Copy link
Member

3.6215752811868267 is the correct input in the comment. You can see bit representations of them to check difference.

Thanks, I guess in the second case you have result on aarch64 (on x86_64 linux box I got your first result). In this case, this is worse by 1ULP (3ULP vs 2ULP) than x86_64 result:

In [17]: struct.unpack('<d', struct.pack('<q', 4615760666384231000))  # 1st example
Out[17]: (3.8095071915716225,)

In [18]: struct.unpack('<d', struct.pack('<q', 4615760666384231005))  # 2nd
Out[18]: (3.8095071915716248,)

In [19]: f"{gamma(3.6215752811868267):.17f}"  # mpmath, computed with mp.prec=113
Out[19]: '3.80950719157162333'

In [20]: 3.80950719157162333-3.8095071915716225
Out[20]: 8.30000000000000000061569678209333068e-16

In [21]: 3.80950719157162333-3.8095071915716248
Out[21]: -1.47000000000000000001158839675115145e-15

In [22]: math.ulp(3.80950719157162333)
Out[22]: 4.440892098500626e-16

In [23]: _20/_22
Out[23]: 1.86899384535875584013864258992043688

In [24]: _21/_22
Out[24]: -3.31014572611731456002609474964515357

In [25]: math.gamma(3.6215752811868267)  # my x86_64 box (c.f. In[17])
Out[25]: 3.8095071915716225

BTW, could you test libm's tgamma() on your systems?

why you are emphasizing math.isclose() in same story.

It's a good idea to use something like this, unless you are sure that your computations are identical (i.e differ only by transformations, that are true for all floats, e.g. you can't use associativity to reorder operands). Optimization, using HW's FMA - is an example of transformation (permitted with -ffp-contract=on), which can break this.

modifying the build configuration with appropriate compiler flags would be the solution.

I doubt that forbidding floating-point contraction is a good idea. Besides a performance boost you will, probably, get more accurate results on average.

@tim-one
Copy link
Member

tim-one commented Apr 21, 2025

[@youknowone wrote]

Checking whether this is a policy of CPython will be good enough. But it is not because it is 2ULP difference, but because no promise about advanced math funcitions, right?

Right! Although, to be fair, CPython makes no guarantees about the results of simple float arithmetic (+ - * /) either. Those results are also inherited from the platform C. It just so happens that almost all platforms today have HW that implements the IEEE-754 standard, which does define the precise results of simple arithmetic.

gamma is a little different in that CPython implements it itself. Our code doesn't rely on FMA, but it's up to the compiler whether to use it in lines like:

            num = num * x + lanczos_num_coeffs[i];
            den = den * x + lanczos_den_coeffs[i];

Results can differ.

Why does CPython implement gamma itself? Because some platforms' implementations at the time were too sloppy to bear. From the comments:

/* Implementation of the real gamma function.  Kept here to work around
   issues (see e.g. gh-70309) with quality of libm's tgamma/lgamma implementations
   on various platforms (Windows, MacOS).  In extensive but non-exhaustive
   random tests, this function proved accurate to within <= 10 ulps across the
   entire float domain.  Note that accuracy may depend on the quality of the
   system math functions, the pow function in particular.  Special cases
   follow C99 annex F.  The parameters and method are tailored to platforms
   whose double format is the IEEE 754 binary64 format. ...
*/

Note that errors as large as 10 ULP were seen at the time it was checked in. It's possible that the platforms in question have improved their implementations since then. The kind of testing done in gh-70309 could be done again to see whether we still "need" to supply our own.

@youknowone youknowone changed the title math.gamma result slightly different on macos math.gamma result slightly different on aarch64-apple-darwin Apr 21, 2025
@youknowone
Copy link
Contributor Author

youknowone commented Apr 21, 2025

Thank you all. I am understanding the situation better.

        num = num * x + lanczos_num_coeffs[i];
        den = den * x + lanczos_den_coeffs[i];

I confirmed this is the exact line why this happned. When I changed this operation to fma, tgamma always returns the "different" value of aarch64 macos. lgamma seems to have more place than that.

I also confirmed this is only happening on aarch64 macos, not on x86_64 macos. Probably compiler default is set to be more aggressive for aarch64.

Rather than changing compiler ops, how do you think making some operations in gamma to be explicitly using fma or hostile to be automatically fma applied by which one is more accurate to real function? Explicitly using fma will make FMA-supporting architectures to return always the same value and another group works as before. Making hostile to fma will make everything agree on the other way.
Of course nothing need to be done if it is not worth to do.

@youknowone
Copy link
Contributor Author

The second part of lgamma is here: https://github.com/python/cpython/blob/main/Modules/mathmodule.c#L516

r += (absx - 0.5) * (log(absx + lanczos_g - 0.5) - 1);

I confirmed changing this to use fma makes lgamma to return the value of aarch64 macos build

@youknowone
Copy link
Contributor Author

youknowone commented Apr 21, 2025

tgamma.c

libm tgamma vs py tgamma

double input = -3.8510064710745118;

on my aarch64 macos machine:

$ clang -lm tgamma.c && ./a.out
The result of libm tgamma() is: 0.362222323843281
Bit representation of result: 4600196837208649423
The result of py_tgamma() is: 0.362222323843281
Bit representation of result: 4600196837208649423

on my x86_64 linux machinie:

The result of libm tgamma() is: 0.362222323843281
Bit representation of result: 4600196837208649424
The result of py_tgamma() is: 0.362222323843281
Bit representation of result: 4600196837208649421

on my x86_64 macos machine:

clang -lm tgamma.c && ./a.out
The result of libm tgamma() is: 0.362222323843281
Bit representation of result: 4600196837208649423
The result of py_tgamma() is: 0.362222323843281
Bit representation of result: 4600196837208649421

@tim-one
Copy link
Member

tim-one commented Apr 22, 2025

As I recall, CPython never uses the platform C's fma(). Because it's not always implemented by HW on Python platforms. If the platform C has to implement it in software, then:

  • It's enormously slower than a HW implementation.
  • It's likely to get some overflow end cases wrong.

I don't believe anyone has the time and patience to give to resolving that, so we don't use fma() at all, and write algorithms that don't rely on it to get reasonably accurate results (it's certainly possible to write better & faster algorithms that do rely on it - which is why FMA exists - but people in the high-performance float world don't use HW that doesn't support FMA directly).

So I'm inclined to let this go.

r += (absx - 0.5) * (log(absx + lanczos_g - 0.5) - 1);

If r and the RHS (right-hand side) have opposite signs, that's where FMA can make a huge difference. If the high-order bits cancel out, FMA has access to the low-order bits of the product to shift into the result. In the absence of FMA, the lower-order product bits play no role: every high-order bit lost to cancellation is replaced by shifting a 0 low-order bit into the result.

If we absolutely have to "do something" about this, I'd favor a rule that compiler options to suppress FMA be used for compiling CPython. Nothing in our code base relies on it, and we have no internal algorithms that were designed with it in mind. So it's just a potential source of mysterious low-order-bit platform differences we really don't care about.

@skirpichev
Copy link
Member

skirpichev commented Apr 22, 2025

It's possible that the platforms in question have improved their implementations since then. The kind of testing done in #7030 could be done again to see whether we still "need" to supply our own.

@tim-one, I'm not sure to which pr you refer, but I think that latest attempt to remove custom code was ~2years ago: #101678. Probably this even not reported to upstream (I've no idea how to do this) and/or they don't care (MacOS/Window$, eh).

libm tgamma vs py tgamma

@youknowone, you now printed only 15 digits instead of 17. That's why now your results show no difference on aarch64 wrt x86_64 (i.e. for original issue).

we don't use fma() at all [...] Nothing in our code base relies on it

In fact, we do in math.dist(). Though, this might be turned off manually by UNRELIABLE_FMA define. But Raymond most likely will object to enable this fallback per default;-)

(Ah, and the platform's libm fma() exposed as math.fma().)

So it's just a potential source of mysterious low-order-bit platform differences we really don't care about.

If it's not the only possible source of differences - we don't buy too much.

I.e. CPython's tgamma/lgamma code rely on platform's libm's. Unless same libc is used (e.g. Glibc) on all platforms - OP still should expect that results will not be bit-to-bit identical.

It's relatively easy to find differences even on FOSS systems, especially for non-elementary functions, e.g. erfc(0.1):

$ uname -sm                                                                                                                  
OpenBSD amd64
$ cc a.c -lm && ./a.out 
0.88753708398171516
$ uname -sm
Linux x86_64
$ cc a.c -lm && ./a.out 
0.88753708398171505

@youknowone
Copy link
Contributor Author

youknowone commented Apr 22, 2025

@skirpichev I am sorry to confuse you about it. since the value correctness is out of my interest, I usually only checked uint64 transformed binary representations. Please check the last digit of second lines' integer values. You will see a few bits of difference in x86. If you want to check float values, I will share you them with enough digits.


Thanks to you all, I now feel like the comments of this issue contain enough explanation how it is going, and I am good with it. Please feel free to close the issue if the conclusion is changing nothing. Otherwise I will do my best for what I can help anything.

@tim-one
Copy link
Member

tim-one commented Apr 22, 2025

I'm not sure to which pr you refer,

Sorry, it was given in the C comment I pasted, but Github didn't make it clickable in that context. So I copy/pasted it again, but dropped a trailing digit by mistake. I edited the msg to repair that, and here it is again:

gh-70309

In fact, we do [use FMA] in math.dist()

Good catch! More reason to just leave this all alone.

Ah, and the platform's libm fma() exposed as math.fma()

I think that's irrelevant. We supply wrappers for all standard libm functions, and fma() is just another one of those. We're not using it internally, just supplying the platform's implementation to users. Changing that isn't on the table in any universe 😉.

So I would close this issue as "not planned". Heroic efforts to hide low-order bit differences is a rat hole CPython should stay out of. Unless we write our own libm from the ground up, we'll never succeed. And we don't have anywhere near the human resources needed to even contemplate that. And, if we did, users would complain that our libm sometimes gives different results than their platform C libm gives.

The best long-term approach is "benign neglect". Over time, libm authors move closer to correct rounding in ever more cases. For that reason, it was highly arguable whether Python should have taken over the gamma functions to begin with. Consider that if we had not, this issue would never have been opened (if we used the platform implementations, the options used to compile CPython would be irrelevant).

@skirpichev
Copy link
Member

since the value correctness is out of my interest, I usually only checked uint64 transformed binary representations.

@youknowone, I see. It's my bad. BTW, next time you could use %a-formatted representation (without precision, this will show you exact value in much more readable form). Or even %.17g - that should be enough for IEEE-doubles (aka CPython's floats on most platforms).

It looks, like on aarch64 and for libm's tgamma() - results within 1ULP (and same for libm's tgamma() and CPython's tgamma() - maybe Glibc use fma() explicitly?), while for CPython's version: ~3ULP. Though, one point is not enough to conclude that CPython implementation is worse.

#70309

Ah, and that was an attempt right before (+ ~5 years) of mine. As you can see, there is no progress at all.

So I would close this issue as "not planned". Heroic efforts to hide low-order bit differences is a rat hole CPython should stay out of.

I'm closing this on that ground.

Unless we write our own libm from the ground up, we'll never succeed.

I hoped we are slowly moving rather from this state (previously CPython had much more workarounds for buggy libm's).

@skirpichev skirpichev closed this as not planned Won't fix, can't repro, duplicate, stale Apr 22, 2025
@skirpichev skirpichev removed the pending The issue will be closed if no feedback is provided label Apr 22, 2025
@tim-one
Copy link
Member

tim-one commented Apr 22, 2025

Though, one point is not enough to conclude that CPython implementation is worse.

In the absence of rigorous error analysis, the best that can be done is to compare an implementation to correctly rounded results across at least billions of points. That's very expensive. in effect requiring emulating arbitrary-precision float math. That was done at the time, and the comments noted that errors as large as 10 ULP were seen. Not disastrous by any means, but certainly room for improvement. By now, I'm sure some platforms' implementations are better than ours.

I hoped we are slowly moving rather from this state (previously CPython had much more workarounds for buggy libm's).

Not just libm! All sorts of C library functions have been problems. Cute: the only reason I started writing sorts for Python was that platform C qsort() implementations were at the root of many segfaults (e.g., many weren't thread-safe or even reentrant, and some even overflowed the C stack due to internal recursion) and quadratic-time inputs. Progress in C libraries is slow, but it is ongoing.

@youknowone
Copy link
Contributor Author

youknowone commented Apr 23, 2025

It looks, like on aarch64 and for libm's tgamma() - results within 1ULP (and same for libm's tgamma() and CPython's tgamma() - maybe Glibc use fma() explicitly?), while for CPython's version: ~3ULP. Though, one point is not enough to conclude that CPython implementation is worse.

I am not opposed to the conclusion itself, but opposite to the decision-making flow. While I agree that a single data point is insufficient to conclude that the CPython implementation is inferior, the value is not the worse case and just randomly picked up. I simply chose the first failing value to illustrate the difference, without focusing on which implementation is more correct (as I have repeatedly emphasized not correctness but difference).

From your response, I understand that the correctness of values appears to influence your decision. If correctness is indeed a factor in your decision-making, it would be problematic to conclude that one implementation is not worse than others based on comparing just a single value.

If you are open to adopting FMA for its potential accuracy benefits, I believe we should conduct a more comprehensive statistical analysis of a properly sampled set of double values, or even the entire double type domain, before determining which approach is more correct. I would be happy to provide the necessary computing resources and machines to support such an analysis.
If correctness is a critical factor, I suggest the order of decision-making have to be reversed. The assessment of correctness should come first to make any decision to reject anything based on the correctness.

I previously indicated I am good to close this issue because I understood it not as an issue of correctness, but somewhat other than that like less maintaining cost, betting to wait for future library improvement, or something I don't know well not about the correctness. However, I am uncomfortable with reaching conclusions based on correctness claims without properly measuring and establishing that correctness.

@skirpichev
Copy link
Member

the value is not the worse case and just randomly picked up.

For another value CPython's result without fma() seems slightly better:
#132763 (comment)

I believe we should conduct a more comprehensive statistical analysis of a properly sampled set of double values

You can use mpmath to get precise results and compare them with CPython's tgamma() with or without fma(). The libm's answer also interesting (though, using libm's tgamma() in CPython seems to be not an option).

However, I am uncomfortable with reaching conclusions based on correctness claims

It's not just correctness. As Tim noted, fma() may be unavailable in HW, then this will slowdown gamma functions.

@skirpichev skirpichev self-assigned this Apr 23, 2025
@tim-one
Copy link
Member

tim-one commented Apr 23, 2025

FMA makes little difference unless a numeric algorithm suffers massive cancellation without it. The algorithms Python uses do not--by design--suffer massive cancellation. FMA wouldn't help them except by accident depending on specific inputs. May help, may hurt, depending. And we don't care about low-order bit differences.

Python could theoretically use entirely different algorithms that exploit FMA, but there's essentially no chance anyone will do that. The code we have was written by a world-class numerical analyst (who is, alas, no longer active in the project), and as has been said several times now, his extensive testing found maximum error of 10 ULP. "Good enough". There's nothing more to be done here with reasonable effort.

Testing numerical functions for accuracy is not easy. Doubles are a 64-bit type, and there are quintillions of them. Exhaustive testing will never be done. What can be done is to have an expert numerical analyst run some (mere!) billions of randomized tests, and "white box" tests based on deep understanding of the algorithm (to know in advance where it's likely to do worst).

That was already done for the code we have. I doubt anyone would volunteer the effort to do it again. I know I won't 😉. For a start, I have scant understanding of the algorithms we're using, and "easy" randomized testing is insufficient. Good testing on a space so large requires bona fide expertise in the algorithm being tested.

@youknowone
Copy link
Contributor Author

youknowone commented Apr 23, 2025

I did and do (again) totally agree about the conclusion. Just pointed unmeasured correctness doesn't back the conclusion. (Whether another value is better or worse, another single value still shouldn't back the conclusion.)

Now I learned another point the algorithm is already carefully designed not to use fma. I didn't know that. That helps a lot to understand the conservative view of those algorithms. I also realized I also should turn off fma for my ported code too by that. (Well tested is better than general expectation)

Thank you so much and sorry for bothering about it again.

@skirpichev
Copy link
Member

Here are quick tests, on the CPython data points for tgamma(). Second and fourth columns measure difference wrt computed by mpmath value, in the third column I get expected value from CPython tests (btw they look less precise).

Without fma():

                        x    mpmath diff (ULP)  expected diff (ULP)      libm diff (ULP)
----------------------------------------------------------------------------------------
                        1                  0.0                  0.0                  0.0
                        2                  0.0                  0.0                  0.0
                        3                  0.0                  0.0                  0.0
[...]
      -63.349078729022985                  1.0                  1.0                  1.0
      -127.45117632943295                  0.0                  0.0                  0.0
----------------------------------------------------------------------------------------
                 max diff                  4.0                  8.0                  2.0
                mean diff                  0.5                  0.8                  0.3

With fma() (see patch), this affects the second column:

                        x    mpmath diff (ULP)  expected diff (ULP)      libm diff (ULP)
----------------------------------------------------------------------------------------
                        1                  0.0                  0.0                  0.0
                        2                  0.0                  0.0                  0.0
                        3                  0.0                  0.0                  0.0
[...]
      -63.349078729022985                  1.0                  1.0                  1.0
      -127.45117632943295                  0.0                  0.0                  0.0
----------------------------------------------------------------------------------------
                 max diff                  3.0                  8.0                  2.0
                mean diff                  0.4                  0.7                  0.3

As you can see, on my system - libm's code seems better on the given set. The fma()'s effect seems less impressive (max 4ULP vs 3ULP).

Data:

$ cat Lib/test/mathdata/math_testcases.txt|grep ^gam|tee gamma.txt

Script:

def parse_mtestfile(fname):
    with open(fname, encoding="utf-8") as fp:
        for line in fp:
            # strip comments, and skip blank lines
            if '--' in line:
                line = line[:line.index('--')]
            if not line.strip():
                continue

            lhs, rhs = line.split('->')
            id, fn, arg = lhs.split()
            rhs_pieces = rhs.split()
            exp = rhs_pieces[0]
            flags = rhs_pieces[1:]

            yield (id, fn, float(arg), float(exp), flags)

import math
import ctypes
import mpmath
import statistics

mpmath_diff = []
libm_diff = []
expected_diff = []

libm = ctypes.CDLL('libm.so.6')
libm.tgamma.argtypes = [ctypes.c_double]
libm.tgamma.restype = ctypes.c_double

print(f'{"x":>25} {"mpmath diff (ULP)":>20} {"expected diff (ULP)":>20} '
      f'{"libm diff (ULP)":>20}')
print('-'*(25+1+20+1+20+1+20))

for id, fn, arg, expected, flags in parse_mtestfile('gamma.txt'):
    if flags or not math.isfinite(expected):
        continue
    mp_arg = mpmath.mpf(arg)
    with mpmath.mp.workprec(100):
        mp_expected = mpmath.mp.gamma(mp_arg)
    mp_expected = +mp_expected
    with mpmath.mp.workprec(1000):
        mp_expected2 = mpmath.mp.gamma(mp_arg)
    mp_expected2 = +mp_expected2
    assert mp_expected == mp_expected2

    ulp_mp = math.ulp(mp_expected)
    ulp_expected = math.ulp(expected)

    mpmath_diff.append(float(abs(mp_expected - math.gamma(arg))/ulp_mp))
    expected_diff.append(float(abs(expected - math.gamma(arg))/ulp_expected))
    libm_diff.append(float(abs(mp_expected - libm.tgamma(arg))/ulp_mp))

    print(f'{arg:>25.17g} {mpmath_diff[-1]:>20.1f} '
          f'{expected_diff[-1]:>20.1f} {libm_diff[-1]:>20.1f}')

print('-'*(25+1+20+1+20+1+20))
print(f'{"max diff":>25} {max(mpmath_diff):>20.1f} {max(expected_diff):>20.1f} '
      f'{max(libm_diff):>20.1f}')
print(f'{"mean diff":>25} {statistics.mean(mpmath_diff):>20.1f} '
      f'{statistics.mean(expected_diff):>20.1f} {statistics.mean(libm_diff):>20.1f}')

Patch:

diff --git a/Modules/mathmodule.c b/Modules/mathmodule.c
index 11d9b7418a..58363f1ac4 100644
--- a/Modules/mathmodule.c
+++ b/Modules/mathmodule.c
@@ -363,8 +363,8 @@ lanczos_sum(double x)
        this resulted in lower accuracy. */
     if (x < 5.0) {
         for (i = LANCZOS_N; --i >= 0; ) {
-            num = num * x + lanczos_num_coeffs[i];
-            den = den * x + lanczos_den_coeffs[i];
+            num = fma(num, x, lanczos_num_coeffs[i]);
+            den = fma(den, x, lanczos_den_coeffs[i]);
         }
     }
     else {

@tim-one
Copy link
Member

tim-one commented Apr 23, 2025

Did I mention "rat hole"? 😉

A "proper" way to test is not to take some library's results as the expected results. The expected results have infinite precision - the "true answers". Then various library implementation results are measured against those.

Of course we don't have infinite precision, not even with mpmath. So we approximate. In outline, testing function f() on argument arg:

  1. Compute mpmath's f(arg) with extra precision; Keep the extra precision in this result. Call this expected.
  2. Compute the library's f(arg). Call the result got.
  3. An excellent approximation to the true ULP error is then:
with mpmath.extraprec(same extra precision used to compute `expected`):
    diff = (got - expected) / math.ulp(got)
diff = float(diff) # round back to native float

The magnitude of diff can absolutely be < 1, and that's important. An algorithm with max error < 0.51 ULP is generally of much higher quality than one with max error < 0.9 ULP (in general, 0.5 ULP is the best that can possibly be done in worst cases). Keeping the sign can also be important, to analyze whether the algorithm is biased in the "too high" or "too low" direction - but that's another rat hole, and of less importance than the distribution of error magnitudes.

If the expected value is only kept with native precision, the computed ULP differences are always exact integers. That vastly reduces the value of testing

Example::

>>> import math, mpmath
>>> arg = math.pi / 2
>>> got = math.tan(arg)
>>> got
1.633123935319537e+16
>>> with mpmath.extraprec(100):
...     expected = mpmath.tan(arg)
...     diff = (got - expected) / math.ulp(got)
...
>>> diff
mpf('0.12201613147923554')
>>> float(diff)
0.12201613147923554

So Pyrhon's result is about 0.12 ULP larger than the infinitely precise result. That's the best possible result in native precision (because it's < 0.5 ulp, it's the correctly rounded result).

Note that, as @youknowone discovered, FMA can also make a difference in our:

r += (absx - 0.5) * (log(absx + lanczos_g - 0.5) - 1);

@tim-one
Copy link
Member

tim-one commented Apr 24, 2025

I'll include the guts of what I consider to be "good" randomized testing. It's crude but effective. A pair of adjacent output lines:

ulp1 count1
ulp2 count2

says that there were count1 cases seen with a ULP error in the clopen interval [ulp1, ulp2). "Correctly rounded" cases are limited to the counts on the -0.5 and 0.0 lines. All others are not correctly rounded results.

For Python's gamma, this run (the results depend on the random number generator's initial state) showed that "correctly rounded" is not the most likely outcome, but it's close. Errors seen were in the range -7.5 ULP to 7.5 ULP. About 70% were within 1 ULP (sum of the -1.0, -0.5, 0.0, and 0.5 bin counts), which is in "excellent" territory for portable and reasonably fast transcendental functions even much simpler to implement than gamma.

If there are several libraries you want to compare, I urge not trying to do them all at once. Do something like this for each one on its own, and save the bin counts (e.g., pickle). Of course then you also want to force the random number generator to start with the same seed.

Boosting ULP_SCALE will give a finer-grained breakdown, and then it can also be interesting to plot the bin counts. Lack of symmetry around the center is a possible red flag. Superimposing plots from two libraries of different quality usually makes it obvious, by eyeball, which is better.

Details

import math, mpmath
from collections import defaultdict
import random
from random import random
from math import ulp, floor

ULP_SCALE = 2.0
bins = defaultdict(int)

ref = mpmath.gamma
lib = math.gamma
##    def lib(arg, base=ref):
##        with mpmath.workprec(53):
##            return base(arg)

count = 1
while True:
    arg = random() * 170.0
    with mpmath.extraprec(30):
        expected = ref(arg)
        got = lib(arg)
        diff = (got - expected) / ulp(got)
    diff = floor(float(diff) * ULP_SCALE)
    bins[diff] += 1
    if not count & 0xfffff:
        print(format(count, '_'))
        for k, v in sorted(bins.items()):
            print(k / ULP_SCALE, v)
    count += 1

Output from the first block of one run:

-7.5 1
-7.0 1
-6.5 1
-6.0 16
-5.5 52
-5.0 120
-4.5 363
-4.0 1067
-3.5 3000
-3.0 7944
-2.5 19793
-2.0 45734
-1.5 91979
-1.0 154540
-0.5 202348
0.0 202701
0.5 153829
1.0 91247
1.5 44370
2.0 18585
2.5 7060
3.0 2581
3.5 867
4.0 265
4.5 78
5.0 25
5.5 6
6.0 2
7.0 1

@tim-one
Copy link
Member

tim-one commented Apr 24, 2025

Interesting: uncomment the redefinition of lib() in my code, and change the working precision argument from 53 to 43. I believe mpmath magically adds 10 to the working precision for the duration of computing most fancy functions. That is, I'm trying trick it into using just 53 bits while evaluating gamma.

The results are horrid, with results hundreds of ULPs wrong. Which is fine for mpmath: it uses algorithms that aren't super-careful, relying on that the systematic use of extra precision in simpler code will absorb most rounding errors.

If you leave the 53 as 53, the results are stellar. From one run:

1_048_576
-1.0 46
-0.5 524415
0.0 524115

So out of the 2**20 inputs, all results were correctly rounded except for 46, and those were within 1 ULP. However, there is slight bias in that all of those were "too small". That's often a symptom of summing a power series with positive terms, and cutting it off "too early". But hundreds of ULP error is consistent with that 2**10 is about 1000, and the algorithm probably knows it doesn't really care about the last 10 bits (since it added 10 bits to absorb predictable errors).

@skirpichev
Copy link
Member

A "proper" way to test is not to take some library's results as the expected results.

No, it was taken not for results, but for some set of data points from the gamma() domain. Probably, it's representative enough to take into account specific behavior of the given function (big/small arguments, nearby poles, etc). (Something you miss with just scaled random() ;-))

Expected results (from the data set) are checked too, but just for completeness.

with mpmath.extraprec(same extra precision used to compute expected):
diff = (got - expected) / math.ulp(got)

Good point (both for sign and keeping extra precision).

BTW, why you take ulp's of got, not expected?

r += (absx - 0.5) * (log(absx + lanczos_g - 0.5) - 1);

Yes, but I did no tests for lgamma().

If there are several libraries you want to compare, I urge not trying to do them all at once.

Yes, this is not a good idea in general. But my point was to argue, that using libm's version seems to be a better option than trying to "improve" CPython's algorithm with explicit fma() calls. (Though, it's not a real option as long as we have CPython platforms, where libm's version is off by billions ULP.)

BTW, here are results without/with fma() for slightly adapted your script (include negative input and use autoprec):

>>> # with seed=1 & count=0xfffff
>>> [...]
>>> for k in sorted(set(bins_ref)|set(bins_patch)|set(bins_libm)):
...     v1 = bins_ref[k]  
...     v2 = bins_patch[k]
...     v3 = bins_libm[k]
...     print(f"{k/ULP_SCALE:+3} {v1:7} {v1/count:6.2%} "
...           f"{v2:7} {v2/count:6.2%} {v3:7} {v3/count:6.2%}")
...           
-7.0       6  0.00%       5  0.00%       0  0.00%
-6.5      13  0.00%      12  0.00%       0  0.00%
-6.0      38  0.00%      32  0.00%       1  0.00%
-5.5     114  0.01%     105  0.01%       6  0.00%
-5.0     293  0.03%     259  0.02%      36  0.00%
-4.5     768  0.07%     718  0.07%     145  0.01%
-4.0    1971  0.19%    1871  0.18%     453  0.04%
-3.5    4812  0.46%    4619  0.44%    1769  0.17%
-3.0   11521  1.10%   11240  1.07%    5911  0.56%
-2.5   25373  2.42%   25050  2.39%   18148  1.73%
-2.0   51834  4.94%   51586  4.92%   47917  4.57%
-1.5   95846  9.14%   95735  9.13%  101661  9.70%
-1.0  147871 14.10%  148090 14.12%  166747 15.90%
-0.5  186481 17.78%  187028 17.84%  211659 20.19%
+0.0  186207 17.76%  186613 17.80%  203320 19.39%
+0.5  147213 14.04%  147560 14.07%  148221 14.14%
+1.0   94756  9.04%   94953  9.06%   84277  8.04%
+1.5   50741  4.84%   50668  4.83%   37419  3.57%
+2.0   24463  2.33%   24378  2.32%   14023  1.34%
+2.5   10770  1.03%   10705  1.02%    4745  0.45%
+3.0    4556  0.43%    4511  0.43%    1497  0.14%
+3.5    1853  0.18%    1801  0.17%     465  0.04%
+4.0     698  0.07%     683  0.07%     120  0.01%
+4.5     257  0.02%     242  0.02%      29  0.00%
+5.0      83  0.01%      78  0.01%       6  0.00%
+5.5      25  0.00%      23  0.00%       0  0.00%
+6.0      10  0.00%       8  0.00%       0  0.00%
+6.5       1  0.00%       1  0.00%       0  0.00%
+7.0       1  0.00%       1  0.00%       0  0.00%

Again, almost no effect of fma(), but libm's version looks noticeably better.

I believe mpmath magically adds 10 to the working precision for the duration of computing most fancy functions.

No, working precision increased in a much more tricked way (see mpf_gamma - it's called with the current context precision & rounding).

# a.py
import math, mpmath
from collections import defaultdict
from random import random, seed
from math import ulp, floor
from sys import argv
from pickle import dump

ULP_SCALE = 2.0
bins = defaultdict(int)

ref = mpmath.gamma
lib = math.gamma

seed(argv[1])
count = 1
while count & 0xfffff:
    arg = 2*(random()-0.5) * 170.0
    if arg <= 0 and arg.is_integer():
        continue  # pole
    with mpmath.extraprec(100):
        expected = mpmath.autoprec(ref)(arg)
        got = lib(arg)
        diff = (got - expected) / ulp(got)
    diff = floor(float(diff) * ULP_SCALE)
    bins[diff] += 1
    count += 1

with open(argv[2], 'wb') as f:
    dump(bins, f)
# b.py
import math, mpmath, ctypes
from collections import defaultdict
from random import random, seed
from math import ulp, floor
from sys import argv
from pickle import dump

ULP_SCALE = 2.0
bins = defaultdict(int)

ref = mpmath.gamma
libm = ctypes.CDLL('libm.so.6')
libm.tgamma.argtypes = [ctypes.c_double]
libm.tgamma.restype = ctypes.c_double
lib = libm.tgamma

seed(argv[1])
count = 1
while count & 0xfffff:
    arg = 2*(random()-0.5) * 170.0
    if arg <= 0 and arg.is_integer():
        continue  # pole
    with mpmath.extraprec(100):
        expected = mpmath.autoprec(ref)(arg)
        got = lib(arg)
        diff = (got - expected) / ulp(got)
    diff = floor(float(diff) * ULP_SCALE)
    bins[diff] += 1
    count += 1

with open(argv[2], 'wb') as f:
    dump(bins, f)

@tim-one
Copy link
Member

tim-one commented Apr 24, 2025

Probably, it's representative enough to take into account specific behavior of the given function (big/small arguments, nearby poles, etc). (Something you miss with just scaled random() ;-))

Right, I was talking about "good randomized testing". You're also moving toward "white box" testing, which is also essential, but beyond what I was talking about.

BTW, why you take ulp's of got, not expected?

The true expected result has conceptually infinite precision. The concept of "ULP" makes no sense for it. If you round it back to native precision, then ULP is defined for it, but as already noted the computed ULP difference will always be an exact integer.

The only thing the user sees is got, and ULP is clearly defined for it. If this framework reports, e.g., a 1.37 ULP error, then the infinitely precise result is very close to (with infinite precision) got - 1.37 * ulp(got). It's answering "how can I change got to get the infinitely precise result?" using got as the only input.

and use autoprec)

Cool! I never used that. I'll start to now 🥰. Thanks!

Again, almost no effect of fma(), but libm's version looks noticeably better.

FMA is systematically better in every respect except for max error, but very slightly so - not enough so to be worth any effort to pursue.

And, yes, that platform's gamma is clearly better, and quite significantly so on the "max error" measure. But still lots of room for improvement.

@skirpichev
Copy link
Member

If you round it back to native precision, then ULP is defined for it, but as already noted the computed ULP difference will always be an exact integer.

No, I'm not about rounding back expected value, but using ulp(expected) vs ulp(got). Though, now I see that's not important, unless got is a complete garbage.

I'll start to now

autoprec is just another heuristics. So, probably you will be disappointed.

But still lots of room for improvement.

Yes, this seems to be worst 1-arg libm function on my system. With erfc and... cbrt.

@skirpichev skirpichev removed their assignment Apr 26, 2025
@tim-one
Copy link
Member

tim-one commented Apr 26, 2025

No, I'm not about rounding back expected value, but using ulp(expected) vs ulp(got)/

Sorry, but then I have no idea what you have in mind. mpmath doesn't offer a ulp function. Python's math.ulp(some_mpf_value) ends up rounding the mpf value back to native precision first. So it loses all information about the correct bits beyond native precision. It's of very little real use. ulp(got) is exact and loses no relevant information.

Yes, this seems to be worst 1-arg libm function on my system. With erfc and... cbrt.

Ya, gamma is notoriously hard to compute. It grows faster than exp, and its derivatives are also messy. Approximations based on polynomials struggle as a result. I'm surprised, though. to hear it has a sloppy cube root! That one shouldn't be particularly challenging.

@tim-one
Copy link
Member

tim-one commented Apr 26, 2025

Ah. I expect you have in mind keeping expect in extended precision, but are fine with math.ulp(expected) rounding it back to native precision for the purpose of defining ULP.

The problem with that is that the user has no idea what expected is,. So If float(expected) is in a different binade than got, they have no idea how large the ULP in "+1.37 ULP" is. But they (can) know its exact value if the unit is ulp(got).

@tim-one
Copy link
Member

tim-one commented Apr 26, 2025

Let me illustrate with an example using 2-digiit decimal floats:

got = 0.99
expected = 0.995 (the infinitely precise result)

The absolute difference is 0.005. That's 1/2 ULP of got. But math.ulp(0.995) would first round expected back to 2 digits, same as math.ulp(1.0). A ULP of that is 0.1, and then 0.005 ia 1/20 ULP. It's nearly senseless to report that, but clear what 1/2 ULP means for that example.

EDIT: Or leave rounding out of it: if the infinitely precise result is 1.0, then the difference of 0.01 is 1 ULP of got but only 0.1 ULP of expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

4 participants