-
Notifications
You must be signed in to change notification settings - Fork 7.9k
JIT tracing - thundering herd causes lockup after opcache reset #11609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@maaarghk thanks for the report. This should be already partially fixed by 5b033b0 @iluuu1994, your fix may be significantly improved.
|
@maaarghk disabling opcache is too conservative solution, I recommend you to disable JIT |
@dstogov Ah neat, I didn't notice it. Is the appropriate thing to add a bool to |
I think - yes, but let @iluuu1994 fix this. (I'm on vacation for next two weeks) |
@dstogov I'll have a look. Enjoy your vacation! |
Use atomic to prevent clearing resetting the handlers from multiple processes or threads. Fixes phpGH-11609
Use atomic to prevent resetting handlers from multiple processes or threads. Fixes phpGH-11609
Use atomic to prevent resetting handlers from multiple processes or threads. Fixes phpGH-11609
@maaarghk I just want to confirm if we are seeing a similar issue in production, do you also have php-fpm log that it's starting children that exit almost immediately seemingly in a loop? Otherwise I'll have to make a separate bug report for this, but it started happening since we switched over to php 8.2. |
believe it or not I done the above debugging on a server where the hosting co doesn't give me an easy way to access basic things like the fpm error log, they had to generate the core dumps and strace for me, so I can't say conclusively. But it doesn't sound like it. The child processes here are not broken as such, just locked up doing this memory dance. Afaict they did not die afterwards, just went on to process the request normally. I also don't think they used enough memory to be killed by OOM. But - there was a supervisor process which would detect FPM not responding to requests and try to restart it - so maybe that's causing your logs to look like that? Maybe post an excerpt. |
Hmm that doesn't really sound like the same issue then. I'll create a separate issue. |
@maaarghk Any luck checking if the newest version of PHP already mitigates this to some degree? |
Hey @iluuu1994 - because I don't have root on the web server, I rely on a support team to intervene if I upgrade to the latest version and find the issue still causes downtime. Combined with the fact I only see this issue consistently on a production site, I'm not really in a position to experiment with it when I know that having JIT turned off keeps the site stable. Sorry I can't be of more use. |
@iluuu1994 We decided to just YOLO it and re-enable JIT on production after upgrading to 8.2.8, and it does appear to have mitigated the issue. I can't see an unusual number of failed requests since yesterday. It's hard to say conclusively since it was exacerbated by traffic (we were only really in a place bad enough to notice if every single fpm worker got caught in this loop whilst at max_children) but unless other reports come in from busier sites, it seems like your improvement can be simplified targetted to 8.3 safely enough. I'll follow up in a week or so. |
@maaarghk Great, thank you! |
Avoid stopping counters repeatedly from different threads/processes. Fixes phpGH-11609
Avoid stopping counters repeatedly from different threads/processes. Fixes phpGH-11609 Closes phpGH-11806
Description
we've spotted some bad behaviour around the locking the shared opcache memory on an 8.2 site (php-fpm) and unfortunately we can't reproduce it locally, only on production, so it's going to be difficult to get much more info than this - we've just left opcache disabled for now to avoid downtime in prod. i realise that there's already been a big
flamewarchat about whether or not the jit code is actually going to see any fixes and maybe we should just disable it but i don't see any reason not to report a known issue in the mean time (may be just a PR needs reversed out).after calling opcache_reset after a deploy, we occasionally end up with a situation where all php-fpm worker processes end up in a loop of locking and unlocking the SHM. the strace output looks like this:
after this occurred I asked our hosting company to SIGABRT all the php-fpm worker processes, to get coredumps. these aren't debug builds but we do have symbols and i've been able to draw some conclusions.
99% of the processes have a backtrace like this:
with one like this:
I've taken a look in the core code and this does seem slightly wild:
within zend_jit_trace_hot_root:
zend_jit_stop_counter_handlers is a for loop in a for loop:
i'm certainly no expert (i.e. i've got no idea whatsoever) regarding the contents of
ZCSG(hash)
but it appears that if a loop is being done over max_num_entries AND cache_entry->next is being used to traverse a list, then there's some kind of logarithmic complexity thing going on here where a cache_entry could be visited multiple times if it is pointed to bynext
of some previous cache_entry?anyway, the main issue: within this loop, zend_jit_stop_persistent_script is called -
so here we have the main op array being stopped, followed by the function table array being stopped, followed by the class methods table being stopped(?). zend_jit_stop_persistent_op_array more or less wraps zend_jit_stop_hot_trace_counters, which locks the SHM, updates the function pointer back from jit to vm, and then unlocks the SHM.
on our box with 100 or so worker processes each having entered an up-to-quadruple-nested foreach loop, with the absolute innermost loop of each calling unlock / lock, potentially over a file which has many methods and functions, the end result is absolute chaos. the loop tends quickly towards taking minutes to complete as the number of
setlkw
s queued up reaches thousands. in the short strace snippet posted above you can see an 0.6s wait for example. imagine that, per function?notably, we seemed to trigger this whenever we flushed opcache. so possibly we're seeing some kind of thundering herd thing here. i see the
ZEND_JIT_TRACE_NUM >= JIT_G(max_root_traces)
condition being checked; maybe there is a situation where several threads at the same time add a script to the jit table and then they all try to reverse out their own work as they fill it up? I am really not familiar with jit even as a concept so I'm just guessing based on the fact we were at max worker processes and under load whenever we saw this issue.this nested foreach thing was introduced here: #9343 and an alternative solution was proposed here: #9964
@dstogov - the first was accepted but the second not, with the second having the comment that it might cause "too many locks"? maybe worth reconsidering both of the above together with this issue report to find some better solution, assuming that improvements and bugfixes are currently permitted on the jit code.
ps: I don't believe this is a duplicate of any other issue that I can find; unlike in many other issues we do not have segfaults, just persistent lock waits.
PHP Version
8.2.5 - 7 / amd64 / linux / fpm sapi
The text was updated successfully, but these errors were encountered: