JIT tracing - thundering herd causes lockup after opcache reset #11609

maaarghk · 2023-07-07T00:07:43Z

Description

we've spotted some bad behaviour around the locking the shared opcache memory on an 8.2 site (php-fpm) and unfortunately we can't reproduce it locally, only on production, so it's going to be difficult to get much more info than this - we've just left opcache disabled for now to avoid downtime in prod. i realise that there's already been a big ~~flamewar~~ chat about whether or not the jit code is actually going to see any fixes and maybe we should just disable it but i don't see any reason not to report a known issue in the mean time (may be just a PR needs reversed out).

after calling opcache_reset after a deploy, we occasionally end up with a situation where all php-fpm worker processes end up in a loop of locking and unlocking the SHM. the strace output looks like this:

20:53:12 fcntl(3</tmp/.ZendSem.Le8Hvi (deleted)>, F_SETLKW, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=1}) = 0 <0.008294>
20:53:12 fcntl(3</tmp/.ZendSem.Le8Hvi (deleted)>, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=1}) = 0 <0.000064>
20:53:12 fcntl(3</tmp/.ZendSem.Le8Hvi (deleted)>, F_SETLKW, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=1}) = 0 <0.681358>
20:53:13 fcntl(3</tmp/.ZendSem.Le8Hvi (deleted)>, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=1}) = 0 <0.000054>
20:53:13 fcntl(3</tmp/.ZendSem.Le8Hvi (deleted)>, F_SETLKW, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=1}) = 0 <0.050460>
20:53:13 fcntl(3</tmp/.ZendSem.Le8Hvi (deleted)>, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=1}) = 0 <0.000421>
20:53:13 fcntl(3</tmp/.ZendSem.Le8Hvi (deleted)>, F_SETLKW, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=1}) = 0 <0.012670>
20:53:13 fcntl(3</tmp/.ZendSem.Le8Hvi (deleted)>, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=1}) = 0 <0.000067>
20:53:13 fcntl(3</tmp/.ZendSem.Le8Hvi (deleted)>, F_SETLKW, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=1}) = 0 <0.037775>
20:53:13 fcntl(3</tmp/.ZendSem.Le8Hvi (deleted)>, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=1}) = 0 <0.000056>
20:53:13 fcntl(3</tmp/.ZendSem.Le8Hvi (deleted)>, F_SETLKW, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=1}) = 0 <0.004498>
20:53:13 fcntl(3</tmp/.ZendSem.Le8Hvi (deleted)>, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=1}) = 0 <0.000085>

after this occurred I asked our hosting company to SIGABRT all the php-fpm worker processes, to get coredumps. these aren't debug builds but we do have symbols and i've been able to draw some conclusions.

99% of the processes have a backtrace like this:

(gdb) bt
#0  0x00007f0f7a90a974 in fcntl64 () from libc.so.6
#1  0x00007f0f79451129 in zend_shared_alloc_lock () from opcache.so
#2  0x00007f0f79458a98 in zend_jit_stop_persistent_op_array () from opcache.so
#3  0x00007f0f794df510 in zend_jit_trace_hot_root.part () from opcache.so
#4  0x000055fb59200647 in ?? () < note - seems to be inside jit memory map
#5  0x00007f0f6f117b30 in ?? ()
#6  0x000055fb4d269898 in ?? () < note - seems to be inside opcache memory map

with one like this:

#0  0x00007f0f7a90f510 in ?? () from libc.so.6
#1  0x00007f0f7a90a944 in fcntl64 () from libc.so.6
#2  0x00007f0f7945119e in zend_shared_alloc_unlock () from opcache.so
#3  0x00007f0f794df5f0 in zend_jit_trace_hot_root.part () from opcache.so
#4  0x000055fb59200647 in ?? ()
#5  0x0000000000000001 in ?? ()
#6  0x000055fb53bed058 in ?? ()

I've taken a look in the core code and this does seem slightly wild:

within zend_jit_trace_hot_root:

	if (ZEND_JIT_TRACE_NUM >= JIT_G(max_root_traces)) {
		stop = ZEND_JIT_TRACE_STOP_TOO_MANY_TRACES;
		zend_jit_stop_counter_handlers();
		goto abort;
	}

zend_jit_stop_counter_handlers is a for loop in a for loop:

	for (uint32_t i = 0; i < ZCSG(hash).max_num_entries; i++) {
		zend_accel_hash_entry *cache_entry;
		for (cache_entry = ZCSG(hash).hash_table[i]; cache_entry; cache_entry = cache_entry->next) {
			zend_persistent_script *script;
			if (cache_entry->indirect) continue;
			script = (zend_persistent_script *)cache_entry->data;
			zend_jit_stop_persistent_script(script);
		}
	}

i'm certainly no expert (i.e. i've got no idea whatsoever) regarding the contents of ZCSG(hash) but it appears that if a loop is being done over max_num_entries AND cache_entry->next is being used to traverse a list, then there's some kind of logarithmic complexity thing going on here where a cache_entry could be visited multiple times if it is pointed to by next of some previous cache_entry?

anyway, the main issue: within this loop, zend_jit_stop_persistent_script is called -

static void zend_jit_stop_persistent_script(zend_persistent_script *script)
{
	zend_class_entry *ce;
	zend_op_array *op_array;

	zend_jit_stop_persistent_op_array(&script->script.main_op_array);

	ZEND_HASH_FOREACH_PTR(&script->script.function_table, op_array) {
		zend_jit_stop_persistent_op_array(op_array);
	} ZEND_HASH_FOREACH_END();

	ZEND_HASH_FOREACH_PTR(&script->script.class_table, ce) {
		ZEND_HASH_FOREACH_PTR(&ce->function_table, op_array) {
			if (op_array->type == ZEND_USER_FUNCTION) {
				zend_jit_stop_persistent_op_array(op_array);
			}
		} ZEND_HASH_FOREACH_END();
	} ZEND_HASH_FOREACH_END();
}

so here we have the main op array being stopped, followed by the function table array being stopped, followed by the class methods table being stopped(?). zend_jit_stop_persistent_op_array more or less wraps zend_jit_stop_hot_trace_counters, which locks the SHM, updates the function pointer back from jit to vm, and then unlocks the SHM.

on our box with 100 or so worker processes each having entered an up-to-quadruple-nested foreach loop, with the absolute innermost loop of each calling unlock / lock, potentially over a file which has many methods and functions, the end result is absolute chaos. the loop tends quickly towards taking minutes to complete as the number of setlkws queued up reaches thousands. in the short strace snippet posted above you can see an 0.6s wait for example. imagine that, per function?

notably, we seemed to trigger this whenever we flushed opcache. so possibly we're seeing some kind of thundering herd thing here. i see the ZEND_JIT_TRACE_NUM >= JIT_G(max_root_traces) condition being checked; maybe there is a situation where several threads at the same time add a script to the jit table and then they all try to reverse out their own work as they fill it up? I am really not familiar with jit even as a concept so I'm just guessing based on the fact we were at max worker processes and under load whenever we saw this issue.

this nested foreach thing was introduced here: #9343 and an alternative solution was proposed here: #9964

@dstogov - the first was accepted but the second not, with the second having the comment that it might cause "too many locks"? maybe worth reconsidering both of the above together with this issue report to find some better solution, assuming that improvements and bugfixes are currently permitted on the jit code.

ps: I don't believe this is a duplicate of any other issue that I can find; unlike in many other issues we do not have segfaults, just persistent lock waits.

PHP Version

8.2.5 - 7 / amd64 / linux / fpm sapi

The text was updated successfully, but these errors were encountered:

dstogov · 2023-07-07T06:04:03Z

@maaarghk thanks for the report.

This should be already partially fixed by 5b033b0

@iluuu1994, your fix may be significantly improved.

zend_jit_stop_persistent_script() need to be called only once (by one process). It should set some flag in SHM. The other processes should check this flag before and and after the lock to prevent redundant work.

dstogov · 2023-07-07T06:07:22Z

@maaarghk disabling opcache is too conservative solution, I recommend you to disable JIT opcache.jit=0.

maaarghk · 2023-07-07T10:16:44Z

@dstogov Ah neat, I didn't notice it. Is the appropriate thing to add a bool to zend_accel_shared_globals?

dstogov · 2023-07-07T11:00:30Z

I think - yes, but let @iluuu1994 fix this. (I'm on vacation for next two weeks)

iluuu1994 · 2023-07-07T13:13:13Z

@dstogov I'll have a look. Enjoy your vacation!

Use atomic to prevent clearing resetting the handlers from multiple processes or threads. Fixes phpGH-11609

Use atomic to prevent resetting handlers from multiple processes or threads. Fixes phpGH-11609

blackwolf12333 · 2023-07-14T13:33:47Z

@maaarghk I just want to confirm if we are seeing a similar issue in production, do you also have php-fpm log that it's starting children that exit almost immediately seemingly in a loop?

Otherwise I'll have to make a separate bug report for this, but it started happening since we switched over to php 8.2.

maaarghk · 2023-07-14T13:40:01Z

believe it or not I done the above debugging on a server where the hosting co doesn't give me an easy way to access basic things like the fpm error log, they had to generate the core dumps and strace for me, so I can't say conclusively. But it doesn't sound like it. The child processes here are not broken as such, just locked up doing this memory dance. Afaict they did not die afterwards, just went on to process the request normally. I also don't think they used enough memory to be killed by OOM. But - there was a supervisor process which would detect FPM not responding to requests and try to restart it - so maybe that's causing your logs to look like that? Maybe post an excerpt.

blackwolf12333 · 2023-07-14T14:08:23Z

Hmm that doesn't really sound like the same issue then. I'll create a separate issue.

iluuu1994 · 2023-07-14T15:27:00Z

@maaarghk Any luck checking if the newest version of PHP already mitigates this to some degree?

maaarghk · 2023-07-18T11:27:41Z

Hey @iluuu1994 - because I don't have root on the web server, I rely on a support team to intervene if I upgrade to the latest version and find the issue still causes downtime. Combined with the fact I only see this issue consistently on a production site, I'm not really in a position to experiment with it when I know that having JIT turned off keeps the site stable. Sorry I can't be of more use.

maaarghk · 2023-07-20T17:55:37Z

@iluuu1994 We decided to just YOLO it and re-enable JIT on production after upgrading to 8.2.8, and it does appear to have mitigated the issue. I can't see an unusual number of failed requests since yesterday. It's hard to say conclusively since it was exacerbated by traffic (we were only really in a place bad enough to notice if every single fpm worker got caught in this loop whilst at max_children) but unless other reports come in from busier sites, it seems like your improvement can be simplified targetted to 8.3 safely enough. I'll follow up in a week or so.

iluuu1994 · 2023-07-24T09:12:59Z

@maaarghk Great, thank you!

Avoid stopping counters repeatedly from different threads/processes. Fixes phpGH-11609

Avoid stopping counters repeatedly from different threads/processes. Fixes phpGH-11609 Closes phpGH-11806

maaarghk added Bug Status: Needs Triage labels Jul 7, 2023

Girgias added the Category: JIT label Jul 7, 2023

iluuu1994 added a commit to iluuu1994/php-src that referenced this issue Jul 7, 2023

Improve zend_jit_stop_counter_handlers() locking

b7ede53

Use atomic to prevent clearing resetting the handlers from multiple processes or threads. Fixes phpGH-11609

iluuu1994 added a commit to iluuu1994/php-src that referenced this issue Jul 7, 2023

Improve zend_jit_stop_counter_handlers() locking

1ccec75

Use atomic to prevent resetting handlers from multiple processes or threads. Fixes phpGH-11609

iluuu1994 mentioned this issue Jul 7, 2023

Improve zend_jit_stop_counter_handlers() locking #11618

Closed

iluuu1994 added a commit to iluuu1994/php-src that referenced this issue Jul 7, 2023

Improve zend_jit_stop_counter_handlers() locking

8ecfc0c

Use atomic to prevent resetting handlers from multiple processes or threads. Fixes phpGH-11609

iluuu1994 mentioned this issue Jul 7, 2023

Improve zend_jit_stop_counter_handlers() locking #11623

Closed

maaarghk mentioned this issue Jul 18, 2023

JIT cause 100% cpu usage #11736

Open

iluuu1994 added a commit to iluuu1994/php-src that referenced this issue Jul 27, 2023

Synchronize zend_jit_stop_counter_handlers()

91a6a3a

Avoid stopping counters repeatedly from different threads/processes. Fixes phpGH-11609

iluuu1994 mentioned this issue Jul 27, 2023

Synchronize zend_jit_stop_counter_handlers() #11806

Closed

iluuu1994 closed this as completed in b80bebc Aug 3, 2023

jorgsowa pushed a commit to jorgsowa/php-src that referenced this issue Aug 16, 2023

Synchronize zend_jit_stop_counter_handlers()

3825f08

Avoid stopping counters repeatedly from different threads/processes. Fixes phpGH-11609 Closes phpGH-11806

Elkropac mentioned this issue Oct 16, 2023

PHP-FPM: 8.2 random lockups of daemon #12449

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT tracing - thundering herd causes lockup after opcache reset #11609

JIT tracing - thundering herd causes lockup after opcache reset #11609

maaarghk commented Jul 7, 2023 •

edited

Loading

dstogov commented Jul 7, 2023

dstogov commented Jul 7, 2023

maaarghk commented Jul 7, 2023

dstogov commented Jul 7, 2023

iluuu1994 commented Jul 7, 2023

blackwolf12333 commented Jul 14, 2023

maaarghk commented Jul 14, 2023

blackwolf12333 commented Jul 14, 2023

iluuu1994 commented Jul 14, 2023

maaarghk commented Jul 18, 2023

maaarghk commented Jul 20, 2023

iluuu1994 commented Jul 24, 2023

JIT tracing - thundering herd causes lockup after opcache reset #11609

JIT tracing - thundering herd causes lockup after opcache reset #11609

Comments

maaarghk commented Jul 7, 2023 • edited Loading

Description

PHP Version

dstogov commented Jul 7, 2023

dstogov commented Jul 7, 2023

maaarghk commented Jul 7, 2023

dstogov commented Jul 7, 2023

iluuu1994 commented Jul 7, 2023

blackwolf12333 commented Jul 14, 2023

maaarghk commented Jul 14, 2023

blackwolf12333 commented Jul 14, 2023

iluuu1994 commented Jul 14, 2023

maaarghk commented Jul 18, 2023

maaarghk commented Jul 20, 2023

iluuu1994 commented Jul 24, 2023

maaarghk commented Jul 7, 2023 •

edited

Loading