Skip to content

Commit 2f58ac4

Browse files
authored
[libc][x86] copy one cache line at a time to prevent the use of rep;movsb (#113161)
When using `-mprefer-vector-width=128` with `-march=sandybridge` copying 3 cache lines in one go (192B) gets converted into `rep;movsb` which translate into a 60% hit in performance. Consecutive calls to `__builtin_memcpy_inline` (implementation behind `builtin::Memcpy::block_offset`) are not coalesced by the compiler and so calling it three times in a row generates the desired assembly. It only differs in the interleaving of the loads and stores and does not affect performance. This is needed to reland #108939.
1 parent 9ae41c2 commit 2f58ac4

File tree

1 file changed

+9
-8
lines changed

1 file changed

+9
-8
lines changed

libc/src/string/memory_utils/x86_64/inline_memcpy.h

+9-8
Original file line numberDiff line numberDiff line change
@@ -98,19 +98,19 @@ inline_memcpy_x86_sse2_ge64_sw_prefetching(Ptr __restrict dst,
9898
while (offset + K_TWO_CACHELINES + 32 <= count) {
9999
inline_memcpy_prefetch(dst, src, offset + K_ONE_CACHELINE);
100100
inline_memcpy_prefetch(dst, src, offset + K_TWO_CACHELINES);
101-
builtin::Memcpy<K_TWO_CACHELINES>::block_offset(dst, src, offset);
102-
offset += K_TWO_CACHELINES;
101+
// Copy one cache line at a time to prevent the use of `rep;movsb`.
102+
for (size_t i = 0; i < 2; ++i, offset += K_ONE_CACHELINE)
103+
builtin::Memcpy<K_ONE_CACHELINE>::block_offset(dst, src, offset);
103104
}
104105
} else {
105106
// Three cache lines at a time.
106107
while (offset + K_THREE_CACHELINES + 32 <= count) {
107108
inline_memcpy_prefetch(dst, src, offset + K_ONE_CACHELINE);
108109
inline_memcpy_prefetch(dst, src, offset + K_TWO_CACHELINES);
109110
inline_memcpy_prefetch(dst, src, offset + K_THREE_CACHELINES);
110-
// It is likely that this copy will be turned into a 'rep;movsb' on
111-
// non-AVX machines.
112-
builtin::Memcpy<K_THREE_CACHELINES>::block_offset(dst, src, offset);
113-
offset += K_THREE_CACHELINES;
111+
// Copy one cache line at a time to prevent the use of `rep;movsb`.
112+
for (size_t i = 0; i < 3; ++i, offset += K_ONE_CACHELINE)
113+
builtin::Memcpy<K_ONE_CACHELINE>::block_offset(dst, src, offset);
114114
}
115115
}
116116
// We don't use 'loop_and_tail_offset' because it assumes at least one
@@ -148,8 +148,9 @@ inline_memcpy_x86_avx_ge64_sw_prefetching(Ptr __restrict dst,
148148
inline_memcpy_prefetch(dst, src, offset + K_ONE_CACHELINE);
149149
inline_memcpy_prefetch(dst, src, offset + K_TWO_CACHELINES);
150150
inline_memcpy_prefetch(dst, src, offset + K_THREE_CACHELINES);
151-
builtin::Memcpy<K_THREE_CACHELINES>::block_offset(dst, src, offset);
152-
offset += K_THREE_CACHELINES;
151+
// Copy one cache line at a time to prevent the use of `rep;movsb`.
152+
for (size_t i = 0; i < 3; ++i, offset += K_ONE_CACHELINE)
153+
builtin::Memcpy<K_ONE_CACHELINE>::block_offset(dst, src, offset);
153154
}
154155
// We don't use 'loop_and_tail_offset' because it assumes at least one
155156
// iteration of the loop.

0 commit comments

Comments
 (0)