Skip to content

Commit d39268a

Browse files
hansendcsuryasaimadhu
authored andcommitted
x86/mm/tlb: Revert retpoline avoidance approach
0day reported a regression on a microbenchmark which is intended to stress the TLB flushing path: https://lore.kernel.org/all/20220317090415.GE735@xsang-OptiPlex-9020/ It pointed at a commit from Nadav which intended to remove retpoline overhead in the TLB flushing path by taking the 'cond'-ition in on_each_cpu_cond_mask(), pre-calculating it, and incorporating it into 'cpumask'. That allowed the code to use a bunch of earlier direct calls instead of later indirect calls that need a retpoline. But, in practice, threads can go idle (and into lazy TLB mode where they don't need to flush their TLB) between the early and late calls. It works in this direction and not in the other because TLB-flushing threads tend to hold mmap_lock for write. Contention on that lock causes threads to _go_ idle right in this early/late window. There was not any performance data in the original commit specific to the retpoline overhead. I did a few tests on a system with retpolines: https://lore.kernel.org/all/dd8be93c-ded6-b962-50d4-96b1c3afb2b7@intel.com/ which showed a possible small win. But, that small win pales in comparison with the bigger loss induced on non-retpoline systems. Revert the patch that removed the retpolines. This was not a clean revert, but it was self-contained enough not to be too painful. Fixes: 6035152 ("x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()") Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Nadav Amit <namit@vmware.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/164874672286.389.7021457716635788197.tip-bot2@tip-bot2
1 parent 3123109 commit d39268a

File tree

1 file changed

+5
-32
lines changed

1 file changed

+5
-32
lines changed

arch/x86/mm/tlb.c

Lines changed: 5 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -855,13 +855,11 @@ static void flush_tlb_func(void *info)
855855
nr_invalidate);
856856
}
857857

858-
static bool tlb_is_not_lazy(int cpu)
858+
static bool tlb_is_not_lazy(int cpu, void *data)
859859
{
860860
return !per_cpu(cpu_tlbstate_shared.is_lazy, cpu);
861861
}
862862

863-
static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask);
864-
865863
DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared);
866864
EXPORT_PER_CPU_SYMBOL(cpu_tlbstate_shared);
867865

@@ -890,36 +888,11 @@ STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask,
890888
* up on the new contents of what used to be page tables, while
891889
* doing a speculative memory access.
892890
*/
893-
if (info->freed_tables) {
891+
if (info->freed_tables)
894892
on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
895-
} else {
896-
/*
897-
* Although we could have used on_each_cpu_cond_mask(),
898-
* open-coding it has performance advantages, as it eliminates
899-
* the need for indirect calls or retpolines. In addition, it
900-
* allows to use a designated cpumask for evaluating the
901-
* condition, instead of allocating one.
902-
*
903-
* This code works under the assumption that there are no nested
904-
* TLB flushes, an assumption that is already made in
905-
* flush_tlb_mm_range().
906-
*
907-
* cond_cpumask is logically a stack-local variable, but it is
908-
* more efficient to have it off the stack and not to allocate
909-
* it on demand. Preemption is disabled and this code is
910-
* non-reentrant.
911-
*/
912-
struct cpumask *cond_cpumask = this_cpu_ptr(&flush_tlb_mask);
913-
int cpu;
914-
915-
cpumask_clear(cond_cpumask);
916-
917-
for_each_cpu(cpu, cpumask) {
918-
if (tlb_is_not_lazy(cpu))
919-
__cpumask_set_cpu(cpu, cond_cpumask);
920-
}
921-
on_each_cpu_mask(cond_cpumask, flush_tlb_func, (void *)info, true);
922-
}
893+
else
894+
on_each_cpu_cond_mask(tlb_is_not_lazy, flush_tlb_func,
895+
(void *)info, 1, cpumask);
923896
}
924897

925898
void flush_tlb_multi(const struct cpumask *cpumask,

0 commit comments

Comments
 (0)