Skip to content

Commit b27abac

Browse files
hansendctorvalds
authored andcommitted
mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes
Patch series "Introduce multi-preference mempolicy", v7. This patch series introduces the concept of the MPOL_PREFERRED_MANY mempolicy. This mempolicy mode can be used with either the set_mempolicy(2) or mbind(2) interfaces. Like the MPOL_PREFERRED interface, it allows an application to set a preference for nodes which will fulfil memory allocation requests. Unlike the MPOL_PREFERRED mode, it takes a set of nodes. Like the MPOL_BIND interface, it works over a set of nodes. Unlike MPOL_BIND, it will not cause a SIGSEGV or invoke the OOM killer if those preferred nodes are not available. Along with these patches are patches for libnuma, numactl, numademo, and memhog. They still need some polish, but can be found here: https://gitlab.com/bwidawsk/numactl/-/tree/prefer-many It allows new usage: `numactl -P 0,3,4` The goal of the new mode is to enable some use-cases when using tiered memory usage models which I've lovingly named. 1a. The Hare - The interconnect is fast enough to meet bandwidth and latency requirements allowing preference to be given to all nodes with "fast" memory. 1b. The Indiscriminate Hare - An application knows it wants fast memory (or perhaps slow memory), but doesn't care which node it runs on. The application can prefer a set of nodes and then xpu bind to the local node (cpu, accelerator, etc). This reverses the nodes are chosen today where the kernel attempts to use local memory to the CPU whenever possible. This will attempt to use the local accelerator to the memory. 2. The Tortoise - The administrator (or the application itself) is aware it only needs slow memory, and so can prefer that. Much of this is almost achievable with the bind interface, but the bind interface suffers from an inability to fallback to another set of nodes if binding fails to all nodes in the nodemask. Like MPOL_BIND a nodemask is given. Inherently this removes ordering from the preference. > /* Set first two nodes as preferred in an 8 node system. */ > const unsigned long nodes = 0x3 > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8); > /* Mimic interleave policy, but have fallback *. > const unsigned long nodes = 0xaa > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8); Some internal discussion took place around the interface. There are two alternatives which we have discussed, plus one I stuck in: 1. Ordered list of nodes. Currently it's believed that the added complexity is nod needed for expected usecases. 2. A flag for bind to allow falling back to other nodes. This confuses the notion of binding and is less flexible than the current solution. 3. Create flags or new modes that helps with some ordering. This offers both a friendlier API as well as a solution for more customized usage. It's unknown if it's worth the complexity to support this. Here is sample code for how this might work: > // Prefer specific nodes for some something wacky > set_mempolicy(MPOL_PREFER_MANY, 0x17c, 1024); > > // Default > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0); > // which is the same as > set_mempolicy(MPOL_DEFAULT, NULL, 0); > > // The Hare > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0); > > // The Tortoise > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0); > > // Prefer the fast memory of the first two sockets > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2); > This patch (of 5): The NUMA APIs currently allow passing in a "preferred node" as a single bit set in a nodemask. If more than one bit it set, bits after the first are ignored. This single node is generally OK for location-based NUMA where memory being allocated will eventually be operated on by a single CPU. However, in systems with multiple memory types, folks want to target a *type* of memory instead of a location. For instance, someone might want some high-bandwidth memory but do not care about the CPU next to which it is allocated. Or, they want a cheap, high capacity allocation and want to target all NUMA nodes which have persistent memory in volatile mode. In both of these cases, the application wants to target a *set* of nodes, but does not want strict MPOL_BIND behavior as that could lead to OOM killer or SIGSEGV. So add MPOL_PREFERRED_MANY policy to support the multiple preferred nodes requirement. This is not a pie-in-the-sky dream for an API. This was a response to a specific ask of more than one group at Intel. Specifically: 1. There are existing libraries that target memory types such as https://github.com/memkind/memkind. These are known to suffer from SIGSEGV's when memory is low on targeted memory "kinds" that span more than one node. The MCDRAM on a Xeon Phi in "Cluster on Die" mode is an example of this. 2. Volatile-use persistent memory users want to have a memory policy which is targeted at either "cheap and slow" (PMEM) or "expensive and fast" (DRAM). However, they do not want to experience allocation failures when the targeted type is unavailable. 3. Allocate-then-run. Generally, we let the process scheduler decide on which physical CPU to run a task. That location provides a default allocation policy, and memory availability is not generally considered when placing tasks. For situations where memory is valuable and constrained, some users want to allocate memory first, *then* allocate close compute resources to the allocation. This is the reverse of the normal (CPU) model. Accelerators such as GPUs that operate on core-mm-managed memory are interested in this model. A check is added in sanitize_mpol_flags() to not permit 'prefer_many' policy to be used for now, and will be removed in later patch after all implementations for 'prefer_many' are ready, as suggested by Michal Hocko. [mhocko@kernel.org: suggest to refine policy_node/policy_nodemask handling] Link: https://lkml.kernel.org/r/1627970362-61305-1-git-send-email-feng.tang@intel.com Link: https://lore.kernel.org/r/20200630212517.308045-4-ben.widawsky@intel.com Link: https://lkml.kernel.org/r/1627970362-61305-2-git-send-email-feng.tang@intel.com Co-developed-by: Ben Widawsky <ben.widawsky@intel.com> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Feng Tang <feng.tang@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Huang Ying <ying.huang@intel.com>b Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1 parent 062db29 commit b27abac

File tree

2 files changed

+60
-14
lines changed

2 files changed

+60
-14
lines changed

include/uapi/linux/mempolicy.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ enum {
2222
MPOL_BIND,
2323
MPOL_INTERLEAVE,
2424
MPOL_LOCAL,
25+
MPOL_PREFERRED_MANY,
2526
MPOL_MAX, /* always last member of enum */
2627
};
2728

mm/mempolicy.c

Lines changed: 59 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,9 @@
3131
* but useful to set in a VMA when you have a non default
3232
* process policy.
3333
*
34+
* preferred many Try a set of nodes first before normal fallback. This is
35+
* similar to preferred without the special case.
36+
*
3437
* default Allocate on the local node first, or when on a VMA
3538
* use the process policy. This is what Linux always did
3639
* in a NUMA aware kernel and still does by, ahem, default.
@@ -207,6 +210,14 @@ static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
207210
return 0;
208211
}
209212

213+
static int mpol_new_preferred_many(struct mempolicy *pol, const nodemask_t *nodes)
214+
{
215+
if (nodes_empty(*nodes))
216+
return -EINVAL;
217+
pol->nodes = *nodes;
218+
return 0;
219+
}
220+
210221
static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
211222
{
212223
if (nodes_empty(*nodes))
@@ -408,6 +419,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
408419
[MPOL_LOCAL] = {
409420
.rebind = mpol_rebind_default,
410421
},
422+
[MPOL_PREFERRED_MANY] = {
423+
.create = mpol_new_preferred_many,
424+
.rebind = mpol_rebind_preferred,
425+
},
411426
};
412427

413428
static int migrate_page_add(struct page *page, struct list_head *pagelist,
@@ -900,6 +915,7 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
900915
case MPOL_BIND:
901916
case MPOL_INTERLEAVE:
902917
case MPOL_PREFERRED:
918+
case MPOL_PREFERRED_MANY:
903919
*nodes = p->nodes;
904920
break;
905921
case MPOL_LOCAL:
@@ -1446,7 +1462,13 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
14461462
{
14471463
*flags = *mode & MPOL_MODE_FLAGS;
14481464
*mode &= ~MPOL_MODE_FLAGS;
1449-
if ((unsigned int)(*mode) >= MPOL_MAX)
1465+
1466+
/*
1467+
* The check should be 'mode >= MPOL_MAX', but as 'prefer_many'
1468+
* is not fully implemented, don't permit it to be used for now,
1469+
* and the logic will be restored in following patch
1470+
*/
1471+
if ((unsigned int)(*mode) >= MPOL_PREFERRED_MANY)
14501472
return -EINVAL;
14511473
if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
14521474
return -EINVAL;
@@ -1875,16 +1897,27 @@ static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
18751897
*/
18761898
nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
18771899
{
1900+
int mode = policy->mode;
1901+
18781902
/* Lower zones don't get a nodemask applied for MPOL_BIND */
1879-
if (unlikely(policy->mode == MPOL_BIND) &&
1880-
apply_policy_zone(policy, gfp_zone(gfp)) &&
1881-
cpuset_nodemask_valid_mems_allowed(&policy->nodes))
1903+
if (unlikely(mode == MPOL_BIND) &&
1904+
apply_policy_zone(policy, gfp_zone(gfp)) &&
1905+
cpuset_nodemask_valid_mems_allowed(&policy->nodes))
1906+
return &policy->nodes;
1907+
1908+
if (mode == MPOL_PREFERRED_MANY)
18821909
return &policy->nodes;
18831910

18841911
return NULL;
18851912
}
18861913

1887-
/* Return the node id preferred by the given mempolicy, or the given id */
1914+
/*
1915+
* Return the preferred node id for 'prefer' mempolicy, and return
1916+
* the given id for all other policies.
1917+
*
1918+
* policy_node() is always coupled with policy_nodemask(), which
1919+
* secures the nodemask limit for 'bind' and 'prefer-many' policy.
1920+
*/
18881921
static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
18891922
{
18901923
if (policy->mode == MPOL_PREFERRED) {
@@ -1936,7 +1969,9 @@ unsigned int mempolicy_slab_node(void)
19361969
case MPOL_INTERLEAVE:
19371970
return interleave_nodes(policy);
19381971

1939-
case MPOL_BIND: {
1972+
case MPOL_BIND:
1973+
case MPOL_PREFERRED_MANY:
1974+
{
19401975
struct zoneref *z;
19411976

19421977
/*
@@ -2008,29 +2043,31 @@ static inline unsigned interleave_nid(struct mempolicy *pol,
20082043
* @addr: address in @vma for shared policy lookup and interleave policy
20092044
* @gfp_flags: for requested zone
20102045
* @mpol: pointer to mempolicy pointer for reference counted mempolicy
2011-
* @nodemask: pointer to nodemask pointer for MPOL_BIND nodemask
2046+
* @nodemask: pointer to nodemask pointer for 'bind' and 'prefer-many' policy
20122047
*
20132048
* Returns a nid suitable for a huge page allocation and a pointer
20142049
* to the struct mempolicy for conditional unref after allocation.
2015-
* If the effective policy is 'BIND, returns a pointer to the mempolicy's
2016-
* @nodemask for filtering the zonelist.
2050+
* If the effective policy is 'bind' or 'prefer-many', returns a pointer
2051+
* to the mempolicy's @nodemask for filtering the zonelist.
20172052
*
20182053
* Must be protected by read_mems_allowed_begin()
20192054
*/
20202055
int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
20212056
struct mempolicy **mpol, nodemask_t **nodemask)
20222057
{
20232058
int nid;
2059+
int mode;
20242060

20252061
*mpol = get_vma_policy(vma, addr);
2026-
*nodemask = NULL; /* assume !MPOL_BIND */
2062+
*nodemask = NULL;
2063+
mode = (*mpol)->mode;
20272064

2028-
if (unlikely((*mpol)->mode == MPOL_INTERLEAVE)) {
2065+
if (unlikely(mode == MPOL_INTERLEAVE)) {
20292066
nid = interleave_nid(*mpol, vma, addr,
20302067
huge_page_shift(hstate_vma(vma)));
20312068
} else {
20322069
nid = policy_node(gfp_flags, *mpol, numa_node_id());
2033-
if ((*mpol)->mode == MPOL_BIND)
2070+
if (mode == MPOL_BIND || mode == MPOL_PREFERRED_MANY)
20342071
*nodemask = &(*mpol)->nodes;
20352072
}
20362073
return nid;
@@ -2063,6 +2100,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
20632100
mempolicy = current->mempolicy;
20642101
switch (mempolicy->mode) {
20652102
case MPOL_PREFERRED:
2103+
case MPOL_PREFERRED_MANY:
20662104
case MPOL_BIND:
20672105
case MPOL_INTERLEAVE:
20682106
*mask = mempolicy->nodes;
@@ -2173,7 +2211,7 @@ struct page *alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
21732211
* node and don't fall back to other nodes, as the cost of
21742212
* remote accesses would likely offset THP benefits.
21752213
*
2176-
* If the policy is interleave, or does not allow the current
2214+
* If the policy is interleave or does not allow the current
21772215
* node in its nodemask, we allocate the standard way.
21782216
*/
21792217
if (pol->mode == MPOL_PREFERRED)
@@ -2311,6 +2349,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
23112349
case MPOL_BIND:
23122350
case MPOL_INTERLEAVE:
23132351
case MPOL_PREFERRED:
2352+
case MPOL_PREFERRED_MANY:
23142353
return !!nodes_equal(a->nodes, b->nodes);
23152354
case MPOL_LOCAL:
23162355
return true;
@@ -2451,6 +2490,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
24512490
break;
24522491

24532492
case MPOL_PREFERRED:
2493+
if (node_isset(curnid, pol->nodes))
2494+
goto out;
24542495
polnid = first_node(pol->nodes);
24552496
break;
24562497

@@ -2465,9 +2506,10 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
24652506
break;
24662507
goto out;
24672508
}
2509+
fallthrough;
24682510

2511+
case MPOL_PREFERRED_MANY:
24692512
/*
2470-
* allows binding to multiple nodes.
24712513
* use current page if in policy nodemask,
24722514
* else select nearest allowed node, if any.
24732515
* If no allowed nodes, use current [!misplaced].
@@ -2829,6 +2871,7 @@ static const char * const policy_modes[] =
28292871
[MPOL_BIND] = "bind",
28302872
[MPOL_INTERLEAVE] = "interleave",
28312873
[MPOL_LOCAL] = "local",
2874+
[MPOL_PREFERRED_MANY] = "prefer (many)",
28322875
};
28332876

28342877

@@ -2907,6 +2950,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
29072950
if (!nodelist)
29082951
err = 0;
29092952
goto out;
2953+
case MPOL_PREFERRED_MANY:
29102954
case MPOL_BIND:
29112955
/*
29122956
* Insist on a nodelist
@@ -2993,6 +3037,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
29933037
case MPOL_LOCAL:
29943038
break;
29953039
case MPOL_PREFERRED:
3040+
case MPOL_PREFERRED_MANY:
29963041
case MPOL_BIND:
29973042
case MPOL_INTERLEAVE:
29983043
nodes = pol->nodes;

0 commit comments

Comments
 (0)