CINXE.COM

LKML: hawkes@sgi ...: [PATCH] -mm tree: broken "dynamic sched domains" and "migration cost"

<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>LKML: hawkes&#64;sgi ...: [PATCH] -mm tree: broken "dynamic sched domains" and "migration cost"</title><link href="/css/message.css" rel="stylesheet" type="text/css" /><link href="/css/wrap.css" rel="alternate stylesheet" type="text/css" title="wrap" /><link href="/css/nowrap.css" rel="stylesheet" type="text/css" title="nowrap" /><link href="/favicon.ico" rel="shortcut icon" /><script src="/js/simple-calendar.js" type="text/javascript"></script><script src="/js/styleswitcher.js" type="text/javascript"></script><link rel="alternate" type="application/rss+xml" title="lkml.org : last 100 messages" href="/rss.php" /><link rel="alternate" type="application/rss+xml" title="lkml.org : last messages by &#9;hawkes&#64;sgi ..." href="/groupie.php?aid=4069" /><!--Matomo--><script> var _paq = window._paq = window._paq || []; /* tracker methods like "setCustomDimension" should be called before "trackPageView" */ _paq.push(["setDoNotTrack", true]); _paq.push(["disableCookies"]); _paq.push(['trackPageView']); _paq.push(['enableLinkTracking']); (function() { var u="//m.lkml.org/"; _paq.push(['setTrackerUrl', u+'matomo.php']); _paq.push(['setSiteId', '1']); var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0]; g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s); })(); </script><!--End Matomo Code--></head><body onload="es.jasper.simpleCalendar.init();" itemscope="itemscope" itemtype="http://schema.org/BlogPosting"><table border="0" cellpadding="0" cellspacing="0"><tr><td width="180" align="center"><a href="/"><img style="border:0;width:135px;height:32px" src="/images/toprowlk.gif" alt="lkml.org" /></a></td><td width="32">聽</td><td class="nb"><div><a class="nb" href="/lkml"> [lkml]</a> 聽 <a class="nb" href="/lkml/2005"> [2005]</a> 聽 <a class="nb" href="/lkml/2005/12"> [Dec]</a> 聽 <a class="nb" href="/lkml/2005/12/9"> [9]</a> 聽 <a class="nb" href="/lkml/last100"> [last100]</a> 聽 <a href="/rss.php"><img src="/images/rss-or.gif" border="0" alt="RSS Feed" /></a></div><div>Views: <a href="#" class="nowrap" onclick="setActiveStyleSheet('wrap');return false;">[wrap]</a><a href="#" class="wrap" onclick="setActiveStyleSheet('nowrap');return false;">[no wrap]</a> 聽 <a class="nb" href="/lkml/mheaders/2005/12/9/202" onclick="this.href='/lkml/headers'+'/2005/12/9/202';">[headers]</a>聽 <a href="/lkml/bounce/2005/12/9/202">[forward]</a>聽 </div></td><td width="32">聽</td></tr><tr><td valign="top"><div class="es-jasper-simpleCalendar" baseurl="/lkml/"></div><div class="threadlist">Messages in this thread</div><ul class="threadlist"><li class="root"><a href="/lkml/2005/12/9/202">First message in thread</a></li><li class="origin"><a href="/lkml/2005/12/9/255"> hawkes&#64;sgi ...</a><ul><li><a href="/lkml/2005/12/9/255">Paul Jackson</a></li><li><a href="/lkml/2005/12/10/58">Ingo Molnar</a></li><li><a href="/lkml/2005/12/10/60">Ingo Molnar</a><ul><li><a href="/lkml/2005/12/12/192">"John Hawkes"</a><ul><li><a href="/lkml/2005/12/13/26">Ingo Molnar</a></li></ul></li></ul></li></ul></li></ul><div class="threadlist">Patch in this message</div><ul class="threadlist"><li><a href="/lkml/diff/2005/12/9/202/1">Get diff 1</a></li></ul></td><td width="32" rowspan="2" class="c" valign="top"><img src="/images/icornerl.gif" width="32" height="32" alt="/" /></td><td class="c" rowspan="2" valign="top" style="padding-top: 1em"><table><tr><td><table><tr><td class="lp">Date</td><td class="rp" itemprop="datePublished">Fri, 9 Dec 2005 12:54:54 -0800 (PST)</td></tr><tr><td class="lp">From</td><td class="rp" itemprop="author"> hawkes&#64;sgi ...</td></tr><tr><td class="lp">Subject</td><td class="rp" itemprop="name">[PATCH] -mm tree: broken "dynamic sched domains" and "migration cost"</td></tr></table></td><td></td></tr></table><pre itemprop="articleBody">As recently as 2.6.15-rc5-mm1 the combination of "migration cost matrix"<br />calculations and "dynamic sched domains" is broken. This patch fixes<br />the basic bug, even though I still have larger problems with the<br />"migration cost matrix" concept as implemented.<br /><br />The essence of the bug is that the code and data in the CPU Scheduler<br />that perform the cost calculation are declared as "__initdata" or<br />"__init" or "__devinit", even though a runtime declaration of a<br />cpu-exclusive cpuset will invoke build_sched_domains() to rebuilt<br />the sched domains, and that calls the now-released<br />calibrate_migration_costs(), et al. The attached patch changes the<br />declarations of the code and data to make them persistent.<br /><br />To review, to create a cpu-exclusive cpuset, do something like this:<br /> cd /dev/cpuset<br /> mkdir cpu0<br /> echo 0 &gt; cpu0/cpus<br /> echo 0 &gt; cpu0/mems<br /> echo 1 &gt; cpu0/cpu_exclusive<br /><br />Various other problems with "migration cost matrix" are not addressed<br />by this patch. They are:<br /><br />(1) calibrate_migration_costs() printk's the matrix at boot time, and<br /> for large NR_CPUS values this can be voluminous. We might consider<br /> changing the naked printk()'s into printk(KERN_DEBUG ...) to hide<br /> them to everyone but a sysadmin or developer who has a need to<br /> view the values.<br /><br />(2) calibrate_migration_costs() printk's the matrix for every call to<br /> build_sched_domains(), which is called twice for each declaration<br /> of a cpu-exclusive cpuset. The syslog output is voluminous for<br /> large NR_CPUS values. It is unclear to me how interesting this<br /> output is after the initial display at boot time. Various Job<br /> Manager software will actively create and destroy cpu-exclusive<br /> cpusets, and that will flood the syslog with matrix output.<br /><br />(3) For a cpu-exclusive cpuset cpu, the domain_distance() calculation <br /> produces a legal value, but not a useful value. That is to say,<br /> if cpu0 is isolated into a cpu-exclusive cpuset, then the semantic<br /> meaning of "domain distance" between cpu0 and another CPU outside of<br /> this cpu-exclusive is essentially undefined. No load-balancing<br /> will occur in or out of this cpu-exclusive cpuset. There is no<br /> need to measure_migration_cost() of any migrations in or out of<br /> a cpu-exclusive cpuset.<br /><br /> The reason why this is significant is that measure_migration_cost()<br /> is expensive, and the creation of that cpu0 cpu-exclusive cpuset<br /> stalls the<br /> echo 1 &gt; cpu0/cpu_exclusive<br /> about six seconds on a 64p ia64/sn2 platform, all to calculate<br /> a cost value that is never used.<br /><br />(4) The expense of the migration_cost() calculation can be sizeable.<br /> It can be reduced somewhat by changing the 5% increments into 10%<br /> or even higher. That reduces the boot-time "calibration delay"<br /> from 50 seconds down to 20 seconds (for the 64p ia64/sn2) for<br /> three levels of sched domains, and the declaration of a single<br /> cpu0 cpu-exclusive dynamic sched domain from 6 seconds down to 5<br /> (all to calculate that unnecessary new domain_distance).<br /><br />(5) Besides, the migration cost between any two CPUs is something<br /> that can be calculated once at boot time and remembered<br /> thereafter. I suspect the reason why the algorithm doesn't do<br /> this is that an exhaustive NxN calculation will be very slow for<br /> large NR_CPUS counts, which explains why the calculations are<br /> now done in the context of sched domains.<br /><br /> However, simply calculating the migration cost for each "domain<br /> distance" presupposes that the sched domain groupings accurately<br /> reflect true migration costs. For platforms like the ia64/sn2<br /> this is not true. For example, a 64p ia64/sn2 will have three<br /> sched domain levels: level zero is each 2-CPU pair within a<br /> node; level one is an SD_NODES_PER_DOMAIN-size arbitrary clump<br /> of 32 CPUs; level two is all 64 CPUs. Looking at that middle<br /> level, if I understand the algorithm, there is one and only one<br /> migration_cost[1] calculation made for an arbitrary choice of one<br /> CPU in a 32-CPU clump and another CPU in that clump. Reality is<br /> that there are various migration costs for different pairs of CPUs<br /> in that set of 32. This is yet another reason why I believe we<br /> don't need to do those 5% size increments to get really accurate<br /> timings.<br /><br />For now, here is the patch to avoid the basic bug:<br /><br /><br />Signed-off-by: John Hawkes &lt;hawkes&#64;sgi.com&gt;<br /><br />Index: linux/kernel/sched.c<br />===================================================================<br />--- linux.orig/kernel/sched.c 2005-12-09 09:26:04.000000000 -0800<br />+++ linux/kernel/sched.c 2005-12-09 10:53:28.000000000 -0800<br />&#64;&#64; -5295,7 +5295,7 &#64;&#64;<br /> <br /> #define MIGRATION_FACTOR_SCALE 128<br /> <br />-static __initdata unsigned int migration_factor = MIGRATION_FACTOR_SCALE;<br />+static unsigned int migration_factor = MIGRATION_FACTOR_SCALE;<br /> <br /> static int __init setup_migration_factor(char *str)<br /> {<br />&#64;&#64; -5310,7 +5310,7 &#64;&#64;<br /> * Estimated distance of two CPUs, measured via the number of domains<br /> * we have to pass for the two CPUs to be in the same span:<br /> */<br />-__devinit static unsigned long domain_distance(int cpu1, int cpu2)<br />+static unsigned long domain_distance(int cpu1, int cpu2)<br /> {<br /> unsigned long distance = 0;<br /> struct sched_domain *sd;<br />&#64;&#64; -5329,7 +5329,7 &#64;&#64;<br /> return distance;<br /> }<br /> <br />-static __initdata unsigned int migration_debug;<br />+static unsigned int migration_debug;<br /> <br /> static int __init setup_migration_debug(char *str)<br /> {<br />&#64;&#64; -5345,7 +5345,7 &#64;&#64;<br /> * bootup. Gets used in the domain-setup code (i.e. during SMP<br /> * bootup).<br /> */<br />-__initdata unsigned int max_cache_size;<br />+unsigned int max_cache_size;<br /> <br /> static int __init setup_max_cache_size(char *str)<br /> {<br />&#64;&#64; -5360,7 +5360,7 &#64;&#64;<br /> * is the operation that is timed, so we try to generate unpredictable<br /> * cachemisses that still end up filling the L2 cache:<br /> */<br />-__init static void touch_cache(void *__cache, unsigned long __size)<br />+static void touch_cache(void *__cache, unsigned long __size)<br /> {<br /> unsigned long size = __size/sizeof(long), chunk1 = size/3,<br /> chunk2 = 2*size/3;<br />&#64;&#64; -5382,8 +5382,8 &#64;&#64;<br /> /*<br /> * Measure the cache-cost of one task migration. Returns in units of nsec.<br /> */<br />-__init static unsigned long long measure_one(void *cache, unsigned long size,<br />- int source, int target)<br />+static unsigned long long measure_one(void *cache, unsigned long size,<br />+ int source, int target)<br /> {<br /> cpumask_t mask, saved_mask;<br /> unsigned long long t0, t1, t2, t3, cost;<br />&#64;&#64; -5449,7 +5449,7 &#64;&#64;<br /> * Architectures can prime the upper limit of the search range via<br /> * max_cache_size, otherwise the search range defaults to 20MB...64K.<br /> */<br />-__init static unsigned long long<br />+static unsigned long long<br /> measure_cost(int cpu1, int cpu2, void *cache, unsigned int size)<br /> {<br /> unsigned long long cost1, cost2;<br />&#64;&#64; -5501,7 +5501,7 &#64;&#64;<br /> return cost1 - cost2;<br /> }<br /> <br />-__devinit static unsigned long long measure_migration_cost(int cpu1, int cpu2)<br />+static unsigned long long measure_migration_cost(int cpu1, int cpu2)<br /> {<br /> unsigned long long max_cost = 0, fluct = 0, avg_fluct = 0;<br /> unsigned int max_size, size, size_found = 0;<br />&#64;&#64; -5606,7 +5606,7 &#64;&#64;<br /> return 2 * max_cost * migration_factor / MIGRATION_FACTOR_SCALE;<br /> }<br /> <br />-void __devinit calibrate_migration_costs(void)<br />+void calibrate_migration_costs(void)<br /> {<br /> int cpu1 = -1, cpu2 = -1, cpu, orig_cpu = raw_smp_processor_id();<br /> struct sched_domain *sd;<br />-<br />To unsubscribe from this list: send the line "unsubscribe linux-kernel" in<br />the body of a message to majordomo&#64;vger.kernel.org<br />More majordomo info at <a href="http://vger.kernel.org/majordomo-info.html">http://vger.kernel.org/majordomo-info.html</a><br />Please read the FAQ at <a href="http://www.tux.org/lkml/">http://www.tux.org/lkml/</a><br /></pre></td><td width="32" rowspan="2" class="c" valign="top"><img src="/images/icornerr.gif" width="32" height="32" alt="\" /></td></tr><tr><td align="right" valign="bottom"> 聽 </td></tr><tr><td align="right" valign="bottom">聽</td><td class="c" valign="bottom" style="padding-bottom: 0px"><img src="/images/bcornerl.gif" width="32" height="32" alt="\" /></td><td class="c">聽</td><td class="c" valign="bottom" style="padding-bottom: 0px"><img src="/images/bcornerr.gif" width="32" height="32" alt="/" /></td></tr><tr><td align="right" valign="top" colspan="2"> 聽 </td><td class="lm">Last update: 2005-12-09 21:58 聽聽 [from the cache]<br />漏2003-2020 <a href="http://blog.jasper.es/"><span itemprop="editor">Jasper Spaans</span></a>|hosted at <a href="https://www.digitalocean.com/?refcode=9a8e99d24cf9">Digital Ocean</a> and my Meterkast|<a href="http://blog.jasper.es/categories.html#lkml-ref">Read the blog</a></td><td>聽</td></tr></table><script language="javascript" src="/js/styleswitcher.js" type="text/javascript"></script></body></html>

Pages: 1 2 3 4 5 6 7 8 9 10