CINXE.COM
LKML: Thomas Gleixner: [patch V3a 17/18] posix-timers: Provide a mechanism to allocate a given timer ID
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>LKML: Thomas Gleixner: [patch V3a 17/18] posix-timers: Provide a mechanism to allocate a given timer ID</title><link href="/css/message.css" rel="stylesheet" type="text/css" /><link href="/css/wrap.css" rel="alternate stylesheet" type="text/css" title="wrap" /><link href="/css/nowrap.css" rel="stylesheet" type="text/css" title="nowrap" /><link href="/favicon.ico" rel="shortcut icon" /><script src="/js/simple-calendar.js" type="text/javascript"></script><script src="/js/styleswitcher.js" type="text/javascript"></script><link rel="alternate" type="application/rss+xml" title="lkml.org : last 100 messages" href="/rss.php" /><link rel="alternate" type="application/rss+xml" title="lkml.org : last messages by Thomas Gleixner" href="/groupie.php?aid=" /><!--Matomo--><script> var _paq = window._paq = window._paq || []; /* tracker methods like "setCustomDimension" should be called before "trackPageView" */ _paq.push(["setDoNotTrack", true]); _paq.push(["disableCookies"]); _paq.push(['trackPageView']); _paq.push(['enableLinkTracking']); (function() { var u="//m.lkml.org/"; _paq.push(['setTrackerUrl', u+'matomo.php']); _paq.push(['setSiteId', '1']); var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0]; g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s); })(); </script><!--End Matomo Code--></head><body onload="es.jasper.simpleCalendar.init();" itemscope="itemscope" itemtype="http://schema.org/BlogPosting"><table border="0" cellpadding="0" cellspacing="0"><tr><td width="180" align="center"><a href="/"><img style="border:0;width:135px;height:32px" src="/images/toprowlk.gif" alt="lkml.org" /></a></td><td width="32">聽</td><td class="nb"><div><a class="nb" href="/lkml"> [lkml]</a> 聽 <a class="nb" href="/lkml/2025"> [2025]</a> 聽 <a class="nb" href="/lkml/2025/3"> [Mar]</a> 聽 <a class="nb" href="/lkml/2025/3/11"> [11]</a> 聽 <a class="nb" href="/lkml/last100"> [last100]</a> 聽 <a href="/rss.php"><img src="/images/rss-or.gif" border="0" alt="RSS Feed" /></a></div><div>Views: <a href="#" class="nowrap" onclick="setActiveStyleSheet('wrap');return false;">[wrap]</a><a href="#" class="wrap" onclick="setActiveStyleSheet('nowrap');return false;">[no wrap]</a> 聽 <a class="nb" href="/lkml/mheaders/2025/3/11/1618" onclick="this.href='/lkml/headers'+'/2025/3/11/1618';">[headers]</a>聽 <a href="/lkml/bounce/2025/3/11/1618">[forward]</a>聽 </div></td><td width="32">聽</td></tr><tr><td valign="top"><div class="es-jasper-simpleCalendar" baseurl="/lkml/"></div><div class="threadlist">Messages in this thread</div><ul class="threadlist"><li class="root"><a href="/lkml/2025/3/8/426">First message in thread</a></li><li><a href="/lkml/2025/3/11/1588">Frederic Weisbecker</a><ul><li><a href="/lkml/2025/3/11/1615">Thomas Gleixner</a><ul><li class="origin"><a href="/lkml/2025/3/11/1633">Thomas Gleixner</a><ul><li><a href="/lkml/2025/3/11/1633">Frederic Weisbecker</a><ul><li><a href="/lkml/2025/3/12/252">Cyrill Gorcunov</a></li></ul></li><li><a href="/lkml/2025/3/13/618">"tip-bot2 for Thomas Gleixner"</a></li></ul></li></ul></li><li><a href="/lkml/2025/3/12/649">Cyrill Gorcunov</a></li></ul></li></ul><div class="threadlist">Patch in this message</div><ul class="threadlist"><li><a href="/lkml/diff/2025/3/11/1618/1">Get diff 1</a></li></ul></td><td width="32" rowspan="2" class="c" valign="top"><img src="/images/icornerl.gif" width="32" height="32" alt="/" /></td><td class="c" rowspan="2" valign="top" style="padding-top: 1em"><table><tr><td><table><tr><td class="lp">From</td><td class="rp" itemprop="author">Thomas Gleixner <></td></tr><tr><td class="lp">Subject</td><td class="rp" itemprop="name">[patch V3a 17/18] posix-timers: Provide a mechanism to allocate a given timer ID</td></tr><tr><td class="lp">Date</td><td class="rp" itemprop="datePublished">Tue, 11 Mar 2025 23:07:44 +0100</td></tr></table></td><td></td></tr></table><pre itemprop="articleBody">Checkpoint/Restore in Userspace (CRIU) requires to reconstruct posix timers<br />with the same timer ID on restore. It uses sys_timer_create() and relies on<br />the monotonic increasing timer ID provided by this syscall. It creates and<br />deletes timers until the desired ID is reached. This is can loop for a long<br />time, when the checkpointed process had a very sparse timer ID range.<br /><br />It has been debated to implement a new syscall to allow the creation of<br />timers with a given timer ID, but that's tideous due to the 32/64bit compat<br />issues of sigevent_t and of dubious value.<br /><br />The restore mechanism of CRIU creates the timers in a state where all<br />threads of the restored process are held on a barrier and cannot issue<br />syscalls. That means the restorer task has exclusive control.<br /><br />This allows to address this issue with a prctl() so that the restorer<br />thread can do:<br /><br /> if (prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_ON))<br /> goto linear_mode;<br /> create_timers_with_explicit_ids();<br /> prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_OFF);<br /> <br />This is backwards compatible because the prctl() fails on older kernels and<br />CRIU can fall back to the linear timer ID mechanism. CRIU versions which do<br />not know about the prctl() just work as before.<br /><br />Implement the prctl() and modify timer_create() so that it copies the<br />requested timer ID from userspace by utilizing the existing timer_t<br />pointer, which is used to copy out the allocated timer ID on success.<br /><br />If the prctl() is disabled, which it is by default, timer_create() works as<br />before and does not try to read from the userspace pointer.<br /><br />There is no problem when a broken or rogue user space application enables<br />the prctl(). If the user space pointer does not contain a valid ID, then<br />timer_create() fails. If the data is not initialized, but constains a<br />random valid ID, timer_create() will create that random timer ID or fail if<br />the ID is already given out. <br /> <br />As CRIU must use the raw syscall to avoid manipulating the internal state<br />of the restored process, this has no library dependencies and can be<br />adopted by CRIU right away.<br /><br />Recreating two timers with IDs 1000000 and 2000000 takes 1.5 seconds with<br />the create/delete method. With the prctl() it takes 3 microseconds.<br /><br />Signed-off-by: Thomas Gleixner <tglx@linutronix.de><br />---<br />V3a: Remove the locking in the prctl() and clear restore mode on exec()<br /> - Frederic<br />V2: Move the ID counter ahead to avoid collisions after switching back to<br /> normal mode.<br />---<br /> include/linux/posix-timers.h | 2 <br /> include/linux/sched/signal.h | 1 <br /> include/uapi/linux/prctl.h | 10 ++++<br /> kernel/sys.c | 5 ++<br /> kernel/time/posix-timers.c | 99 +++++++++++++++++++++++++++++++------------<br /> 5 files changed, 91 insertions(+), 26 deletions(-)<br /><br />--- a/include/linux/posix-timers.h<br />+++ b/include/linux/posix-timers.h<br />@@ -114,6 +114,7 @@ bool posixtimer_init_sigqueue(struct sig<br /> void posixtimer_send_sigqueue(struct k_itimer *tmr);<br /> bool posixtimer_deliver_signal(struct kernel_siginfo *info, struct sigqueue *timer_sigq);<br /> void posixtimer_free_timer(struct k_itimer *timer);<br />+long posixtimer_create_prctl(unsigned long ctrl);<br /> <br /> /* Init task static initializer */<br /> #define INIT_CPU_TIMERBASE(b) { \<br />@@ -140,6 +141,7 @@ static inline void posixtimer_rearm_itim<br /> static inline bool posixtimer_deliver_signal(struct kernel_siginfo *info,<br /> struct sigqueue *timer_sigq) { return false; }<br /> static inline void posixtimer_free_timer(struct k_itimer *timer) { }<br />+static inline long posixtimer_create_prctl(unsigned long ctrl) { return -EINVAL; }<br /> #endif<br /> <br /> #ifdef CONFIG_POSIX_CPU_TIMERS_TASK_WORK<br />--- a/include/linux/sched/signal.h<br />+++ b/include/linux/sched/signal.h<br />@@ -136,6 +136,7 @@ struct signal_struct {<br /> #ifdef CONFIG_POSIX_TIMERS<br /> <br /> /* POSIX.1b Interval Timers */<br />+ unsigned int timer_create_restore_ids:1;<br /> atomic_t next_posix_timer_id;<br /> struct hlist_head posix_timers;<br /> struct hlist_head ignored_posix_timers;<br />--- a/include/uapi/linux/prctl.h<br />+++ b/include/uapi/linux/prctl.h<br />@@ -353,4 +353,14 @@ struct prctl_mm_map {<br /> */<br /> #define PR_LOCK_SHADOW_STACK_STATUS 76<br /> <br />+/*<br />+ * Controls the mode of timer_create() for CRIU restore operations.<br />+ * Enabling this allows CRIU to restore timers with explicit IDs.<br />+ *<br />+ * Don't use for normal operations as the result might be undefined.<br />+ */<br />+#define PR_TIMER_CREATE_RESTORE_IDS 77<br />+# define PR_TIMER_CREATE_RESTORE_IDS_OFF 0<br />+# define PR_TIMER_CREATE_RESTORE_IDS_ON 1<br />+<br /> #endif /* _LINUX_PRCTL_H */<br />--- a/kernel/sys.c<br />+++ b/kernel/sys.c<br />@@ -2811,6 +2811,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsi<br /> return -EINVAL;<br /> error = arch_lock_shadow_stack_status(me, arg2);<br /> break;<br />+ case PR_TIMER_CREATE_RESTORE_IDS:<br />+ if (arg3 || arg4 || arg5)<br />+ return -EINVAL;<br />+ error = posixtimer_create_prctl(arg2);<br />+ break;<br /> default:<br /> trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);<br /> error = -EINVAL;<br />--- a/kernel/time/posix-timers.c<br />+++ b/kernel/time/posix-timers.c<br />@@ -19,6 +19,7 @@<br /> #include <linux/nospec.h><br /> #include <linux/posix-clock.h><br /> #include <linux/posix-timers.h><br />+#include <linux/prctl.h><br /> #include <linux/sched/task.h><br /> #include <linux/slab.h><br /> #include <linux/syscalls.h><br />@@ -57,6 +58,8 @@ static const struct k_clock * const posi<br /> static const struct k_clock *clockid_to_kclock(const clockid_t id);<br /> static const struct k_clock clock_realtime, clock_monotonic;<br /> <br />+#define TIMER_ANY_ID INT_MIN<br />+<br /> /* SIGEV_THREAD_ID cannot share a bit with the other SIGEV values. */<br /> #if SIGEV_THREAD_ID != (SIGEV_THREAD_ID & \<br /> ~(SIGEV_SIGNAL | SIGEV_NONE | SIGEV_THREAD))<br />@@ -128,38 +131,60 @@ static bool posix_timer_hashed(struct ti<br /> return false;<br /> }<br /> <br />-static int posix_timer_add(struct k_itimer *timer)<br />+static bool posix_timer_add_at(struct k_itimer *timer, struct signal_struct *sig, unsigned int id)<br />+{<br />+ struct timer_hash_bucket *bucket = hash_bucket(sig, id);<br />+<br />+ scoped_guard (spinlock, &bucket->lock) {<br />+ /*<br />+ * Validate under the lock as this could have raced against<br />+ * another thread ending up with the same ID, which is<br />+ * highly unlikely, but possible.<br />+ */<br />+ if (!posix_timer_hashed(bucket, sig, id)) {<br />+ /*<br />+ * Set the timer ID and the signal pointer to make<br />+ * it identifiable in the hash table. The signal<br />+ * pointer has bit 0 set to indicate that it is not<br />+ * yet fully initialized. posix_timer_hashed()<br />+ * masks this bit out, but the syscall lookup fails<br />+ * to match due to it being set. This guarantees<br />+ * that there can't be duplicate timer IDs handed<br />+ * out.<br />+ */<br />+ timer->it_id = (timer_t)id;<br />+ timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);<br />+ hlist_add_head_rcu(&timer->t_hash, &bucket->head);<br />+ return true;<br />+ }<br />+ }<br />+ return false;<br />+}<br />+<br />+static int posix_timer_add(struct k_itimer *timer, int req_id)<br /> {<br /> struct signal_struct *sig = current->signal;<br /> <br />+ if (unlikely(req_id != TIMER_ANY_ID)) {<br />+ if (!posix_timer_add_at(timer, sig, req_id))<br />+ return -EBUSY;<br />+<br />+ /*<br />+ * Move the ID counter past the requested ID, so that after<br />+ * switching back to normal mode the IDs are outside of the<br />+ * exact allocated region. That avoids ID collisions on the<br />+ * next regular timer_create() invocations.<br />+ */<br />+ atomic_set(&sig->next_posix_timer_id, req_id + 1);<br />+ return req_id;<br />+ }<br />+<br /> for (unsigned int cnt = 0; cnt <= INT_MAX; cnt++) {<br /> /* Get the next timer ID and clamp it to positive space */<br /> unsigned int id = atomic_fetch_inc(&sig->next_posix_timer_id) & INT_MAX;<br />- struct timer_hash_bucket *bucket = hash_bucket(sig, id);<br /> <br />- scoped_guard (spinlock, &bucket->lock) {<br />- /*<br />- * Validate under the lock as this could have raced<br />- * against another thread ending up with the same<br />- * ID, which is highly unlikely, but possible.<br />- */<br />- if (!posix_timer_hashed(bucket, sig, id)) {<br />- /*<br />- * Set the timer ID and the signal pointer to make<br />- * it identifiable in the hash table. The signal<br />- * pointer has bit 0 set to indicate that it is not<br />- * yet fully initialized. posix_timer_hashed()<br />- * masks this bit out, but the syscall lookup fails<br />- * to match due to it being set. This guarantees<br />- * that there can't be duplicate timer IDs handed<br />- * out.<br />- */<br />- timer->it_id = (timer_t)id;<br />- timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);<br />- hlist_add_head_rcu(&timer->t_hash, &bucket->head);<br />- return id;<br />- }<br />- }<br />+ if (posix_timer_add_at(timer, sig, id))<br />+ return id;<br /> cond_resched();<br /> }<br /> /* POSIX return code when no timer ID could be allocated */<br />@@ -364,6 +389,15 @@ static enum hrtimer_restart posix_timer_<br /> return HRTIMER_NORESTART;<br /> }<br /> <br />+long posixtimer_create_prctl(unsigned long ctrl)<br />+{<br />+ if (ctrl > PR_TIMER_CREATE_RESTORE_IDS_ON)<br />+ return -EINVAL;<br />+<br />+ current->signal->timer_create_restore_ids = ctrl == PR_TIMER_CREATE_RESTORE_IDS_ON;<br />+ return 0;<br />+}<br />+<br /> static struct pid *good_sigevent(sigevent_t * event)<br /> {<br /> struct pid *pid = task_tgid(current);<br />@@ -435,6 +469,7 @@ static int do_timer_create(clockid_t whi<br /> timer_t __user *created_timer_id)<br /> {<br /> const struct k_clock *kc = clockid_to_kclock(which_clock);<br />+ timer_t req_id = TIMER_ANY_ID;<br /> struct k_itimer *new_timer;<br /> int error, new_timer_id;<br /> <br />@@ -449,11 +484,20 @@ static int do_timer_create(clockid_t whi<br /> <br /> spin_lock_init(&new_timer->it_lock);<br /> <br />+ /* Special case for CRIU to restore timers with a given timer ID. */<br />+ if (unlikely(current->signal->timer_create_restore_ids)) {<br />+ if (copy_from_user(&req_id, created_timer_id, sizeof(req_id)))<br />+ return -EFAULT;<br />+ /* Valid IDs are 0..INT_MAX */<br />+ if ((unsigned int)req_id > INT_MAX)<br />+ return -EINVAL;<br />+ }<br />+<br /> /*<br /> * Add the timer to the hash table. The timer is not yet valid<br /> * after insertion, but has a unique ID allocated.<br /> */<br />- new_timer_id = posix_timer_add(new_timer);<br />+ new_timer_id = posix_timer_add(new_timer, req_id);<br /> if (new_timer_id < 0) {<br /> posixtimer_free_timer(new_timer);<br /> return new_timer_id;<br />@@ -1041,6 +1085,9 @@ void exit_itimers(struct task_struct *ts<br /> struct hlist_node *next;<br /> struct k_itimer *timer;<br /> <br />+ /* Clear restore mode for exec() */<br />+ tsk->signal->timer_create_restore_ids = 0;<br />+<br /> if (hlist_empty(&tsk->signal->posix_timers))<br /> return;<br /> <br /></pre></td><td width="32" rowspan="2" class="c" valign="top"><img src="/images/icornerr.gif" width="32" height="32" alt="\" /></td></tr><tr><td align="right" valign="bottom"> 聽 </td></tr><tr><td align="right" valign="bottom">聽</td><td class="c" valign="bottom" style="padding-bottom: 0px"><img src="/images/bcornerl.gif" width="32" height="32" alt="\" /></td><td class="c">聽</td><td class="c" valign="bottom" style="padding-bottom: 0px"><img src="/images/bcornerr.gif" width="32" height="32" alt="/" /></td></tr><tr><td align="right" valign="top" colspan="2"> 聽 </td><td class="lm">Last update: 2025-03-11 23:08 聽聽 [W:0.901 / U:0.130 seconds]<br />漏2003-2020 <a href="http://blog.jasper.es/"><span itemprop="editor">Jasper Spaans</span></a>|hosted at <a href="https://www.digitalocean.com/?refcode=9a8e99d24cf9">Digital Ocean</a> and my Meterkast|<a href="http://blog.jasper.es/categories.html#lkml-ref">Read the blog</a></td><td>聽</td></tr></table><script language="javascript" src="/js/styleswitcher.js" type="text/javascript"></script></body></html>