I found the bug in May, 2001. Linuxthreads is integrated in glibc-2 (generally in Linux-2.0 and above). This bug does _not_ affect most threads programs in Linux. However, if you are using both sem_wait() and pthread_cond_wait(), then you might want to look at this.
The bug was found when using my TOP-C/C++ for easy parallelization of new or previously existing sequential code. The bug can be fixed by writing your own routines for sem_wait(), sem_post(), etc. If you are one of the few who experience this bug, you are welcome to use .../src/sem-pthread.h in the TOP-C/C++ distribution, which directly implements semaphores on top of other pthread primitives.
I believe I have found a simple bug in the semaphore implementation of linuxthreads-2.0.6. Specifically, it is in semaphore.c:sem_wait(). I also checked linuxthreads-2.2.2, and the same bug appears to be present in oldsemaphore.c:__old_sem_wait(). The bug does not affect the new sem_wait of linuxthreads-2.2.2. Even in the old code, the bug appears _only_ in certain cases when the application mixes calls to semaphores with calls to mutexes and condition variables. If semaphore are used in isolation or if semaphores are not used, then the bug does not appear. I will describe the bug with reference to linuxthreads-2.0.6. If you just want the bug fix, you'll find my proposed bug fix at the end of this message. I've confirmed the bug by enabling the ASSERT in linuxthreads:queue.h, and re-compiling. The bug in my application is then always shadowed by a triggering of assert. I first discovered this bug using my TOP-C/C++ system for writing parallel applications for both distributed and shared memory (http://www.ccs.neu.edu/home/gene/topc.html). TOP-C/C++ seems to have good test coverage for finding some of these pthreads bugs. :-) - Gene Cooperman gene@ccs.neu.edu
The logic of semaphore.c indicates that sem->sem_status is used to EITHER hold the head of a queue of threads waiting on sem, or ELSE hold a semaphore count, stored as value ((long)count << 1) + 1. In the event that sem->sem_status holds the head of a queue of waiting threads, the forward links from a thread th is given as th->p_nextwaiting. The last thread in the waiting queue has: th->p_nextwaiting == (pthread_descr)0x1 It is the job of sem_wait() to set: th->p_nextwaiting = 0x1 before suspending the thread. It is the job of sem_restart_list() to set: th->p_nextwaiting = NULL before restarting the thread. The bug occurs in the case that the semaphore count is 0, which is indicated by: sem->sem_status == 1 and sem_post(sem) and sem_wait(sem) are called at approximately the same time. The following race condition can occur. The invocation of sem_wait(sem) causes the local variable oldstatus to receive the value 1 (corresponding to sem->sem_status). Assume that sem_post(sem) then executes in its entirety, causing sem->sem_status to be incremented to the value 3. Then sem_wait(sem) continues executing. The result is that sem_wait(sem) will follow the "else" branch and set: self->p_nextwaiting = (pthread_descr) oldstatus; where oldstatus is 1. The following call to sem_compare_and_swap() will then return 0, and one iterates again through the "do" loop. On the second iteration, oldstatus is set to 3 (the new value of sem->sem_status), the "if" branch is followed, newstatus is set to oldstatus - 2 (or to 1), and sem_compare_and_swap() returns 1. So, sem_wait(sem) exits with a return value of 1, as it should. But now self->p_nextwaiting has the value of 1 instead of NULL, while the thread is not suspended. If that thread later calls pthread_cond_wait(cond, mutex) and is suspended, the condition variable, cond, will have this thread, th, on its wait queue, and the value of th->p_nextwaiting will be 1. The code for enqueue(), dequeue(), etc. assumes that the value of th->p_nextwaiting is always either NULL or a valid thread descriptor, and so the code breaks upon encountering the value 1.
For a bug fix, I propose to restore the correct value of NULL
for self->p_nextwaiting before sem_wait(sem) returns.
I propose changing a code fragment in sem_wait()
FROM:
if (newstatus & 1)
/* We got the semaphore. */
return 0;
TO:
if (newstatus & 1) {
/* We got the semaphore. */
self->p_nextwaiting = NULL;
return 0;
}
From drepper@redhat.com Fri May 25 03:40:03 2001 X-Authentication-Warning: otr.mynet: drepper set sender to drepper@redhat.com us ing -f To: Gene CoopermanCc: bug-glibc@gnu.org Subject: Re: bug in linuxthreads: sem_wait/pthread_cond_wait interaction X-fingerprint: BE 3B 21 04 BC 77 AC F0 61 92 E4 CB AC DD B9 5A X-fingerprint: e6:49:07:36:9a:0d:b7:ba:b5:e9:06:f3:e7:e7:08:4a From: Ulrich Drepper Date: 25 May 2001 00:38:55 -0700 User-Agent: Gnus/5.0807 (Gnus v5.8.7) XEmacs/21.2 (Thelxepeia) MIME-Version: 1.0 Gene Cooperman writes: > I believe I have found a simple bug in the semaphore implementation > of linuxthreads-2.0.6. Specifically, it is in semaphore.c:sem_wait()_ > [...] I think it's correct. I've applied the patch now. Thanks, -- ---------------. ,-. 1325 Chesapeake Terrace Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com `------------------------