View Issue Details

IDProjectCategoryView StatusLast Update
0000142tcshGeneralpublic2020-02-19 14:06
Reporterbitstreamout Assigned Tochristos  
PrioritynormalSeveritymajorReproducibilitysometimes
Status assignedResolutionopen 
Platformx86_64OSlinux 
Product Version6.22.02 
Summary0000142: Dot locking sometimes does block after reboot
DescriptionDuring reboot it happens sometimes that one of my open tcsh sesssions does not get enough time to write out history and got killed by systemd before the dot lock file had been removed. It would be a a solution to combine the dot locking method with fcntl(F_SETLKW/F_UNLCK) as this does work over NFS at least for Linux. This also would allow to avoid polling due waiting on F_SETLKW ...
Steps To ReproduceOpen several tcsh sessions, do some work therein, and reboot
TagsNo tags attached.

Activities

christos

2020-02-16 16:15

manager   ~0003371

Perhaps it is better to give up with a message to the user instead of getting stuck... I don't want to add more complexity to the locking. The point of dotlock is to not use the fcntl() based in the first place which is even less reliable over NFS...

bitstreamout

2020-02-17 10:05

reporter   ~0003374

Hmmm ... on Linux based systems itz is possible to determine the process which holds the file descriptor. But this requires to parse /proc/ ... on the other hand most modern Linux systems have a tmpfs based /dev/shm worls writable and sticky directory which not only tmpfs but also a local file system. Besdide this : I had used some years a patch set from redhat with fcntl() based locking which never had shown problems over NFS ... only the porting to the next tcsh version had become complicated and fragile with increasing version number.

christos

2020-02-17 17:53

manager   ~0003375

Last edited: 2020-02-17 17:54

Yes, but remember tcsh runs on other OS's including windows... And this is a corner case that does not usually happen. Doesn't systemd have a way to increase the grace time it waits before killing the processes?

bitstreamout

2020-02-18 08:06

reporter  

tcsh-6.22.02-local-dotlock.dif (1,861 bytes)   
Avoid left over dot lock file after reboot

---
 dotlock.c |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

--- dotlock.c
+++ dotlock.c	2020-02-17 11:16:22.785018224 +0000
@@ -30,8 +30,38 @@
 #define O_SYNC	0
 #endif
 
+#if defined(__linux__)
+#  include <sys/statfs.h>
+#  include <unistd.h>
+#  ifndef  TMPFS_MAGIC
+#    define TMPFS_MAGIC 0x01021994
+#  endif
+#endif
+
 #include "dotlock.h"
 
+#if defined(__linux__)
+static char *sys_tmpdir;
+static int
+dosys_tmpdir ()
+{
+	static char *shm = "/dev/shm";
+	struct statfs fs;
+	static int doshm;
+
+	if (doshm)
+		return (sys_tmpdir != NULL);
+
+	doshm++;
+
+	if (statfs(shm, &fs) < 0 || fs.f_type != TMPFS_MAGIC || eaccess(shm, W_OK|X_OK))
+		return 0;
+
+	sys_tmpdir = shm;
+	return 1;
+}
+#endif
+
 static int create_exclusive(const char *);
 /*
  * Create a unique file. O_EXCL does not really work over NFS so we follow
@@ -140,7 +170,17 @@ dot_lock(const char *fname, int pollinte
 	(void)sigaddset(&nset, SIGTSTP);
 	(void)sigaddset(&nset, SIGCHLD);
 
+#if defined(__linux__)
+	const char *ptr;
+	if ((ptr = strrchr(fname, '/')) && dosys_tmpdir()) {
+		ptr++;
+		fname = ptr;
+		(void)snprintf(path, sizeof(path), "%s/%s.lock", sys_tmpdir, fname);
+	} else
+		(void)snprintf(path, sizeof(path), "%s.lock", fname);
+#else
 	(void)snprintf(path, sizeof(path), "%s.lock", fname);
+#endif
 
 	retval = -1;
 	for (;;) {
@@ -174,6 +214,16 @@ dot_unlock(const char *fname)
 {
 	char path[MAXPATHLEN];
 
+#if defined(__linux__)
+	const char *ptr;
+	if ((ptr = strrchr(fname, '/')) && dosys_tmpdir()) {
+		ptr++;
+		fname = ptr;
+		(void)snprintf(path, sizeof(path), "%s/%s.lock", sys_tmpdir, fname);
+	} else
+		(void)snprintf(path, sizeof(path), "%s.lock", fname);
+#else
 	(void)snprintf(path, sizeof(path), "%s.lock", fname);
+#endif
 	(void)unlink(path);
 }
tcsh-6.22.02-local-dotlock.dif (1,861 bytes)   

bitstreamout

2020-02-18 08:06

reporter   ~0003376

Not really as the most tcsh process do poll by sleeping within usleep() ... using F_SETLKW would avoid this but requires interrupt handling. Currently I use a workaround, which is use /dev/shm as this file system is always fresh because it is tmpfs based

christos

2020-02-18 20:20

manager   ~0003377

What happens when the home directory is mounted over NFS and shared between different hosts? How will the other systems see the tmpfs lock? Let's try to treat the cause (systemd killing tcsh too quickly) instead of the symptoms (tcsh leaving a stray lock file around because it got kill -9'ed).

bitstreamout

2020-02-19 07:54

reporter   ~0003379

Good point, personal I use different histories for different hosts. The fcntl() method had worked here over years without any problems, that means I had never seen any problems nor had got any bug reports. Maybe this is also Linux specific ... only rebooted (linux based) NFS servers may cause that a lock gets lost. For this a local tcsh has to write out its history at the same time as the NFS server is rebooted

bitstreamout

2020-02-19 10:05

reporter   ~0003380

in create_exclusive(): How about open the fname after link(path, fname) and then unlink(fname)? As long as the process hold the file descriptor the dot lock file i That means that if the process gets kill-9'ed the file will disapear with the last open file descriptor
tcsh-6.22.02-local-dotlock-2.dif (2,594 bytes)   
Avoid left over dot lock file after reboot

---
 dotlock.c |   20 +++++++++++---------
 dotlock.h |    2 +-
 sh.hist.c |   11 +++++++----
 3 files changed, 19 insertions(+), 14 deletions(-)

--- dotlock.c
+++ dotlock.c	2020-02-19 10:00:31.999490600 +0000
@@ -76,7 +76,7 @@ create_exclusive(const char *fname)
 	 * We try to create the unique filename.
 	 */
 	for (ntries = 0; ntries < 5; ntries++) {
-		fd = open(path, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL|O_SYNC, 0);
+		fd = open(path, O_WRONLY|O_CREAT|O_TRUNC|O_EXCL|O_SYNC, S_IWUSR);
 		if (fd != -1) {
 			(void)close(fd);
 			break;
@@ -110,8 +110,14 @@ create_exclusive(const char *fname)
 		errno = EEXIST;
 		return -1;
 	}
-	return 0;
 
+	fd = open(fname, O_WRONLY|O_EXCL|O_SYNC, 0);
+	if (fd < 0) {
+		errno = EEXIST;
+		return -1;
+	}
+	(void)unlink(fname);
+	return fd;
 bad:
 	serrno = errno;
 	(void)unlink(path);
@@ -146,9 +152,8 @@ dot_lock(const char *fname, int pollinte
 	for (;;) {
 		handle_pending_signals();
 		(void)sigprocmask(SIG_BLOCK, &nset, &oset);
-		if (create_exclusive(path) != -1) {
+		if ((retval = create_exclusive(path)) != -1) {
 			(void)sigprocmask(SIG_SETMASK, &oset, NULL);
-			retval = 0;
 			break;
 		}
 		else
@@ -170,10 +175,7 @@ dot_lock(const char *fname, int pollinte
 }
 
 void
-dot_unlock(const char *fname)
+dot_unlock(const int fd)
 {
-	char path[MAXPATHLEN];
-
-	(void)snprintf(path, sizeof(path), "%s.lock", fname);
-	(void)unlink(path);
+	close(fd);
 }
--- dotlock.h
+++ dotlock.h	2020-02-19 09:40:16.034422159 +0000
@@ -30,6 +30,6 @@
  * pollinterval -- Interval (miliseconds) to check for lock, -1 return
  */
 int dot_lock(const char *fname, int pollinterval);
-void dot_unlock(const char *fname);
+void dot_unlock(const int fd);
 
 #endif /* #ifndef _DOTLOCK_H_ */
--- sh.hist.c
+++ sh.hist.c	2020-02-19 09:48:57.640589111 +0000
@@ -1209,9 +1209,11 @@ fmthist(int fmt, ptr_t ptr)
 }
 
 static void
-dotlock_cleanup(void* lockpath)
+dotlock_cleanup(void* xfdp)
 {
-	dot_unlock((char*)lockpath);
+	int *fdp;
+	fdp = xfdp;
+	dot_unlock(*fdp);
 }
 
 /* Save history before exiting the shell. */
@@ -1284,11 +1286,12 @@ rechist(Char *fname, int ref)
 	    jmp_buf_t osetexit;
 	    if (lock) {
 #ifndef WINNT_NATIVE
+		int fdlk;
 		char *lockpath = strsave(short2str(fname));
 		cleanup_push(lockpath, xfree);
 		/* Poll in 100 miliseconds interval to obtain the lock. */
-		if ((dot_lock(lockpath, 100) == 0))
-		    cleanup_push(lockpath, dotlock_cleanup);
+		if ((fdlk = dot_lock(lockpath, 100)) != -1)
+		    cleanup_push(&fdlk, dotlock_cleanup);
 #endif
 	    }
 	    getexit(osetexit);

christos

2020-02-19 13:43

manager   ~0003381

Then locking does not work. Another process can do the same (on the same or a different machine). This is the whole idea behind dot locking: that you can only touch the file once you've established you are the owner (by creating the link and making sure that the link count is 2).

bitstreamout

2020-02-19 14:00

reporter   ~0003382

Hmm ... here it seems to work at least on the same machine

```
openat(AT_FDCWD, "/suse/werner/.tcsh/.noether.d574", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_SYNC, 0200) = 3
close(3) = 0
link("/suse/werner/.tcsh/.noether.d574", "/suse/werner/.tcsh/werner@noether.lock") = -1 EEXIST (File exists)
unlink("/suse/werner/.tcsh/.noether.d574") = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({tv_sec=0, tv_nsec=100000000}, NULL) = 0
rt_sigprocmask(SIG_BLOCK, [HUP INT QUIT TERM CHLD TSTP TTIN TTOU], [], 8) = 0
uname({sysname="Linux", nodename="noether", ...}) = 0
getpid() = 11676
openat(AT_FDCWD, "/suse/werner/.tcsh/.noether.2be3e", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_SYNC, 0200) = 3
close(3) = 0
link("/suse/werner/.tcsh/.noether.2be3e", "/suse/werner/.tcsh/werner@noether.lock") = 0
openat(AT_FDCWD, "/suse/werner/.tcsh/werner@noether.lock", O_WRONLY|O_EXCL|O_SYNC) = 3
unlink("/suse/werner/.tcsh/werner@noether.lock") = 0
stat("/suse/werner/.tcsh/.noether.2be3e", {st_mode=S_IFREG|0200, st_size=0, ...}) = 0
unlink("/suse/werner/.tcsh/.noether.2be3e") = 0
```

```
openat(AT_FDCWD, "/suse/werner/.tcsh/.noether.dc5c", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_SYNC, 0200) = 3
close(3) = 0
link("/suse/werner/.tcsh/.noether.dc5c", "/suse/werner/.tcsh/werner@noether.lock") = 0
openat(AT_FDCWD, "/suse/werner/.tcsh/werner@noether.lock", O_WRONLY|O_EXCL|O_SYNC) = 3
unlink("/suse/werner/.tcsh/werner@noether.lock") = 0
stat("/suse/werner/.tcsh/.noether.dc5c", {st_mode=S_IFREG|0200, st_size=0, ...}) = 0
unlink("/suse/werner/.tcsh/.noether.dc5c") = 0
```

christos

2020-02-19 14:06

manager   ~0003383

The point is that at the time you remove the lock there could be another process (on a different machine perhaps) who owns the lock and is currently writing the file. The lock is a sign that the file is busy. If you remove it indiscriminately there is no point in creating it in the first place.

Issue History

Date Modified Username Field Change
2020-02-14 14:32 bitstreamout New Issue
2020-02-16 16:13 christos Assigned To => christos
2020-02-16 16:13 christos Status new => assigned
2020-02-16 16:15 christos Status assigned => feedback
2020-02-16 16:15 christos Note Added: 0003371
2020-02-17 10:05 bitstreamout Note Added: 0003374
2020-02-17 10:05 bitstreamout Status feedback => assigned
2020-02-17 17:53 christos Note Added: 0003375
2020-02-17 17:54 christos Note Edited: 0003375
2020-02-18 08:06 bitstreamout File Added: tcsh-6.22.02-local-dotlock.dif
2020-02-18 08:06 bitstreamout Note Added: 0003376
2020-02-18 20:20 christos Note Added: 0003377
2020-02-19 07:54 bitstreamout Note Added: 0003379
2020-02-19 10:05 bitstreamout File Added: tcsh-6.22.02-local-dotlock-2.dif
2020-02-19 10:05 bitstreamout Note Added: 0003380
2020-02-19 13:43 christos Note Added: 0003381
2020-02-19 14:00 bitstreamout Note Added: 0003382
2020-02-19 14:06 christos Note Added: 0003383