Nfs readdirplus file handle is invalid stale

11/7/2022

#Nfs readdirplus file handle is invalid stale serial
#Nfs readdirplus file handle is invalid stale code

You might get lucky and find an NFS mount that is more reliable than your home directory. On clusters I've worked on, there is often scratch space or other NFS directly under root /. Try using a different network drive on the worker node (if one is available).If you need to make a work-around and cannot wait for sysadmins to address the problem, you could: This isn't a problem you can fix directly unless you have administrative privileges on the cluster. It sounds like the NFS mount of your home directory on the worker node probably failed while your job was running. If this were not the case, each node would have its own version of your home directory, and you would have to take explicit action to copy relevant files between worker nodes and/or the login node. On academic clusters, home directories are frequently mounted via NFS on each node in the cluster to give you a uniform experience across all of the nodes. Has anyone ever run into this weird error? What should we do to fix this problem on the cluster? Any help is appreciated! The administrator of this cluster is looking into the issue but also cannot find out the specific reasons for now.

#Nfs readdirplus file handle is invalid stale code

I can make sure that the code I'm using has no problem because it can successfully finish on other clusters. I also tried running the job in parallel with multiple cores, it still failed with the same error after dozens of hours running. The new job still failed with the same "Stale file handle, errno=116" error after dozens of hours, so the standard output is also not the reason. To test whether this standard output overwhelms the cluster system, I ran another same job but without the standard output log file. However, when I logged into the computing node while running to check the memory usage with "free -m" and "htop" command, I noticed that both the RAM and swap memory occupation never exceed 10%, at a very low level, so the memory usage is not a problem.īecause I used "tee" to record the job running to a log file, this file can contain up to tens of thousands of lines and the size is over 1MB. I once thought this error is due to that the job consumes the RAM up and this is a memory overflow issue. What does this message mean specifically? This is very strange because I never modify the home directory and this "/home/shangxin/" is surely my permanent directory where the code is.Īlso, in the standard output. Post job file processing error job on host hs012/0Unknown resource type REJHOST=hs012.cluster MSG=invalid home directory '/home/shangxin' specified, errno=116 (Stale file handle)

However, the job always runs for dozens of hours and then crashes with the following " Stale file handle, errno=116" error message:Īn error has occurred processing your job, see below.

#Nfs readdirplus file handle is invalid stale serial

I first ran a benchmark test in serial to see the performance of the software. I'm now running a simulation code called CMAQ on a remote cluster.

0 Comments

Nfs readdirplus file handle is invalid stale

#Nfs readdirplus file handle is invalid stale code

#Nfs readdirplus file handle is invalid stale serial

Leave a Reply.

Author

Archives

Categories