Friday, December 2, 2016

Error polling HP CQ with status WORK REQUEST FLUSHED ERROR status on LSF Platform

I was encountering "Error polling HP CQ with status WORK REQUEST FLUSHED ERROR status" during OpenMPI run and it was occuring randomly.

I suspect it is due to nodes issue. I checked the LSF /opt/lsf/log/sbatchd.log.comp001. It is definitely an authentication issue with AD. I'm using centrify.

acctMapTo: No valid user name found for job 149044, userName(mr_x) failed:Success
runEexec: getOSUid_() failed. Bad user ID


I did a
$ badmin hclose comp001

and then restart centrify services. Alternatively, you can reboot if you want a clean start.

The OpenMPI could run again.

No comments: