05/15/2016 TAGS: cvmfs nodes SAM critical root full The root partions of compute-2-0 and compute-2-1 are full, which is preventing cvmfs from mounting which is causing the SAM tests to go critical. In compute-2-0, /scratch took up 13G of the 20G of available space. I have cleared the folder. $ rm -rf /scratch/* Similar situation in compute-2-1. 05/16/2016: SAM tests failing in OD 05/23/2016 TAGS: SAM critical compute-2-1 /scratch was still writing to / I changed it to write to /var/cache/cvmfs as described above. 05/25/2016 TAGS: SAM critical compute-2-0 /scratch was still writing to / I changed it to write to /var/cache/cvmfs as described above. 05/26/2016 TAGS: condor not running SAM 15 critical The diagnostics page reports that condor is not running, but the page that monitors condor on all the nodes shows no issue. Running $ condor_status returns: Error: communication error CEDAR:6001:Failed to connect to <163.118.42.1:9618> The condor master had turned off because /var was running out of space. /var must be cleaned up. /var/cache/yum takes up a lot of space; it can be cleaned up with: $ yum clean expire-cache NOTE: While 'yum clean all' is probably harmless, it has some potential to create problems, hence the use of the expire-cache argument. Finally, turn condor back on by running $ condor_master To check to see if condor is actually running: $ service condor status The status returns: condor_master dead but subsys locked The problem was that there were several instances of condor running, but no run file. To fix it, kill all processes of condor, then restart condor. To kill: $ pkill -9 condor OR $ kill -9 'pgrep condor' To restart: $ condor_restart 06/01/2016 TAGS: SAM 13 14 15 critical condor not running $ service condor status returns condor_master dead but subsys locked I tried the solution from above: (*) kill all processes of condor $ pkill -9 condor (*) turn the master on $ condor_master NOTE: if this step is skipped, condor_restart won't work (it will not be able to find the master) (*) restart condor $ condor_restart but it didn't work. After I tried the previous solution, 'ps aux | grep condor' showed several instances of 'condor_shadow' running. I ran 'pkill -9 condor' to kill them all, then 'condor_master' and 'condor_restart'. 'condor_status' now works just fine, and the nodes appear to be receiving jobs and working on them (their slots are being filled). 'service condor status', however, still returns the error from before (condor_master dead but subsys locked). There are still very many condor_shadow processes running (over 100). The condor_shadow processes might correspond with the jobs running on the nodes. $ ps aux | grep condor_shadow | wc -l; condor_status | grep Busy | wc -l returns two similar numbers. cont. 06/04/2016 The SAM tests are now green, the condor tests seem to have corrected themselves. 'service condor status' is still not working, though. The 'condor_master dead but subsys locked' error is a result of a discrepency between the .pid file in /var/run and a file in /var/lock/subsys . For example, if the (empty) lockfile /var/lock/subsys/crond exists, then the first line of /var/run/crond/pid is expected to contain the PID of the process. I tried deleting /var/lock/subsys/condor to see what would happen. I ran 'pkill -9 condor' to kill condor, then 'condor_master' and 'condor_restart' to restart condor. 'service condor status' then returned 'condor_master dead but pid file exists' cont. 06/10/2016 $ service condor status returns: condor_master is stopped I restarted condor again. cont. 06/11/2016 Jobs are running on the nodes; condor_status shows jobs running and the SAM tests are green. 'condor_master is stopped' persists. I restarted condor using 'pkill, condor_master, condor_restart' Condor is now running. The proper way to restart condor is: $ pkill -9 condor (*) to kill all processes $ service condor stop (*) to officially turn off condor $ service condor start (*) to officially turn condor back on $ condor_restart (*) to restart condor itself 06/25/2016 TAGS: SAM 12 critical SE SAM test 12 has been going critical on a regular basis for the past few months; it has easily been "fixed" by simply restarting the SE. It's time we fix the issue! The SAM error report says that the file failed to copy because "connection reset by peer". This means that the connection was abruptly cut. What happens every 10 days or so that causes the connection to be cut? I am restarting the SE to fix the problem and see what happens. 06/26/2016 TAGS: NAS0 drive failed ECC-ERROR Ankit alerted us to a failed drive in NAS-0. I will replace it the next time I am in Melbourne. cont. 07/15/2016 I followed the instructions mentioned previously, and all seems to be well. 06/29/2016 TAGS: dCache dccp security update yum update not working A security alert was released concerning dCache, so we have to update it. A yum update would do the trick, but it's not working! PYCURL ERROR 22 A possible fix is to append "http_caching=packages" to /etc/yum.conf That made something more happen, but it didn't quite fix the issue. The error is being caused by the 'scl' repository. I don't know what's wrong with that repo yet, but for now I've run 'yum --disablerepo=scl update' to update everything except for scl. I will investigate scl later. The yum update broke GUMS; I'm working on bringing it back up. The antlr symlinks broke again; I followed the previously detailed instructions on how to fix them. 07/15/2016 TAGS: sam 12 critical bestman2 Everything done in SE: SAM 12 has been critical for some time, and the usual restart of the SE won't fix the problem; something else is wrong. It went critical after the 'yum update' I did earlier. The SRM service talked about in the Metric Page is bestman2; SAM 12 has to do with bestman2. The most recent bestman2 log repeatedly reports java.lang.reflect.InvocationTargetException Maybe it's a certificate issue? I copied /etc/grid-secuiry/hostcert.pem and /etc/grid-security/hostkey.pem into /etc/grid-security/bestman as bestmancert.pem and bestmankey.pem I then changed their ownership to bestman $ chown bestman:bestman bestmancert.pem $ chown bestman:bestman bestmankey.pem hostcert.pem and hostkey.pem may be outdated. cont. 08/29/2016 I am offically a GridAdmin! I can now accept my own requests for a new hostcert! The command to request a new OSG host cert on the SE is: $ osg-cert-request -t uscms1-se.fltech-grid3.fit.edu -e [your email] -n [your name] -p [phone number] -v cms -m [comment] -o hostkey.pem I approved my own request, and am retriving the hostcert. $ osg-cert-retrieve [ID] The new hostcert.pem and hostkey.pem are stored in ~/hostcertStuff cont. 09/01/2016 I have copied the new hostcert.pem and hostkey.pem from ~/hostcertStuff into /etc/grid-security. I then copied them into /etc/grid-security/bestman and renamed them bestmancert.pem and bestmankey.pem. I changed the ownerships of both files to bestman. I then restarted the SE. Nothing seems to have happened, so I am going to perform a yum update. yum update complete! The yum update didn't do anything to help this SAM; a report for this yum update is written further below. cont. 09/06/2016 I will restart the SE to see if I will get fresh error messages. cont. 09/13/2016 I stumbled upon some previous articles in the log (10/19/15) that deal with SAM 12. They mentioned that after an SE restart, some services did not start. $ service gratia-xrootd-transfer start $ service gratia-xrootd-storage start $ service globus-gridftp-server start The article also described how to see if file transfers are succeeding. I tried to follow the directions, but it's using Daniel's certificate and I don't know the GRID passphrase for it. I'm going to try to replace his certificate with my own. cont. 09/19/2016 I replaced Daniel's certificates in the /etc/grid-security of both the CE and SE with my own. I restarted the SE and tried agian, but it still thought I was Daniel. I replaced the certificates in SE: /root/.globus and now it knows it's me. I ran the testing commands mentioned in the (10/19/15) article: $ grid-proxy-init $ touch /tmp/test $ srm-copy file:////tmp/test srm://uscms1-se.fltech-grid3.fit.edu:8443/srm/v2/server?SFN=/mnt/nas1/store/temp/test_2 I recieved some errors, which means that file transfer does not work, which is why the SAM Test is Critical. The srm-copy command is complaining about an SRM_AUTHORIZATION_FAILURE. Perhaps my certificate is not properly authorized? cont. 09/26/2016 Both the SE and the SAM status reports say that they failed due to an authorization failure. Since it did not work while under both Daniel's certificate and mine, perhaps the SE is not configured properly? cont. 10/01/2016 Ankit is here to save the day! $ srm-ping [...] bestman2 log The CE is still using Daniel's old certificates, I have to replace them with my own. Check the website files! They are very helpful! cont. 10/03/2016 I am finding out where the GUMS certificates are stored on the CE. cont. 10/13/2016 /var/log/tomcat6/catalina.2016-10-13.log reports problems locating the antlr file. cont. 10/27/2016 It says the globus_ftp_client returned an error. tomcat6 was stopped on the SE; I started it. I also started gratia-xrootd-transfer and gratia-xrootd-storage. cont. 03/13/2017 The SAM metric states that there was an authorization failure because "the name of the remote entity is uscms1-se.fltech-grid3.fit.edu and the expected name is uscms1.fltech-grid3.fit.edu" There is a misconfiguration somewhere that's telling whatever SAM 12 uses to look for the CE, when it's trying to connect to the SE. I grepped /etc in the SE for any files that contain "uscms1.fltech-grid3.fit.edu". The following change was made, in the SE, to /etc/idmapd.conf: - Domain = uscms1.fltech-grid3.fit.edu + Domain = uscms1-se.fltech-grid3.fit.edu I then restarted the SE. The SAM ran again, and nothing changed. In the first line of /etc/grid-security/bestman/bestmancert.pem I changed 'uscms1-se' to 'uscms1' because the SAM test says it's looking for 'uscms1', but it's getting 'uscms1-se'. The old certificate also had 'uscms1' rather than 'uscms1-se'. I restarted it again. cont. 03/22/2017 I reversed the change made to /etc/grid-security/bestman/bestmancert.pem and /etc/idmapd.conf