05/15/2016
TAGS: cvmfs nodes SAM critical root full
The root partions of compute-2-0 and compute-2-1
are full, which is preventing cvmfs from mounting
which is causing the SAM tests to go critical.
In compute-2-0, /scratch took up 13G of the 20G
of available space. I have cleared the folder.
$ rm -rf /scratch/*
Similar situation in compute-2-1.

05/16/2016:
SAM tests failing in OD


05/23/2016
TAGS: SAM critical compute-2-1
/scratch was still writing to /
I changed it to write to /var/cache/cvmfs
as described above.

05/25/2016
TAGS: SAM critical compute-2-0
/scratch was still writing to /
I changed it to write to /var/cache/cvmfs
as described above.


05/26/2016
TAGS: condor not running SAM 15 critical
The diagnostics page reports that condor is not running,
but the page that monitors condor on all the nodes shows 
no issue. 
Running 
$ condor_status
returns:
Error: communication error
CEDAR:6001:Failed to connect to <163.118.42.1:9618>
The condor master had turned off because /var was running 
out of space. /var must be cleaned up.
/var/cache/yum takes up a lot of space;
it can be cleaned up with:
$ yum clean expire-cache
NOTE: While 'yum clean all' is probably harmless,
      it has some potential to create problems,
      hence the use of the expire-cache argument.
Finally, turn condor back on by running
$ condor_master
To check to see if condor is actually running:
$ service condor status
The status returns:
condor_master dead but subsys locked
The problem was that there were several instances
of condor running, but no run file.
To fix it, kill all processes of condor, then restart condor.
To kill:
$ pkill -9 condor
  OR
$ kill -9 'pgrep condor'
To restart:
$ condor_restart


06/01/2016
TAGS: SAM 13 14 15 critical condor not running
$ service condor status 
returns
condor_master dead but subsys locked

I tried the solution from above:
(*) kill all processes of condor
    $ pkill -9 condor
(*) turn the master on
    $ condor_master
    NOTE: if this step is skipped, condor_restart won't work
    	  (it will not be able to find the master)
(*) restart condor
    $ condor_restart
but it didn't work.

After I tried the previous solution, 'ps aux | grep condor' showed several 
instances of 'condor_shadow' running. I ran 'pkill -9 condor'
to kill them all, then 'condor_master' and 'condor_restart'.
'condor_status' now works just fine, and the nodes appear to be 
receiving jobs and working on them (their slots are being filled).
'service condor status', however, still returns the error from before 
(condor_master dead but subsys locked).
There are still very many condor_shadow processes running (over 100).
The condor_shadow processes might correspond with the jobs running on the nodes.
$ ps aux | grep condor_shadow | wc -l; condor_status | grep Busy | wc -l
returns two similar numbers.

cont. 06/04/2016
The SAM tests are now green, the condor tests seem
to have corrected themselves. 'service condor status'
is still not working, though.
The 'condor_master dead but subsys locked' error
is a result of a discrepency between the .pid file
in /var/run and a file in /var/lock/subsys .
For example, if the (empty) lockfile /var/lock/subsys/crond
exists, then the first line of /var/run/crond/pid is
expected to contain the PID of the process.
I tried deleting /var/lock/subsys/condor to see what would happen.
I ran 'pkill -9 condor' to kill condor, then 'condor_master' and
'condor_restart' to restart condor. 'service condor status'
then returned 'condor_master dead but pid file exists'

cont. 06/10/2016
$ service condor status
returns:
condor_master is stopped
I restarted condor again.

cont. 06/11/2016
Jobs are running on the nodes; condor_status shows jobs running
and the SAM tests are green. 'condor_master is stopped' persists.
I restarted condor using 'pkill, condor_master, condor_restart'

Condor is now running.
The proper way to restart condor is:
$ pkill -9 condor
  (*) to kill all processes
$ service condor stop
  (*) to officially turn off condor
$ service condor start
  (*) to officially turn condor back on
$ condor_restart
  (*) to restart condor itself


06/25/2016
TAGS: SAM 12 critical SE
SAM test 12 has been going critical on a regular basis 
for the past few months; it has easily been "fixed" by 
simply restarting the SE. It's time we fix the issue!

The SAM error report says that the file failed to copy
because "connection reset by peer". This means that
the connection was abruptly cut. What happens every 10 days
or so that causes the connection to be cut?

I am restarting the SE to fix the problem and see what happens.


06/26/2016
TAGS: NAS0 drive failed ECC-ERROR
Ankit alerted us to a failed drive in NAS-0.
I will replace it the next time I am in Melbourne.

cont. 
07/15/2016
I followed the instructions mentioned previously, and all
seems to be well.

06/29/2016
TAGS: dCache dccp security update yum update not working
A security alert was released concerning dCache, so we have to update
it. A yum update would do the trick, but it's not working!
PYCURL ERROR 22
A possible fix is to append "http_caching=packages" to /etc/yum.conf
That made something more happen, but it didn't quite fix the issue.

The error is being caused by the 'scl' repository. I don't know what's
wrong with that repo yet, but for now I've run
'yum --disablerepo=scl update' to update everything except for scl.
I will investigate scl later.

The yum update broke GUMS; I'm working on bringing it back up.

The antlr symlinks broke again; I followed the previously detailed
instructions on how to fix them.


07/15/2016
TAGS: sam 12 critical bestman2
Everything done in SE:

SAM 12 has been critical for some time, and the usual restart of 
the SE won't fix the problem; something else is wrong. It went critical
after the 'yum update' I did earlier. The SRM service talked about in 
the Metric Page is bestman2; SAM 12 has to do with bestman2.
The most recent bestman2 log repeatedly reports java.lang.reflect.InvocationTargetException

Maybe it's a certificate issue?
I copied /etc/grid-secuiry/hostcert.pem and /etc/grid-security/hostkey.pem 
into /etc/grid-security/bestman as bestmancert.pem and bestmankey.pem
I then changed their ownership to bestman
$ chown bestman:bestman bestmancert.pem
$ chown bestman:bestman bestmankey.pem
hostcert.pem and hostkey.pem may be outdated.

cont.
08/29/2016
I am offically a GridAdmin! I can now accept my own requests for a new hostcert!
The command to request a new OSG host cert on the SE is:
$ osg-cert-request -t uscms1-se.fltech-grid3.fit.edu -e [your email] -n [your name] -p [phone number] -v cms -m [comment] -o hostkey.pem
I approved my own request, and am retriving the hostcert.
$ osg-cert-retrieve [ID]
The new hostcert.pem and hostkey.pem are stored in ~/hostcertStuff

cont.
09/01/2016
I have copied the new hostcert.pem and hostkey.pem from ~/hostcertStuff
into /etc/grid-security.
I then copied them into /etc/grid-security/bestman and renamed them
bestmancert.pem and bestmankey.pem.
I changed the ownerships of both files to bestman.
I then restarted the SE.
Nothing seems to have happened, so I am going to perform a yum update.
yum update complete!
The yum update didn't do anything to help this SAM; a report for this 
yum update is written further below.

cont.
09/06/2016
I will restart the SE to see if I will get fresh error messages.

cont.
09/13/2016
I stumbled upon some previous articles in the log (10/19/15) that
deal with SAM 12. They mentioned that after an SE restart, some services
did not start.

$ service gratia-xrootd-transfer start
$ service gratia-xrootd-storage start
$ service globus-gridftp-server start

The article also described how to see if file transfers are succeeding.
I tried to follow the directions, but it's using Daniel's certificate and
I don't know the GRID passphrase for it. I'm going to try to replace his 
certificate with my own.

cont.
09/19/2016
I replaced Daniel's certificates in the /etc/grid-security of both
the CE and SE with my own. I restarted the SE and tried agian, but
it still thought I was Daniel. I replaced the certificates in SE: 
/root/.globus and now it knows it's me. I ran the testing commands mentioned
in the (10/19/15) article:
$ grid-proxy-init
$ touch /tmp/test
$ srm-copy file:////tmp/test srm://uscms1-se.fltech-grid3.fit.edu:8443/srm/v2/server?SFN=/mnt/nas1/store/temp/test_2
I recieved some errors, which means that file transfer does not work, which is why the SAM Test is Critical.
The srm-copy command is complaining about an SRM_AUTHORIZATION_FAILURE. Perhaps my certificate is not properly authorized?

cont.
09/26/2016
Both the SE and the SAM status reports say that they failed due to an authorization failure.
Since it did not work while under both Daniel's certificate and mine, perhaps the SE is not configured properly?

cont.
10/01/2016
Ankit is here to save the day!

$ srm-ping [...]
bestman2 log

The CE is still using Daniel's old certificates, I have to replace them with my own.

Check the website files! They are very helpful!

cont.
10/03/2016
I am finding out where the GUMS certificates are stored on the CE.

cont.
10/13/2016
/var/log/tomcat6/catalina.2016-10-13.log
reports problems locating the antlr file.

cont.
10/27/2016
It says the globus_ftp_client returned an error.
tomcat6 was stopped on the SE; I started it.
I also started gratia-xrootd-transfer and gratia-xrootd-storage.

cont.
03/13/2017
The SAM metric states that there was an authorization failure because
"the name of the remote entity is uscms1-se.fltech-grid3.fit.edu and
the expected name is uscms1.fltech-grid3.fit.edu" There is a misconfiguration
somewhere that's telling whatever SAM 12 uses to look for the CE, when it's
trying to connect to the SE. I grepped /etc in the SE for any files that
contain "uscms1.fltech-grid3.fit.edu".
The following change was made, in the SE, to /etc/idmapd.conf:
- Domain = uscms1.fltech-grid3.fit.edu
+ Domain = uscms1-se.fltech-grid3.fit.edu
I then restarted the SE.
The SAM ran again, and nothing changed.
In the first line of /etc/grid-security/bestman/bestmancert.pem I changed
'uscms1-se' to 'uscms1' because the SAM test says it's looking for 'uscms1', 
but it's getting 'uscms1-se'. The old certificate also had 'uscms1' rather
than 'uscms1-se'.
I restarted it again.

cont.
03/22/2017
I reversed the change made to /etc/grid-security/bestman/bestmancert.pem
and /etc/idmapd.conf