05/11/2017
TAGS: condor idle ce / bloated
The diagnostics page says that condor is idle on the CE and '/' is bloated with
'core.*' files. Clearly some shenanigans occured when I updated OSG. 
I fully restarted condor, but when I tried to run `condor_status`, it
said there was a communication error. After waiting a minute, it gives the regular list,
but says everything is unclaimed. 
These "core" files seem to be generated
whenever a job crashes. None of the configuration files in '/etc/condor' or
'/etc/condor-ce' seem to have been modified by the update, although the directories
have been touched. Perhaps files were deleted? 
Since OSG was updated, '/var/log/condor/MasterLog' reports that condor is unable to create a security session
to the CE on port 9618 with TCP. Port 9618 is only listed in the log when it fails; its successful connections are
never listed. That port number is listed in '/etc/condor-ce/config.d/50-osg-configure.conf' as the connection
port for the 'JOB_ROUTER_SCHEDD2_POOL' variable. That file is said to be managed by 'osg-configure'.


05/15/2017
TAGS: NAS-1 almost full
NAS-1 is almost full, and Vallary needs to put stuff on it!
I need to investigate directories: g4hep, backup_g4hep, general_g4hep
'g4hep/MTSAtFIT' is a primary offender (14TB of the directory's 15TB), there are some large files in there
NOTE: `tree -ifhugD path/to/directory` is a very usefull command for mapping the directory structure.
I've made trees at '/mnt/nas1/g4hep/MTSAtFIT/tree.txt', '/mnt/nas1/backup_g4hep/tree.txt', 
and '/mnt/nas1/general_g4hep/treeTrim.txt'.

cont.
05/16/2017
Dr. Hohlmann has said I can safely delete anything with 'alignment' and 'empty'
in their names. To see how much space will be freed from one of the three sections:
$ grep -iE 'alignment|empty' tree.txt | awk -F' ' '{print $3}' | grep G | sed 's/G//g' | paste -sd+ | bc


05/18/2017
TAGS: add group
I'm creating a new user group for Vallary and I: Analysis.


06/09/2017
TAGS: glideins down globus error
At the beginning of June, OSG said that our glideins were failing
due to a globus error. When Daniel was helping me with Condor,
we tried replacing my certificates for his in '~/.globus', which
probably caused the errors. I have replaced his cert with my CERN cert.
I've updated OSG.

cont.
06/12/2017
Elizabeth said to copy 'hostcert.pem' and 'hostkey.pem' from 
'/etc/grid-security' to '~/.globus'. I have done that, and I've 
restarted GUMS. She's been updated.

cont.
06/15/2017
I misunderstood Elizabeth; she was just making sure the hostcerts weren't
expired or otherwise wonked.


06/13/2017
TAGS: batteries UPS check compute-1-7 not working
The red light on the APC UPS had been quickly turning red after the routine
tests, so I whipped out the batteries and took a multimeter to them. The batteries
are rated at 12V, and the multimeter measured just over 13V for each of them, so I 
put them back.
When turning the cluster back on, though, compute-1-7 is having trouble mounting the 
NFS filesystems. The little ethernet LEDs on the node are off, and the ethernet LED
for the node on the router (port 15) is red. Once the node had booted up, the LEDs 
didn't change. It doesn't seem to have internet, either, which is to be expected.

cont.
06/14/2017 
I found a manual for the 'HP ProCurve 2910al-24G'
router.  The blinking orange 'Fault' light means that, "A fault has
occurred on the switch, one of the switch ports, module in the rear of
the switch, or the fan. The Status LED for the component with the
fault will flash simultaneously." In this case, the LED for port 15
is blinking in synchronization with the 'Fault' LED. The 'Test' LED
is also blinking along with the others, and it means that one of the
components has failed its self-test, so port 15 failed its test.
The manual recommends power-cycling the router, so I'll do that
tomorrow morning.

cont.
06/15/2017
I turned the cluster off and power cycled the router (unpluged it), and
it displayed no warning lights, so I turned the cluster back on, and all
is well! 


06/27/2017
TAGS: security update
Security update day! I'm yum updating everything, and rebooting the cluster.
Everything booted up properly!


07/11/2017
TAGS: nas0 drive not-present
Drive 15 in NAS-0 suddenly became labled as "NOT-PRESENT". 
I removed the drive and put it back in, and the drive is now
rebuilding.

cont.
07/12/2017
Drive 15 has returned to the "NOT-PRESENT" state again, I'm gonna try 
replacing the drive. I've replaced the drive, and it's rebuilding.

cont.
07/13/2017
The new drive has experienced a SMART failure, so I'm gonna replace it
with the other spare drive. I've started the rebuild.

cont.
07/14/2017
The new drive rebuilt successfully, and everything is good.