05/11/2017 TAGS: condor idle ce / bloated The diagnostics page says that condor is idle on the CE and '/' is bloated with 'core.*' files. Clearly some shenanigans occured when I updated OSG. I fully restarted condor, but when I tried to run `condor_status`, it said there was a communication error. After waiting a minute, it gives the regular list, but says everything is unclaimed. These "core" files seem to be generated whenever a job crashes. None of the configuration files in '/etc/condor' or '/etc/condor-ce' seem to have been modified by the update, although the directories have been touched. Perhaps files were deleted? Since OSG was updated, '/var/log/condor/MasterLog' reports that condor is unable to create a security session to the CE on port 9618 with TCP. Port 9618 is only listed in the log when it fails; its successful connections are never listed. That port number is listed in '/etc/condor-ce/config.d/50-osg-configure.conf' as the connection port for the 'JOB_ROUTER_SCHEDD2_POOL' variable. That file is said to be managed by 'osg-configure'. 05/15/2017 TAGS: NAS-1 almost full NAS-1 is almost full, and Vallary needs to put stuff on it! I need to investigate directories: g4hep, backup_g4hep, general_g4hep 'g4hep/MTSAtFIT' is a primary offender (14TB of the directory's 15TB), there are some large files in there NOTE: `tree -ifhugD path/to/directory` is a very usefull command for mapping the directory structure. I've made trees at '/mnt/nas1/g4hep/MTSAtFIT/tree.txt', '/mnt/nas1/backup_g4hep/tree.txt', and '/mnt/nas1/general_g4hep/treeTrim.txt'. cont. 05/16/2017 Dr. Hohlmann has said I can safely delete anything with 'alignment' and 'empty' in their names. To see how much space will be freed from one of the three sections: $ grep -iE 'alignment|empty' tree.txt | awk -F' ' '{print $3}' | grep G | sed 's/G//g' | paste -sd+ | bc 05/18/2017 TAGS: add group I'm creating a new user group for Vallary and I: Analysis. 06/09/2017 TAGS: glideins down globus error At the beginning of June, OSG said that our glideins were failing due to a globus error. When Daniel was helping me with Condor, we tried replacing my certificates for his in '~/.globus', which probably caused the errors. I have replaced his cert with my CERN cert. I've updated OSG. cont. 06/12/2017 Elizabeth said to copy 'hostcert.pem' and 'hostkey.pem' from '/etc/grid-security' to '~/.globus'. I have done that, and I've restarted GUMS. She's been updated. cont. 06/15/2017 I misunderstood Elizabeth; she was just making sure the hostcerts weren't expired or otherwise wonked. 06/13/2017 TAGS: batteries UPS check compute-1-7 not working The red light on the APC UPS had been quickly turning red after the routine tests, so I whipped out the batteries and took a multimeter to them. The batteries are rated at 12V, and the multimeter measured just over 13V for each of them, so I put them back. When turning the cluster back on, though, compute-1-7 is having trouble mounting the NFS filesystems. The little ethernet LEDs on the node are off, and the ethernet LED for the node on the router (port 15) is red. Once the node had booted up, the LEDs didn't change. It doesn't seem to have internet, either, which is to be expected. cont. 06/14/2017 I found a manual for the 'HP ProCurve 2910al-24G' router. The blinking orange 'Fault' light means that, "A fault has occurred on the switch, one of the switch ports, module in the rear of the switch, or the fan. The Status LED for the component with the fault will flash simultaneously." In this case, the LED for port 15 is blinking in synchronization with the 'Fault' LED. The 'Test' LED is also blinking along with the others, and it means that one of the components has failed its self-test, so port 15 failed its test. The manual recommends power-cycling the router, so I'll do that tomorrow morning. cont. 06/15/2017 I turned the cluster off and power cycled the router (unpluged it), and it displayed no warning lights, so I turned the cluster back on, and all is well! 06/27/2017 TAGS: security update Security update day! I'm yum updating everything, and rebooting the cluster. Everything booted up properly! 07/11/2017 TAGS: nas0 drive not-present Drive 15 in NAS-0 suddenly became labled as "NOT-PRESENT". I removed the drive and put it back in, and the drive is now rebuilding. cont. 07/12/2017 Drive 15 has returned to the "NOT-PRESENT" state again, I'm gonna try replacing the drive. I've replaced the drive, and it's rebuilding. cont. 07/13/2017 The new drive has experienced a SMART failure, so I'm gonna replace it with the other spare drive. I've started the rebuild. cont. 07/14/2017 The new drive rebuilt successfully, and everything is good.