01/04/2016 (Daniel) TAGS: NAS1 NAS2 PHEDEX SITECONF GIT COMMIT STORAGE.XML The backup of NAS1's main data is complete. This is enough to run phedex again. I changed the storage.xml file (on SE, /home/phedex/SITECONF/T3_US_FIT/PhEDEx/storage.xml) to update the location from nas2 to nas1. To commit the change use $ git add storage.xml $ git commit storage.xml -m "Some commit message" When changing the git user name and email, I accidentally ran git commit without specifying storage.xml. This commited quite a few files, including DBParam. I removed DBParam by first backing it up then $ git rm -f DBParam then returning the copy to /home/phedex/SITECONF/T3_US_FIT/PhEDEx/ I then started the phedex agents, which seemed succesful. 01/05/2016 (Ryan) TAGS: sam 11 12 critical se The test is complaining that the file: /mnt/nas1/store/mc/SAM/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0013/52D06725-4BAE-E111-A059-001D09F252DA.root does not exist. Upon investigation, however, the file does exist on the CE. The metrics page states that a "cmsRun" command was run, and it could not find the file in question. Upon trying to run "cmsRun", however, both the CE and compute-2-3 reported that the command was not found. Test 12 has now gone critical as well. 01/08/2016 (Daniel) TAGS: DIAGNOSTICS NAS DF The diagnostic page now shows both percent and size for the usage statistics 01/13/2016 (Eric, Ryan) TAGS: nas1 rocks cluster We are adding nas1 to the cluster with rocks. 1) Install rocks on the nas I unmounted nas1 from both the CE and the SE using plain umount, and I unmounted nas2 from nas1 using the same method. We then plugged in an external disc drive with a rocks 6.1.1 Jumbo DVD disc inside and restarted nas1. nas1 not only did not boot straight into the disc, it failed to recognize the disc at all!Upon further investigation, it was discovered that the disc drive was at fault, not the disc. A bootable usb of rocks is required. 01/14/2016 (Ankit, Ryan) TAGS: nas2 nas1 sam critical redirection While nas1 is down for its addition to the cluster, the sam tests that go to it must be redirected to nas2 to avoid the tests becoming critical. 1) delete the old siteconf directory from the CE rm -rf ~/siteconf/ 2) authorize the download: $ kinit -A -f cernusername@CERN.CH 3) download the siteconf directory: $ git clone https://:@git.cern.ch/kerberos/siteconf 4) make appropriate changes to ~/siteconf/T3_US_FIT/Phedex/storage.xml change the directory paths to the desired one NOTE: make use of the replace-string command in emacs 5) implement changes $ git add storage.xml $ git commit -m "NAS1 to NAS2" $ git push origin master 6) repeat step 4 in the SE (the siteconf directory will be in caps) NOTE: user must be phedex for steps 6-8 7) restart the Phedex agents NOTE: user must be in the phedex home directory while executing commands in steps 7 and 8 $ PHEDEX/Utilities/Master -config ~/SITECONF/T3_US_FIT/PhEDEx/Config.Debug start 8) after about 30 minutes, stop the agents PHEDEX/Utilities/Master -config ~/SITECONF/T3_US_FIT/PhEDEx/Config.Debug stop 01/22/2016 (Ryan) TAGS: CE full The CE is full of data! I am finding the largest files and investigating them. 02/01/2016 (Ankit, Ryan) TAGS: sam test 12 critical SAM test 12 has gone critical again! The quickest way to fix it is to restart the SE, but a more permenant solution is needed. 02/06/2016 (Ryan) TAGS: nodes local condor nas2 I turned off NAS-2 and turned on the other 10 nodes. Five of those nodes I am reserving for local users (Stefano) by setting START = LOCAL in /etc/condor/config.d/00personal_condor.config on compute-2-5 to compute-2-9. After changing the value in a node, run: $ condor_reconfig to save the change, then run: $ condor_config_val -v START to verify that the change has been made. 02/09/2016 (Ryan) TAGS: nodes down du unmount mtab After NAS-2 was removed, the "du" command hung up on the nodes. This was because NAS-2 was never properly unmounted from the nodes. To fix the issue, the entry for NAS-2 must be manually removed from /etc/mtab in all of the nodes. 02/09/2016 (Ryan) TAGS: cleaning I cleaned up some of the dust on the outside of the cluster with compressed air, swiffer wipes, and the static-free wipes on the desk. 02/12/2016 (Ryan) TAGS: condor jobs priority A local user (Stefano) would like to pause all currently running jobs, and use as many CPUs as possible to quickly run his jobs, then return the nodes back to the cluster when his jobs are complete. NOTES: (*) condor_q is used to view all jobs (*) condor_prio can be used to change job priority (*) ganglia cli monitors several metrics for nodes (including CPU load) (*) condor_suspend can be used to pause jobs on the CPU -- The job still occupies the slot and is still consuming RAM, but it is not consuming CPU cycles. (*) condor_status can be used to view the status of each CPU on each node (*) $ condor_config_val -v START can be used to directly view the value of START QUESTIONS: (*) Can a new job run on a CPU where there is a paused job? (*) Can jobs be paused (not killed, not wait to completion)? [A] yes, with condor_suspend (*) Is there a way to monitor how many CPUs are idle? [A] yes, with ganglia SOLUTION: (*) Stefano queues his jobs, then a script executes: $ condor_suspend -constraint 'Owner =!= "SRSUser"' When the jobs are done (can be detected with condor_status), $ condor_continue -all is run to resume the jobs. -- assumes that a new job can be run on a CPU where one is paused and the paused job can be resumed once the new job is complete P] Even though job is suspended, the CPU is still labeled as "Claimed", and no jobs will run on it. 02/15/2016 (Ryan) TAGS: rsv warning certificates The certificate RSV tests went into warning. The cause of the failures is imminent certificate expirations. To install the host certificate, follow the instructions on: https://twiki.opensciencegrid.org/bin/view/Documentation/Release3/GetHostServiceCertificates#Request_a_Host_Certificate Once you have approved the certificate and downloaded your copy, transfer it to the cluster. The certificate must be placed in these directories with their respective owners: CE: /etc/grid-security/hostkey(hostcert).pem - root /etc/grid-security/rsv/rsvkey(rsvcert).pem - rsv /etc/grid-security/http/httpkey(httpcert).pem - tomcat SE: /etc/grid-security/hostkey(hostcert).pem - root /etc/grid-security/bestman/bestmankey(bestmancert).pem - bestman NODES: /etc/grid-security/hostkey(hostcert).pem - root The certificates must also be given their proper names. The services (rsv, globus (service name: tomcat6), bestman2 (in SE)) must also be restarted. NOTE: GUMS is the main certificate sofware; certificate problems are often in GUMS. Change DC (DigiCert) in GUMS. Delete RSV mapping and map to DN: of rsv cert /etc/grid-security rsv/uscms1... uscms1... 02/15/2016 (Ryan) TAGS: sam 12 critical SAM test 12 has gone critical (again), and I made it green by restarting the SE. 02/22/2016 (Ryan) TAGS: date configuration Date is installed and must be configured on the cluster. I have been provided with Michael Staib's powerpoint on date. The /date directory mentioned in the powerpoint is located at /mnt/nas1/test_install/opt/date The scripts /mnt/nas1/test_install/opt/date/runControl/do_start_dim.sh /mnt/nas1/test_install/opt/date/setup.sh must be modified to contain the correct paths (replace /date with /mnt/nas1/test_install/opt/date) event.h is an important file used to compile much of the .c files in the /mnt/nas1/test_install/opt/date/db directory, and it is located at /mnt/nas1/test_install/opt/date/commonDefs/event.h investigate: /mnt/nas1/test_install/opt/date/runControl/do_start_dim.sh 02/24/2016 (Ankit, Ryan) TAGS: iozone default values Find default values for iozone. 02/29/2016 (Ankit, Ryan) TAGS: GUMS administrator adding .pem .p12 To add a new GUMS administrator: 1. Copy the new admin's OSG certificate (.p12). to the cluster 2. Convert the .p12 file to a .pem file. (openssl pkcs12 -in path.p12 -out newfile.crt.pem -clcerts -nokeys) 3. Determine the DN of the new admin. (openssl x509 -in usercert.pem -subject -issuer -dates -noout) copy the subject= line 4. Run the add admin command. (gums-add-mysql-admin '') 5. restart tomcat6 (the gums service) (service tomcat6 restart) 02/29/2016 (Ryan) TAGS: nas0 nas-0-0 degraded restart rebuild nas0 had been in a degraded state for about a month. Physical drive 2 was "not-present" and drive 10 was experiencing a "SMART-failure". Upon a restart of the system today, however, drive 10 is now rebuilding, although drive 2 is still "not-present". to check rebuild status: $ tw_cli /c0/u0 show rebuildstatus The rebuild is stuck at 93%. The current solution is to upgrade the firmware. Drive 8 is now listed as "ECC-ERROR" and drive 12 has a "SMART-FAILURE". The backup of nas0 to nas1 has been started: $ nohup rsync -av --append /mnt/nas0/home /mnt/nas1/nas0-bak-20160304 & I found some instructions (originally for Debian, not yet optimized): 1. add: deb http://jonas.genannt.name/debian lenny restricted to /etc/ 2. import key with: wget -O - http://jonas.genannt.name/debian/jonas_genannt.pub | apt-key add - 3. TO RESTART NAS-0: 1. Unmount nas0 from everything. $ umount -l /mnt/nas0 $ umount -l /home 2. Restart nas0. 3. Remount nas0 on everything. $ mount /mnt/nas0 $ service autofs restart 03/01/2016 (Ankit, Ryan) TAGS: glexec critical compute-1-1 SAM The glexec SAM test is critical on compute-1-1. The fetch-crl cron job was not running. This was discovered by checking the /var/log/cron file and searching for "fetch-crl". It was absent on compute-1-1, but present on compute-1-2. To see fetch-crl status: $ /etc/init.d/fetch-crl-cron status To restart fetch-crl: $ /etc/init.d/fetch-crl-cron restart 03/14/2016 (Ankit, Ryan) TAGS: drive swap nas0 Drive p10 has failed and must be replaced. First the drive must be removed from the RAID $ tw_cli maint remove c0 p The the drive can then be removed and replaced. NOTE: a screw driver is required to remove the drive housing Once the new drive is in place, the RAID card must be rescanned. $ tw_cli /c0 rescan The new drive should start rebuilding. If it does not start automatically, the rebuild process can be manually started with: $ tw_cli maint rebuild c0 u0 p NOTE: c0: card number u0: RAID number use $ tw_cli /c0 show alarms $ tw_cli /c0/u0 show rebuildstatus to monitor progress 03/17/2016 (Ryan) TAGS: compute-2-0 df -h hanging sam failure test nfs $ df -h hangs when executed on compute-2-0. Ankit says SAM tests are also failing on it, and he mentioned that it could be an NFS issue. 03/17/2016 (Ryan) TAGS: nas0 nas-0 drive 12 SMART failure On March 15, drive 12 exceeded the SMART threshold, so it must be replaced. 03/18/2016 (Ankit, Ryan) TAGS: compute file system issue sam tests fail df Some SAM tests were failing on compute-1-0 and compute-2-0. The SAM test was reporting that it could not access some files, and df -h did not work on compute-2-0. For compute-2-0, NAS-2 was still listed in the mtab file, so we removed it. For compute-1-0, NAS-1 was not mounted. NOTE: For filesystem issues, check /etc/fstab and /etc/mtab df -h should be monitored to check for future issues. 03/21/2016 (Ryan) TAGS: sam test 1 critical 4 warning glexec compute-1-1 SAM TEST 1: The glexec sam test (SAM 1) has failed for compute-1-1. The fetch-crl is absent from the /var/log/cron file, as before, so I restarted the process using the command previously mentioned. When I tried to run $ /usr/sbin/glexec getting payload uid (the command mentioned by the SAM test) it said: [gLExec]: environment variable $GLEXEC_CLIENT_CERT is empty. The SAM test appears to be accessing a non-existant file for $GLEXEC_CLIENT_CERT /var/lib/condor/execute/dir_16220/nagios/probes/org.cms.glexec/testjob/tests/payloadproxy SAM TEST 4: SAM test 4 has gone into warning alongside the critical SAM test 1. It reports that "SIGTERM has been caught" on compute-1-1. Both SAM tests appear to have been fixed by the simple restart. 03/21/2016 (Ryan) TAGS: partition compute-1-1 nodes I changed the START value of compute-1-1 to PART to reserve it for partioning testing. /etc/condor/config.d/00personal_condor.config 03/24/2016 (Ryan) TAGS: glexec compute-1-1 osg jobs fail glexec is not working on compute-1-1. When the diagnostic command: $ voms-proxy-init -voms cms:/cms is run while as user amohapatra, it does not connect. Because the first diagnostic command will not work, though, none of the others will work either. The errors say that there is an SSL handshake error between compute-1-1 and the two cms servers it tries to connect to. Contacting fails reportedly due to a SSL handshake error. It appears that the handshake fails due to outdated certificates. It says that the CRL has expired. The problem was that fetch-crl was not running. Check /var/log/cron for fetch-crl. Use $ fetch-crl >to run fetch-crl. The automatic running of fetch-crl seems to not be working properly. 03/28/2016 (Ryan) TAGS: xrootd /etc/xrootd/xrootd-clustered.cfg TFC (Trivial File Catalog) ~/siteconf/T3_US_FIT/PhEDEx/storage.xml 03/30/2016 (Ryan) TAGS: repartition compute-1-1 I began the test repartitioning of compute-1-1. I followed the first page of instructions here to shrink the original /scratch partition: http://www.htmlgraphic.com/how-to-resize-partition-without-data-loss/ I followed these instrucitons on how to make /var its own partition: http://unix.stackexchange.com/questions/131311/moving-var-home-to-separate-partition IMPORTANT: Make sure the UUIDs of the partitions are input correctly into /etc/fstab. The correct UUIDs can be found by running $ blkid or $ ls -l /dev/disk/by-uuid NOTE: Toward the end of the second instruction set, make sure the new /var partition is writeable! (1) Unmount the /var partition: $ umount /dev/sdaX (2) Mount the partition with proper permissions $ mount /dev/sdaX /var -t ext3 In case the boot partition is also unwriteable: $ mount -o remount,rw / Test to make sure the /var partition is writeable by trying to touch a file to it. NOTE: the nodes operate on run level 3 04/06/2016 (Ryan) TAGS: repartition nodes I changed the START values of compute-1-2 and compute-1-3 to PART two days ago to let the jobs currently running on them to die. when running: $ resize2fs /dev/sda3 20000M this error is printed: resize2fs: New size smaller than minimum (37039928) That error was not present on compute-1-1, but it is present on compute-1-2 and compute-1-3. I went forward with the resizing on compute-1-2 and I resized the partition to an abnormally high value (~130G). compute-1-2 is experiencing an issue with: $ tune2fs -j /dev/sda3 It reports that it "Could not allocate block in ext2 filesystem while trying to create journal file" I deleted the /dev/sda3 partition with fdisk. NOTE: before deleting a partition, remove it from /etc/fstab I made a new partition, that uses the area on disk also occupied by /scratch. I cleared /scratch. I then followed the set of directions for giving /var its own partition. I tried changing the ext3 tag for /dev/sda3 in /etc/fstab to ext2. $fsck -n /dev/sda3 did its normal 5-step check rather than report that /dev/sda3 is clean like it normally does at this step. A new error was reported when $tune2fs -j /dev/sda3 was run. The new error: tune2fs: No space left on device while trying to create journal file The rather large minimum size of the partitions may be due to the large amount of data stored in the /scratch partion I'm shrinking. In order to prevent data loss, I cannot make the partition smaller than the amount of data stored in it. I made a mistake while creating the new partition size for /dev/sda3 on compute-1-3. I told fdisk to make the partition much larger than was possible. I fixed it (forgot what I did). The /var partition of compute-1-3 was resized to 97G (04/22/2016). 04/13/2016 (Ryan, Ankit) TAGS: compute-1-1 repartitioning cvmfs compute-1-1 made almost all of the SAM tests go critical. Most of the critical tests reported that there was no CMS software on the node. It has been taken off condor by changing its START value from TRUE; this stops the SAM tests from examining the broken node. The symlink between /cmssoft/cms and /cvmfs/cms.cern.ch has been broken because /cvmfs/cms.cern.ch is missing. The files can be obtained by using cvmfs to transfer the appropriate data using the URL given (cms.cern.ch). The /cvmfs/cms.cern.ch magically reappeared where it should be (04/15/2016), so I turned condor back on to see what the SAM tests say. It was green again until 04/21/2016 when they failed again due to the same problem as before. I turned off its condor. / was very full because the /etc/cvmfs/defauls.local file was pointing the cache to / It was fixed by pointing it to /var/cache/cvmfs cvmfs can be checked with $ cvmfs_config probe 04/18/2016 (Ryan) TAGS: UPS test Tripplite I tested the two Tripp-lite UPSs, and no errors were reported. 04/19/2016 (Ryan) TAGS: hostname '=' wrong incorrect The hostname for the CE, rather than the ususal 'uscms1.fltech-grid3.fit.edu' is now just '=' When hostname is run, '=' is returned. In /etc/idmapd.conf the domain is correct, but there was whitespace on either side of the '=' to the right of the Domain variable: Domain = uscms1.fltech-grid3.fit.edu I tried deleting the space to the left of the '=', and I restarted the service with: $ service rpcidmapd restart Nothing changed. SAM tests have started to fail. I changed the hostname with: $ hostname uscms1.fltech-grid3.fit.edu After relogging in, the prompt was fixed and $hostname returned the proper hostname. I will wait to see what the SAM tests think. REPARTITIONING OF OTHER NODES compute-1-4 (*) begun 04/22/2016 (*) cleared /scratch (*) followed instructions and completed partitioning compute-1-5 (*) already complete (148G) compute-1-6 (*) already complete (40G) compute-1-7 (*) begun 04/25/2016 (*) complete compute-1-8 (*) begun 04/25/2016 NOTE: when clearing /scratch be sure to $ rm -rfv /scratch/* rather than $ rm -rfv /scratch The first option actually deletes everything, the second option just removes the pointers. When the filesystem is recreated after the second option, the data will remain and cause problems later on. (*) experiencing boot problems with new partition (rocks won't load properly on boot) (*) Turns out everything was already done, I just didn't mount /var on /dev/sda3 on boot. The error was because /var was empty without the mount. compute-1-9 (*) begun 04/29/2016 (*) cleared /scratch (*) complete compute-2-0 (*) begun 04/29/2016 (*) cleared /scratch (*) complete compute-2-1 (*) begun 05/02/2016 (*) cleared /scratch (*) complete compute-2-2 (*) begun 05/03/2016 (*) cleared /scratch (*) complete compute-2-3 (*) begun 05/03/2016 (*) NOTE: strange partition scheme with 100G /export partition on /dev/sda6 (*) backed up /export onto nas1 and deleted it on the node (*) NOTE: the password is the new one (*) backed /var into /var.old (*) deleted /dev/sda5 and /dev/sda6 (*) deleted strange "extend" system partition /dev/sda4 (*) created /dev/sda4, does not work (*) we had to boot into Single User mode (-) enter GRUB menu on boot (-) select "Red Hat Enterprise Linux" with the version of the kernal you wish to boot (-) type 'a' to append the line (-) go to end of line and type 'single' as a seperate word (space before word) (-) press ENTER to exit edit mode (*) /var must be populated on / to boot properly (-) we moved the contents of /var.old to /var (*) everything in /var is now owned by root (it should NOT be) (-) refer to other nodes for proper permissions compute-2-4 (*) already complete (100G) compute-2-5 (*) begun 05/04/2016 (*) cleared /scratch (*) complete compute-2-6 (*) begun 05/04/2016 (*) cleared /scratch (*) complete compute-2-7 (*) begun 05/05/2016 (*) cleared /scratch (*) complete compute-2-8 (*) begun 05/06/2016 (*) cleared /scratch (*) complete compute-2-9 (*) begun 05/06/2016 (*) cleared /scratch (*) complete 04/27/2016 TAGS: ganglia 1-min loads The 1-min load on the ganglia load plot has shot up. On the Ganglia website, the page showed that the loads of compute-2-2 and compute-2-3 had shot up (this is reflected in the 1-min load). Both nodes had faulty /etc/mtab files. A symptom of this is that df -h doesn't work. Once the mtab files are fixed, restart Ganglia on the CE. $ service gmond restart $ service gmetad restart 05/02/2016 TAGS: compute-1-3 /tmp cvmfs low space SAM 9 critical SAM test 9 went critical on compute-1-3; the reason being that /tmp was full. cvmfs still had its cache in / which was now full. To change the location of the cvmfs cache, edit /etc/cvmfs/default.local It was changed from /scratch/cvmfs to /var/cache/cvmfs The directory cvmfs must be manually created in /var/cache and its permissions must be changed from root:root to cvmfs:cvmfs $ chown -R cvmfs:cvmfs /var/cache/cvmfs/ The service must be reloaded $ cvmfs_config reload Clear cache with $ cvmfs_config wipecache