08/22/2016 TAGS: condor running red compute 2-1 The "running" text for Condor on the diagnostics page was red. The problem was that condor had gone critical on compute-2-1. 'service condor status' returned "condor_master dead but subsys locked" To fix it, I restarted condor within the node: $ pkill -9 condor $ condor service restart $ condor_restart 08/24/2016 TAGS: drive replace nas0 Drives 2 and 12 have failed in NAS-0, probably as a result of the drive replacement done about a month earlier. A 1TB drive was used in slot 2, and a 750GB was used in slot 12. Upon rescan, neither drive was detected. After about an hour, I came back to the rebuild process having been started. It was at 2%, and drive 10 had experienced an ECC-ERROR. RIP NAS-0 cont. 08/25/2016 At around 2:00 this morning the rebuild started from its previously paused state. A few hours later, the rebuild finished and the two new drives appear to be working properly. The drive that failed is still broken, so I removed it from the RAID. The replacement drive, however, failed the Seagate diagnostic tests, so the drive slot (p10) will remain empty until either the replacement can be fixed or the new drives arrive. cont. 08/26/2016 Eric was able to fix two of the previously broken drives, and I am installing one of them. Drive p10 is now rebuilding. cont. 08/27/2016 Drive p10 rebuilt successfully. 08/25/2016 TAGS: NAS1 full Yesterday, NAS-1 became 100% full! $ nohup du -m /mnt/nas1 > ~/du_nas1_20160825.txt was run to list all files and their sizes in NAS-1. I sent the list to Vallary for review. 08/26/2016 TAGS: website nas0 drive missing Since the NAS-0 catastrophe, drive p2 has been missing from the diagnostics page. It appears to be fine when I investigate the NAS itself, however. It's probably because the drive in slot p2 is 1TB rather than the ususal 750GB. 08/30/2016 TAGS: user revival bdorney temp password sent Stefano requested that an old user's account have a password reset, and the temporary password be emailed to him. 08/31/2016 TAGS: nas1 cleaning delete files Dr. Hohlmann has cleared the following directories for deletion: /mnt/nas1/g4hep/MTSAtFIT/1cmPbBot /mnt/nas1/g4hep/MTSAtFIT/Bot1cmLead /mnt/nas1/g4hep/MTSAtFIT/Center1cmLead /mnt/nas1/g4hep/MTSAtFIT/Turkey :( /mnt/nas1/g4hep/MTSAtFIT/WPb guragain - files, but not account idiaz - files and account Brian - files and account Doug - files and account There was an error removing the home directory of idiaz using: $ userdel -r idiaz in nas-0-0 09/01/2016 TAGS: yum update SAM 6 14 critical Shortly after I conducted a yum update, SAM tests 6 and 14 went critical! SAM 14 is a condor test and SAM 6 is the xrootd test. 14: 'condor_status' says jobs are still running. cont. 09/06/2016 Test 14 went green again shortly after is went critical. It went critical again two other times afterward, however. 6: The error report says that copy_jobs is empty (whatever that means). I will try another yum update to see what will happen. Only tomcat was updated and nothing interesting appeared to have happened. The Twiki page for SAM 6 reports that the test ensures that "the CMS software directory ($VO_CMS_SW_DIR for EGEE and $OSG_APP/cmssoft/cms for OSG) is defined, existing, and readable". The $VO_CMS_SW_DIR looks fine, but $OSG_APP is already defined as '/cmssoft/cms'. This leads me to believe that the test is trying to access the nonexistant '/cmssoft/cms/cmssoft/cms' rather than the intended '/cmssoft/cms'. I changed $OSG_APP to null. cont. 09/06/2016 The SAM test is still critical. cont. 09/13/2016 The test also checks to see if cmsset_default.sh exists and can be properly sourced. The script is located in /cvmfs/cms.cern.ch/ and 'source cmsset_default.sh' produces no errors. Checks that directory containing MC test code can be accessed: MC might stand for Monte Carlo. I'm trying to find the tested directory. cont. 09/19/2016 On 09/16/2016, the SAM Test suddenly changed to, and remains in, Warning. The error report says that a SIGTERM was caught. cont. 09/22/2016 Later on 09/19/2016, the SAM Test reverted back to its Critical state. 09/06/2016 TAGS: NAS0 NAS-0 drives failed 10 12 Drives p10 and p12 have failed again. I am backing up NAS-0 to NAS-1, then deleting the old backup of NAS-0. We only have one 750GB drive, so I will wait for the other to arrive before replacing the two dead ones. $ nohup rsync -av --append /mnt/nas0/home /mnt/nas1/nas0-bak-20160906 & cont. 09/07/2016 Only about 9GB were transfered to nas-1. Even though NAS-1 has plenty of space, nohup.out is filled with "device full" errors. I have deleted the partial backup and am trying again. It failed again in the same way. I don't want to delete the backup because NAS-0 might be broken, so I'm going to compress it instead. done in /mnt/nas1/ $ nohup tar -cjvf --append nas0-bak-20160304.tar.bz2 nas0-bak-20160304 & cont. 09/12/2016 The new drive has arrived; I will replace the two broken drives. The brand new Western Digital drive was placed in slot p12, and the other drive was placed in p10. The rebuild has begun. cont. 09/13/2016 The rebuild completed successfully. The compression did not work, and I deleted the file. I tried to rsync everything again, but it froze up (or so it seems). The file it was copying at the time was over 700GB, so it was probably just taking a while to copy the one file. I want to restart the process. According to 'ps aux', there were 3 of the rsync processes running (oops). I'm trying to kill them all. NOTE: 'pgrep' can be used to get the PID of the specified process They are all dead. I am restarting the rsync, and I will let it sit for a while. cont. 09/14/2016 The transfer failed again with only about 9GB being transfered, and the 'device full' errors persisted. The rsync always fails while it is attempting to copy a large (~700GB) file. To get more information, I am going to run the command again with the -P flag which will report progress. cont. 09/26/2016 I will transfer user directories one at a time to try to see where the problem lies. I am writing a command that will individually rsync each of the users' home directories. I created a text file that contains the names of the home directories, and I will be reading that file into a loop that will individually rsync that particular user's home directory. $ while read dir; do rsync -av /mnt/nas0/home/$dir /mnt/nas1/nas0-bak-20160926/$dir &> rsync.out; done < homeDirs.txt The transfer failed again, but this time after 43G were transfered. NAS-1 also appears to functionally be full. When I try to touch files, it tells me there is no space remaining on device. df -h, however, reveals that there are over 9T of space on NAS-1. cont. 09/29/2016 I am performing an indepth investigation into NAS-1. I am unmounting it from the CE and SE so that I can fsck it. in CE and SE: $ umount /mnt/nas1 $ fsck /mnt/nas1 I am getting error 2: fsck.nfs not found. There is no fsck.nfs in /sbin, there is an fsck file for the other filesystems. $ exportfs -v only returns information for NAS-0. Maybe that's because it's part of the cluster while NAS-1 is network-accessible? I ssh'd into NAS-1 and am investigating the filesystem from there. NAS-1 was able to be completely filled about a month ago. Only when we've tried to fill it back up has the problem arisen. There are more than enough Inodes to go around, and no processes are keeping deleted files busy. What gives, man? in NAS-1: $ lsof | grep DEL Yeilded some results, but none of the files were more than a few MB. There are many results for `lsof | grep DEL` in the CE, but none of them are from /mnt/nas1. cont. 10/01/2016 Ankit is doing some commands in NAS-1 $ service nfs restart He restarted NAS-1. seeing many high-priority processes running on NAS-1 Ankit said he fixed the issue, so I'm trying the backup again. It's still busted. It appears that the files previously deleted from NAS-1 are still taking up space somehow. The "free" reported space is about the same amount of space freed by deleting the files. Ankit unmounted /nas1 from nas-0-1 $ umount -l /nas1 then remounted it It's working now when we rsync just Ankit's home directory. To determine corrupted files: (*) Check nohup.out periodically for when rsync stops (*) kill the rsync processes (*) delete troublesome file/directory from nas0 (*) delete current backup folder (*) unmount nas1. in NAS-1: `umount -l /nas1` (*) mount nas1. in NAS-1: `mount /dev/sdc -o inode64 /nas1` (*) restart rsync The problem is that rsync is trying to copy corrupt files. To test if there is space on the NAS: $ head -c 1073741824 myfile It writes 1GB of random data. 09/13/2016 TAGS: yum update Update successfully completed. 09/27/2016 TAGS: UPS Tripplite red light The 'balance' light on the bottom UPS was red. I installed the Tripplite software and tested the UPS. Everything looks to be fine. 10/01/2016 TAGS: condor down Condor is down! `condor_status` returns "Failed to connect to <163.118.42.1:9618>" cont. 10/03/2016 Condor had just turned off. To turn it back on: $ condor_master To verify that it's back up and running: $ service condor status `condor_status` now has regular output 10/01/2016 TAGS: gums home page down /var/log/gums logs report that they are still using Daniel's old certificate. ~/diagnostics/gumscheck.txt cron jobs for diagnostics page in /etc/cron.d The issue is causing SAM 12 to fail [continued on SAM 12 failed thread] 10/01/2016 TAGS: nas0 /mnt/mobile partition weird tune2fs is not working on /mnt/mobile on NAS-0, it is reporting a superblock error After some more testing, NAS-0 appears to be fine. 10/01/2016 TAGS: yum update antlr Do not run just `yum update`. Run: $ yum update; yum downgrade --disablerepo=Rocks\* antlr to prevent antlr from updating. GUMS does not like the newer versions of antlr. 10/03/2016 TAGS: tomcat not running The website reports that tomcat is not running. To check the status of tomcat: $ service tomcat6 status It reports the following error: PID file exists, but process is not running I just started tomcat with: $ service tomcat6 start and everything seems to be okay. 10/05/2016 TAGS: Hurricane! shutdown restart cluster There is an approaching hurricane, so the cluster is being turned off and wrapped up. 1) stop services: $ service condor stop $ service autofs stop 2) shutdown nodes: (*) uncomment "shutdown now" from ~/osg-node.sh $ ./osg-wn-setup.sh 3) unmount NASs from SE 4) shutdown SE 5) unmount storage partitions from NASs 6) unmount NASs from CE 7) shutdown NASs 8) shutdown CE Good luck; don't die! cont. 10/13/2016 The cluster is back online! Steps to revive the cluster: 1) turn on NASs and watch them boot 2) turn on CE and SE and watch them boot 3) turn on nodes NOTE: When the UPSs are plugged in, green lights will appear. This does NOT mean they are on! The power button must be pressed. The 'balance' light indicates if they are on. 10/13/2016 TAGS: mouse CE not working The CE does not seem to be accepting any mouse input, even from a direct connection. 10/15/2016 TAGS: condor down not working Condor was down again; I simply restarted it. $ pkill -9 condor $ service condor stop $ service condor start $ condor_restart 10/16/2016 TAGS: condor down again I had to restart condor again, I will investigate this reoccuring issue. 10/17/2016 TAGS: SAM tests not appearing Since I have turned the cluster back on, most of the SAM tests have not reappeared. Only 5, 12, 13, 14, and 15 are visible. On the twiki page about SAM tests, under the section "How to resubmit the SAM tests", there is a link to a site that has all the SAM tests. It reports that the condor CE tests have been recently submitted and that the JobState tests are all OK (the JobSubmit tests are all in WARN). There are buttons to schedule immediate checks. I pressed them for the OK tests (the only ones for which a button was available), and nothing seems to have changed. On the SE page on the same site, the "age" of each test is from June 30, but the "checked" is 15 min. The "checked" just incremented to 16 min, so I'm going to think that that means the test were run 16 min ago. In that case, all of the SE test have been recently run. Only one is OK, but they have been run. If the tests appear to have been run, why are they not on the main SAM test page? Ankit says that the SAM test jobs are probably not running. `condor_history` says that the last time a grid0002 (SAM user) job was run was 10/3. The Gratia Accounting on the website says that no CMS jobs have run since the cluster was brought back online; SAM tests are operated by CMS. All the CMS stuff on the website is blank. CMS jobs are not running. cont. 10/18/2016 Like the condor problem below, maybe some critical services aren't running. It complained that condor-cron wasn't running, and "sshftp access to globus-gridftp-server is disabled" To test if the gridftp server is running: $ telnet localhost 2811 If a 220 banner appears, it's running correctly. Next on the list was to test globus-url-copy, the troublesome command: $ globus-url-copy -vb -dbg gsiftp://uscms1.fltech-grid3.fit.edu/dev/zero file:///dev/null returned 530 error code: login incorrect error, globus_gss_assist: error invoking callout along with a bunch of other 530 related lines The website said that 530 is due to certificate issues. I replaced the old certificates in ~/.globus with my own. $ grid-proxy-init on the CE now recognizes me. The `globus-url-copy` is still returning 530 errors. tomcat6 was not started on the SE It started successfully, but it said that it could not find a name for the group id 501. I enabled sshftp for globus-gridftp-server $ globus-gridftp-server-enable-sshftp `globus-url-copy` works! now to test to see if it works in the other direction: $ globus-url-copy -vb -dbg file:///dev/zero gsiftp://uscms1.fltech-grid3.fit.edu/dev/null and it does! I restarted tomcat6 on CE and SE. Regular jobs are running again! cont. 10/19/2016 The SAM tests have all reappeared! 10/18/2016 TAGS: condor down again Condor is now idle. It's idle because no new jobs are being recieved. The antlr symlinks were broken, so I fixed them. Some services also were not running in the SE: $ service gratia-xrootd-transfer start $ service gratia-xrootd-storage start $ service globus-gridftp-server start Jobs are running normally again! (refer to 10/18/2016 section of above article) 10/18/2016 TAGS: CE var full condor down /var in the CE is 100% full! This is causing condor to fail. I deleted /var/log/maillog-20161013 - 1.1G $ yum clean expire-cache I turned condor back on. 10/19/2016 TAGS: all home directories mounted Stefano emailed me saying SRSUser couldn't write any data. I logged into the CE and ran `df -h`. It says that all of the home directories are mounted. I `su`d into SRSUser and successfully ran `touch test` in SRSUser's home directory. I was unable to write data to NAS-1, however. NAS-1 thinks it is full, although there are 9.2T free. I unmounted NAS-0, which didn't do anything other than unmount NAS-0. I tried unmounting all of the home directories with `umount -l /home/*`. They were all unmounted, but they were immediatly remounted. I restarted autofs `service autofs restart`. It seemed to have worked; the regular amount of items are mounted. NAS-1 still thinks it's full, though. 10/19/2016 TAGS: NAS1 NAS-1 full space avialable Stefano (or anyone else) is unable to write to NAS-1 because it is complaining that it is out of space. It appears that the files previously deleted from NAS-1 are still taking up space somehow. The "free" reported space is about the same amount of space freed by deleting the files. Holding open deleted files doesn't seem to be the problem. $ lsof | grep DEL | awk '{for(i=1;i<=6;i++){printf "%s ", $i}; print $7/1048576 "MB" " "$8" "$9 }' did not reveal any file over 50MB in size in either the CE or NAS-1. cont. 10/20/2016 IN NAS-1: I showed the problem to Daniel Campos from Blueshark, and he did some things. He did some basic checks to make sure it wasn't just a simple problem I had overlooked, and he didn't find anything out of the ordinary. He tried to run `xfs_check` in order to see if anything was wrong with the filesystem itself, but NAS-1 ran out of RAM before the process was able to complete. He said reinstalling the filesystem would probably fix it. Because NAS-1 only has 12G of RAM, and the motherboard (SuperMicro X8DT6) can support upto 192G, I'm looking into getting NAS-1 some more RAM. If we do end up reinstalling the filesystem, we might be able to also install ZFS, where all that extra RAM would come in handy. I found a page that recommended the use of xfs_repair over xfs_check. I am trying to properly unmount NAS-1 so that I can run the new command. I mounted nas1 as read-only on NAS-1, and ran `xfs_repair -n /dev/sdc`, I am letting it run. cont. 10/21/2016 Nothing special seems to have come up from the `xfs_repair -n /dev/sdc`. I am remounting NAS-1 for Vallary. I am having trouble remounting NAS-1. When I try to mount in from NAS-1, it says it's already mounted or busy. When I try to mount it from the CE, it mounts some 48G thing. I tried umounting NAS-1 from all the nodes, maybe that's what was keeping it busy, but no change. I'm trying to remount NAS-1 on the nodes just to see if it will work. It does not. There is currently no way to mount NAS-1. /mnt/backup and /mnt/general are commented out in /etc/fstab. I will uncomment them and try to mount them, then nas1. /mnt/backup, /mnt/general, and /mnt/nas1 all have the same size and space taken up. 48G total and 6.4G used. I have unmounted /mnt/nas1 /mnt/backup /mnt/general from everything. I am going to try to mount /mnt/backup and /mnt/general on NAS-1 and play with that some. The /etc/fstab in NAS-1 is kinda strange. The three similar-ish lines for /mnt/nas1 /mnt/backup and /mnt/general are all commented out and a new, shortened line for /mnt/nas1 is present at the bottom. I will uncomment the three lines and comment the strange line to see what happens. When I try to mount the three devices, it says they don't exist. I changed /etc/fstab back to what it was before and now nas1 seems to mount just fine on NAS-1. Now that it's mounted again, let's continue solving the issue at hand! The `xfs_db` command appears to be very useful. I will try to use it once I have sufficently researched it, because it is also quite dangerous. [cont. 11/15/2016] 10/24/2016 TAGS: ssh port changed root login disabled Ankit changed the ssh port from the default 22 to a less-than-default value (the new value can be found in /etc/ssh/sshd_config). Root login has also been disabled. Sysadmins must now login through their user accounts, then `su -` to root. 10/24/2016 TAGS: condor not running squid log Condor stopped again! Probably because /var is almost filled up again (it's at 97%). All of the home directories are mounted again, too! Why and how? `service autofs restart` seemed to have fixed it last time. And it's fixed it this time. Strange. I want to increase the size of /var on the CE, but for now, I'm just gonna figure out how to change the write location of the squid log. I moved access.log into the home directory, and restarted condor. The configuration for squid is in /etc/squid. The main configuration file (squid.conf) is not designed to be directly edited. There is a script (customize.sh) that is supposed to be edited to run custom awk commands (written in customhelps.awk) that will edit the desired items in squid.conf. 10/26/2016 TAGS: var full again I am trying to download an important security patch with yum update, but the /var directory is full. I removed a couple old maillogs to make just enough space for the update. `du -sh /var/*` claims that there is no more than 4G of data in /var. Turns out, a process was holding a deleted log file open. `lsof | grep deleted revealed that a process was holding open a file in /log (which means /var/log). I killed the process, and the space was freed up. df and du now agree. 10/26/2016 TAGS: security update A patch was released to fix a recently found security bug. I am going to try to `yum update` the nodes and the SE. Both have been fully updated, now for the restart! Restart complete! 10/27/2016 TAGS: RSV all green The RSV tests are all green! Looks like the restart fixed them. 10/31/2016 TAGS: certificate expire soon My Grid DigiCert was set to expire next month, so I renewed it. The cluster uses my CERN certification, so it was not affected. 10/31/2016 TAGS: squid critical The squid SAM test and status on the CE dashboard was critical. `service --status-all` revealed that the cache_log for squid was still pointing to /var/log/squid which conflicted with the new write point of the access_log. I changed the cache_log to /root/squidAccessLogDump using the same steps as before. cont. 11/01/2016 `service --status-all` says "Frontier Squid" is not running. I started it with `service frontier-squid start` 11/01/2016 TAGS: website condor button Dr. Hohlmann thinks the "Condor" button on the diagnostics page is misleading. The status on the page refers to condor on the CE, while the link provided by the button leads to the condor status of the individual nodes. I am going to edit the text of the "Condor" button to "Condor-CE (click for node status)" in /var/www/html/diagnostics/index.php. Nevermind, that looks gross AF, I'm gonna come up with a better solution. I changed the "Idle" status of condor to "Idle-CE" to better clarify that when the status is "Idle" it is refering specifically to the CE. 11/01/2016 TAGS: website ganglia broken The Ganglia button on the website is broke AF; it says, "There was an error collecting ganglia data (127.0.0.1:8652): fsockopen error: Connection refused" I found a webpage that said the problem was caused by incorrect permissions for /var/lib/ganglia/rrds The permissions should be set to nobody:root, but were set to root:root `chown -R nobody:root /var/lib/ganglia/rrds` changes the permissions. The ganglia service was also off. To turn it on: `/etc/init.d/gmetad restart`. 11/03/2016 TAGS: squid still broken SAM 4 Barry from OSG emailed me today saying that squid is still busted. After a quick gander at the SAM test page, SAM 4, the squid test, was indeed critical. Both Barry and the SAM test said that compute-1-8 was the problem. Barry said that the node was not in the squid ACL (Access Control List), and to check the configuration. The SAM metric error report said that the port 3128 refused the squid request. `netstat -lptu` did not contain an entry for 3128, so maybe the port is closed. I am going to try to open it. Turns out the port is totally open according to `nmap -sT -O localhost`. Maybe it's a certificate problem, like Ankit suggested? I copied my brand new OSG certificate to the cluster. cont. 11/07/2016 I made my .p12 certificate into a .pem file and copied it into /etc/grid-security, replacing the older usercert.pem and userkey.pem files in the CE. SIDE: tomcat6 was not running in any of the nodes, so I started the service cont. 11/10/2016 Barry sent me some instructions on what to do. He says that I need to add the IPs of the nodes to squid.conf via customize.sh. I need to specify which IPs can talk to squid by adding something like this: `setoption("acl NET_LOCAL src", "172.20.0.0/255.255.255.0 172.20.1.0/255.255.255. 0 162.129.223.0/255.255.255.0")` to customize.sh. I need to specify the IPs of the nodes in the file. The "acl NET_LOCAL" line in squid.conf only has the IP 0.0.0.0/32, which I think might be incorrect. I got all of the IPs of the nodes by adding `ifconfig | grep -m1 inet` to osg-nodes.sh and running osg-wn-setup.sh. I am going to add the found range of IPs to the line in squid.conf. `setoption("acl NET_LOCAL src", "0.0.0.0/32 10.1.255.235/254")` I restarted squid. It gave me some error messages that said it wasn't happy with my changes. It didn't recognize the new IP. I changed the line and tried again: `setoption("acl NET_LOCAL src", "0.0.0.0/32 10.1.255.235-254")` It doesn't seem to like any extra stuff on the end of the IP, so I'm just gonna try the first node, 10.1.255.254. It's cool with the syntax. Barry has the full IP written after the "/" and "-" characters, rather than the shortcut method I was trying to use. `setoption("acl NET_LOCAL src", "0.0.0.0/32 10.1.255.235-10.1.255.254")` Upon restart, squid did not yell at me, so my syntax is correct, but did I use the correct IPs? cont. 11/11/2016 SAM 4 is now Green! Squid is fixed! 11/08/2016 TAGS: gums discrepancy diagnostics page GUMS is working fine (according to the gums website), but it's displaying critical on the diagnostics page. `gums mapAccount 0002`, which writes to the gumscheck file, says that my certificate has expired, even though I replaced the usercert and userkey in /etc/grid-security. The hostcert may have expired. The "crl-expiry" RSV test is critical, not the hostcert test. What is the crl-certificate? A previous problem was that fetch-crl was not running on some of the nodes, and that was causing SAM 1 (glexec) to go critical. fetch-crl is running on all of the CE, SE, and nodes. 11/14/2016 TAGS: Bestman phase out zeroth order Daniel sent me an email saying that Bestman is going to be phased out by next year. gridftp and HDFS (Hadoop Distributed File System) will have to work together without Bestman, which acted as somesort of middle-man between the two. The email said that if we only have one gridftp door, the process will be simpler. Do we only have one gridftp door? The CE has ports for gsiftp and gsigatekeeper, and the SE has a port for gsiftp. Because Bestman is on the SE, I think this means we fall into the "only one gridftp door" category. Instructions for making the switch for sites with only one gridftp door are provided in the email, so I will try to follow them. cont. 11/15/2016 I talked to Daniel about it, and we don't actually have HDSF on the SE. So, the first step is installing HDFS, which will be interesting. 11/14/2016 TAGS: hypernews I finally figured out how to make a HyperNews account! $ ssh cernUsername@lxplus.cern.ch THEN $ ssh cernUsername@hypernews.cern.ch 11/15/2016 TAGS: var full CE /var was 90% full. Squid had a bunch of data (2.5G) for some reason. It was strange data though, not the logs and what-not I'm used to. It was a bunch of directories and files with hex names (ie. 00, 4E). I moved the only directory with data (00) to the squid dump in the home directory, and I deleted the contents of 00 in /var/log/squid 11/15/2016 TAGS: NAS-1 xfs_db I am going to unmount NAS-1 and play with xfs_db. I unmounted NAS-1 from NAS-1, CE, and SE. I encountered a "mount.nfs stale file handle" error on the nodes. To fix it, forcefully unmount nas1 from the nodes `umount -f /mnt/nas1`, then mount it again as normal `mount /mnt/nas1`. I tried to run `xfs_db /dev/sdc`, but was met with "xfs_db: /dev/sdc contains a mounted filesystem" Some instructions online said to: (*) comment /nas1 entry in /etc/fstab (*) restart NAS-1 (*) run xfs_db again (*) uncomment /nas1 when ready to mount it (*) mount it The instructions are legit; I'm in. `blockfree` returns the following error: "block usage information not allocated" I'm investigating what that means. Maybe the filesystem needs to be expanded to accomodate all of the space freed up by the deletion? I'm looking into xfs_growfs. Nevermind it's for expanding the filesystem to new disks. I ran out of time today, so I mounted NAS-1 back onto everything. cont. 11/27/2016 I had read the man page for xfs_db, and it said that blockfree uses the data created by blockget. So maybe I have to run that command first? I followed the above instructions for unmounting NAS-1; I'm in xfs_db. I tried running `blockget`, but it only returned "killed". That's because it is the same thing as `xfs_check`, which ran out of RAM the last time we tried to run it. What if I tell blockget to only check a small range of blocks at time? After over an hour of running `blockget -b10 -v` in xfs_db, my Terminal crashed with 32G of RAM usage (I only have 16G, no clue what's going on there). So I'm gonna try the command without verbose mode and see what happens. It just dies like normal. After trying a bunch of different numbers for -b option, none worked. All (except for 1000000) printed something, but all of them failed. Looks like I'm gonna have to reinstall the filesystem. I'm trying to mount NAS-1 back onto the CE, but mount.nfs keeps timing out. It suddenly worked, maybe there was a warm-up time after mounting back onto NAS-1. NAS-1 is remounted, and condor is restarted. cont. 11/28/2016 How do I find out which blocks are what? `xfs_info` returns both the block size and the total number of blocks. The utility "badblocks" is looking promising; I can specify a range of blocks for it to check, and it seems to be running correctly. I wrote a script (NAS-1: ~/nas1Bad.sh) that automatically checks all of the blocks of NAS-1 with badblocks in 100,000,000 block increments. I've started the script. cont. 11/30/2016 The script has completed, and nothing was written to the output file. So, badblocks hasn't helped. I'm gonna have to reinstall the filesystem. First, I need to find somewhere to store 50T of data. 11/18/2016 TAGS: /var full squid access log incorrect writing location The squid access log filled up /var again, even though I had fixed it earlier. When I had added the IPs of the nodes to squid.conf, it changed the write location access.log back to /var. I changed the write location back to the root home directory. I ran customize.sh with the node changes and the access.log write location change. 11/20/2016 TAGS: /var full squid cache The squid cache filled up /var again. This time is was folder 01. I will have to change the write location of the cache in squid.conf. cont. 11/21/2016 I changed the cache_dir of squid.conf from "ufs /var/cache/squid 20000 16 256" to "ufs /root/squidAccessLogDump/cache 20000 16 256". Make a cache directory in squidAccessLogDump and make sure squid can write to it (`chown squid:squid ~/squidAccessLogDump/cache`). After I deleted the directory, `df -h` still showed a bunch of space being taken up by the filesystem. The squid processes were holding the deleted files open. I ran `lsof | grep deleted | grep squid` to find the troublesome processes, then killed them. This also kills squid, however. Restart it with `service frontier-squid start`. 11/27/2016 TAGS: rsv tests critical condor Several RSV tests are failing becuase they are having trouble connecing to "local queue manager" (condor). cont. 12/5/2016 All RSV tests except for "org.osg.certificates.crl-expiry" have decided to turn green. 11/28/2016 TAGS: APC UPS battery replacement The battery replacement light (battery with an 'X') on the APC UPS is red, which means a battery failed the most recent self-test. cont. 11/29/2016 I'm gonna unplug everything from the APC UPS and examine the batteries. The APC is plugged into everything that's not the nodes, so I'm gonna do it Thursday after badblocks is done running. cont. 12/01/2016 Today is battery day! I'm going to take the cluster offline and examine the batteries. Everything rebooted correctly. All of the batteries are producing about 13V, and they are each rated for 12V. After restarting the UPS, though, the replace battery light turned off. The APC website mentioned that the light can sometimes be a false alarm. Next time when the light goes off, simply restart the UPS first before whipping out and testing all of the batteries.