08/22/2016
TAGS: condor running red compute 2-1
The "running" text for Condor on the diagnostics page was red.
The problem was that condor had gone critical on compute-2-1.
'service condor status' returned "condor_master dead but subsys locked"
To fix it, I restarted condor within the node:
$ pkill -9 condor
$ condor service restart
$ condor_restart

08/24/2016
TAGS: drive replace nas0
Drives 2 and 12 have failed in NAS-0, probably as a result of the 
drive replacement done about a month earlier. A 1TB drive was used
in slot 2, and a 750GB was used in slot 12. Upon rescan, neither
drive was detected. After about an hour, I came back to the rebuild
process having been started. It was at 2%, and drive 10 had experienced
an ECC-ERROR.

RIP NAS-0

cont.
08/25/2016

At around 2:00 this morning the rebuild started from its previously paused state.
A few hours later, the rebuild finished and the two new drives appear to 
be working properly. The drive that failed is still broken, so I removed it
from the RAID. The replacement drive, however, failed the Seagate
diagnostic tests, so the drive slot (p10) will remain empty until either the replacement
can be fixed or the new drives arrive.

cont.
08/26/2016

Eric was able to fix two of the previously broken drives, and I am installing one 
of them. Drive p10 is now rebuilding.

cont.
08/27/2016

Drive p10 rebuilt successfully.


08/25/2016
TAGS: NAS1 full
Yesterday, NAS-1 became 100% full!
$ nohup du -m /mnt/nas1 > ~/du_nas1_20160825.txt
was run to list all files and their sizes in NAS-1.
I sent the list to Vallary for review.


08/26/2016
TAGS: website nas0 drive missing
Since the NAS-0 catastrophe, drive p2 has been missing from the diagnostics
page. It appears to be fine when I investigate the NAS itself, however.
It's probably because the drive in slot p2 is 1TB rather than 
the ususal 750GB.


08/30/2016
TAGS: user revival bdorney temp password sent
Stefano requested that an old user's account have a password reset, and the 
temporary password be emailed to him.


08/31/2016
TAGS: nas1 cleaning delete files
Dr. Hohlmann has cleared the following directories for deletion:
/mnt/nas1/g4hep/MTSAtFIT/1cmPbBot
/mnt/nas1/g4hep/MTSAtFIT/Bot1cmLead
/mnt/nas1/g4hep/MTSAtFIT/Center1cmLead
/mnt/nas1/g4hep/MTSAtFIT/Turkey :(
/mnt/nas1/g4hep/MTSAtFIT/WPb

guragain - files, but not account
idiaz - files and account
Brian - files and account
Doug - files and account

There was an error removing the home directory of idiaz using:
$ userdel -r idiaz
in nas-0-0


09/01/2016
TAGS: yum update SAM 6 14 critical
Shortly after I conducted a yum update, SAM tests 6 and 14 went critical!
SAM 14 is a condor test and SAM 6 is the xrootd test.

14:
'condor_status' says jobs are still running.

cont.
09/06/2016
Test 14 went green again shortly after is went critical. It went critical 
again two other times afterward, however.

6:
The error report says that copy_jobs is empty (whatever that means).

I will try another yum update to see what will happen.
Only tomcat was updated and nothing interesting appeared to have happened.

The Twiki page for SAM 6 reports that the test ensures that "the CMS 
software directory ($VO_CMS_SW_DIR for EGEE and $OSG_APP/cmssoft/cms for
OSG) is defined, existing, and readable".
The $VO_CMS_SW_DIR looks fine, but $OSG_APP is already defined as '/cmssoft/cms'.
This leads me to believe that the test is trying to access the 
nonexistant '/cmssoft/cms/cmssoft/cms' rather than the intended '/cmssoft/cms'.
I changed $OSG_APP to null.

cont.
09/06/2016
The SAM test is still critical.

cont.
09/13/2016
The test also checks to see if cmsset_default.sh exists and can be properly sourced.
The script is located in /cvmfs/cms.cern.ch/ and 'source cmsset_default.sh' produces
no errors.

Checks that directory containing MC test code can be accessed:
MC might stand for Monte Carlo. I'm trying to find the tested directory.

cont.
09/19/2016
On 09/16/2016, the SAM Test suddenly changed to, and remains in, Warning.
The error report says that a SIGTERM was caught.

cont.
09/22/2016
Later on 09/19/2016, the SAM Test reverted back to its Critical state.


09/06/2016
TAGS: NAS0 NAS-0 drives failed 10 12
Drives p10 and p12 have failed again.
I am backing up NAS-0 to NAS-1, then deleting the old backup of NAS-0.
We only have one 750GB drive, so I will wait for the other to arrive
before replacing the two dead ones.

$ nohup rsync -av --append /mnt/nas0/home /mnt/nas1/nas0-bak-20160906 &

cont.
09/07/2016
Only about 9GB were transfered to nas-1.
Even though NAS-1 has plenty of space, nohup.out is filled with
"device full" errors. I have deleted the partial backup and am 
trying again.
It failed again in the same way. I don't want to delete the backup
because NAS-0 might be broken, so I'm going to compress it instead.

done in /mnt/nas1/
$ nohup tar -cjvf --append nas0-bak-20160304.tar.bz2 nas0-bak-20160304 &

cont.
09/12/2016
The new drive has arrived; I will replace the two broken drives.
The brand new Western Digital drive was placed in slot p12, and the
other drive was placed in p10. The rebuild has begun.

cont.
09/13/2016
The rebuild completed successfully.

The compression did not work, and I deleted the file.

I tried to rsync everything again, but it froze up (or so it seems).
The file it was copying at the time was over 700GB, so it was probably just taking
a while to copy the one file. I want to restart the process. According to 'ps aux', 
there were 3 of the rsync processes running (oops). I'm trying to kill them all.

NOTE: 'pgrep' can be used to get the PID of the specified process

They are all dead. I am restarting the rsync, and I will let it sit for a while.

cont.
09/14/2016
The transfer failed again with only about 9GB being transfered, and
the 'device full' errors persisted. The rsync always fails while it is
attempting to copy a large (~700GB) file.
To get more information, I am going to run the command again with
the -P flag which will report progress.

cont.
09/26/2016
I will transfer user directories one at a time to try to see where the problem lies.
I am writing a command that will individually rsync each of the users' home directories.
I created a text file that contains the names of the home directories, and I will be reading
that file into a loop that will individually rsync that particular user's home directory.
$ while read dir; do rsync -av /mnt/nas0/home/$dir /mnt/nas1/nas0-bak-20160926/$dir &> rsync.out; done < homeDirs.txt
The transfer failed again, but this time after 43G were transfered. NAS-1 also appears to functionally be full.
When I try to touch files, it tells me there is no space remaining on device.
df -h, however, reveals that there are over 9T of space on NAS-1. 

cont.
09/29/2016
I am performing an indepth investigation into NAS-1. I am unmounting it from the CE and SE
so that I can fsck it.
in CE and SE:
$ umount /mnt/nas1

$ fsck /mnt/nas1
I am getting error 2: fsck.nfs not found.
There is no fsck.nfs in /sbin, there is an fsck file for the other filesystems.
$ exportfs -v
only returns information for NAS-0. Maybe that's because it's part of the cluster
while NAS-1 is network-accessible?

I ssh'd into NAS-1 and am investigating the filesystem from there.

NAS-1 was able to be completely filled about a month ago. Only when we've tried
to fill it back up has the problem arisen. There are more than enough Inodes to go
around, and no processes are keeping deleted files busy. What gives, man?

in NAS-1:
$ lsof | grep DEL
Yeilded some results, but none of the files were more than a few MB.

There are many results for `lsof | grep DEL` in the CE, but none of them are
from /mnt/nas1.

cont.
10/01/2016
Ankit is doing some commands in NAS-1
$ service nfs restart

He restarted NAS-1.

seeing many high-priority processes running on NAS-1

Ankit said he fixed the issue, so I'm trying the backup again.

It's still busted.

It appears that the files previously deleted from NAS-1 are still taking up space somehow.
The "free" reported space is about the same amount of space freed by deleting the files.

Ankit unmounted /nas1 from nas-0-1
$ umount -l /nas1
then remounted it

It's working now when we rsync just Ankit's home directory.

To determine corrupted files:
(*) Check nohup.out periodically for when rsync stops
(*) kill the rsync processes
(*) delete troublesome file/directory from nas0
(*) delete current backup folder
(*) unmount nas1. in NAS-1: `umount -l /nas1`
(*) mount nas1. in NAS-1:  `mount /dev/sdc -o inode64 /nas1`
(*) restart rsync

The problem is that rsync is trying to copy corrupt files.

To test if there is space on the NAS:
$ head -c 1073741824 </dev/urandom >myfile
It writes 1GB of random data.


09/13/2016
TAGS: yum update
Update successfully completed.


09/27/2016
TAGS: UPS Tripplite red light
The 'balance' light on the bottom UPS was red. I installed the Tripplite software
and tested the UPS. Everything looks to be fine.


10/01/2016
TAGS: condor down
Condor is down! `condor_status` returns "Failed to connect to <163.118.42.1:9618>"

cont.
10/03/2016
Condor had just turned off. To turn it back on:
$ condor_master
To verify that it's back up and running:
$ service condor status
`condor_status` now has regular output


10/01/2016
TAGS: gums home page down
/var/log/gums logs report that they are still using Daniel's old certificate.
~/diagnostics/gumscheck.txt
cron jobs for diagnostics page in /etc/cron.d
The issue is causing SAM 12 to fail
[continued on SAM 12 failed thread]


10/01/2016
TAGS: nas0 /mnt/mobile partition weird
tune2fs is not working on /mnt/mobile on NAS-0, it is 
reporting a superblock error

After some more testing, NAS-0 appears to be fine.

10/01/2016
TAGS: yum update antlr
Do not run just `yum update`. Run:
$ yum update; yum downgrade --disablerepo=Rocks\* antlr
to prevent antlr from updating. GUMS does not like the newer versions of antlr.


10/03/2016
TAGS: tomcat not running 
The website reports that tomcat is not running.

To check the status of tomcat:
$ service tomcat6 status
It reports the following error:
PID file exists, but process is not running

I just started tomcat with:
$ service tomcat6 start
and everything seems to be okay.


10/05/2016
TAGS: Hurricane! shutdown restart cluster
There is an approaching hurricane, so the cluster is being turned off and wrapped up.
1) stop services:
   $ service condor stop
   $ service autofs stop
2) shutdown nodes:
   (*) uncomment "shutdown now" from ~/osg-node.sh
   $ ./osg-wn-setup.sh
3) unmount NASs from SE
4) shutdown SE
5) unmount storage partitions from NASs
6) unmount NASs from CE
7) shutdown NASs
8) shutdown CE
Good luck; don't die!

cont.
10/13/2016
The cluster is back online!
Steps to revive the cluster:
1) turn on NASs and watch them boot
2) turn on CE and SE and watch them boot
3) turn on nodes
NOTE: When the UPSs are plugged in, green lights will appear.
      This does NOT mean they are on! The power button must be pressed.
      The 'balance' light indicates if they are on.


10/13/2016
TAGS: mouse CE not working
The CE does not seem to be accepting any mouse input, even from a direct connection.


10/15/2016
TAGS: condor down not working
Condor was down again; I simply restarted it.
$ pkill -9 condor
$ service condor stop
$ service condor start
$ condor_restart


10/16/2016
TAGS: condor down again
I had to restart condor again, I will investigate this reoccuring issue.


10/17/2016
TAGS: SAM tests not appearing
Since I have turned the cluster back on, most of the SAM tests have not reappeared.
Only 5, 12, 13, 14, and 15 are visible.
On the twiki page about SAM tests, under the 
section "How to resubmit the SAM tests", there is a link to a site that has all the SAM
tests. It reports that the condor CE tests have been recently submitted and that the 
JobState tests are all OK (the JobSubmit tests are all in WARN). There are buttons to 
schedule immediate checks. I pressed them for the OK tests (the only ones for which
a button was available), and nothing seems to have changed.
On the SE page on the same site, the "age" of each test is from June 30, but the "checked"
is 15 min. The "checked" just incremented to 16 min, so I'm going to think that that means
the test were run 16 min ago. In that case, all of the SE test have been recently run. Only
one is OK, but they have been run. 
If the tests appear to have been run, why are they not on the main SAM test page?

Ankit says that the SAM test jobs are probably not running.
`condor_history` says that the last time a grid0002 (SAM user) job was run was 10/3.
The Gratia Accounting on the website says that no CMS jobs have run since the cluster
was brought back online; SAM tests are operated by CMS. All the CMS stuff on the website
is blank. CMS jobs are not running.

cont.
10/18/2016
Like the condor problem below, maybe some critical services aren't running.
It complained that condor-cron wasn't running, and 
"sshftp access to globus-gridftp-server is disabled"

To test if the gridftp server is running:
$ telnet localhost 2811
If a 220 banner appears, it's running correctly.
Next on the list was to test globus-url-copy, the troublesome command:
$ globus-url-copy -vb -dbg gsiftp://uscms1.fltech-grid3.fit.edu/dev/zero file:///dev/null
returned 530 error code: login incorrect error, globus_gss_assist: error invoking callout
along with a bunch of other 530 related lines
The website said that 530 is due to certificate issues.
I replaced the old certificates in ~/.globus with my own.
$ grid-proxy-init
on the CE now recognizes me.
The `globus-url-copy` is still returning 530 errors.

tomcat6 was not started on the SE
It started successfully, but it said that it could not find a name for the group id 501.

I enabled sshftp for globus-gridftp-server
$ globus-gridftp-server-enable-sshftp

`globus-url-copy` works!
now to test to see if it works in the other direction:
$ globus-url-copy -vb -dbg file:///dev/zero gsiftp://uscms1.fltech-grid3.fit.edu/dev/null
and it does!

I restarted tomcat6 on CE and SE.

Regular jobs are running again!

cont.
10/19/2016
The SAM tests have all reappeared!


10/18/2016
TAGS: condor down again
Condor is now idle. It's idle because no new jobs are being recieved.
The antlr symlinks were broken, so I fixed them.
Some services also were not running in the SE:
$ service gratia-xrootd-transfer start
$ service gratia-xrootd-storage start
$ service globus-gridftp-server start
Jobs are running normally again! 
(refer to 10/18/2016 section of above article)


10/18/2016
TAGS: CE var full condor down
/var in the CE is 100% full! This is causing condor to fail.
I deleted /var/log/maillog-20161013 - 1.1G
$ yum clean expire-cache
I turned condor back on.


10/19/2016
TAGS: all home directories mounted
Stefano emailed me saying SRSUser couldn't write any data. I logged into the CE and
ran `df -h`. It says that all of the home directories are mounted. I `su`d into SRSUser
and successfully ran `touch test` in SRSUser's home directory. I was unable to write data to 
NAS-1, however. NAS-1 thinks it is full, although there are 9.2T free. I unmounted NAS-0, which
didn't do anything other than unmount NAS-0. I tried unmounting all of the home directories with
`umount -l /home/*`. They were all unmounted, but they were immediatly remounted. I restarted autofs
`service autofs restart`. It seemed to have worked; the regular amount of items are mounted.
NAS-1 still thinks it's full, though.


10/19/2016
TAGS: NAS1 NAS-1 full space avialable
Stefano (or anyone else) is unable to write to NAS-1 because it is complaining that it
is out of space.
 
It appears that the files previously deleted from NAS-1 are still taking up space somehow.
The "free" reported space is about the same amount of space freed by deleting the files.

Holding open deleted files doesn't seem to be the problem.
$ lsof | grep DEL | awk '{for(i=1;i<=6;i++){printf "%s ", $i}; print $7/1048576 "MB" " "$8" "$9 }'
did not reveal any file over 50MB in size in either the CE or NAS-1.

cont.
10/20/2016
IN NAS-1:
I showed the problem to Daniel Campos from Blueshark, and he did some things.
He did some basic checks to make sure it wasn't just a simple problem I had overlooked,
and he didn't find anything out of the ordinary. He tried to run `xfs_check` in order to see
if anything was wrong with the filesystem itself, but NAS-1 ran out of RAM before the process
was able to complete. He said reinstalling the filesystem would probably fix it.

Because NAS-1 only has 12G of RAM, and the motherboard (SuperMicro X8DT6) can support upto 192G,
I'm looking into getting NAS-1 some more RAM. If we do end up reinstalling the filesystem, we
might be able to also install ZFS, where all that extra RAM would come in handy.

I found a page that recommended the use of xfs_repair over xfs_check. I am trying to properly unmount
NAS-1 so that I can run the new command. I mounted nas1 as read-only on NAS-1, and ran `xfs_repair -n /dev/sdc`,
I am letting it run.

cont.
10/21/2016
Nothing special seems to have come up from the `xfs_repair -n /dev/sdc`.
I am remounting NAS-1 for Vallary. I am having trouble remounting NAS-1.
When I try to mount in from NAS-1, it says it's already mounted or busy. When
I try to mount it from the CE, it mounts some 48G thing. I tried umounting NAS-1
from all the nodes, maybe that's what was keeping it busy, but no change. I'm trying
to remount NAS-1 on the nodes just to see if it will work. It does not. 
There is currently no way to mount NAS-1.

/mnt/backup and /mnt/general are commented out in /etc/fstab. I will uncomment them and try to
mount them, then nas1. /mnt/backup, /mnt/general, and /mnt/nas1 all have the same size and space taken
up. 48G total and 6.4G used. I have unmounted /mnt/nas1 /mnt/backup /mnt/general from everything. I am 
going to try to mount /mnt/backup and /mnt/general on NAS-1 and play with that some.

The /etc/fstab in NAS-1 is kinda strange. The three similar-ish lines for /mnt/nas1 /mnt/backup and /mnt/general
are all commented out and a new, shortened line for /mnt/nas1 is present at the bottom. I will uncomment the three
lines and comment the strange line to see what happens. When I try to mount the three devices, it says they don't
exist. I changed /etc/fstab back to what it was before and now nas1 seems to mount just fine on NAS-1.

Now that it's mounted again, let's continue solving the issue at hand!

The `xfs_db` command appears to be very useful. I will try to use it once I have sufficently
researched it, because it is also quite dangerous.

[cont. 11/15/2016]

10/24/2016
TAGS: ssh port changed root login disabled
Ankit changed the ssh port from the default 22 to a less-than-default value
(the new value can be found in /etc/ssh/sshd_config). Root login has also been disabled.
Sysadmins must now login through their user accounts, then `su -` to root.


10/24/2016
TAGS: condor not running squid log
Condor stopped again! Probably because /var is almost filled up again (it's at 97%). 
All of the home directories are mounted again, too! Why and how?
`service autofs restart` seemed to have fixed it last time. And it's
fixed it this time. Strange.
I want to increase the size of /var on the CE, but for now, I'm just gonna figure out how to
change the write location of the squid log. I moved access.log into the home directory, and 
restarted condor.

The configuration for squid is in /etc/squid. The main configuration file (squid.conf) is not designed to
be directly edited. There is a script (customize.sh) that is supposed to be edited to run custom
awk commands (written in customhelps.awk) that will edit the desired items in squid.conf.


10/26/2016
TAGS: var full again
I am trying to download an important security patch with yum update, but
the /var directory is full. I removed a couple old maillogs to make just enough space for the update.
`du -sh /var/*` claims that there is no more than 4G of data in /var.
Turns out, a process was holding a deleted log file open. `lsof | grep deleted  revealed that a process
was holding open a file in /log (which means /var/log). I killed the process, and the space was freed up.
df and du now agree.


10/26/2016
TAGS: security update 
A patch was released to fix a recently found security bug.
I am going to try to `yum update` the nodes and the SE.
Both have been fully updated, now for the restart!
Restart complete!


10/27/2016
TAGS: RSV all green
The RSV tests are all green! Looks like the restart fixed them.


10/31/2016
TAGS: certificate expire soon
My Grid DigiCert was set to expire next month, so I renewed it. The cluster uses
my CERN certification, so it was not affected.


10/31/2016
TAGS: squid critical
The squid SAM test and status on the CE dashboard was critical.
`service --status-all` revealed that the cache_log for squid was still pointing 
to /var/log/squid which conflicted with the new write point of the access_log.
I changed the cache_log to /root/squidAccessLogDump using the same steps as before.

cont.
11/01/2016
`service --status-all` says "Frontier Squid" is not running. I started it with
`service frontier-squid start`


11/01/2016
TAGS: website condor button
Dr. Hohlmann thinks the "Condor" button on the diagnostics page is misleading.
The status on the page refers to condor on the CE, while the link provided by the 
button leads to the condor status of the individual nodes. I am going to edit the
text of the "Condor" button to "Condor-CE (click for node status)" in 
/var/www/html/diagnostics/index.php.
Nevermind, that looks gross AF, I'm gonna come up with a better solution.
I changed the "Idle" status of condor to "Idle-CE" to better clarify that when the 
status is "Idle" it is refering specifically to the CE.


11/01/2016
TAGS: website ganglia broken
The Ganglia button on the website is broke AF; it says, "There was an error collecting
ganglia data (127.0.0.1:8652): fsockopen error: Connection refused"
I found a webpage that said the problem was caused by incorrect permissions for
/var/lib/ganglia/rrds The permissions should be set to nobody:root, but were set to 
root:root `chown -R nobody:root /var/lib/ganglia/rrds` changes the permissions.
The ganglia service was also off. To turn it on: `/etc/init.d/gmetad restart`.


11/03/2016
TAGS: squid still broken SAM 4
Barry from OSG emailed me today saying that squid is still busted. After a quick
gander at the SAM test page, SAM 4, the squid test, was indeed critical. Both Barry
and the SAM test said that compute-1-8 was the problem. Barry said that the node was
not in the squid ACL (Access Control List), and to check the configuration.
The SAM metric error report said that the port 3128 refused the squid request.
`netstat -lptu` did not contain an entry for 3128, so maybe the port is closed.
I am going to try to open it. Turns out the port is totally open according to 
`nmap -sT -O localhost`. 
Maybe it's a certificate problem, like Ankit suggested? I copied my brand new
OSG certificate to the cluster.

cont.
11/07/2016
I made my .p12 certificate into a .pem file and copied it into /etc/grid-security, 
replacing the older usercert.pem and userkey.pem files in the CE.

SIDE: tomcat6 was not running in any of the nodes, so I started the service

cont.
11/10/2016
Barry sent me some instructions on what to do. He says that I need to add the IPs
of the nodes to squid.conf via customize.sh. I need to specify which IPs can talk
to squid by adding something like this:
`setoption("acl NET_LOCAL src", "172.20.0.0/255.255.255.0 172.20.1.0/255.255.255.
0 162.129.223.0/255.255.255.0")`
to customize.sh. I need to specify the IPs of the nodes in the file.
The "acl NET_LOCAL" line in squid.conf only has the IP 0.0.0.0/32, which I think might
be incorrect. I got all of the IPs of the nodes by adding `ifconfig | grep -m1 inet`
to osg-nodes.sh and running osg-wn-setup.sh. I am going to add the found range of IPs
to the line in squid.conf.
`setoption("acl NET_LOCAL src", "0.0.0.0/32 10.1.255.235/254")`
I restarted squid.
It gave me some error messages that said it wasn't happy with my changes. It didn't 
recognize the new IP.
I changed the line and tried again:
`setoption("acl NET_LOCAL src", "0.0.0.0/32 10.1.255.235-254")`
It doesn't seem to like any extra stuff on the end of the IP, so I'm just gonna try
the first node, 10.1.255.254. It's cool with the syntax. Barry has the full IP written
after the "/" and "-" characters, rather than the shortcut method I was trying to use.
`setoption("acl NET_LOCAL src", "0.0.0.0/32 10.1.255.235-10.1.255.254")`
Upon restart, squid did not yell at me, so my syntax is correct, but did I use the 
correct IPs?

cont.
11/11/2016
SAM 4 is now Green! Squid is fixed!


11/08/2016
TAGS: gums discrepancy diagnostics page
GUMS is working fine (according to the gums website), but it's displaying critical
on the diagnostics page. `gums mapAccount 0002`, which writes to the gumscheck file, 
says that my certificate has expired, even though I replaced the usercert and userkey
in /etc/grid-security. The hostcert may have expired. The "crl-expiry" RSV test is critical,
not the hostcert test. What is the crl-certificate?
A previous problem was that fetch-crl was not running on some of the nodes, and that was
causing SAM 1 (glexec) to go critical. fetch-crl is running on all of the CE, SE, and nodes.


11/14/2016
TAGS: Bestman phase out zeroth order
Daniel sent me an email saying that Bestman is going to be phased out by next year.
gridftp and HDFS (Hadoop Distributed File System) will have to work
together without Bestman, which acted as somesort of middle-man between the two. 
The email said that if we only have one gridftp door, the process will be simpler.
Do we only have one gridftp door?
The CE has ports for gsiftp and gsigatekeeper, and the SE has a port for gsiftp. Because
Bestman is on the SE, I think this means we fall into the "only one gridftp door" category.
Instructions for making the switch for sites with only one gridftp door are provided in the
email, so I will try to follow them.

cont.
11/15/2016
I talked to Daniel about it, and we don't actually have HDSF on the SE.
So, the first step is installing HDFS, which will be interesting.


11/14/2016
TAGS: hypernews
I finally figured out how to make a HyperNews account!
$ ssh cernUsername@lxplus.cern.ch
THEN
$ ssh cernUsername@hypernews.cern.ch


11/15/2016
TAGS: var full CE
/var was 90% full. Squid had a bunch of data (2.5G) for some reason. It was strange data though, not
the logs and what-not I'm used to. It was a bunch of directories and files with hex names (ie. 00, 4E).
I moved the only directory with data (00) to the squid dump in the home directory, and I deleted the 
contents of 00 in /var/log/squid


11/15/2016
TAGS: NAS-1 xfs_db
I am going to unmount NAS-1 and play with xfs_db.
I unmounted NAS-1 from NAS-1, CE, and SE.
I encountered a "mount.nfs stale file handle" error on the nodes. To
fix it, forcefully unmount nas1 from the nodes `umount -f /mnt/nas1`, 
then mount it again as normal `mount /mnt/nas1`.
I tried to run `xfs_db /dev/sdc`, but was met with 
"xfs_db: /dev/sdc contains a mounted filesystem"
Some instructions online said to:
(*) comment /nas1 entry in /etc/fstab
(*) restart NAS-1
(*) run xfs_db again
(*) uncomment /nas1 when ready to mount it
(*) mount it
The instructions are legit; I'm in.

`blockfree` returns the following error:
"block usage information not allocated"
I'm investigating what that means.
Maybe the filesystem needs to be expanded to accomodate all of the 
space freed up by the deletion? I'm looking into xfs_growfs. Nevermind
it's for expanding the filesystem to new disks.
I ran out of time today, so I mounted NAS-1 back onto everything.

cont.
11/27/2016
I had read the man page for xfs_db, and it said that blockfree uses the
data created by blockget. So maybe I have to run that command first? 
I followed the above instructions for unmounting NAS-1; I'm in xfs_db.
I tried running `blockget`, but it only returned "killed". That's because
it is the same thing as `xfs_check`, which ran out of RAM the last time
we tried to run it. What if I tell blockget to only check a small range of 
blocks at time? After over an hour of running `blockget -b10 -v` in xfs_db,
my Terminal crashed with 32G of RAM usage (I only have 16G, no clue what's
going on there). So I'm gonna try the command without verbose mode and 
see what happens. It just dies like normal.
After trying a bunch of different numbers for -b option, none worked. All
(except for 1000000) printed something, but all of them failed.
Looks like I'm gonna have to reinstall the filesystem.

I'm trying to mount NAS-1 back onto the CE, but mount.nfs keeps timing out.
It suddenly worked, maybe there was a warm-up time after mounting back onto NAS-1.
NAS-1 is remounted, and condor is restarted.

cont.
11/28/2016
How do I find out which blocks are what? `xfs_info` returns both the block size
and the total number of blocks. The utility "badblocks" is looking promising; I
can specify a range of blocks for it to check, and it seems to be running correctly.
I wrote a script (NAS-1: ~/nas1Bad.sh) that automatically checks all of the blocks 
of NAS-1 with badblocks in 100,000,000 block increments. I've started the script.

cont.
11/30/2016
The script has completed, and nothing was written to the output file. So, 
badblocks hasn't helped. I'm gonna have to reinstall the filesystem. First,
I need to find somewhere to store 50T of data.

11/18/2016
TAGS: /var full squid access log incorrect writing location
The squid access log filled up /var again, even though I had fixed it earlier.
When I had added the IPs of the nodes to squid.conf, it changed the write 
location access.log back to /var. I changed the write location back to the
root home directory. I ran customize.sh with the node changes and the access.log
write location change.


11/20/2016
TAGS: /var full squid cache
The squid cache filled up /var again. This time is was folder 01.
I will have to change the write location of the cache in squid.conf.

cont.
11/21/2016
I changed the cache_dir of squid.conf from "ufs /var/cache/squid 20000 16 256"
to "ufs /root/squidAccessLogDump/cache 20000 16 256". Make a cache directory in 
squidAccessLogDump and make sure squid can write to it 
(`chown squid:squid ~/squidAccessLogDump/cache`). After I deleted the directory,
`df -h` still showed a bunch of space being taken up by the filesystem.
The squid processes were holding the deleted files open. I ran 
`lsof | grep deleted | grep squid` to find the troublesome processes, then
killed them. This also kills squid, however. Restart it with 
`service frontier-squid start`.

11/27/2016
TAGS: rsv tests critical condor
Several RSV tests are failing becuase they are having trouble connecing
to "local queue manager" (condor).

cont.
12/5/2016
All RSV tests except for "org.osg.certificates.crl-expiry" have decided to turn
green.


11/28/2016
TAGS: APC UPS battery replacement
The battery replacement light (battery with an 'X') on the APC UPS is red,
which means a battery failed the most recent self-test. 


cont.
11/29/2016
I'm gonna unplug everything from the APC UPS and examine the batteries.
The APC is plugged into everything that's not the nodes, so I'm gonna do it
Thursday after badblocks is done running.

cont.
12/01/2016
Today is battery day! I'm going to take the cluster offline and examine the batteries.
Everything rebooted correctly. All of the batteries are producing about 13V, and they
are each rated for 12V. After restarting the UPS, though, the replace battery light turned
off. The APC website mentioned that the light can sometimes be a false alarm. Next time
when the light goes off, simply restart the UPS first before whipping out and testing 
all of the batteries.