01/09/2017
TAGS: ACP UPS battery replacement light red
The battery replacement light for the APC UPS is red again. Since it's connected to everything
except for the nodes, I'm gonna wait before restarting it.

cont.
01/10/2017
I'm ready to turn everything off and restart the UPS.
After the UPS restart, the red light turned off.
Everything booted up properly.


01/09/2017
TAGS: NAS-1 NAS1 Curtis is helping
Curtis recommended that I try to mount the filesystem with inode64 option enabled.
In NAS-1, I ran: `mount /dev/sdc -o remount,rw,inode64 /nas1`
`mount` reported that /nas1 was mounted with inode64, but I was still unable to write to it.

Curtis says the server might be hitting the open file limits:
$ ulimit -Hn
1024
$ ulimit -Sn
1024
NA`ulimit` provides control over resources allowed to the shell. -Hn shows the hard value of open
file descriptors and -Sn shows the soft value of open file descriptors. The hard value cannot be
changed while the soft value can be. 
$ cat /proc/sys/fs/file-nr
1530  0		1021706
The current number of file handlers open for all users is 1530, which exceeds the limit.
I am going to increase the limit for the number of open file descriptors available to root from 1024
to 4096 by editing /etc/security/limits.conf

cont.
01/10/2017
The changes to /etc/security/limits.conf take effect when a new session is started.
/proc/sys/fs/file-nr shows a 0 where the 1530 used to be, but /nas1 is still "full".x

cont.
01/11/2017
/proc/sys/fs/file-nr now shows 1020 and /nas1 is still unwritable.
That didn't seem to work, so I'm gonna try Curtis' other test: boot NAS-1 into a LiveCD of CentOS 6
and try to test /nas1 from there (NAS-1 is still on CentOS 5). /nas1 was still not writable from
the LiveCD.

cont.
01/12/2017
Stefano suggested I check the size of all of the Trashes on /nas1.

cont.
01/12/2017
Daniel Campos is looking at NAS-1, and he's doing many things.
We tried mounting /nas1 on the CentOS 7 LiveCD, and it worked! We could write to /nas1!
Daniel says that the filesystem probably ran into a bug and panicked, but that bug has 
been fixed in later versions. Because we're not ready for a system-wide update (which would 
break everything), we're gonna try to update just the part that we need to.

Success! NAS-1 is fixed! When we deleted all of those files a while ago, it caused a bug in the 
filesystem. In CentOS 7, that bug is fixed, so all is now well (mostly)!


01/12/2017
TAGS: update NAS-1
Daniel is gonna update NAS-1 to CentOS 7.
backup old NAS-1 with rsync
$ rsync -aH ...
The update was successful.
 

01/12/2016
TAGS: SE not booting turning on
The SE is refusing to start properly. It boots to the CentOS 6.8 screen with the little loading
lines at the bottom of the screen, but the white bar fills up and nothing happens afterward. The
cluster seems to be working fine except for that, though.

cont.
01/13/2016
I checked the SE when I arrived, and was greeted by the ususal login screen. It appears to be working
fine afterall! Perhaps it was just taking extra time to turn on.


01/13/2016
TAGS: SE not ssh-able NAS-0 not mounted
I am unable to ssh into the SE and NAS-0 is not mounted on it.
I can ssh into the SE from my computer via the SE's IP address, but
I can't ssh into it with the compute-0-0 designation used on the CE.
The SE is unreachable on the local network.
I've run out of time today, I'm just gonna turn everything off for the power
outage tomorrow, and investigate further next week.

cont.
01/17/2016
Everything booted up properly.
The problem seems to be related to the SE's new abnormally long boot time;
it sits at the CentOS 6.8 loading bar for a long while.
By pressing any key during the loading bar screen, verbose mode is enabled.
The screen was covered with CRL errors; the CRL for [...] was not retrieved,
the 24h grace period had expired, and the CRL needed to be updated.
The CRL is related to openssl, which can explain why ssh isn't working.

cont.
01/18/2017
`fetch-crl` is not finding the CRLs it needs; it's the command that's taking forever
at boot. The SE was unable to resolve any mirrors for a yum update, so maybe it 
doesn't have internet access. Because it cannot resolve any mirrors, it is taking
FOREVER to complete. I'm gonna let it do it's thing and come back later.

cont.
01/20/2017
It has internet because it pings 8.8.8.8 fine. Maybe the yum update is dependent
upon some of the CRLs. I tried mounting NAS-0, but it doesn't work; it won't ping
it either. The SE doesn't seem to be talking to the rest of the cluster at all. 
I can't ping the SE from CE.
It can probably only mount NAS-1 because NAS-1 isn't technically part of the 
cluster; it's hardwired into NAS-1. Disrupting the direct connection between the
two does not seemed to have affected anything. It can ping the nodes just fine.
I'm gonna try investigating the ".info" files for all of the certificates in 
/etc/certificates All of the .info files were put into /etc/certificates/infoList.txt
`fetch-crl` retrieves information based upon the .info  and .crl_url files in 
/etc/certificates The URLs from which the CRLs can be retrieved are listed in the 
trust anchor meta-data (.info files and .crl_url files).  

cont.
01/27/2017
Turns out there's nothing installed on NAS-1; emacs wasn't there. I ran a `yum install emacs`
and it downloaded a whole bunch of stuff. Maybe everything's borked because NAS-1 doesn't have
all of its software. Imma investigate the repositories from old NAS-1.
The only discrepancy is the lack of the rpmforge repo on the current NAS-1, so I'm installing
that. There are still some issues with `yum update` and `yum upgrade`. I'll investigate later.

cont.
02/01/2017
I discovered that a service called `NetworkManager` was turned off on the CE.
I turned it on, and now the SE is ssh-able from the CE and NAS-0 can be mounted
on the SE.
I'm gonna restart the SE and see if anything's changed.

All of these problems have been fixed! Make sure `NetworkManager` is turned on
in the CE!

01/25/2017
TAGS: NAS-1 website RAID health check
After the NAS-1 update, the RAID health check on the website wasn't working,
all of the website files were giving permission errors. The problem was that
the CE's ssh key had to be put back onto NAS-1; the health scripts relied on 
the ability to ssh into NAS-1 automatically.
$ cat ~/.ssh/id_rsa.pub | ssh user@123.45.56.78 "mkdir -p ~/.ssh && cat >>  ~/.ssh/authorized_keys"

cont.
01/27/2017
Nevermind, I didn't fix it. The red bars now say "bash:" instead of "permissions:".
It's probably no big deal.

cont.
01/29/2017
The problem is that the script that checks NAS-1 is trying to use the `storcli64` command, which
apparently doesn't exist. Maybe some software needs to be installed onto NAS-1 again so that the 
command can be used. Either that or the command is outdated.

cont.
01/30/2017
I'm trying to install `storcli64`, but every link I've found is dead.


01/25/2017
TAGS: NAS-1 attempted logins China
There have been repeated attempts to log into NAS-1 since it's been updated (about 100
attempts every few seconds). I looked up the IP (59.63.166.80), and it's from China. 
Well boys, I reckon this is it: cyber warfare, toe to toe with the Chinese.
The IP report only gave me information about the hub through which IPs
59.62.0.0 - 59.63.255.255 are run.

cont.
01/27/2017
Daniel Campos recommended the software Fail2Ban. It bans IPs that repeatedly attempt to login.
I'm also looking up how to use the firewall.

cont.
02/03/2017
The Chinese seem to have given up. There have only been 2 failed login attempts in the past
hour from Bulgaria.

cont.
02/06/2017
JK, the Chinese just took a couple days off. I've installed fail2ban,
and now I'm gonna learn how to use it.
fail2ban has been installed and it is in use. Get rekt, Chinese!
The default sshd filter seems to be doing the trick nicely.


02/01/2017
TAGS: SAM critical
Only 6 of the usual 15 SAM tests are visible, and almost all of those are red.
It looks like the SAM website switched to a new kind of monitoring. Anything before
about two days ago simply isn't there. The CE monitoring has also switched flavors
from CE to GLOBUS.
The "cacert-expiry" RSV test has also gone into Warning.
I'm checking the globus logs. The SAM site says the change (whatever it was) happened at about
11:00 (or 06:00) on 01/31/2017. The logs for that day, however, stop during 03:00. The logs
are missing (or were never written).
The gram log for grid0004 has repeated error messages for "no job found" from 01/30/2017 to
02/01/2017.

cont.
02/03/2017
Since I've fixed the SE, all of the SE SAM tests have gone green!

cont.
02/06/2017
Nevermind, they went red again sometime yesterday.

cont. on 2017-02-06

02/03/2017
TAGS: trouble mounting NAS-1 from IP
Stefano is having trouble mounting NAS-1 from a specific IP, although he can mount it fine
from a different one. I just added the IP he wanted to /etc/exports on NAS-1 and saved the 
changes with `exportfs -ra`.


02/03/2017
TAGS: OSG software missing
OSG sent me an email yesterday saying that some packages that are required by CMSSW are missing.
I've installed the requested packages.

cont.
02/06/2017
Turns out I need to install the packages on more than just the CE, so I'm gonna install
it on the nodes and the SE.


02/06/2017
TAGS: cluster shutdown script
I wrote a script that will properly shutdown the entire cluster:
~/scripts/totalShutdown.sh
It can be run by simply typing `totalShutdown`.


02/06/2017
TAGS: APC UPS Battery light red
The check battery light for the APC UPS was red again.
I turned the cluster off and restarted the UPS. The red
light did not turn back on, so I brought the cluster back online.


02/06/2017
TAGS: Daniel certificate expired
I was looking around the /var/mail files, and the most recent phedex mail
was complaining that Daniel's certificate had expired. Daniel's certificate
is still on the cluster somewhere! This could be the source of our problems!

cont.
02/10/2016
Today is certificate day: let's find 'em!
The phedex mail said that phedex is still using Daniel's expired certificate.
The user phedex on the SE doesn't have home directory, and a cron job that checks
the expiration date of the certificate looks at a file buried in that missing home
directory.
I have tried running `voms-proxy-init -cert usercert.pem -key userkey.pem` to set
the certificates as my own, but it failed because the CRLs are out of date. `fetch-crl`
returned a bunch of CRL retrieval errors. On NAS-1, fetch-crl gets its information from 
/etc/certificates, but /etc/certificates doesn't exist on either the CE or SE. That's 
because they were reconfigured to be /etc/grid-security/certificates
I tried going to one of the URLs mentrioned in a .crl_url file, and the URL is fine; the
CRL downloaded. Maybe fetch-crl needs certificates to get the CRLs?
I messed around with `certutils` a bit. `certutils -L` reports that the certificate/key
database is in an old, unsupported format. It may be the case that certutils is pointing
at the wrong directory, or the faulty database is contributing to the fetch-crl issues.
I came across a command to list all certificates (`certutil -d sql:$HOME/.pki/nssdb -L`),
but ~/.pki/nssdb is empty! When I tried to run the command, I got the same error as before.
That just means that there aren't any nss data bases.

cont.
02/13/2017
`fetch-crl` on the CE also fails; it fails the slow way, the way the SE used to.
I'm scouring the log files on the CE. Nothing glaring came up in 
/var/log/globus-gatekeeper.log .
The wiki page for fetch-crl mentioned the installation and maintenance of CA certs.
There is an application ment to maintain them, and there is a cron job for it.
I ran `[ ! -f /var/lock/subsys/osg-info-services ] || /bin/sh -c 'perl -e "sleep rand 300" && http_proxy= /usr/sbin/osg-info-services` on its own to see the output, and it gave me some
juicy config files to look at:
/etc/gip/gip.conf
/etc/osg/config.d/
It also talks a lot about condor and its users, so it probably has something to do with
authenticating users to use condor.
/etc/osg/config.d is a folder full of .ini files. I'm searching them for any information
about certificates. The cacert-expiry RSV test is red, so I think I'm looking in the right
direction. The RSV test says that the CA "UNLPGrid" is out of sync.
I've tried running `osg-ca-certs-updater` on its own to see what happens. It said that its
update succeeded. I ran the critical RSV test by hand, and it still failed.
Maybe the missing CA is part of antlr (the thing we refuse to update because it
breaks GUMS). Let's see what happens when we update it! Nothing changed, so I downgraded
antlr back to where it normally is.
I tried running `osg-ca-certs-updater` on the SE, and it said none of the hosts for 
the certificates could be resolved! I manually checked the URLs, but they worked. The
SE does have internet access. The same errors appeared when I tried a yum update.
I fixed it by adding "nameserver 8.8.8.8" to /etc/resolv.conf
fetch-crl on the SE only resulted in 2 failed CRLs!
fetch-crl on the CE resulted in the same 2 failed CRLs.
`voms-proxy-init -cert usercert.pem -key userkey.pem` from before now works and recognizes
me.
I found Daniel's old certificate! It's in phedex's home directory on the CE. The script
in that directory `phedex_proxy_update.sh` is what's been running and returning the Daniel
errors. I'm working on updating the certs in that directory.

cont.
02/15/2017
Time to update the certificates! 
There was some crazy encryption to protect the certificate password.

To encrypt a text file:
Generate a public-private key pair.
$ openssl genrsa -out key.pem 1024
Extract the public key.
$ openssl rsa -in key.pem -pubout > key.pub
Encrypt the text file with the public key.
$ cat federer.txt | openssl rsautl -encrypt -pubin -inkey key.pub > federer.txt.enc

The files `key.pub` and `federer.txt` can be deleted.

To decrypt:
openssl rsautl -inkey key.pem -decrypt -in federer.txt.enc

I'm replacing the old usercert.pem and userkey.pem with my own info.
I didn't know the old password, so the GRID passphrase had to be changed.
Place the new usercert.pem and userkey.pem into .globus in phedex's 
home dirctory. `grid-change-pass-phrase` will allow for the changing 
of the passphrase. The passphrase works!
It says "Error: cms: User unknown to this VO." 
I tried it again using my GRID certificate rather than my CERN certificate, but
same result.

cont.
02/22/2017
The SAM tests said that a condor jobs were idle.
`condor_q` reveals 366 idle jobs. 
Ah-ha! Problemo foundo! Why are the jobs idle?
I tried fully shutting down and restarting condor, but the condor_q list remains
the same. I searched the output of both `condor_q -analyze -verbose` and
`condor_q -better-analyze -verbose` for the IDs of the idle jobs:
$ condor_q | grep I | awk -F' ' '{print $1}' > idleNums.txt
$ while read num; do grep $num condor_qAnalyzeVerbose.out condor_qBetterAnalyzeVerbose.out; done<idleNums.txt
but no matching IDs were found. It does not seem as if idle jobs are analyzed.
The SAM test mentions a log file:
/var/lib/gridprobes/cms.Role.production/org.sam/CONDORJS/uscms1.fltech-grid3.fit.edu/jobOutput/gridjob.err
"gridprobes" are mentioned, and `service --status-all` shows that "gridftp-transfer probe" is disabled.
"gratia-transfer-cron" is enabled, but its constituent "gridftp-transfer probe" is disabled.
The configuration file for gratia transfer probe is 
/etc/gratia/gridftp-transfer/ProbeConfig
I made the following changes:
  -SiteName="Generic Site"
  +SiteName="T3_US_FIT"
  -EnableProbe="0"
  +EnableProbe="1"
I then restarted the service:
$ service gratia-transfer-cron restart
gridftp-transfer probe is now enabled.

cont.
02/27/2017
Only 128 are idle, now. The above file doesn't exist: no such thing as /var/lib/gridprobes.
The SAM tests say "Globus error 48: the provided RSL could not be properly
parsed". RSL = Resource Specification Language. I'm going to check the Globus
logs and see what's up. Nothing of value appears to be in the logs. RSLs are 
used by GRAM (Grid Resource Allocation Manager) to specify the requirements of jobs.
I'm gonna check the GRAM logs, and investigate how it's configured. /var/log/globus/gram_glow.log
is reporting several instances of Error: 48. In /etc/osg/config.d/10-gateway.ini
the setting "htcondor_gateway_enabled" was set to FALSE. I have set it to "TRUE".
To save the configuration, I ran `osg-configure -v` to verify that the input was good, then
`osg-configure -c` to save the configuration. The Twiki page also talks about 
/etc/globus/globus-condor.conf, but everything appears to be in order there. I've
decided to snoop around the globus configuration files. I didn't find anything too
interesting in /etc/globus. 


02/17/2017
TAGS: node reservation
A while ago, I reserved compute-2-5 to compute-2-9 for Stefano's use. He doesn't need
them anymore, so I'm gonna give all but compute-2-9 back to the cluster.

cont.
02/17/2017
Began revamping of condor_reserve script which will automate the
reservation of nodes.

cont.
02/21/2017
Manually gave compute-2-5 to compute-2-8 back to the cluster.


02/20/2017
TAGS: hostcert update soon expired
OSG sent me an email saying that my hostcertificate will expire in about a month.
I updated the host certificate. I copied the new hostcertificate to /ect/grid-security
in the CE and the SE. I also copies the new host certificate to /etc/grid-security/bestman
in the SE, as per the earlier instructions. I still need to copy the host certificate to
the other locations according to the other instructions.


02/26/2017
TAGS: Vallary NAS1 transfer refused failed
Vallary is trying to remotely backup one of the lab computers onto NAS-1, but the 
connection is being refused. I suspect the connection is being refused because its
IP is not in NAS-1's /etc/exports file. Nevermind, that can't be the issue because she's not
trying to mount NAS-1, she's just trying to access it. Either way, though, the IP of the machine is 
not in /etc/exports, and I did not add it. The command 
$ scp -P 2729 <file> <username>@uscms1.fltech-grid3.fit.edu:/mnt/nas1
works just fine for me; I've asked her to try it. Turns out she just forgot to
capitalize '-P'.


03/04/2017
TAGS: emergency documentation
We had to turn off the cluster a couple nights ago for building maintenance.
I have written new Emergency Cluster Documentation that provides updated 
instructions for shutting down and powering up the cluster.


03/08/2017
TAGS: condor idle
Condor is idle; no jobs are running. I restarted condor, but I suspect
it's because of the SAM tests.

cont.
03/13/2017
The condor shadow log says that condor_shadow keeps exiting with status code 115, which
doesn't exist according to Wisconson's list of shadow exit codes. It exits with code
115 after a job fails (job terminates with code 1), and with code 100 when the job
runs properly (this is normal use). The log stopped writing when the jobs stopped.
The SchedLog, on 03/09/17, is trying to negotiate for users grid0004, osg, and glow,
but there are no matches being made in the local pool (16 rejections for grid0004,
17 rejections for osg, 4 rejections for glow). It also keeps reporting that there
are 0 active workers. The NegotiatorLog illustrates what's written in the SchedLog.
The globus-gatekeeper.log repeatedly reports GSS failures. I found documentation regarding
the errors found in globus-gatekeeper.log. 
The file /etc/grid-security/grid-mapfile contains a list of basically everyone everywhere.

cont.
03/15/2017
I found an OSG ticket where someone else had a similar problem.
$ globusrun -a -r uscms1.fltech-grid3.fit.edu
from the SE returns the same error message from SAM 12, the expected name
is the CE, but the SE is found. 

$ openssl x509 -text -noout -in /etc/grid-security/hostcert.pem
brings up information about the hostcert
$ openssl rsa -text -noout -in /etc/grid-security/hostkey.pem
returns info about the hostkey

There is a discrepancy in the Subject line, revealed by the above command,
betweent the new and old hostcerts. The old hostcerts have "uscms1" in the subject,
while the new one has "uscms1-se". This may be contributing to the SAM 12 problem;
it looks like a similar discrepancy. 

The OSG ticket's problem appeared to have been centered around the hostcert entirely.
I updated our hostcert Feb 20, after the SAM tests had initially disappeared, and some
time before jobs stopped, so I don't think it's our primary concern.

I sent off a ticket to OSG. (https://ticket.opensciencegrid.org/33031)

cont.
03/20/2017
Elizabeth from OSG  wanted an update, so I told them that I suspect a hidden expired certificate
to be at fault.

cont.
03/22/2017
Elizabeth said that `osg-system-profiler` will provide a long printout of everything OSG related.

NOTE: the osg-wn-setup.sh script will not work if root is logged into via `su -`

I had forgotten to update the hostcerts on most of the cluster, so I did that. The problem
began before the hostcert was updated at all, though. Maybe some of the problems that
came about later were a result of using different hostcerts?
After all of the new hostcerts are in place, I tried to restart the rsv service, as per
instructions, but it complained that the service condor-cron was not running, because it was
giving a "condor_master dead buy subsys locked" error. I restarted
condor-cron the same way I would force restart condor, and it's running again. Maybe that was the
whole problem all along?  

The `osg-system-profiler` command produces "/root/osg-profile.txt". The profile includes 
some lines from "/var/log/osg/osg-configure.log", and on Feb 26, about the same time jobs
stopped running, it complained that the 'worker_node_temp' variable in 
"/etc/osg/config.d/10-storage.ini" was not set. Several other variables in that file were also
incorrect, so I made the following changes:
worker_node_temp = /scratch
se_available = TRUE
default_se = uscms1-se.fltech-grid3.fit.edu
app_dir = /cmssoft/cms
data_dir = /mnt/nas1/osgDataDir

"/etc/osg/config.d/30-gip.ini" also appears to be misconfigured; the entire SE section was commented out!
Since I got a new hostcert with these issues (probably) in place, the new hostcert may also be
misconfigured; I may have to get a new one once these configurations are ironed out.

NAS-0 was not able to be mounted on the SE, but I fixed it by adding it to /etc/hosts (it was already listed
in /etc/fstab).

I have edited /etc/osg/config.d/30-gip.ini to include information about the SE; I'm not sure
if I did it correctly.

/etc/osg/config.d/40-siteinfo.ini hasn't been changed since 2014, so I'm updating some of the information,
namely contact information.

The only strange thing related to certificates reported by osg-profile.txt is that we don't have
a /etc/grid-security/voms/vomscert.pem certificate. I don't think we've ever had one before, so I
don't think it's a big deal.

To save the changes to the OSG files:
$ osg-configure -v
  (*) to verify that the changes are properly written
$ osg-configure -c
  (*) to save the changes

The changes were successfully made! I restarted the SE to save the changed I made to it.

The RSV tests are still dying, so I'm gonna get a new hostcert and see what that does.

cont.
03/23/2017
I had been accidentally requesting hostcerts specifically for the SE. Separate hostcerts must
be requested for both the SE and CE. I have obtained and distributed the new hostcerts, and
I've restarted all of the relevant services, but RSV still dies. I've updated OSG.

cont.
03/24/2017
Elizabeth thinks the DN of the certificates may have changed, and she has provided
some instructions to re-add them.
The instructions require me to do things in the GUMS page, but I'm not registered as
a GUMS admin. I tried to run the command to make the change:
$ openssl x509 -subject -noout -in ~/.globus/usercert.pem
  (*) returns the DN of the current admin (probably you)
$ gums-add-mysql-admin "[output from above command minus 'subject= '"
but it required the mysql gums password, which I don't know. I'm changing the password.
To change the mysql gums password:
$ kill `cat /var/run/mysqld/mysqld.pid`
  (*) kills the mysql server
$ echo "SET PASSWORD FOR 'gums'@'localhost' = PASSWORD('newPass');" > /var/lib/mysql/mysql-init
  (*) creates file with line to be run when server is restarted (for MySQL 5.7.5 or earlier)
  (*) be sure to remove the file once the server is all good (no need to have passwords laying around in plain text)
$ mysqld_safe --init-file=/var/run/mysql/mysql-init &
  (*) starts the mysql server with and executes the line written in mysql-init on startup (make sure this file is owned by mysql)
The `gums-add-mysql-admin` worked, and I'm now a GUMS admin! Or so I though; the GUMS website still doesn't think I'm
an admin.
To start the mysql server without password verification:
$ mysql_safe --skip-grant-tables &

Elizabeth's instructions said to do the following in the GUMS homepage:
1) add the RSV DN to "Manual User Group Members"
   (*) was already done
2) In "User Groups", ensure rsv is present with settings: type=manual, name=rsv, permissions='read all'
   (*) The permissions were set to 'read self'; I've changed it.
3) In "Group To Account Mappings", for rsv set: user_groups=rsv, account_mapper=rsv
   (*) it was all good
4) In "Host To Group Mappings", add rsv to the list.
   (*) it was already there

I restarted the mysql server:
$ kill `cat /var/run/mysqld/mysqld.pid`
$ mysqld_safe &
Then restarted GUMS:
$ service tomcat6 restart

It complained about not being able to correctly read configuration. The /root/gums.config file
didn't have the permissions change I made to rsv, so I manually changed it in the file. I restarted
both the mysql and tomcat. The configurations are still messed up.
The real gums configuration files are in "/etc/gums", but I still don't know what's up.

I've updated OSG.

cont.
03/37/2017
A small number of jobs ran over the weekend! I don't know what allowed them to, though,
because all of the test still fail.
The GUMS page now provides an error message rather than a simple "cannot read configuration".
It's just a java OutOfMemoryError, which was probably caused by being inactive (unable to read
configuration) for too long. Yeah, the configuration still isn't being read properly and it's 
also reporting "Database error: cannot open connection", which is no change.

"/var/log/gums/gums-developer.root.log" reports that the certificate /tmp/x509up_u0 has expired,
and it appears to be the old hostcert (last modified March 13).  

cont.
03/29/2017
OSG says that the Globus GRAM gatekeeper is outdated, and to update the system to HTCondor-CE as
support for GRAM. They have thoughtfully provided a link to instructions:
(*) Disabled GRAM gateway (/etc/osg/config.d/10-gateway.ini)
(*) Disabled worker node proxy renewal (/etc/blah.config)
    (-) blah_disable_wn_proxy_renewal=yes
    (-)	blah_delegate_renewed_proxies=no
    (-) blah_disable_limited_proxy=yes
    	(*) NOTE: the blah_disable_limited_proxy option was not found, so I added the line and option
`osg-configure -c` is complaining that it can't copy /etc/osg/grid3-location.txt to /cmssoft/cms/etc/grid3-locations.txt
because it's trying to write to /cvmfs/cms.cern.ch/... which is a read-only filesystem mounted from CERN.
Other than that, everything went smoothly.

OSG then said to do a `fetch-crl`, which came back with some errors. Some CRL verifications
and retrievals failed. I've updated OSG.

cont.
03/31/207
Ankit is here to save the day! When I changed the password for gums, I didn't update the password in
"/etc/gums/gums.config". That was just done. GUMS works again! And the RSV tests are running too!
Some are still broken, but I'm getting actual error messages now, so it's something to work with.
Condor is running again! Jobs are running! Ankit saved the day!!! 
The key signal was that the database connection was refused. 
HTCondor should not be set to "true" in /etc/osg/config.d/10-gateway.ini.
It has been changed to "false" and now things work again.
HTCondor-CE does have to be properly installed, however, and that is a great undertaking that
will get its own log entry.


03/13/2017
TAGS: RSV critical
Basically all of the RSV tests are red because they run as jobs, and jobs
aren't working at the moment.


03/20/2017
(Hannah)
TAGS: Diagnostics Website OSG Map
Fixed the diagnostics OSG map link by editing the old link in ~/diagnostics/index.php 


03/21/17
(Hannah)
TAGS: diagnostics website menu
copied html from the CE to the SE so the drop-down menue works on mobile
dashboardCE.php -> dashboardSE.php

Attempted to reinstall storcli64 on NAS-1, just have to configure the software


03/23/2017
TAGS: NAS-1 unmounted
NAS-1 has decided to unmount itself from the CE, and I can't mount it back. 
I also can't mount it on the nodes or SE. The storage partition of NAS-1 is mounted
on NAS-1 just fine. I'm going to try to restart the cluster. That fixed it.


03/24/2017
TAGS: NAS-1 drive failure beeping
NAS-1 was beeping because drive 0:10 (as read on the casing, "28:10" as read in the software) 
has failed (the red LED was on).
I've installed the MegaCli64 software from:
http://www.avagotech.com/support/download-search
extracted the .zip using `unzip`, then navigated to the Linux directory
and followed the instructions there. The command will install to
"/opt/MegaRAID/storcli/storcli64" I created an alias for the command called 'storcli'.
`storcli /c0 show` shows the condition of the RAID, and drive 0:10 has indeed failed.
To turn the alarm off:
$ storcli /c0 set alarm=off
I'm going to replace the drive when I come back this afternoon.

To remove the drive with storcli:
$ storcli /c0/e28/s10 set offline
  (*) in the left-most column of `storcli /c0 show` is the drive names in 'enclosureID:slotID' format
$ storcli /c0/e28/s10 set missing
$ storcli /c0/e28/s10 spindown
  (*) spins down the drive and makes it safe for removal

The drive can now be safely removed.

Once the new drive is in place it should automatically start rebuilding. If 
the drive's status doesn't change to "Rbld", the rebuild can be manually
started with `storcli /c0/e28/s10 start rebuild`. The rebuild status can be
monitored with `storcli /c0/e28/s10 show rebuild`.

The rebuild has begun.

cont.
03/27/2017
The rebuild succeeded! Drive 28:13 (front drive 13) has now failed.
I've replaced it, and the drive is now rebuilding.

03/27/2017
Hannah fixing the dropdown menu part 2 (continued from 03/21/2017)
Info on what was switched out in ~/diagnostics/dashboardSE.php and so on is available at ~/diagnostics/mobiledropdownfix.pdf
The dropdowns all work now.


04/03/2017
TAGS: HTCondor-CE install
It is time to install and configure HTCondor-CE! 
I enabled HTCondor and disabled GRAM in '/etc/osg/config.d/10-gateway.ini'.
I changed 'ce-type' in '/etc/rsv/metrics/uscms1.fltech-grid3.fit.edu/allmetrics.conf'
from "gram" to "htcondor-ce".
I ran `gums-host-cron` to generate a new user-vo-map file.

NOTE: `globus-job-run` can be used to send jobs to the cluster via the grid

`globus-job-run uscms1.fltech-grid3.fit.edu:2119 /bin/hostname` can be used to test jobs.
Jobs run with the above configuration, but RSV doesn't work.
The RSV tests are running on port 9619, which is the port for HTCondor-CE, which it was 
told to use. Port 2119 is the globus gatekeeper port, so jobs are supposed to be sent
there (probably). 
I restarted rsv `service rsv restart`. I'm thinking port 9619 isn't open. 

NOTE: the service name of HTCondor-CE is "condor-ce"

`condor_ce_run` is used to test HTCondor-CE.
`condor_ce_run uscms1.fltech-grid3.fit.edu:9619 /bin/hostname` was what the HTCondor-CE
troubleshooting page said to run, but it just says that the address of the schedd cannot
be found. By changing the port number to 9618, however, an authentication error is thrown.
`condor_ce_trace --debug uscms1.fltech-grid3.fit.edu` says that the CE's schedd could not
be found.

cont.
04/05/2017
`condor_ce_run` cannot be run by root for security reasons, so I switched to my account. I
gave myself the proper usercert.pem and userkey.pem so I could `grid-proxy-init`, and
I ran `condor_ce_run`. It is hanging; I'm not sure if it's supposed to or not, so I'm gonna
let it run for a while. `globus-job-run` does not work on my account. `condor_ce_trace` still
reports that the schedd is not found.
`condor_ce_trace` shows the 'Remote Mapping:' to be 'unauthenticated@unmapped' on both my account and 
root. A complete `condor_ce_trace`, an example of which is given on the HTCondor-CE Troubleshoot page,
looks an awful lot like the output of a SAM test. Huh! The website says that if 'Remote Mapping:' behaves in
this way, authentication needs to be configured. It looks like I've already done everything, though! The 
"Authorization with GUMS" section of the HTCondor-CE Install page just says to add:
authorization_method = xacml
gums_host = uscms1.fltech-grid3.fit.edu
to the /etc/osg/config.d/10-misc.ini file.
Maybe the usercert I'm using also needs to be in GUMS?
I tried restarting RSV and it complained that 'condor-cron' wasn't running, so I turned it on.
RSV started, but gave some errors:
ERROR: Command returned error code '256': 'condor_cron_q -l -constraint 'OSGRSVUniqueName=="uscms1.fltech-grid3.fit.edu__org.osg.local.hostcert-expiry"''
ERROR: Could not determine if job is running
The first error is strange because the hostcert-expiry test is one of the few RSVs that are green.
The htcondor-ce.job-routes RSV complained that it could not ping the CE. Ping printed the following to stderr:
ping: sendmsg: Operation not permitted
Hmm, that's disconcerning. Perhaps there are some faulty permissions somewhere?
The error message means that the CE is not allowed to send ICMP packets. The site recommended that
I mess with the chain policies in iptables, but the INPUT chain for port 9619 is already set to ACCEPT. 
RSV keeps on complaining that condor-cron isn't running whenever I restart it, although it has, in fact,
been highkey running. If condor-cron is left alone after an RSV restart, RSV will continue to throw a fit
until condor-cron is restarted.

cont.
04/07/2017
Squid was not running, and starting it failed. What's up with this, now?
Periodic 'gums-host-cron' is also disabled. 'gums-host-cron' is a script that 
is supposed to keep the CE synced with GUMS. I manually ran `gums-host-cron --gumsdebug`
and everything went smoothly. There are several targets of the INPUTS chain that are set to 
REJECT with the error message set to "icmp-port-unreachable". This is the default error message 
for the REJECT setting. None of the reject ports are ones I need, though; 9619 is labelled as ACCEPT.
The NetworkManager was also stopped, so I started it. Several old rsv jobs are held in the condor-cron queue.
Whenever I ran `condor_ce_trace`, the ping command always had the READ instruction. I ran
$ condor_ce_ping -verbose WRITE
on my account to enable WRITE instructions.
Lo and behold, my remote mapping was correctly authenticated! Why isn't WRITE instruction given to condor_ce_ping
during condor_ce_trace? 
The condor_ce commands sometimes decide to not work for brief periods of time; the ping breaks.

cont.
04/10/2017
I've officially added port 9620 to the iptables list of accepted ports with:
$ iptables -A INPUT -p tcp --dport 9620 -m conntrack --ctstate NEW,ESTABLISHED -j ACCEPT
$ service iptables save
"condor_ce_ping" has decided to keep failing. Upon further investigation, it was discovered that
port 9619 is closed (`nmap -p 9619 localhost`), which is strange because `iptables -L` says that
it's set to ACCEPT from anywhere. 'nmap' only lists a port as "open" if both iptables allows traffic
and a service is listening on that port. 
In '/etc/condor-ce/config.d/03-ce-shared-port.conf', the SHARED_PORT_ARGS variable was set to 9620, so I'm gonna
try setting it to 9619. I restarted condor-ce. `lsof` still does not report anything listening on port 9619.
I changed '/etc/condor-ce/config.d/03-ce-shared-port.conf' back to its original state.
Strangely enough, `condor_ce_ping -verbose WRITE` works just fine when I run it from my account.
condor-ce is configured to use port 9619 (according to '/etc/condor-ce/condor_config').

I scoured the Hypernews and found an article talking about HTCondor-CE. They mentioned that for HTCondor
sites, a special line needs to be filled out on OSG. At https://oim.grid.iu.edu/oim/resourceedit?id=163
I edited the 'SAM URI' section to be "htcondor://uscms1.fltech-grid3.fit.edu".

I have sent a ticket to OSG.

cont.
04/19/2019
'blah_delegate_renewed_proxies' in '/etc/blah.config' did not have any option selected, so I set it equal to "no".
The command that uploads the info from GIP to OSG (`osg-info-services`) is failing:
GIP.Wrapper:WARNING osg_info_wrapper:516:  The module /usr/libexec/gip/providers/storage_element timed out!
GIP.Wrapper:WARNING osg_info_wrapper:517:  Attempting to kill pgrp 6250
Maybe the inability to upload our information is preventing HTCondor-CE from working?
Some python module related to the SE keeps on timing out. Maybe it's trying to ssh? I copied
the new ssh key to the SE. Nope, still dies. Maybe it's trying to get into all of the nodes?
I'm still resetting their passwords. The ssh is good; let's try again.
The connection timed out again, but not because of a failing module. I tried it again and the module failed
this time. Huh.

cont.
04/24/2017
It appears that the information propagation procedure above is conducted by 'CEMon'. I looked up 
troubleshooting for it, and the page said that whether information is begin sent up or not can be
verified by visiting 'myosg.grid.iu.edu' "Resource Group" > "Current GIP Validation Status". It says
that our "GIP Validation Status" is "Could not get LDIF Entries". LDIF (LDAP Data Interchange Files) are used
to exchange data between LDAP directory servers (between us and OSG). 

There appears to be a syntax error in '/etc/init.d/glite-ce-check-blparser', found with `service --status-all`.
'glite' is associated with CEMon. The error is being caused because the script is assigning a string to a variable where
an integer is expected.

On the CEMon troubleshooting page, it says to check '/var/log/glite-ce-monitor/glit-ce-monitor.log'. Unfortunately,
it doesn't exist!

I checked '/var/log/gip/gip.log', because gip is closely related to CEMon, for errors and found several "CEMonUploader:ERROR".
It's just complaining that `osg-info`services` keeps on timing out. '/etc/gip/gip.conf' is empty; I'm not sure if there's
supposed to be stuff in there or not. 

cont.
04/26/2017
`osg-info-services` now refuses to even try to work; some python variable isn't getting initialized properly.

cont.
04/28/2017
OSG finally responded! I sent them the output of osg-system-profiler, and they said that our condor
is outdated. They said to run `yum update condor\*` to update it. That command says condor's up to date,
however. The new condor is excluded due to repository priority protections (viewable on yum debug level 3).
Why are they excluded (probably for a pretty good reason), and can they be unexcluded (probably, but I wouldn't want
to do that because it would mess up all kinds of stuff)? Huh, what do I tell OSG?

The priority plugin is responsible for the priority protections. I'm going to check out its configuration.

cont.
05/08/2017
I'm gonna update OSG using the documentation provided by OSG.
The documentation said to run:
$ rpm -e osg-release
  (*) removes the old yum repositories
$ rpm -Uvh http://repo.grid.iu.edu/osg/3.3/osg-3.3-el6-release-latest.rpm
  (*) installs the new repos for CentOS6
$ yum clean all --enablerepo=*
  (*) cleans the yum cache
$ yum update
  (*) I did a normal `yum update` instead of a `yumUp` because the situation regarding antlr and GUMS may have
      changed. If GUMS still doesn't like the new antlr, though, I'll just degrade it like normal. They still
      don't agree, so I downgraded antlr and all is well.

Now I'm gonna see how HTCondor_CE behaves:
$ condor_ce_run -r uscms1.fltech-grid3.fit.edu:9619 /bin/env
  (*) run from my account (can't from root because security)
  (*) reports: ERROR: Can't find address of schedd uscms1.fltech-grid3.fit.edu
$ condor_ce_trace --debug uscms1.fltech-grid3.fit.edu
  (*) reports that 163.118.42.1:9619 cannot be pinged by condor_ce_ping

I tried running `osg-configure -v`, but it didn't go through. '/cvmfs' is empty, and
some links are broken. Where did the stuff go?

cont.
05/15/2017
OSG wants the condor logs, condor-ce logs, and both config dumps. 
NOTE: to make a tarball: `tar -czvf archive.tar.gz path/to/compressed/directory`
      to extract a tarball: `tar -xzvf archive.tar.gz`

cont.
05/17/2017
OSG said that some GSS failures were symptoms of a password-protected hostkey, which GSS doesn't support.
The hostkey isn't password protected, but the userkey is, and the userkey in '~/Cluster_System_Files/Cert_Files/certs' is
different from the one in '/etc/grid-security'. I updated the cert/key in '/etc/grid-security' and restarted 'condor',
'condor-ce', 'condor-cron', and 'tomcat6'. 

cont.
05/19/2017
`condor_ce_trace --debug uscms1.fltech-grid3.fit.edu` successfully pinged the schedd!
But it produced this error: "2017-05-19 17:12:00 Could not find an X509 proxy in /tmp/x509up_u502"
Hmm, it looks like some certificate issues.
I ran `voms-proxy-init`, which fixed that, but then I got a crazy uncaught exception.
I've updated OSG.

cont.
05/21/2017
I'm gonna go through the HTCondor-CE documentation to see if anything is awry.
Port 9619 is open according to nmap, but port 9620 is closed.

`osg-configure -v` produces some errors:
(*) The 'app_dir' variable in '/etc/osg/config.d/10-storage.ini' is set to '/cmssoft/cms'
    which doesn't exist.
(*) "Option 'app_dir' in section 'Storage' located in
    /etc/osg/config.d/10-storage.ini: The app_dir and app_dir/etc
    directories should exist and have permissions of 1777 or 777 on OSG installations."

'/cmssoft/cms' is a broken symlink pointing to '/cvmfs/cms.cern.ch/'
which doesn't exist. There is nothing in '/cvmfs'. I'm going to try to
add the cvmfs repository again (according to the documentation). The
'/etc/yum.repos.d/cernvm.repo' file was not found, so I downloaded it
with `wget -O /etc/yum.repos.d/cernvm.repo
http://cvmrepo.web.cern.ch/cvmrepo/yum/cernvm.repo`
I also downloaded the GPG key for the repository:
`wget -O /etc/pki/rpm-gpg/RPM-GPG-KEY-CernVM
http://cvmrepo.web.cern.ch/cvmrepo/yum/RPM-GPG-KEY-CernVM`
I tried to install the proper software using the repositories with
`yum install cvmfs cvmfs-init-scripts`, but it said they were already
installed. 

The documentation recommended `cvmfs_config chksetup` to verify the
setup, and it reported several errors. It turns out none of the cvmfs
configuration files are not readable for some reason.

`cvmfs_config showconfig cms.cern.ch` reports:
"required configuration repository directory does not exist: 
/cvmfs/config-osg.opensciencegrid.org/etc/cvmfs"
along with a long list of empty configuration variables, only some of
which are filled out.

The documentation said to try to mount cvmfs to rule out autofs
issues:
$ mkdir -p /mnt/cvmfs
$ mount -t cmvfs cms.cern.ch /mnt/cvmfs
But it said that 'cvmfs' was an unknown filesystem. 

cont.
05/23/2017
Further wiki pages about cvmfs debugging never show the "configuration repository" 
message, they only check to see if it'll mount. I need to find out what 
a "configuration repository" is.

cont.
05/24/2017
According to some documentation, there are two parameters for cvmfs that I think
are assignable in '/etc/cvmfs/local.config'. 'CVMFS_CONFIG_REPOSITORY' determines
where the configuration repository is stored, and 'CVMFS_CONFIG_REPO_REQUIRED'
determines whether cvmfs should check for a configuration repository.
That didn't work, so I posted to the 'T3 Discussion' forum on Hypernews.

cont.
05/25/2017
Dave Dykstra responed to the T3 Discussion post, and he said he got the 
same thing on his test machine. For him, a cvmfs2 process was stuck running
for the config-osg repository even though it wasn't shown as mounted. 
I tried to kill a rouge cvmfs process (revealed by `ps aux | grep cvmfs`),
but its PID seemed to be continually changing.

cont.
05/26/2017
Dave says that maybe the problem is that 'cvmfs' is not in the 'fuse'
group. 'cvmfs' is in the fuse group, however.

cont.
05/30/2017
Dave said to try a restart, so I'm trying that.
I looked up the error message and someone with a similar
problem tried `lsmod | grep fuse`, but they recieved no output, while
we do. We do not, however, get ouput from `modprobe fuse`. 
'/etc/fuse.conf' is set to allow others.

cont.
06/01/2017
Dave solved the issue! Turns out 'cvmfs' is hard coded to deal with lines
in '/etc/group' that are only 16K in size, but ours had lines over 45K.
He discovered this by using 'strace' to look for relevant system calls.
Since he knew the issue was with the 'fuse' group, he monitored reading
from '/etc/group', and he saw that the whole file wasn't being read.
So he made and installed a development version of 'cvmfs' that allocates
a line buffer of variable size. 'cvmfs' is mountable again! Unfortunately,
'condor_ce_trace' is still failing due to a failure to ping. Man!

I tried running `osg-configure -v` again, and it succeeded, but 
`osg-configure -c` failed:
[root@uscms1 ~]# osg-configure -c
WARNING  Can't copy grid3-location file from /etc/osg/grid3-locations.txt to /cmssoft/cms/etc/grid3-locations.txt
CRLs exist, skipping fetch-crl invocation
ERROR    Option 'glexec_location' in section 'Misc Services' located in /etc/osg/config.d/10-misc.ini: Can't use glExec because LCMAPS glExec plugin not installed. Install lcmaps-plugins-glexec-tracking or unset glexec_location
CRITICAL Can't configure module, exiting
Can't configure module, exiting
You may be able to get more details rerunning /usr/sbin/osg-configure with the -d option and/or by examining /var/log/osg/osg-configure.log

I'm trying to install 'glexec', but the packages I'm trying (from the 
suggested above and the glexec twiki page) don't seem to exist.
I've run a `yumUp`, and it updated 'cvmfs', I may have just undone what Dave did.
I highkey totally did. Fortunately, I know what files he accessed, so maybe I can 
find what he changed.

I fixed it! Dave left the rmp he modified in '/tmp', so I just force installed that
'cvmfs' rmp over the one I accidentally installed: 
`rpm -Uv --force /tmp/cvmfs-2.3.5-0.0.20170531211728.dwd.74e701106a94e88784e6049de792df0397fc0824git.el6.x86_64.rpm`
Back to 'glexec'.

'glexec' is installed on the nodes, and according to the diagnostics page,
they're all fine. 

I was able to install glexec by first installing the osg-release, which somehow
got uninstalled:
`rpm -Uvh https://repo.grid.iu.edu/osg/3.3/osg-3.3-el6-release-latest.rpm`
I then installed glexec with:
`yum install osg-wn-client-glexec`
`osg-configure -c` now works!

'condor_ce_trace' is displaying the 'X509' error from before when run in '/tmp'.

cont.
06/09/2017
I am now registered for CMS on my CERN certificate!
I'm gonna replace all instances of my current usercert with
a version of my CERN cert. Earlier log threads indicated that 
the usercert.pem and userkey.pem are to be stored in '/etc/grid-security',
and the CERN certs are there. `voms-proxy-init -voms cms` doesn't work
because it's not using my usercert, it's saying we're the cluster itself
(I think it's using the hostcert). `voms-proxy-init`, though, works just
fine. `grid-proxy-init` also works now, after I ran
`grid-proxy-init -debug` to learn that I had to change the permissions of
'~/.globus/usercert.pem' and '~/.globus/userkey.pem' to 600.
'condor_ce_trace', still reports a x509 error, though.

Turns out, `voms-proxy-init` is user based. When I was running it
as root, it created the certificate for root, which looks like the
hostcert. I ran it as me, and it used my certificate just fine.
`voms-proxy-init -voms cms` also knows I'm CMS!

Something happened when I ran 'condor_ce_trace' from my account!  It's
sending connection requests to all 600 idle condor jobs.  Not sure
what it's doing, but it's doing something! Progress!  Nevermind, it's
just trying to send one connection request with a timeout of 600
seconds. Now the error is that condor won't process the job.  I've
hard restarted condor and condor-ce, and I've run 'condor_ce_trace'
again.

cont.  
06/20/2017 
Brian says that the jobs are sitting idle even
though the CE has routed them.  He cited the command `condor_ce_q
-af:jh jobstatus routedtojobid` which, when the condor-ce service is
running, displays nothing more than a header. It looks like he saw
some jobs, though, in the files I sent, and he provided a command for
me to try that uses the numbers from the 'routedtojobid', although,
the IDs may be different because I've restarted since then. I'll wait
a bit, then run the first command again to get new numbers.
I told Brian about the 4 digit numbers.

cont.  
06/22/2017 
Brian sent me a link to a twiki page for
troubleshooting the event of jobs remaining idle on the CE (our
problem). The first step says to check
'/var/log/condor-ce/JobRouterLog' for the text 'src=<job
id>... claimed job'. I grepped that file for 'src=' and nothing was
returned; that file is also full of reports that say no jobs are begin
submitted via the only route. The idling jobs are not matching any
routes. The twiki page recommends using 'condor_ce_job_router_info' to
see what's up. `condor_ce_job_router_info -config` displays the
routes that jobs will match to; we have only one route. I made a
condorTest directory with all of the necessary items to run a test
condor job, and I put it in my (Voytella) home directory. I sent the
job to the condor queue, and it's idling as expected.  I see the job
when I run `condor_q`, but there are no jobs listed in
`condor_ce_q`. The only jobs that appear to run on condor_ce are rsv
jobs. I think the jobs are simply being routed to the old condor,
which I had turned off in favor of HTCondor. I probably have to tell
something to send the jobs to the new condor. 

cont.
06/26/2017
I found a twiki page about HTCondor-CE job routes. It says that the
configuration file for default values is 
'/etc/condor-ce/config.d/02-ce-condor.conf'. 
`condor_ce_config_val JOB_ROUTER_DEFAULTS | sed 's/;/;\n/g'` provides a list
of all the settings. 
I tried looking at the job routing settings for the old condor,
`condor_job_router_info -config`, but it said job routing was disabled.
If routing's disabled, how are jobs getting queued? Maybe I have a misconception
of what routing really is. 
I've discovered something!
The traditional command 'condor_submit' sends the job to old condor, whereas
'condor_ce_submit' will send it to condor-ce. The job is still idle, though.

From the page OSG sent me:

Verify Correct Operation between the CE and Local Batch System
Use `condor_ce_config_val -v <configuration variable>`
to verify that "JOB_ROUTER_SCHEDD2_NAME, JOB_ROUTER_SCHEDD2_POOL, and JOB_ROUTER_SCHEDD2_SPOOL 
configuration variables are set to the hostname of your CE and the hostname 
of your local HTCondor's collector, and the location of your local HTCondor's 
spool directory, respectively." The first variable is just the hostname (probably good),
the second one is the hostname with port 9618 (not sure if that's the correct port), and
the third one is '/var/lib/condor/spool', which does exist and is not full.
It also said to make sure that QUEUE_SUPER_USER_MAY_IMPERSONATE of the old condor
is set to '.*' with `condor_config_val -v QUEUE_SUPER_USER_MAY_IMPERSONATE`. It is 
correctly set. 

Make Sure the Underlying Batch System Can Run Jobs
It said to examine the ScheddLog. I opened it immediately after trying to submit a job
from my (Voytella) account. I saw in the log, references to my job, but it was trying to
update collector (the CE IP) on port 9619, rather than port 9618 that was listed above.
It says if the underlying batch system (this might be the old condor) doesn't work, then
HTCondor-CE will not work, either.

Verify Ability to Change Permissions on Key Files
There are no permission errors in the logs, so I don't think this is the problem..

The log file for the job (in the directory from which the job was sent, in this case,
'~/condorTest/prog.log' (Voytella)) was complaining that HTCondor-CE held the job because
of a missing user proxy. I ran `voms-proxy-init` and resubmitted the job; it's still idle,
but it's not being held. While troublshooting, an RSV job got submitted and ran just fine.
What's the difference between the RSV jobs and the others? After a few minutes, the submitted
job complained of a missing proxy again. 

cont.
06/30/2017
Brian said to try increasing the debug level by placing 'ALL_DEBUG = D_FULLDEBUG'
in '/etc/condor-ce/config.d/99-local.conf', and saving the changes with `condor_ce_reconfig`.
I added the line, but when I tried to reconfigure it, I got this error:
  Can't find address for local master
  Perhaps you need to query another pool.

cont.
07/10/2017
I found a post was receiving the same message. They found somewhere that a cause could be
that $CONDOR_HOST is not set, and ours isn't. $CONDOR_HOST is, however, set to "$(FULL_HOSTNAME)",
in '/etc/condor-ce/condor_config', so I don't think that's the problem.

cont.
07/12/2017
When I run `condor_q` a lengthy error message appears that says that either the 
condor_schedd is not running, the SCHEDD_NAME is not defined in condor_config, or
something is wrong with the SCHEDD_ADDRESS_FILE. The schedd just had to be started
with `condor_schedd`; `condor_q` now works fine.
I tried submitting the job to condor_ce again from my (Voytella) home directory, but it's failing to 
submit: ERROR: Can't find address of local schedd
The schedd for condor_ce must also be started. Never mind, condor-ce just wasn't turned on, now
I can submit jobs from my account like before; they still are idle, then held.
I tried to `condor_ce_reconfig`, and it went through. Turning on the schedd probably fixed that.
The JobRouterLog for condor-ce is spouting things, so I'm waiting for it to finish before I 
begin the trace. It hasn't stopped printing nonsense, so I'm gonna start the trace.
The trace isn't providing any new information; it's just saying that the schedd address cannot be
found. This is the central problem, since the schedd is responsible for sending the jobs off to the
nodes for processing. I've updated OSG.

cont.
07/18/2017
Brian said that the job router is complaining about contacting the HTCondor
schedd that is used to submit jobs to the HTCondor backend rather than complaining 
about the HTCondor-CE schedd. He also said to make sure that the 'condor' service is
running on the CE host and that jobs can be submitted to the pool from the CE via
condor_submit/condor_run.
I submitted a job to condor from my (Voytella) account, `condor_submit submit`,
and '/var/log/condor/SchedLog' reported:
    07/18/17 12:48:40 (pid:1894617) Failed to send RESCHEDULE to unknown daemon:
    07/18/17 12:48:40 (pid:1894617) attempt to connect to <163.118.42.1:9618> failed: Connection refused (connect errno = 111).
    07/18/17 12:48:40 (pid:1894617) ERROR: SECMAN:2003:TCP connection to collector uscms1.fltech-grid3.fit.edu failed.
    07/18/17 12:48:40 (pid:1894617) Failed to start non-blocking update to <163.118.42.1:9618>.
I turned on the condor service `service condor start`, and new errors appeared in the 
SchedLog:
	07/18/17 13:06:27 (pid:1894617) DC_AUTHENTICATE: Command not authorized, done!
	07/18/17 13:06:27 (pid:1894617) PERMISSION DENIED to unauthenticated@unmapped from host 10.1.1.1 for command 416 (NEGOTIATE), access level NEGOTIATOR: reason: cached result for NEGOTIATOR; see first case for the full reason

cont.
07/19/2017
I'm following the debug instructions at 
http://research.cs.wisc.edu/htcondor/CondorWeek2004/presentations/adesmet_admin_tutorial/#DebuggingJobs
1) condor_q
   Nothing seemed terribly off with anything in this step.
2) User Log
   If I submit the job to HTCondor_CE (condor_ce_submit), it complains of a missing 
   user proxy.
   If I submit to regular condor (condor_submit), it just says that the job submitted.
3) ShadowLog
   The last entries were May 10; no new entries means there are matching problems.
4) Matching Problems
   The errors in the SchedLog are posted above (bottom errors). 
   The NegotiatorLog shows the phases that are completed:
       07/19/17 13:53:28 Phase 1:  Obtaining ads from collector ...
       07/19/17 13:53:28   Getting startd private ads ...
       07/19/17 13:53:28   Getting Scheduler, Submitter and Machine ads ...
       07/19/17 13:53:28   Sorting 166 ads ...
       07/19/17 13:53:28 Got ads: 166 public and 160 private
       07/19/17 13:53:28 Public ads include 5 submitter, 160 startd
       07/19/17 13:53:28 Phase 2:  Performing accounting ...
       07/19/17 13:53:28 Phase 3:  Sorting submitter ads by priority ...
       07/19/17 13:53:28 Phase 4.1:  Negotiating with schedds ...
   It is the schedd negotiation that fails:
       07/19/17 13:53:28 SECMAN: FAILED: Received "DENIED" from server for user unauthenticated@unmapped using method (no authentication).
       07/19/17 13:53:28 ERROR: SECMAN:2010:Received "DENIED" from server for user unauthenticated@unmapped using method (no authentication).
       07/19/17 13:53:28     Failed to send NEGOTIATE command to osg@fltech-grid3.fit.edu (<10.1.1.1:9711?addrs=10.1.1.1-9711>)
       07/19/17 13:53:28   Error: Ignoring submitter for this cycle
   These errors are repeated for users glow, Voytella, grid0004, and vbhopatkar.

Clearly, there are severe matching problems.

cont.
07/21/2017
Brian wants to know the output of `condor_config_val QUEUE_SUPER_USER_MAY_IMPERSONATE`
and if the local condor collector is on uscms1.fltech-grid3.fit.edu and
that it's listening on 9618.
The output of the command is:
    # condor_config_val -v QUEUE_SUPER_USER_MAY_IMPERSONATE
      QUEUE_SUPER_USER_MAY_IMPERSONATE = .*
      # at: /etc/condor/config.d/99-condor-ce.conf, line 1
      # raw: QUEUE_SUPER_USER_MAY_IMPERSONATE = .*

cont.
07/24/2017
I searched the condor files in /etc for ports 9618 and 9619.
9618:
condor-ce/config.d/50-osg-configure.conf:JOB_ROUTER_SCHEDD2_POOL=uscms1.fltech-grid3.fit.edu:9618
9619:
condor-ce/condor_config:PORT = 9619
condor-ce/config.d/03-ce-shared-port.conf::SHARED_PORT_ARGS= -p 9619
condor-ce/config.d/10-ce-collector-generated.conf:CONDOR_VIEW_HOST = collector1.opensciencegrid.org:9619:9619,collector2.opensciencegrid.org:9619:9619

The condor collector appears to be running on both port 9618 and 9619:
# netstat -tulpn | grep 9618
tcp        0      0 0.0.0.0:9618                0.0.0.0:*                   LISTEN      1751494/condor_coll
udp        0      0 0.0.0.0:9618                0.0.0.0:*                               1751494/condor_coll
# netstat -tulpn | grep 9619
tcp        0      0 0.0.0.0:9619                0.0.0.0:*                   LISTEN      1902864/condor_shar
udp        0      0 0.0.0.0:9619                0.0.0.0:*                               1902866/condor_coll

cont.
07/27/2017
I ran `condor_q -analyze` and it showed that for the jobs, 
"Request has not yet been considered by the matchmaker."
<https://lists.cs.wisc.edu/archive/htcondor-users/2012-April/msg00100.shtml>
recommends looking at the startlog on the nodes.
'StartLog' of compute-1-0 is full of the same error:
	   attempt to connect to <163.118.42.1:9618> failed: Connection refused (connect errno = 111).
	   ERROR: SECMAN:2004:Failed to create security session to <163.118.42.1:9618> with TCP.|SECMAN:2003:TCP connection to <163.118.42.1:9618> failed.
	   Failed to start non-blocking update to <163.118.42.1:9618>.
`condor_status -any` shows the collector as '"OSG Cluster Condor at fltech-grid3.fit.e' 
with the 'du"' cut off. I'm not sure if that's just a display issue or something more.

Brian has asked me to do the following:
1) Set 'ALL_DEBUG = D_FULLDEBUG' in /etc/condor/config.d/99-local.conf
       '99-local.conf' was not present, so I created the file and put that line in.
2) Run `condor_reconfig`
       It ran successfully
3) Verify that your user proxy is still valid
       On my (Voytella) account, `grid-proxy-init` and `voms-proxy-init` run without issue.
4) Run `condor_ce_trace -d uscms1.fltech-grid3.fit.edu`
       I ran it from Voytella.
5) Wait for the job to go on hold or the trace command to timeout
6) Attach /var/log/condor/SchedLog and /var/log/condor/CollectorLog  

cont.
07/30/2017
Eduardo responded to the Hypernews post. He confirmed that grid authentication is, in fact,
working, and that the problem is with the configuration of the local scheduler (condor, not 
condor-ce). Since 'condor_submit' worked in the past, he said to check the changes to the
condor configuration in '/etc/condor/config.d'. He's also wondering if condor is installed 
on the nodes.
I've sent him the contents of the recently changed configuration files.

Brian doesn't see any evidence of the 'condor_ce_trace' in the SchedLog I sent him.
He wants me to check the SchedLog for the reason it didn't show up. I'm going to run
the command again and check the log. 
The log had some interesting output. It said that the address for startd could not
be found, and the NEGOTIATOR authorization policy contained no matching ALLOW entry for
the request. I notified Brian.

cont.
08/01/2017
Marguerite from HyperNews had a similar problem at the Maryland cluster.
She thinks the problem is due to the version of condor updating with the
change to HTCondorCE. She says to:
1) Make sure everything is running the same version of condor (`condor_q -version)
2) Make sure the firewall is open between all the nodes on the appropriate ports.
3) Add the following (she put it in '/etc/condor/config.d/cluster.conf'):
   # Here you have to use your network domain, or any comma separated list of hostnames and IP addresses including all your condor hosts. * can be used as wildcard
   ALLOW_WRITE = *yourInternalNetwork, 10.1.0.*, SomeIPNumberOfYourCE, name of your CE, fltech-grid3.fit.edu<http://fltech-grid3.fit.edu>;
   ### next four lines needed for condor 8.4.8 that came with OSG 3.3
   ALLOW_NEGOTIATOR = *fit.edu<http://fit.edu>;, firstIPNumbersForYourPublicNetwork.*
   ALLOW_NEGOTIATOR_SCHEDD = $(ALLOW_NEGOTIATOR)
   HOSTALLOW_NEGOTIATOR_SCHEDD = $(HOSTALLOW_NEGOTIATOR_SCHEDD), $(HOSTALLOW_WRITE)
   HOSTALLOW_WRITE = $(ALLOW_WRITE)

I yum updated the CE and nodes.

cont.
08/03/2017
According to `condor_q -version`, the CE is running version '8.4.11 Feb 24 2017',
while the nodes are running version '8.2.10 Oct 27 2015'. Feb 24 is about when the jobs
died, which now makes sense. I just did a yum update, so how do I get the updated version of
condor? Maybe I also have to update OSG on the nodes. The OSG version on the CE is '3.3.26', while
the version on the nodes is '3.2.41'. I'm gonna follow the directions to update OSG from
<https://twiki.grid.iu.edu/bin/view/Documentation/Release3/OSGReleaseSeries#Updating_from_OSG_3_1_3_2_3_3_to>
1) remove the old yum repos:
   `rpm -e osg-release`
2) install the OSG repos:
   `rpm -Uvh <repo URL>`
3) clean yum cache
   `yum clean all --enablerepo=*`
4) update software
   `yum update`
I'm going to first try these instructions manually on compute-1-0. If they work, I'll
make a script for the rest of the nodes.
The update went smoothly, but the problem persists. Nevertheless, I'm updating
all on the nodes anyway. The update went smoothly.

'ALLOW_WRITE' and 'ALLOW_NEGOTIATOR' were already set properly in 
'/etc/condor/config.d/00personal_condor.config' I added:
    ALLOW_NEGOTIATOR_SCHEDD=$(ALLOW_NEGOTIATOR)
    HOSTALLOW_NEGOTIATOR_SCHED=$(HOSTALLOW_NEGOTIATOR_SCHEDD), $(HOSTALLOW_WRITE)
    HOSTALLOW_WRITE=$(ALLOW_WRITE)

cont.
08/04/2017
When I tried to check the status of condor, it said the subsys was locked. I restarted 
'condor-ce' and 'condor-cron'. The internet isn't working, so I'll continue later
today.

cont.
12/20/2017
Alright, now that NAS-0 is back online (mostly), let's resume trying to fix condor.

cont.
01/07/2018
I thought I turned some nodes on, so that I could work on it before the Physics Building
opened up, but I guess not. RIP. I guess I'll just have to wait until tomorrow.

cont.
01/21/2018
Alright, now that NAS-0 is fixed FOR REAL this time, let's get crackin'.
Jk, the nodes won't get power. *sigh*
The output breakers for the plugs into which the node power strips are connected
are weirded out. So that I can continue to play with condor in spite of this 
strange issue, I only have five nodes (2-0 to 2-4) turned on. So far, the UPS 
seems to be alright with that. 

cont.
01/22/2018
Time to play with condor. Let's start off with a classic 'condor_ce_trace' and
see where we end up. First, I need to send off my new usercert. The instructions
for converting a '.p12' to a '.pem' are found at [10/16/2015]. I copied both
the new 'usercert.pem' and 'userkey.pem' to '/etc/grid-security'. 
I tried `condor_ce_trace -d uscms1.fltech-grid3.fit.edu`, and it told me
that it couldn't connect to the CE; the collector daemon appears to be off.
Yup, the collector daemon's down, verified by `condor_ce_status`.
I did `service condor-ce start` to start it up. Now I'm getting all kinds of
output from 'condor_ce_trace'. It's saying it's unable to create a temporary file
in the working directory, '/root'. Imma try to run it as Voytella, and see if I
get anything different. Now it's telling me it can't find a X509 proxy in
'/tmp/x509up_u14122'. That's because my user certificate is hella outdated.
It says to just throw a copy of it and the key into '/home/Voytella/.globus'.
Excellent! I've created a valid temporary proxy! Alright, now it's doing
what it was doing before: querying every single idle job in the queue.
'/var/log/condor/SchedLog' is also reporting a bunch of 'PERMISSION DENIED'
errors like it was doing before.

cont.
01/26/2018
I'm going through the documentation sent by OSG.
It says to look for "DC_AUTHENTICATE" and "PERMISSION DENIED" errors
in '/var/log/condor-ce/SchedLog'. While I don't have those errors in 
the condor-ce SchedLog, they're all over the place in the condor SchedLog.
The errors are also slightly different than what's described in the documentation.
Alright, despite the documentation being for condor-ce, I'm gonna follow its
directions to see what I can discover. 

First, it says to check GUMS or 'grid-mapfile' to ensure that my DN is known
to my authentication method. I made sure that in '/etc/osg/config.d/10-misc.ini',
'authorization_method' was set to 'xacml' and 'gums_host' was set to our hostname.
There is also a note that says that if the local batch system is HTCondor, it will
attempt to use the LCMAPS callouts if enabled in '/etc/condor-ce/condor_mapfile', and
if that's not the desired behavior, to set 'GSI_AUTHZ_CONF=/dev/null' in
'/etc/condor-ce/config.d/99-local.conf'. The GSI thing wasn't set, so I set it.
Imma try condor_ce_trace again and see what happens. Nothing seems to have changed.
Oh, I forgot to `condor_ce_reconfig`. Now let's see if that does anything.
I set the 'condor_ce_trace' command on my user side-by-side with a 
`tail -f /var/log/condor-ce/SchedLog`. The 'condor_ce_trace' is doing the thing
where it queries every single job to report that it's idle and sends a "connection
request to schedd at <163.118.42.1:9619>". Everytime it makes a new query, it writes to
the SchedLog the same thing: saying the number of active workers is 0 and something 
about forking workers and no more children processes to reap. I wonder if 'condor_ce_trace'
writes anything to '/var/log/condor/SchedLog'. While there's a bunch of stuff being
written to '/var/log/condor/SchedLog', it doesn't look like it's being caused by
the 'condor_ce_trace'; it's just a bunch of the 'DC_AUTHENTICATE' and 'PERMISSION DENIED'
errors.
NOTE: There are a TON of LCMAPS and GRAM-gatekeeper authentication errors in '/var/log/messages'.
Let's see what doing the GSI thing for regular condor does.
NOTE: In '/etc/condor/config/d', there's a mysterious '99-condor-ce.conf'. What's that
      doing there? There's also a '50-condor-ce-defaults.conf'. Maybe they're there so 
      condor can talk to condor-ce? They just say that the super user can impersonate
      anything.
I made the GSI addition and reconfigured condor. Nothing new happened.

The next thing it says is to look for LCMAPS errors in '/var/log/messages'.
Oh hey! We're drowning in those! Let's investigate!
It looks like the error starts with an authentication of a globus user, then
it says it can't open file '/etc/lcmaps/lcmaps.db'. That causes a LCMAPS plugin
error, with prevents LCMAPS from initializing. Then that failure breaks everything
else. Let's see about that file.
NOTE: LCMAPS (Local Credential MAPping Service) translates grid credentials to local
      Unix credentials.
Turns out there's only '/etc/lcmaps.db' and no 'lcmaps' directory. I'm gonna try to make
that directory and throw the file in it.
Now, in '/var/log/messages', a bunch of globus users got authenticated in a row without
issue and some other stuff happened. Then it gave a warning about still being "root after the 
LCMAPS execution. The implicit root-mapping safety is enabled. See documentation for details.",
and the next line said that "globus_gss_assist_gridmap() failed authorization" and that the
callout returned an unknown error.

I'm gonna see about debugging LCMAPS. There's a whole page for troubleshooting LCMAPS on
the wiki. First, it said to set up LCMAPS for maximum debugging by adding the following 
to '/etc/sysconfig/condor-ce':
export LCMAPS_DEBUG_LEVEL=5
export LCMAPS_LOG_FILE=/tmp/lcmaps.log
Then 'condor-ce' has to be restarted:
$ service condor-ce restart
It also says that disabling HTCondor-CE's caching of authorization lookups is a good
idea for testing changes to mapfiles. To disable the caching, create
'/etc/condor-ce/config.d/99-disablegsicache.conf' and insert
GSS_ASSIST_GRIDMAP_CACHE_EXPIRATION=0
then restart 'condor-ce'.
NOTE: It says that disabling caching could increase the load on the CE (makes sense),
      so keep an eye on things to make sure nothing gets too out of control.
It gave me a list of configuration files in order of precedence:
/etc/grid-security/ban-mapfile (ban DNs)
/etc/grid-security/ban-voms-mapfile (ban VOs)
/etc/grid-security/grid-mapfile (map DNs)
/etc/grid-security/voms-mapfile (map VOs)
/usr/share/osg/voms-mapfile-default (map VOs default)
'/etc/grip-security/grid-mapfile' is full of grid mappings, but
'/etc/grid-security/voms-mapfile' doesn't exist.
Strangely enough, it says that LCMAPS is configured in '/etc/lcmaps.db', the
file I thought (and it thought) was misplaced earlier. Huh. Either way, it gives
me a bunch of stuff to make sure I have in it. It looks like it contains none of
what it's supposed to have. Imma go through and add bunch of stuff, then.
Above the 'authorize_only' section, I added the 'gridmapfile', 'banfile', 'banvomsfile',
'vomsmapfile', 'defaultmapfile', and 'verifyproxynokey' parameters. It said to edit
the 'authorize_only' section to exactly what it is now; I've commented out what was
already there. It also said to make sure '/etc/grid-security/gsi-authz.conf' containes
a certain line (that terminates with a newline), but that's already there (including
the newline).
That's the end of the document. Now let's see what happens.
That globus_gss_assist_gridmap() is still failing.

Oh, turns out this troubleshooting guide I was following is just the tail end of the
whole LCMAPS page. Imma run down it from the top and see what I can see. 
It says to enable the LCMAPS VOMS plugin, I have to add the following to 
'/etc/ost/config.d/10-misc.ini':
edit_lcmaps = True
authorization_method = vomsmap
It also said to comment out 'glexec_location', and I've commented out the existing
'authorization_method'.
It says that a Unix account must be created for each VO, VO role, VO group, and user
that I wish to support. I'm not sure if that means every single user in 
'/usr/share/osg/voms-mapfile-default' or not, because that's a bunch of users. I can
probably ask OSG about that.
It says the 'allowed_vos' parameter in '/etc/osg/config.d/30-gip.ini' should be populated
with the supported VOs per subcluster (worker node hardware) or resourceEntry (set of subclusters)
section. Not entirely sure what it means by that, but our 'allowed_vos' in empty and commented
out. I'll also ask OSG about that.

cont.
02/03/2018
They think we may not have the OSG version of LCMAPS. To see what version we have,
I ran `rpm -q lcmaps`, and it told me we're running version 'osg33', while the most
updated version is 'osg34'. Ah ha! I'll see about fixing that up. I've run a `yumUp`.
That didn't cut it, I may have to do other things. Brian also said that I may have
not run 'osg-configure', and he's right, I haven't! I've run `osg-configure -v`, and
it gave me some info. It said I'll either have to specify a list of VOs or a '*' for 
the 'allowed_vos' option. It also said that I need to fix the 'gram_ce_hosts' option in 
'/etc/osg/config.d/30-rsv.ini', since GRAM is not longer supported (the whole reason for
this debacle in the first place).
In '/etc/osg/config.d/30-gip.ini', I've set 'allowed_vos' to '*'. I'll probably also 
have to make user accounts for all the VOs in '/usr/share/osg/voms-mapfile-default'.
In '/etc/osg/config.d/30-rsv.ini', I edited 'ce_hosts' to just include HTCondor-CE, and
I've commented out the 'gram_ce_hosts' setting.
`osg-configure -v` gives me a "No allowed_vos specified for section 'Subcluster FLTECH'"
warning, and a VO specification warning, saying that either a list of VOs or '*'
must be given. I thought I had already taken care of that by modifying 'allowed_vos'
in '/etc/osg/config.d/30-gip.ini'. Huh. I'll just go ahead with the `osg-configure -c`
and keep these warnings in mind. The configure reported no errors, just the above warnings.

cont.
02/05/2018
OSG also said they wanted an updated `osg-system-profiler`, so I've started that off.

cont.
02/16/2018
(RIP, sorry OSG) Since it's been so long, I've made a new `osg-system-profiler`.

cont.
02/17/2018
OSG says I've gotta make users for all of the entries in '/usr/share/osg/voms-mapfile-default',
so Imma see about doing that. The new users have been created.
I've run `osg-configure -c` again and got the following warnings:

WARNING  No allowed_vos specified for section 'Subcluster FLTECH'.
WARNING  In OSG 3.4, you will be required to specify either a list of VOs, or a '*' to use an
autodetected VO list based on the user accounts available on your CE.
WARNING  No allowed_vos specified for section 'Subcluster FLTECH'.
WARNING  In OSG 3.4, you will be required to specify either a list of VOs, or a '*' to use an
autodetected VO list based on the user accounts available on your CE.
WARNING  Can't copy grid3-location file from /etc/osg/grid3-locations.txt to /cmssoft/cms/etc/grid3-locations.txt
CRLs exist, skipping fetch-crl invocation

The repetition of the first two warnings is most likely a result of `osg-configure -c` first
running `osg-configure -v`, and simply printing those warnings for both commands. The last warning,
however, I have no explanation for.

cont.
02/20/2018
OSG said I forgot to set 'allowed_vos' to '*' under the '[Subcluster FLTECH]' section
of '/etc/osg/config.d/30-gip.ini'; I had only done it in the '[SE FLTECH-SE]' section.

cont.
02/23/2018
Daniel said he fixed some condor stuff, [02/11/2018], so let's try to run some condor jobs
and see what happens. I submitted a job from my account, and it was immediately held.

cont.
02/24/2018
Since so much has changed, I'm going to run through the Condor troubleshooting documentation
again to see what it says.


04/06/2017
TAGS: CE cannot ssh unresponsive
Vallary emailed me saying that she couldn't ssh into the cluster, and neither could I! Upon arriving at the high bay
I found the CE unresponsive; just the blue background was visible with the mouse. I power cycled the CE and it rebooted,
but condor's not working. `condor_status` returns a communication error stating that it cannot connect to 163.118.42.1:9618.
It stopped because /var is 100% full. /var/lib/globus is 3.3G and is full of strange condor files that were created yesterday
and the day before. Some are several Megabytes while some are empty. The files seem to contain entries for submitted jobs.
I'm going to move all of the "condor.*" files to ~/globusCondorJunk and see if that breaks anything. I fully restarted condor,
and all seems to be well. If it turns out that the "condor.*" files are indeed useless, then I'll delete them. 


04/10/2017
TAGS: mass deletion of users
users are being deleted in 24 hours. I made a file called ~/userdellist.txt that has all the info in it
the programs at the bottom will stay for now, some of them are important. 


04/11/2017
TAGS: node validation failure tmp full
OSG sent us a ticket a while ago (my email wasn't in the list, Ankit told me about it) saying that CMS and OSG 
glideins were failing node validation upon startup (https://ticket.opensciencegrid.org/32896). The CMS glideins
are failing due to being unable to locate CMS software, and the OSG glideins are failing due to a full '/tmp'.
CMS Failing Nodes:
compute-1-1
compute-1-3
compute-1-6
compute-2-1
compute-2-4
compute-2-5
compute-2-6
compute-2-7
compute-2-8
OSG Failing Nodes:
compute-2-5
compute-2-6
compute-2-7
compute-2-8
The OSG Failing Nodes do, in fact, have a completely full primary partition, where '/tmp' is located.

cont.
04/12/2017
The problem was that '/scratch' was all filled up because it was the cvmfs cache. I moved the cvmfs
cache from '/scratch' to '/var/cache/cvmfs' on all the nodes via a script ('~/Scripts/mvCvmfsCache.sh').

cont.
04/14/2017
The other problem was the CMS failing nodes. The listed nodes contain the script
`/var/lib/condor/execute/dir_<someNumber>/glide_<someAlphaNumericCharacters>/discover_CMSSW.sh`.
NOTE: navigate to 'var/lib/condor/execute' then run `find . -name "discover_CMSSW.sh"` to locate the script.
It hangs upon execution. The script just looks for other scripts and executes them. If it doesn't find
what it's looking for, it's supposed to say so. The script however, doesn't seem to do anything. The
discover script is only on some of the nodes listed, and it's not on any that are not listed.


04/13/2017
TAGS: home directory clean
Cleared out the home directory for root so it's usable


04/14/2017
TAGS: condor not running diagnostics passwords required ssh
The diagnostics page reports that condor is not running on any of the nodes. All of a sudden, I need to enter
passwords to ssh from root. Huh, that's strange. Turns out condor's fine, but the monitoring scripts
need to ssh into the nodes, which it can't do now because ssh-ing requires passwords for some reason.
Riley moved some of the ssh files around when he was reorganizing the home directory, so the CE's ssh keys
have been slightly scrambled.

cont.
04/17/2017
Ankit said to investigate ROCKS; it made the ssh keys. The ROCKS documentation said that hostbased
authentication is controlled by '/etc/ssh/shosts.equiv'; the IPs of the cluster parts are all there.
I created a brand new ~/.ssh directory and filled it with a public and private key generated with
$ rocks create keys ~/.ssh/id_rsa > ~/.ssh/id_rsa.pub 
The new key was placed in NAS-1 with
$ cat ~/.ssh/id_rsa.pub | nas1 "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
The new key was confirmed placed where it should be, but a password was still requested.
Silly me, I didn't check id_rsa.pub for errors, of which there was one. I need to type the 
command correctly. 
$ rocks create keys key=~/.ssh/id_rsa > ~/.ssh/id_rsa.pub 
The key was created, and it was correctly put onto NAS-1, but it still doesn't work.
Instead of using the rocks command to make the keys, I used the normal `ssh-keygen -t rsa` command,
then sent the keys over with the normal command.

For installing the new key on all of the nodes, I'm installing `sshpass` which will allow for
the automation of logging into all of the nodes. I added to the osg-node.sh:
cat ~/.ssh/id_rsa.pub | sshpass -p "<password>" ssh -o StrictHostKeyChecking=no compute-fed-nad "mkdir -p ~/.ssh && cat > ~/.ssh/authorized_keys"
be sure to comment out the normal ssh line!
That worked for compute-2-*, but the passwords for compute-1-* are different. I will have to change them to the normal password.

cont.
04/18/2017
To change the root passwords of the other nodes, they must be powercycled and booted into single user mode.
After the password has been changed, run `init 5` to resume normal operations. If the node hangs after `init 5`,
powercycle it again, and allow it to boot normally. I've changed compute-1-0 to compute-1-3 so far.

cont.
04/19/2017
All of the nodes, the SE, NAS-1, and NAS-0 all have the new keys.


04/19/2017
TAGS: gratia accounting osg website GRACC change no job count
OSG updated their grid monitoring software from Gratia to GRACC (GRAtia Compatible Collector). GRACC
is compatible with all existing Gratia probes.
It is shown that we are amassing wall hours, but there is no data for the job count.


04/24/2017
TAGS: squid not running
Squid wasn't running. I checked its status with `squid -k check` and it told me that it couldn't
find the cache directory. That's because it was moved during Riley's spring cleaning. I changed
the squid directories in '/etc/squid/customize.sh' from "ufs /root/squidAccessLogDump/cache 20000 16 256"
to "ufs /root/Cluster_System_Files/squidAccessLogDump/cache 20000 16 256".

cont.
04/26/2017
'customize.sh' will hang, but it does, in fact, edit the file properly after some time. Squid is good again.

rpgpg
04/24/2017
TAGS: NAS0 diagnostics page
The NAS0 diagnostics page had been missing the top table for a while, because a new line was missing
at the end of /etc/cron.d/nas0chk . The line was added so it works now.
 

04/25/2017
TAGS: NAS1 yum update rpmforge gpg keys
NAS-1 has some trouble yum updating due to non-existant rpmforge gpg keys.
I had some trouble finding the keys, and I had to install a security update,
so I just turned off the check for the keys be editing '/etc/yum.repos.d/rpmforge.repo'.
I've turned the check back on for now.