01/09/2017 TAGS: ACP UPS battery replacement light red The battery replacement light for the APC UPS is red again. Since it's connected to everything except for the nodes, I'm gonna wait before restarting it. cont. 01/10/2017 I'm ready to turn everything off and restart the UPS. After the UPS restart, the red light turned off. Everything booted up properly. 01/09/2017 TAGS: NAS-1 NAS1 Curtis is helping Curtis recommended that I try to mount the filesystem with inode64 option enabled. In NAS-1, I ran: `mount /dev/sdc -o remount,rw,inode64 /nas1` `mount` reported that /nas1 was mounted with inode64, but I was still unable to write to it. Curtis says the server might be hitting the open file limits: $ ulimit -Hn 1024 $ ulimit -Sn 1024 NA`ulimit` provides control over resources allowed to the shell. -Hn shows the hard value of open file descriptors and -Sn shows the soft value of open file descriptors. The hard value cannot be changed while the soft value can be. $ cat /proc/sys/fs/file-nr 1530 0 1021706 The current number of file handlers open for all users is 1530, which exceeds the limit. I am going to increase the limit for the number of open file descriptors available to root from 1024 to 4096 by editing /etc/security/limits.conf cont. 01/10/2017 The changes to /etc/security/limits.conf take effect when a new session is started. /proc/sys/fs/file-nr shows a 0 where the 1530 used to be, but /nas1 is still "full".x cont. 01/11/2017 /proc/sys/fs/file-nr now shows 1020 and /nas1 is still unwritable. That didn't seem to work, so I'm gonna try Curtis' other test: boot NAS-1 into a LiveCD of CentOS 6 and try to test /nas1 from there (NAS-1 is still on CentOS 5). /nas1 was still not writable from the LiveCD. cont. 01/12/2017 Stefano suggested I check the size of all of the Trashes on /nas1. cont. 01/12/2017 Daniel Campos is looking at NAS-1, and he's doing many things. We tried mounting /nas1 on the CentOS 7 LiveCD, and it worked! We could write to /nas1! Daniel says that the filesystem probably ran into a bug and panicked, but that bug has been fixed in later versions. Because we're not ready for a system-wide update (which would break everything), we're gonna try to update just the part that we need to. Success! NAS-1 is fixed! When we deleted all of those files a while ago, it caused a bug in the filesystem. In CentOS 7, that bug is fixed, so all is now well (mostly)! 01/12/2017 TAGS: update NAS-1 Daniel is gonna update NAS-1 to CentOS 7. backup old NAS-1 with rsync $ rsync -aH ... The update was successful. 01/12/2016 TAGS: SE not booting turning on The SE is refusing to start properly. It boots to the CentOS 6.8 screen with the little loading lines at the bottom of the screen, but the white bar fills up and nothing happens afterward. The cluster seems to be working fine except for that, though. cont. 01/13/2016 I checked the SE when I arrived, and was greeted by the ususal login screen. It appears to be working fine afterall! Perhaps it was just taking extra time to turn on. 01/13/2016 TAGS: SE not ssh-able NAS-0 not mounted I am unable to ssh into the SE and NAS-0 is not mounted on it. I can ssh into the SE from my computer via the SE's IP address, but I can't ssh into it with the compute-0-0 designation used on the CE. The SE is unreachable on the local network. I've run out of time today, I'm just gonna turn everything off for the power outage tomorrow, and investigate further next week. cont. 01/17/2016 Everything booted up properly. The problem seems to be related to the SE's new abnormally long boot time; it sits at the CentOS 6.8 loading bar for a long while. By pressing any key during the loading bar screen, verbose mode is enabled. The screen was covered with CRL errors; the CRL for [...] was not retrieved, the 24h grace period had expired, and the CRL needed to be updated. The CRL is related to openssl, which can explain why ssh isn't working. cont. 01/18/2017 `fetch-crl` is not finding the CRLs it needs; it's the command that's taking forever at boot. The SE was unable to resolve any mirrors for a yum update, so maybe it doesn't have internet access. Because it cannot resolve any mirrors, it is taking FOREVER to complete. I'm gonna let it do it's thing and come back later. cont. 01/20/2017 It has internet because it pings 8.8.8.8 fine. Maybe the yum update is dependent upon some of the CRLs. I tried mounting NAS-0, but it doesn't work; it won't ping it either. The SE doesn't seem to be talking to the rest of the cluster at all. I can't ping the SE from CE. It can probably only mount NAS-1 because NAS-1 isn't technically part of the cluster; it's hardwired into NAS-1. Disrupting the direct connection between the two does not seemed to have affected anything. It can ping the nodes just fine. I'm gonna try investigating the ".info" files for all of the certificates in /etc/certificates All of the .info files were put into /etc/certificates/infoList.txt `fetch-crl` retrieves information based upon the .info and .crl_url files in /etc/certificates The URLs from which the CRLs can be retrieved are listed in the trust anchor meta-data (.info files and .crl_url files). cont. 01/27/2017 Turns out there's nothing installed on NAS-1; emacs wasn't there. I ran a `yum install emacs` and it downloaded a whole bunch of stuff. Maybe everything's borked because NAS-1 doesn't have all of its software. Imma investigate the repositories from old NAS-1. The only discrepancy is the lack of the rpmforge repo on the current NAS-1, so I'm installing that. There are still some issues with `yum update` and `yum upgrade`. I'll investigate later. cont. 02/01/2017 I discovered that a service called `NetworkManager` was turned off on the CE. I turned it on, and now the SE is ssh-able from the CE and NAS-0 can be mounted on the SE. I'm gonna restart the SE and see if anything's changed. All of these problems have been fixed! Make sure `NetworkManager` is turned on in the CE! 01/25/2017 TAGS: NAS-1 website RAID health check After the NAS-1 update, the RAID health check on the website wasn't working, all of the website files were giving permission errors. The problem was that the CE's ssh key had to be put back onto NAS-1; the health scripts relied on the ability to ssh into NAS-1 automatically. $ cat ~/.ssh/id_rsa.pub | ssh user@123.45.56.78 "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys" cont. 01/27/2017 Nevermind, I didn't fix it. The red bars now say "bash:" instead of "permissions:". It's probably no big deal. cont. 01/29/2017 The problem is that the script that checks NAS-1 is trying to use the `storcli64` command, which apparently doesn't exist. Maybe some software needs to be installed onto NAS-1 again so that the command can be used. Either that or the command is outdated. cont. 01/30/2017 I'm trying to install `storcli64`, but every link I've found is dead. 01/25/2017 TAGS: NAS-1 attempted logins China There have been repeated attempts to log into NAS-1 since it's been updated (about 100 attempts every few seconds). I looked up the IP (59.63.166.80), and it's from China. Well boys, I reckon this is it: cyber warfare, toe to toe with the Chinese. The IP report only gave me information about the hub through which IPs 59.62.0.0 - 59.63.255.255 are run. cont. 01/27/2017 Daniel Campos recommended the software Fail2Ban. It bans IPs that repeatedly attempt to login. I'm also looking up how to use the firewall. cont. 02/03/2017 The Chinese seem to have given up. There have only been 2 failed login attempts in the past hour from Bulgaria. cont. 02/06/2017 JK, the Chinese just took a couple days off. I've installed fail2ban, and now I'm gonna learn how to use it. fail2ban has been installed and it is in use. Get rekt, Chinese! The default sshd filter seems to be doing the trick nicely. 02/01/2017 TAGS: SAM critical Only 6 of the usual 15 SAM tests are visible, and almost all of those are red. It looks like the SAM website switched to a new kind of monitoring. Anything before about two days ago simply isn't there. The CE monitoring has also switched flavors from CE to GLOBUS. The "cacert-expiry" RSV test has also gone into Warning. I'm checking the globus logs. The SAM site says the change (whatever it was) happened at about 11:00 (or 06:00) on 01/31/2017. The logs for that day, however, stop during 03:00. The logs are missing (or were never written). The gram log for grid0004 has repeated error messages for "no job found" from 01/30/2017 to 02/01/2017. cont. 02/03/2017 Since I've fixed the SE, all of the SE SAM tests have gone green! cont. 02/06/2017 Nevermind, they went red again sometime yesterday. cont. on 2017-02-06 02/03/2017 TAGS: trouble mounting NAS-1 from IP Stefano is having trouble mounting NAS-1 from a specific IP, although he can mount it fine from a different one. I just added the IP he wanted to /etc/exports on NAS-1 and saved the changes with `exportfs -ra`. 02/03/2017 TAGS: OSG software missing OSG sent me an email yesterday saying that some packages that are required by CMSSW are missing. I've installed the requested packages. cont. 02/06/2017 Turns out I need to install the packages on more than just the CE, so I'm gonna install it on the nodes and the SE. 02/06/2017 TAGS: cluster shutdown script I wrote a script that will properly shutdown the entire cluster: ~/scripts/totalShutdown.sh It can be run by simply typing `totalShutdown`. 02/06/2017 TAGS: APC UPS Battery light red The check battery light for the APC UPS was red again. I turned the cluster off and restarted the UPS. The red light did not turn back on, so I brought the cluster back online. 02/06/2017 TAGS: Daniel certificate expired I was looking around the /var/mail files, and the most recent phedex mail was complaining that Daniel's certificate had expired. Daniel's certificate is still on the cluster somewhere! This could be the source of our problems! cont. 02/10/2016 Today is certificate day: let's find 'em! The phedex mail said that phedex is still using Daniel's expired certificate. The user phedex on the SE doesn't have home directory, and a cron job that checks the expiration date of the certificate looks at a file buried in that missing home directory. I have tried running `voms-proxy-init -cert usercert.pem -key userkey.pem` to set the certificates as my own, but it failed because the CRLs are out of date. `fetch-crl` returned a bunch of CRL retrieval errors. On NAS-1, fetch-crl gets its information from /etc/certificates, but /etc/certificates doesn't exist on either the CE or SE. That's because they were reconfigured to be /etc/grid-security/certificates I tried going to one of the URLs mentrioned in a .crl_url file, and the URL is fine; the CRL downloaded. Maybe fetch-crl needs certificates to get the CRLs? I messed around with `certutils` a bit. `certutils -L` reports that the certificate/key database is in an old, unsupported format. It may be the case that certutils is pointing at the wrong directory, or the faulty database is contributing to the fetch-crl issues. I came across a command to list all certificates (`certutil -d sql:$HOME/.pki/nssdb -L`), but ~/.pki/nssdb is empty! When I tried to run the command, I got the same error as before. That just means that there aren't any nss data bases. cont. 02/13/2017 `fetch-crl` on the CE also fails; it fails the slow way, the way the SE used to. I'm scouring the log files on the CE. Nothing glaring came up in /var/log/globus-gatekeeper.log . The wiki page for fetch-crl mentioned the installation and maintenance of CA certs. There is an application ment to maintain them, and there is a cron job for it. I ran `[ ! -f /var/lock/subsys/osg-info-services ] || /bin/sh -c 'perl -e "sleep rand 300" && http_proxy= /usr/sbin/osg-info-services` on its own to see the output, and it gave me some juicy config files to look at: /etc/gip/gip.conf /etc/osg/config.d/ It also talks a lot about condor and its users, so it probably has something to do with authenticating users to use condor. /etc/osg/config.d is a folder full of .ini files. I'm searching them for any information about certificates. The cacert-expiry RSV test is red, so I think I'm looking in the right direction. The RSV test says that the CA "UNLPGrid" is out of sync. I've tried running `osg-ca-certs-updater` on its own to see what happens. It said that its update succeeded. I ran the critical RSV test by hand, and it still failed. Maybe the missing CA is part of antlr (the thing we refuse to update because it breaks GUMS). Let's see what happens when we update it! Nothing changed, so I downgraded antlr back to where it normally is. I tried running `osg-ca-certs-updater` on the SE, and it said none of the hosts for the certificates could be resolved! I manually checked the URLs, but they worked. The SE does have internet access. The same errors appeared when I tried a yum update. I fixed it by adding "nameserver 8.8.8.8" to /etc/resolv.conf fetch-crl on the SE only resulted in 2 failed CRLs! fetch-crl on the CE resulted in the same 2 failed CRLs. `voms-proxy-init -cert usercert.pem -key userkey.pem` from before now works and recognizes me. I found Daniel's old certificate! It's in phedex's home directory on the CE. The script in that directory `phedex_proxy_update.sh` is what's been running and returning the Daniel errors. I'm working on updating the certs in that directory. cont. 02/15/2017 Time to update the certificates! There was some crazy encryption to protect the certificate password. To encrypt a text file: Generate a public-private key pair. $ openssl genrsa -out key.pem 1024 Extract the public key. $ openssl rsa -in key.pem -pubout > key.pub Encrypt the text file with the public key. $ cat federer.txt | openssl rsautl -encrypt -pubin -inkey key.pub > federer.txt.enc The files `key.pub` and `federer.txt` can be deleted. To decrypt: openssl rsautl -inkey key.pem -decrypt -in federer.txt.enc I'm replacing the old usercert.pem and userkey.pem with my own info. I didn't know the old password, so the GRID passphrase had to be changed. Place the new usercert.pem and userkey.pem into .globus in phedex's home dirctory. `grid-change-pass-phrase` will allow for the changing of the passphrase. The passphrase works! It says "Error: cms: User unknown to this VO." I tried it again using my GRID certificate rather than my CERN certificate, but same result. cont. 02/22/2017 The SAM tests said that a condor jobs were idle. `condor_q` reveals 366 idle jobs. Ah-ha! Problemo foundo! Why are the jobs idle? I tried fully shutting down and restarting condor, but the condor_q list remains the same. I searched the output of both `condor_q -analyze -verbose` and `condor_q -better-analyze -verbose` for the IDs of the idle jobs: $ condor_q | grep I | awk -F' ' '{print $1}' > idleNums.txt $ while read num; do grep $num condor_qAnalyzeVerbose.out condor_qBetterAnalyzeVerbose.out; done @uscms1.fltech-grid3.fit.edu:/mnt/nas1 works just fine for me; I've asked her to try it. Turns out she just forgot to capitalize '-P'. 03/04/2017 TAGS: emergency documentation We had to turn off the cluster a couple nights ago for building maintenance. I have written new Emergency Cluster Documentation that provides updated instructions for shutting down and powering up the cluster. 03/08/2017 TAGS: condor idle Condor is idle; no jobs are running. I restarted condor, but I suspect it's because of the SAM tests. cont. 03/13/2017 The condor shadow log says that condor_shadow keeps exiting with status code 115, which doesn't exist according to Wisconson's list of shadow exit codes. It exits with code 115 after a job fails (job terminates with code 1), and with code 100 when the job runs properly (this is normal use). The log stopped writing when the jobs stopped. The SchedLog, on 03/09/17, is trying to negotiate for users grid0004, osg, and glow, but there are no matches being made in the local pool (16 rejections for grid0004, 17 rejections for osg, 4 rejections for glow). It also keeps reporting that there are 0 active workers. The NegotiatorLog illustrates what's written in the SchedLog. The globus-gatekeeper.log repeatedly reports GSS failures. I found documentation regarding the errors found in globus-gatekeeper.log. The file /etc/grid-security/grid-mapfile contains a list of basically everyone everywhere. cont. 03/15/2017 I found an OSG ticket where someone else had a similar problem. $ globusrun -a -r uscms1.fltech-grid3.fit.edu from the SE returns the same error message from SAM 12, the expected name is the CE, but the SE is found. $ openssl x509 -text -noout -in /etc/grid-security/hostcert.pem brings up information about the hostcert $ openssl rsa -text -noout -in /etc/grid-security/hostkey.pem returns info about the hostkey There is a discrepancy in the Subject line, revealed by the above command, betweent the new and old hostcerts. The old hostcerts have "uscms1" in the subject, while the new one has "uscms1-se". This may be contributing to the SAM 12 problem; it looks like a similar discrepancy. The OSG ticket's problem appeared to have been centered around the hostcert entirely. I updated our hostcert Feb 20, after the SAM tests had initially disappeared, and some time before jobs stopped, so I don't think it's our primary concern. I sent off a ticket to OSG. (https://ticket.opensciencegrid.org/33031) cont. 03/20/2017 Elizabeth from OSG wanted an update, so I told them that I suspect a hidden expired certificate to be at fault. cont. 03/22/2017 Elizabeth said that `osg-system-profiler` will provide a long printout of everything OSG related. NOTE: the osg-wn-setup.sh script will not work if root is logged into via `su -` I had forgotten to update the hostcerts on most of the cluster, so I did that. The problem began before the hostcert was updated at all, though. Maybe some of the problems that came about later were a result of using different hostcerts? After all of the new hostcerts are in place, I tried to restart the rsv service, as per instructions, but it complained that the service condor-cron was not running, because it was giving a "condor_master dead buy subsys locked" error. I restarted condor-cron the same way I would force restart condor, and it's running again. Maybe that was the whole problem all along? The `osg-system-profiler` command produces "/root/osg-profile.txt". The profile includes some lines from "/var/log/osg/osg-configure.log", and on Feb 26, about the same time jobs stopped running, it complained that the 'worker_node_temp' variable in "/etc/osg/config.d/10-storage.ini" was not set. Several other variables in that file were also incorrect, so I made the following changes: worker_node_temp = /scratch se_available = TRUE default_se = uscms1-se.fltech-grid3.fit.edu app_dir = /cmssoft/cms data_dir = /mnt/nas1/osgDataDir "/etc/osg/config.d/30-gip.ini" also appears to be misconfigured; the entire SE section was commented out! Since I got a new hostcert with these issues (probably) in place, the new hostcert may also be misconfigured; I may have to get a new one once these configurations are ironed out. NAS-0 was not able to be mounted on the SE, but I fixed it by adding it to /etc/hosts (it was already listed in /etc/fstab). I have edited /etc/osg/config.d/30-gip.ini to include information about the SE; I'm not sure if I did it correctly. /etc/osg/config.d/40-siteinfo.ini hasn't been changed since 2014, so I'm updating some of the information, namely contact information. The only strange thing related to certificates reported by osg-profile.txt is that we don't have a /etc/grid-security/voms/vomscert.pem certificate. I don't think we've ever had one before, so I don't think it's a big deal. To save the changes to the OSG files: $ osg-configure -v (*) to verify that the changes are properly written $ osg-configure -c (*) to save the changes The changes were successfully made! I restarted the SE to save the changed I made to it. The RSV tests are still dying, so I'm gonna get a new hostcert and see what that does. cont. 03/23/2017 I had been accidentally requesting hostcerts specifically for the SE. Separate hostcerts must be requested for both the SE and CE. I have obtained and distributed the new hostcerts, and I've restarted all of the relevant services, but RSV still dies. I've updated OSG. cont. 03/24/2017 Elizabeth thinks the DN of the certificates may have changed, and she has provided some instructions to re-add them. The instructions require me to do things in the GUMS page, but I'm not registered as a GUMS admin. I tried to run the command to make the change: $ openssl x509 -subject -noout -in ~/.globus/usercert.pem (*) returns the DN of the current admin (probably you) $ gums-add-mysql-admin "[output from above command minus 'subject= '" but it required the mysql gums password, which I don't know. I'm changing the password. To change the mysql gums password: $ kill `cat /var/run/mysqld/mysqld.pid` (*) kills the mysql server $ echo "SET PASSWORD FOR 'gums'@'localhost' = PASSWORD('newPass');" > /var/lib/mysql/mysql-init (*) creates file with line to be run when server is restarted (for MySQL 5.7.5 or earlier) (*) be sure to remove the file once the server is all good (no need to have passwords laying around in plain text) $ mysqld_safe --init-file=/var/run/mysql/mysql-init & (*) starts the mysql server with and executes the line written in mysql-init on startup (make sure this file is owned by mysql) The `gums-add-mysql-admin` worked, and I'm now a GUMS admin! Or so I though; the GUMS website still doesn't think I'm an admin. To start the mysql server without password verification: $ mysql_safe --skip-grant-tables & Elizabeth's instructions said to do the following in the GUMS homepage: 1) add the RSV DN to "Manual User Group Members" (*) was already done 2) In "User Groups", ensure rsv is present with settings: type=manual, name=rsv, permissions='read all' (*) The permissions were set to 'read self'; I've changed it. 3) In "Group To Account Mappings", for rsv set: user_groups=rsv, account_mapper=rsv (*) it was all good 4) In "Host To Group Mappings", add rsv to the list. (*) it was already there I restarted the mysql server: $ kill `cat /var/run/mysqld/mysqld.pid` $ mysqld_safe & Then restarted GUMS: $ service tomcat6 restart It complained about not being able to correctly read configuration. The /root/gums.config file didn't have the permissions change I made to rsv, so I manually changed it in the file. I restarted both the mysql and tomcat. The configurations are still messed up. The real gums configuration files are in "/etc/gums", but I still don't know what's up. I've updated OSG. cont. 03/37/2017 A small number of jobs ran over the weekend! I don't know what allowed them to, though, because all of the test still fail. The GUMS page now provides an error message rather than a simple "cannot read configuration". It's just a java OutOfMemoryError, which was probably caused by being inactive (unable to read configuration) for too long. Yeah, the configuration still isn't being read properly and it's also reporting "Database error: cannot open connection", which is no change. "/var/log/gums/gums-developer.root.log" reports that the certificate /tmp/x509up_u0 has expired, and it appears to be the old hostcert (last modified March 13). cont. 03/29/2017 OSG says that the Globus GRAM gatekeeper is outdated, and to update the system to HTCondor-CE as support for GRAM. They have thoughtfully provided a link to instructions: (*) Disabled GRAM gateway (/etc/osg/config.d/10-gateway.ini) (*) Disabled worker node proxy renewal (/etc/blah.config) (-) blah_disable_wn_proxy_renewal=yes (-) blah_delegate_renewed_proxies=no (-) blah_disable_limited_proxy=yes (*) NOTE: the blah_disable_limited_proxy option was not found, so I added the line and option `osg-configure -c` is complaining that it can't copy /etc/osg/grid3-location.txt to /cmssoft/cms/etc/grid3-locations.txt because it's trying to write to /cvmfs/cms.cern.ch/... which is a read-only filesystem mounted from CERN. Other than that, everything went smoothly. OSG then said to do a `fetch-crl`, which came back with some errors. Some CRL verifications and retrievals failed. I've updated OSG. cont. 03/31/207 Ankit is here to save the day! When I changed the password for gums, I didn't update the password in "/etc/gums/gums.config". That was just done. GUMS works again! And the RSV tests are running too! Some are still broken, but I'm getting actual error messages now, so it's something to work with. Condor is running again! Jobs are running! Ankit saved the day!!! The key signal was that the database connection was refused. HTCondor should not be set to "true" in /etc/osg/config.d/10-gateway.ini. It has been changed to "false" and now things work again. HTCondor-CE does have to be properly installed, however, and that is a great undertaking that will get its own log entry. 03/13/2017 TAGS: RSV critical Basically all of the RSV tests are red because they run as jobs, and jobs aren't working at the moment. 03/20/2017 (Hannah) TAGS: Diagnostics Website OSG Map Fixed the diagnostics OSG map link by editing the old link in ~/diagnostics/index.php 03/21/17 (Hannah) TAGS: diagnostics website menu copied html from the CE to the SE so the drop-down menue works on mobile dashboardCE.php -> dashboardSE.php Attempted to reinstall storcli64 on NAS-1, just have to configure the software 03/23/2017 TAGS: NAS-1 unmounted NAS-1 has decided to unmount itself from the CE, and I can't mount it back. I also can't mount it on the nodes or SE. The storage partition of NAS-1 is mounted on NAS-1 just fine. I'm going to try to restart the cluster. That fixed it. 03/24/2017 TAGS: NAS-1 drive failure beeping NAS-1 was beeping because drive 0:10 (as read on the casing, "28:10" as read in the software) has failed (the red LED was on). I've installed the MegaCli64 software from: http://www.avagotech.com/support/download-search extracted the .zip using `unzip`, then navigated to the Linux directory and followed the instructions there. The command will install to "/opt/MegaRAID/storcli/storcli64" I created an alias for the command called 'storcli'. `storcli /c0 show` shows the condition of the RAID, and drive 0:10 has indeed failed. To turn the alarm off: $ storcli /c0 set alarm=off I'm going to replace the drive when I come back this afternoon. To remove the drive with storcli: $ storcli /c0/e28/s10 set offline (*) in the left-most column of `storcli /c0 show` is the drive names in 'enclosureID:slotID' format $ storcli /c0/e28/s10 set missing $ storcli /c0/e28/s10 spindown (*) spins down the drive and makes it safe for removal The drive can now be safely removed. Once the new drive is in place it should automatically start rebuilding. If the drive's status doesn't change to "Rbld", the rebuild can be manually started with `storcli /c0/e28/s10 start rebuild`. The rebuild status can be monitored with `storcli /c0/e28/s10 show rebuild`. The rebuild has begun. cont. 03/27/2017 The rebuild succeeded! Drive 28:13 (front drive 13) has now failed. I've replaced it, and the drive is now rebuilding. 03/27/2017 Hannah fixing the dropdown menu part 2 (continued from 03/21/2017) Info on what was switched out in ~/diagnostics/dashboardSE.php and so on is available at ~/diagnostics/mobiledropdownfix.pdf The dropdowns all work now. 04/03/2017 TAGS: HTCondor-CE install It is time to install and configure HTCondor-CE! I enabled HTCondor and disabled GRAM in '/etc/osg/config.d/10-gateway.ini'. I changed 'ce-type' in '/etc/rsv/metrics/uscms1.fltech-grid3.fit.edu/allmetrics.conf' from "gram" to "htcondor-ce". I ran `gums-host-cron` to generate a new user-vo-map file. NOTE: `globus-job-run` can be used to send jobs to the cluster via the grid `globus-job-run uscms1.fltech-grid3.fit.edu:2119 /bin/hostname` can be used to test jobs. Jobs run with the above configuration, but RSV doesn't work. The RSV tests are running on port 9619, which is the port for HTCondor-CE, which it was told to use. Port 2119 is the globus gatekeeper port, so jobs are supposed to be sent there (probably). I restarted rsv `service rsv restart`. I'm thinking port 9619 isn't open. NOTE: the service name of HTCondor-CE is "condor-ce" `condor_ce_run` is used to test HTCondor-CE. `condor_ce_run uscms1.fltech-grid3.fit.edu:9619 /bin/hostname` was what the HTCondor-CE troubleshooting page said to run, but it just says that the address of the schedd cannot be found. By changing the port number to 9618, however, an authentication error is thrown. `condor_ce_trace --debug uscms1.fltech-grid3.fit.edu` says that the CE's schedd could not be found. cont. 04/05/2017 `condor_ce_run` cannot be run by root for security reasons, so I switched to my account. I gave myself the proper usercert.pem and userkey.pem so I could `grid-proxy-init`, and I ran `condor_ce_run`. It is hanging; I'm not sure if it's supposed to or not, so I'm gonna let it run for a while. `globus-job-run` does not work on my account. `condor_ce_trace` still reports that the schedd is not found. `condor_ce_trace` shows the 'Remote Mapping:' to be 'unauthenticated@unmapped' on both my account and root. A complete `condor_ce_trace`, an example of which is given on the HTCondor-CE Troubleshoot page, looks an awful lot like the output of a SAM test. Huh! The website says that if 'Remote Mapping:' behaves in this way, authentication needs to be configured. It looks like I've already done everything, though! The "Authorization with GUMS" section of the HTCondor-CE Install page just says to add: authorization_method = xacml gums_host = uscms1.fltech-grid3.fit.edu to the /etc/osg/config.d/10-misc.ini file. Maybe the usercert I'm using also needs to be in GUMS? I tried restarting RSV and it complained that 'condor-cron' wasn't running, so I turned it on. RSV started, but gave some errors: ERROR: Command returned error code '256': 'condor_cron_q -l -constraint 'OSGRSVUniqueName=="uscms1.fltech-grid3.fit.edu__org.osg.local.hostcert-expiry"'' ERROR: Could not determine if job is running The first error is strange because the hostcert-expiry test is one of the few RSVs that are green. The htcondor-ce.job-routes RSV complained that it could not ping the CE. Ping printed the following to stderr: ping: sendmsg: Operation not permitted Hmm, that's disconcerning. Perhaps there are some faulty permissions somewhere? The error message means that the CE is not allowed to send ICMP packets. The site recommended that I mess with the chain policies in iptables, but the INPUT chain for port 9619 is already set to ACCEPT. RSV keeps on complaining that condor-cron isn't running whenever I restart it, although it has, in fact, been highkey running. If condor-cron is left alone after an RSV restart, RSV will continue to throw a fit until condor-cron is restarted. cont. 04/07/2017 Squid was not running, and starting it failed. What's up with this, now? Periodic 'gums-host-cron' is also disabled. 'gums-host-cron' is a script that is supposed to keep the CE synced with GUMS. I manually ran `gums-host-cron --gumsdebug` and everything went smoothly. There are several targets of the INPUTS chain that are set to REJECT with the error message set to "icmp-port-unreachable". This is the default error message for the REJECT setting. None of the reject ports are ones I need, though; 9619 is labelled as ACCEPT. The NetworkManager was also stopped, so I started it. Several old rsv jobs are held in the condor-cron queue. Whenever I ran `condor_ce_trace`, the ping command always had the READ instruction. I ran $ condor_ce_ping -verbose WRITE on my account to enable WRITE instructions. Lo and behold, my remote mapping was correctly authenticated! Why isn't WRITE instruction given to condor_ce_ping during condor_ce_trace? The condor_ce commands sometimes decide to not work for brief periods of time; the ping breaks. cont. 04/10/2017 I've officially added port 9620 to the iptables list of accepted ports with: $ iptables -A INPUT -p tcp --dport 9620 -m conntrack --ctstate NEW,ESTABLISHED -j ACCEPT $ service iptables save "condor_ce_ping" has decided to keep failing. Upon further investigation, it was discovered that port 9619 is closed (`nmap -p 9619 localhost`), which is strange because `iptables -L` says that it's set to ACCEPT from anywhere. 'nmap' only lists a port as "open" if both iptables allows traffic and a service is listening on that port. In '/etc/condor-ce/config.d/03-ce-shared-port.conf', the SHARED_PORT_ARGS variable was set to 9620, so I'm gonna try setting it to 9619. I restarted condor-ce. `lsof` still does not report anything listening on port 9619. I changed '/etc/condor-ce/config.d/03-ce-shared-port.conf' back to its original state. Strangely enough, `condor_ce_ping -verbose WRITE` works just fine when I run it from my account. condor-ce is configured to use port 9619 (according to '/etc/condor-ce/condor_config'). I scoured the Hypernews and found an article talking about HTCondor-CE. They mentioned that for HTCondor sites, a special line needs to be filled out on OSG. At https://oim.grid.iu.edu/oim/resourceedit?id=163 I edited the 'SAM URI' section to be "htcondor://uscms1.fltech-grid3.fit.edu". I have sent a ticket to OSG. cont. 04/19/2019 'blah_delegate_renewed_proxies' in '/etc/blah.config' did not have any option selected, so I set it equal to "no". The command that uploads the info from GIP to OSG (`osg-info-services`) is failing: GIP.Wrapper:WARNING osg_info_wrapper:516: The module /usr/libexec/gip/providers/storage_element timed out! GIP.Wrapper:WARNING osg_info_wrapper:517: Attempting to kill pgrp 6250 Maybe the inability to upload our information is preventing HTCondor-CE from working? Some python module related to the SE keeps on timing out. Maybe it's trying to ssh? I copied the new ssh key to the SE. Nope, still dies. Maybe it's trying to get into all of the nodes? I'm still resetting their passwords. The ssh is good; let's try again. The connection timed out again, but not because of a failing module. I tried it again and the module failed this time. Huh. cont. 04/24/2017 It appears that the information propagation procedure above is conducted by 'CEMon'. I looked up troubleshooting for it, and the page said that whether information is begin sent up or not can be verified by visiting 'myosg.grid.iu.edu' "Resource Group" > "Current GIP Validation Status". It says that our "GIP Validation Status" is "Could not get LDIF Entries". LDIF (LDAP Data Interchange Files) are used to exchange data between LDAP directory servers (between us and OSG). There appears to be a syntax error in '/etc/init.d/glite-ce-check-blparser', found with `service --status-all`. 'glite' is associated with CEMon. The error is being caused because the script is assigning a string to a variable where an integer is expected. On the CEMon troubleshooting page, it says to check '/var/log/glite-ce-monitor/glit-ce-monitor.log'. Unfortunately, it doesn't exist! I checked '/var/log/gip/gip.log', because gip is closely related to CEMon, for errors and found several "CEMonUploader:ERROR". It's just complaining that `osg-info`services` keeps on timing out. '/etc/gip/gip.conf' is empty; I'm not sure if there's supposed to be stuff in there or not. cont. 04/26/2017 `osg-info-services` now refuses to even try to work; some python variable isn't getting initialized properly. cont. 04/28/2017 OSG finally responded! I sent them the output of osg-system-profiler, and they said that our condor is outdated. They said to run `yum update condor\*` to update it. That command says condor's up to date, however. The new condor is excluded due to repository priority protections (viewable on yum debug level 3). Why are they excluded (probably for a pretty good reason), and can they be unexcluded (probably, but I wouldn't want to do that because it would mess up all kinds of stuff)? Huh, what do I tell OSG? The priority plugin is responsible for the priority protections. I'm going to check out its configuration. cont. 05/08/2017 I'm gonna update OSG using the documentation provided by OSG. The documentation said to run: $ rpm -e osg-release (*) removes the old yum repositories $ rpm -Uvh http://repo.grid.iu.edu/osg/3.3/osg-3.3-el6-release-latest.rpm (*) installs the new repos for CentOS6 $ yum clean all --enablerepo=* (*) cleans the yum cache $ yum update (*) I did a normal `yum update` instead of a `yumUp` because the situation regarding antlr and GUMS may have changed. If GUMS still doesn't like the new antlr, though, I'll just degrade it like normal. They still don't agree, so I downgraded antlr and all is well. Now I'm gonna see how HTCondor_CE behaves: $ condor_ce_run -r uscms1.fltech-grid3.fit.edu:9619 /bin/env (*) run from my account (can't from root because security) (*) reports: ERROR: Can't find address of schedd uscms1.fltech-grid3.fit.edu $ condor_ce_trace --debug uscms1.fltech-grid3.fit.edu (*) reports that 163.118.42.1:9619 cannot be pinged by condor_ce_ping I tried running `osg-configure -v`, but it didn't go through. '/cvmfs' is empty, and some links are broken. Where did the stuff go? cont. 05/15/2017 OSG wants the condor logs, condor-ce logs, and both config dumps. NOTE: to make a tarball: `tar -czvf archive.tar.gz path/to/compressed/directory` to extract a tarball: `tar -xzvf archive.tar.gz` cont. 05/17/2017 OSG said that some GSS failures were symptoms of a password-protected hostkey, which GSS doesn't support. The hostkey isn't password protected, but the userkey is, and the userkey in '~/Cluster_System_Files/Cert_Files/certs' is different from the one in '/etc/grid-security'. I updated the cert/key in '/etc/grid-security' and restarted 'condor', 'condor-ce', 'condor-cron', and 'tomcat6'. cont. 05/19/2017 `condor_ce_trace --debug uscms1.fltech-grid3.fit.edu` successfully pinged the schedd! But it produced this error: "2017-05-19 17:12:00 Could not find an X509 proxy in /tmp/x509up_u502" Hmm, it looks like some certificate issues. I ran `voms-proxy-init`, which fixed that, but then I got a crazy uncaught exception. I've updated OSG. cont. 05/21/2017 I'm gonna go through the HTCondor-CE documentation to see if anything is awry. Port 9619 is open according to nmap, but port 9620 is closed. `osg-configure -v` produces some errors: (*) The 'app_dir' variable in '/etc/osg/config.d/10-storage.ini' is set to '/cmssoft/cms' which doesn't exist. (*) "Option 'app_dir' in section 'Storage' located in /etc/osg/config.d/10-storage.ini: The app_dir and app_dir/etc directories should exist and have permissions of 1777 or 777 on OSG installations." '/cmssoft/cms' is a broken symlink pointing to '/cvmfs/cms.cern.ch/' which doesn't exist. There is nothing in '/cvmfs'. I'm going to try to add the cvmfs repository again (according to the documentation). The '/etc/yum.repos.d/cernvm.repo' file was not found, so I downloaded it with `wget -O /etc/yum.repos.d/cernvm.repo http://cvmrepo.web.cern.ch/cvmrepo/yum/cernvm.repo` I also downloaded the GPG key for the repository: `wget -O /etc/pki/rpm-gpg/RPM-GPG-KEY-CernVM http://cvmrepo.web.cern.ch/cvmrepo/yum/RPM-GPG-KEY-CernVM` I tried to install the proper software using the repositories with `yum install cvmfs cvmfs-init-scripts`, but it said they were already installed. The documentation recommended `cvmfs_config chksetup` to verify the setup, and it reported several errors. It turns out none of the cvmfs configuration files are not readable for some reason. `cvmfs_config showconfig cms.cern.ch` reports: "required configuration repository directory does not exist: /cvmfs/config-osg.opensciencegrid.org/etc/cvmfs" along with a long list of empty configuration variables, only some of which are filled out. The documentation said to try to mount cvmfs to rule out autofs issues: $ mkdir -p /mnt/cvmfs $ mount -t cmvfs cms.cern.ch /mnt/cvmfs But it said that 'cvmfs' was an unknown filesystem. cont. 05/23/2017 Further wiki pages about cvmfs debugging never show the "configuration repository" message, they only check to see if it'll mount. I need to find out what a "configuration repository" is. cont. 05/24/2017 According to some documentation, there are two parameters for cvmfs that I think are assignable in '/etc/cvmfs/local.config'. 'CVMFS_CONFIG_REPOSITORY' determines where the configuration repository is stored, and 'CVMFS_CONFIG_REPO_REQUIRED' determines whether cvmfs should check for a configuration repository. That didn't work, so I posted to the 'T3 Discussion' forum on Hypernews. cont. 05/25/2017 Dave Dykstra responed to the T3 Discussion post, and he said he got the same thing on his test machine. For him, a cvmfs2 process was stuck running for the config-osg repository even though it wasn't shown as mounted. I tried to kill a rouge cvmfs process (revealed by `ps aux | grep cvmfs`), but its PID seemed to be continually changing. cont. 05/26/2017 Dave says that maybe the problem is that 'cvmfs' is not in the 'fuse' group. 'cvmfs' is in the fuse group, however. cont. 05/30/2017 Dave said to try a restart, so I'm trying that. I looked up the error message and someone with a similar problem tried `lsmod | grep fuse`, but they recieved no output, while we do. We do not, however, get ouput from `modprobe fuse`. '/etc/fuse.conf' is set to allow others. cont. 06/01/2017 Dave solved the issue! Turns out 'cvmfs' is hard coded to deal with lines in '/etc/group' that are only 16K in size, but ours had lines over 45K. He discovered this by using 'strace' to look for relevant system calls. Since he knew the issue was with the 'fuse' group, he monitored reading from '/etc/group', and he saw that the whole file wasn't being read. So he made and installed a development version of 'cvmfs' that allocates a line buffer of variable size. 'cvmfs' is mountable again! Unfortunately, 'condor_ce_trace' is still failing due to a failure to ping. Man! I tried running `osg-configure -v` again, and it succeeded, but `osg-configure -c` failed: [root@uscms1 ~]# osg-configure -c WARNING Can't copy grid3-location file from /etc/osg/grid3-locations.txt to /cmssoft/cms/etc/grid3-locations.txt CRLs exist, skipping fetch-crl invocation ERROR Option 'glexec_location' in section 'Misc Services' located in /etc/osg/config.d/10-misc.ini: Can't use glExec because LCMAPS glExec plugin not installed. Install lcmaps-plugins-glexec-tracking or unset glexec_location CRITICAL Can't configure module, exiting Can't configure module, exiting You may be able to get more details rerunning /usr/sbin/osg-configure with the -d option and/or by examining /var/log/osg/osg-configure.log I'm trying to install 'glexec', but the packages I'm trying (from the suggested above and the glexec twiki page) don't seem to exist. I've run a `yumUp`, and it updated 'cvmfs', I may have just undone what Dave did. I highkey totally did. Fortunately, I know what files he accessed, so maybe I can find what he changed. I fixed it! Dave left the rmp he modified in '/tmp', so I just force installed that 'cvmfs' rmp over the one I accidentally installed: `rpm -Uv --force /tmp/cvmfs-2.3.5-0.0.20170531211728.dwd.74e701106a94e88784e6049de792df0397fc0824git.el6.x86_64.rpm` Back to 'glexec'. 'glexec' is installed on the nodes, and according to the diagnostics page, they're all fine. I was able to install glexec by first installing the osg-release, which somehow got uninstalled: `rpm -Uvh https://repo.grid.iu.edu/osg/3.3/osg-3.3-el6-release-latest.rpm` I then installed glexec with: `yum install osg-wn-client-glexec` `osg-configure -c` now works! 'condor_ce_trace' is displaying the 'X509' error from before when run in '/tmp'. cont. 06/09/2017 I am now registered for CMS on my CERN certificate! I'm gonna replace all instances of my current usercert with a version of my CERN cert. Earlier log threads indicated that the usercert.pem and userkey.pem are to be stored in '/etc/grid-security', and the CERN certs are there. `voms-proxy-init -voms cms` doesn't work because it's not using my usercert, it's saying we're the cluster itself (I think it's using the hostcert). `voms-proxy-init`, though, works just fine. `grid-proxy-init` also works now, after I ran `grid-proxy-init -debug` to learn that I had to change the permissions of '~/.globus/usercert.pem' and '~/.globus/userkey.pem' to 600. 'condor_ce_trace', still reports a x509 error, though. Turns out, `voms-proxy-init` is user based. When I was running it as root, it created the certificate for root, which looks like the hostcert. I ran it as me, and it used my certificate just fine. `voms-proxy-init -voms cms` also knows I'm CMS! Something happened when I ran 'condor_ce_trace' from my account! It's sending connection requests to all 600 idle condor jobs. Not sure what it's doing, but it's doing something! Progress! Nevermind, it's just trying to send one connection request with a timeout of 600 seconds. Now the error is that condor won't process the job. I've hard restarted condor and condor-ce, and I've run 'condor_ce_trace' again. cont. 06/20/2017 Brian says that the jobs are sitting idle even though the CE has routed them. He cited the command `condor_ce_q -af:jh jobstatus routedtojobid` which, when the condor-ce service is running, displays nothing more than a header. It looks like he saw some jobs, though, in the files I sent, and he provided a command for me to try that uses the numbers from the 'routedtojobid', although, the IDs may be different because I've restarted since then. I'll wait a bit, then run the first command again to get new numbers. I told Brian about the 4 digit numbers. cont. 06/22/2017 Brian sent me a link to a twiki page for troubleshooting the event of jobs remaining idle on the CE (our problem). The first step says to check '/var/log/condor-ce/JobRouterLog' for the text 'src=... claimed job'. I grepped that file for 'src=' and nothing was returned; that file is also full of reports that say no jobs are begin submitted via the only route. The idling jobs are not matching any routes. The twiki page recommends using 'condor_ce_job_router_info' to see what's up. `condor_ce_job_router_info -config` displays the routes that jobs will match to; we have only one route. I made a condorTest directory with all of the necessary items to run a test condor job, and I put it in my (Voytella) home directory. I sent the job to the condor queue, and it's idling as expected. I see the job when I run `condor_q`, but there are no jobs listed in `condor_ce_q`. The only jobs that appear to run on condor_ce are rsv jobs. I think the jobs are simply being routed to the old condor, which I had turned off in favor of HTCondor. I probably have to tell something to send the jobs to the new condor. cont. 06/26/2017 I found a twiki page about HTCondor-CE job routes. It says that the configuration file for default values is '/etc/condor-ce/config.d/02-ce-condor.conf'. `condor_ce_config_val JOB_ROUTER_DEFAULTS | sed 's/;/;\n/g'` provides a list of all the settings. I tried looking at the job routing settings for the old condor, `condor_job_router_info -config`, but it said job routing was disabled. If routing's disabled, how are jobs getting queued? Maybe I have a misconception of what routing really is. I've discovered something! The traditional command 'condor_submit' sends the job to old condor, whereas 'condor_ce_submit' will send it to condor-ce. The job is still idle, though. From the page OSG sent me: Verify Correct Operation between the CE and Local Batch System Use `condor_ce_config_val -v ` to verify that "JOB_ROUTER_SCHEDD2_NAME, JOB_ROUTER_SCHEDD2_POOL, and JOB_ROUTER_SCHEDD2_SPOOL configuration variables are set to the hostname of your CE and the hostname of your local HTCondor's collector, and the location of your local HTCondor's spool directory, respectively." The first variable is just the hostname (probably good), the second one is the hostname with port 9618 (not sure if that's the correct port), and the third one is '/var/lib/condor/spool', which does exist and is not full. It also said to make sure that QUEUE_SUPER_USER_MAY_IMPERSONATE of the old condor is set to '.*' with `condor_config_val -v QUEUE_SUPER_USER_MAY_IMPERSONATE`. It is correctly set. Make Sure the Underlying Batch System Can Run Jobs It said to examine the ScheddLog. I opened it immediately after trying to submit a job from my (Voytella) account. I saw in the log, references to my job, but it was trying to update collector (the CE IP) on port 9619, rather than port 9618 that was listed above. It says if the underlying batch system (this might be the old condor) doesn't work, then HTCondor-CE will not work, either. Verify Ability to Change Permissions on Key Files There are no permission errors in the logs, so I don't think this is the problem.. The log file for the job (in the directory from which the job was sent, in this case, '~/condorTest/prog.log' (Voytella)) was complaining that HTCondor-CE held the job because of a missing user proxy. I ran `voms-proxy-init` and resubmitted the job; it's still idle, but it's not being held. While troublshooting, an RSV job got submitted and ran just fine. What's the difference between the RSV jobs and the others? After a few minutes, the submitted job complained of a missing proxy again. cont. 06/30/2017 Brian said to try increasing the debug level by placing 'ALL_DEBUG = D_FULLDEBUG' in '/etc/condor-ce/config.d/99-local.conf', and saving the changes with `condor_ce_reconfig`. I added the line, but when I tried to reconfigure it, I got this error: Can't find address for local master Perhaps you need to query another pool. cont. 07/10/2017 I found a post was receiving the same message. They found somewhere that a cause could be that $CONDOR_HOST is not set, and ours isn't. $CONDOR_HOST is, however, set to "$(FULL_HOSTNAME)", in '/etc/condor-ce/condor_config', so I don't think that's the problem. cont. 07/12/2017 When I run `condor_q` a lengthy error message appears that says that either the condor_schedd is not running, the SCHEDD_NAME is not defined in condor_config, or something is wrong with the SCHEDD_ADDRESS_FILE. The schedd just had to be started with `condor_schedd`; `condor_q` now works fine. I tried submitting the job to condor_ce again from my (Voytella) home directory, but it's failing to submit: ERROR: Can't find address of local schedd The schedd for condor_ce must also be started. Never mind, condor-ce just wasn't turned on, now I can submit jobs from my account like before; they still are idle, then held. I tried to `condor_ce_reconfig`, and it went through. Turning on the schedd probably fixed that. The JobRouterLog for condor-ce is spouting things, so I'm waiting for it to finish before I begin the trace. It hasn't stopped printing nonsense, so I'm gonna start the trace. The trace isn't providing any new information; it's just saying that the schedd address cannot be found. This is the central problem, since the schedd is responsible for sending the jobs off to the nodes for processing. I've updated OSG. cont. 07/18/2017 Brian said that the job router is complaining about contacting the HTCondor schedd that is used to submit jobs to the HTCondor backend rather than complaining about the HTCondor-CE schedd. He also said to make sure that the 'condor' service is running on the CE host and that jobs can be submitted to the pool from the CE via condor_submit/condor_run. I submitted a job to condor from my (Voytella) account, `condor_submit submit`, and '/var/log/condor/SchedLog' reported: 07/18/17 12:48:40 (pid:1894617) Failed to send RESCHEDULE to unknown daemon: 07/18/17 12:48:40 (pid:1894617) attempt to connect to <163.118.42.1:9618> failed: Connection refused (connect errno = 111). 07/18/17 12:48:40 (pid:1894617) ERROR: SECMAN:2003:TCP connection to collector uscms1.fltech-grid3.fit.edu failed. 07/18/17 12:48:40 (pid:1894617) Failed to start non-blocking update to <163.118.42.1:9618>. I turned on the condor service `service condor start`, and new errors appeared in the SchedLog: 07/18/17 13:06:27 (pid:1894617) DC_AUTHENTICATE: Command not authorized, done! 07/18/17 13:06:27 (pid:1894617) PERMISSION DENIED to unauthenticated@unmapped from host 10.1.1.1 for command 416 (NEGOTIATE), access level NEGOTIATOR: reason: cached result for NEGOTIATOR; see first case for the full reason cont. 07/19/2017 I'm following the debug instructions at http://research.cs.wisc.edu/htcondor/CondorWeek2004/presentations/adesmet_admin_tutorial/#DebuggingJobs 1) condor_q Nothing seemed terribly off with anything in this step. 2) User Log If I submit the job to HTCondor_CE (condor_ce_submit), it complains of a missing user proxy. If I submit to regular condor (condor_submit), it just says that the job submitted. 3) ShadowLog The last entries were May 10; no new entries means there are matching problems. 4) Matching Problems The errors in the SchedLog are posted above (bottom errors). The NegotiatorLog shows the phases that are completed: 07/19/17 13:53:28 Phase 1: Obtaining ads from collector ... 07/19/17 13:53:28 Getting startd private ads ... 07/19/17 13:53:28 Getting Scheduler, Submitter and Machine ads ... 07/19/17 13:53:28 Sorting 166 ads ... 07/19/17 13:53:28 Got ads: 166 public and 160 private 07/19/17 13:53:28 Public ads include 5 submitter, 160 startd 07/19/17 13:53:28 Phase 2: Performing accounting ... 07/19/17 13:53:28 Phase 3: Sorting submitter ads by priority ... 07/19/17 13:53:28 Phase 4.1: Negotiating with schedds ... It is the schedd negotiation that fails: 07/19/17 13:53:28 SECMAN: FAILED: Received "DENIED" from server for user unauthenticated@unmapped using method (no authentication). 07/19/17 13:53:28 ERROR: SECMAN:2010:Received "DENIED" from server for user unauthenticated@unmapped using method (no authentication). 07/19/17 13:53:28 Failed to send NEGOTIATE command to osg@fltech-grid3.fit.edu (<10.1.1.1:9711?addrs=10.1.1.1-9711>) 07/19/17 13:53:28 Error: Ignoring submitter for this cycle These errors are repeated for users glow, Voytella, grid0004, and vbhopatkar. Clearly, there are severe matching problems. cont. 07/21/2017 Brian wants to know the output of `condor_config_val QUEUE_SUPER_USER_MAY_IMPERSONATE` and if the local condor collector is on uscms1.fltech-grid3.fit.edu and that it's listening on 9618. The output of the command is: # condor_config_val -v QUEUE_SUPER_USER_MAY_IMPERSONATE QUEUE_SUPER_USER_MAY_IMPERSONATE = .* # at: /etc/condor/config.d/99-condor-ce.conf, line 1 # raw: QUEUE_SUPER_USER_MAY_IMPERSONATE = .* cont. 07/24/2017 I searched the condor files in /etc for ports 9618 and 9619. 9618: condor-ce/config.d/50-osg-configure.conf:JOB_ROUTER_SCHEDD2_POOL=uscms1.fltech-grid3.fit.edu:9618 9619: condor-ce/condor_config:PORT = 9619 condor-ce/config.d/03-ce-shared-port.conf::SHARED_PORT_ARGS= -p 9619 condor-ce/config.d/10-ce-collector-generated.conf:CONDOR_VIEW_HOST = collector1.opensciencegrid.org:9619:9619,collector2.opensciencegrid.org:9619:9619 The condor collector appears to be running on both port 9618 and 9619: # netstat -tulpn | grep 9618 tcp 0 0 0.0.0.0:9618 0.0.0.0:* LISTEN 1751494/condor_coll udp 0 0 0.0.0.0:9618 0.0.0.0:* 1751494/condor_coll # netstat -tulpn | grep 9619 tcp 0 0 0.0.0.0:9619 0.0.0.0:* LISTEN 1902864/condor_shar udp 0 0 0.0.0.0:9619 0.0.0.0:* 1902866/condor_coll cont. 07/27/2017 I ran `condor_q -analyze` and it showed that for the jobs, "Request has not yet been considered by the matchmaker." recommends looking at the startlog on the nodes. 'StartLog' of compute-1-0 is full of the same error: attempt to connect to <163.118.42.1:9618> failed: Connection refused (connect errno = 111). ERROR: SECMAN:2004:Failed to create security session to <163.118.42.1:9618> with TCP.|SECMAN:2003:TCP connection to <163.118.42.1:9618> failed. Failed to start non-blocking update to <163.118.42.1:9618>. `condor_status -any` shows the collector as '"OSG Cluster Condor at fltech-grid3.fit.e' with the 'du"' cut off. I'm not sure if that's just a display issue or something more. Brian has asked me to do the following: 1) Set 'ALL_DEBUG = D_FULLDEBUG' in /etc/condor/config.d/99-local.conf '99-local.conf' was not present, so I created the file and put that line in. 2) Run `condor_reconfig` It ran successfully 3) Verify that your user proxy is still valid On my (Voytella) account, `grid-proxy-init` and `voms-proxy-init` run without issue. 4) Run `condor_ce_trace -d uscms1.fltech-grid3.fit.edu` I ran it from Voytella. 5) Wait for the job to go on hold or the trace command to timeout 6) Attach /var/log/condor/SchedLog and /var/log/condor/CollectorLog cont. 07/30/2017 Eduardo responded to the Hypernews post. He confirmed that grid authentication is, in fact, working, and that the problem is with the configuration of the local scheduler (condor, not condor-ce). Since 'condor_submit' worked in the past, he said to check the changes to the condor configuration in '/etc/condor/config.d'. He's also wondering if condor is installed on the nodes. I've sent him the contents of the recently changed configuration files. Brian doesn't see any evidence of the 'condor_ce_trace' in the SchedLog I sent him. He wants me to check the SchedLog for the reason it didn't show up. I'm going to run the command again and check the log. The log had some interesting output. It said that the address for startd could not be found, and the NEGOTIATOR authorization policy contained no matching ALLOW entry for the request. I notified Brian. cont. 08/01/2017 Marguerite from HyperNews had a similar problem at the Maryland cluster. She thinks the problem is due to the version of condor updating with the change to HTCondorCE. She says to: 1) Make sure everything is running the same version of condor (`condor_q -version) 2) Make sure the firewall is open between all the nodes on the appropriate ports. 3) Add the following (she put it in '/etc/condor/config.d/cluster.conf'): # Here you have to use your network domain, or any comma separated list of hostnames and IP addresses including all your condor hosts. * can be used as wildcard ALLOW_WRITE = *yourInternalNetwork, 10.1.0.*, SomeIPNumberOfYourCE, name of your CE, fltech-grid3.fit.edu; ### next four lines needed for condor 8.4.8 that came with OSG 3.3 ALLOW_NEGOTIATOR = *fit.edu;, firstIPNumbersForYourPublicNetwork.* ALLOW_NEGOTIATOR_SCHEDD = $(ALLOW_NEGOTIATOR) HOSTALLOW_NEGOTIATOR_SCHEDD = $(HOSTALLOW_NEGOTIATOR_SCHEDD), $(HOSTALLOW_WRITE) HOSTALLOW_WRITE = $(ALLOW_WRITE) I yum updated the CE and nodes. cont. 08/03/2017 According to `condor_q -version`, the CE is running version '8.4.11 Feb 24 2017', while the nodes are running version '8.2.10 Oct 27 2015'. Feb 24 is about when the jobs died, which now makes sense. I just did a yum update, so how do I get the updated version of condor? Maybe I also have to update OSG on the nodes. The OSG version on the CE is '3.3.26', while the version on the nodes is '3.2.41'. I'm gonna follow the directions to update OSG from 1) remove the old yum repos: `rpm -e osg-release` 2) install the OSG repos: `rpm -Uvh ` 3) clean yum cache `yum clean all --enablerepo=*` 4) update software `yum update` I'm going to first try these instructions manually on compute-1-0. If they work, I'll make a script for the rest of the nodes. The update went smoothly, but the problem persists. Nevertheless, I'm updating all on the nodes anyway. The update went smoothly. 'ALLOW_WRITE' and 'ALLOW_NEGOTIATOR' were already set properly in '/etc/condor/config.d/00personal_condor.config' I added: ALLOW_NEGOTIATOR_SCHEDD=$(ALLOW_NEGOTIATOR) HOSTALLOW_NEGOTIATOR_SCHED=$(HOSTALLOW_NEGOTIATOR_SCHEDD), $(HOSTALLOW_WRITE) HOSTALLOW_WRITE=$(ALLOW_WRITE) cont. 08/04/2017 When I tried to check the status of condor, it said the subsys was locked. I restarted 'condor-ce' and 'condor-cron'. The internet isn't working, so I'll continue later today. cont. 12/20/2017 Alright, now that NAS-0 is back online (mostly), let's resume trying to fix condor. cont. 01/07/2018 I thought I turned some nodes on, so that I could work on it before the Physics Building opened up, but I guess not. RIP. I guess I'll just have to wait until tomorrow. cont. 01/21/2018 Alright, now that NAS-0 is fixed FOR REAL this time, let's get crackin'. Jk, the nodes won't get power. *sigh* The output breakers for the plugs into which the node power strips are connected are weirded out. So that I can continue to play with condor in spite of this strange issue, I only have five nodes (2-0 to 2-4) turned on. So far, the UPS seems to be alright with that. cont. 01/22/2018 Time to play with condor. Let's start off with a classic 'condor_ce_trace' and see where we end up. First, I need to send off my new usercert. The instructions for converting a '.p12' to a '.pem' are found at [10/16/2015]. I copied both the new 'usercert.pem' and 'userkey.pem' to '/etc/grid-security'. I tried `condor_ce_trace -d uscms1.fltech-grid3.fit.edu`, and it told me that it couldn't connect to the CE; the collector daemon appears to be off. Yup, the collector daemon's down, verified by `condor_ce_status`. I did `service condor-ce start` to start it up. Now I'm getting all kinds of output from 'condor_ce_trace'. It's saying it's unable to create a temporary file in the working directory, '/root'. Imma try to run it as Voytella, and see if I get anything different. Now it's telling me it can't find a X509 proxy in '/tmp/x509up_u14122'. That's because my user certificate is hella outdated. It says to just throw a copy of it and the key into '/home/Voytella/.globus'. Excellent! I've created a valid temporary proxy! Alright, now it's doing what it was doing before: querying every single idle job in the queue. '/var/log/condor/SchedLog' is also reporting a bunch of 'PERMISSION DENIED' errors like it was doing before. cont. 01/26/2018 I'm going through the documentation sent by OSG. It says to look for "DC_AUTHENTICATE" and "PERMISSION DENIED" errors in '/var/log/condor-ce/SchedLog'. While I don't have those errors in the condor-ce SchedLog, they're all over the place in the condor SchedLog. The errors are also slightly different than what's described in the documentation. Alright, despite the documentation being for condor-ce, I'm gonna follow its directions to see what I can discover. First, it says to check GUMS or 'grid-mapfile' to ensure that my DN is known to my authentication method. I made sure that in '/etc/osg/config.d/10-misc.ini', 'authorization_method' was set to 'xacml' and 'gums_host' was set to our hostname. There is also a note that says that if the local batch system is HTCondor, it will attempt to use the LCMAPS callouts if enabled in '/etc/condor-ce/condor_mapfile', and if that's not the desired behavior, to set 'GSI_AUTHZ_CONF=/dev/null' in '/etc/condor-ce/config.d/99-local.conf'. The GSI thing wasn't set, so I set it. Imma try condor_ce_trace again and see what happens. Nothing seems to have changed. Oh, I forgot to `condor_ce_reconfig`. Now let's see if that does anything. I set the 'condor_ce_trace' command on my user side-by-side with a `tail -f /var/log/condor-ce/SchedLog`. The 'condor_ce_trace' is doing the thing where it queries every single job to report that it's idle and sends a "connection request to schedd at <163.118.42.1:9619>". Everytime it makes a new query, it writes to the SchedLog the same thing: saying the number of active workers is 0 and something about forking workers and no more children processes to reap. I wonder if 'condor_ce_trace' writes anything to '/var/log/condor/SchedLog'. While there's a bunch of stuff being written to '/var/log/condor/SchedLog', it doesn't look like it's being caused by the 'condor_ce_trace'; it's just a bunch of the 'DC_AUTHENTICATE' and 'PERMISSION DENIED' errors. NOTE: There are a TON of LCMAPS and GRAM-gatekeeper authentication errors in '/var/log/messages'. Let's see what doing the GSI thing for regular condor does. NOTE: In '/etc/condor/config/d', there's a mysterious '99-condor-ce.conf'. What's that doing there? There's also a '50-condor-ce-defaults.conf'. Maybe they're there so condor can talk to condor-ce? They just say that the super user can impersonate anything. I made the GSI addition and reconfigured condor. Nothing new happened. The next thing it says is to look for LCMAPS errors in '/var/log/messages'. Oh hey! We're drowning in those! Let's investigate! It looks like the error starts with an authentication of a globus user, then it says it can't open file '/etc/lcmaps/lcmaps.db'. That causes a LCMAPS plugin error, with prevents LCMAPS from initializing. Then that failure breaks everything else. Let's see about that file. NOTE: LCMAPS (Local Credential MAPping Service) translates grid credentials to local Unix credentials. Turns out there's only '/etc/lcmaps.db' and no 'lcmaps' directory. I'm gonna try to make that directory and throw the file in it. Now, in '/var/log/messages', a bunch of globus users got authenticated in a row without issue and some other stuff happened. Then it gave a warning about still being "root after the LCMAPS execution. The implicit root-mapping safety is enabled. See documentation for details.", and the next line said that "globus_gss_assist_gridmap() failed authorization" and that the callout returned an unknown error. I'm gonna see about debugging LCMAPS. There's a whole page for troubleshooting LCMAPS on the wiki. First, it said to set up LCMAPS for maximum debugging by adding the following to '/etc/sysconfig/condor-ce': export LCMAPS_DEBUG_LEVEL=5 export LCMAPS_LOG_FILE=/tmp/lcmaps.log Then 'condor-ce' has to be restarted: $ service condor-ce restart It also says that disabling HTCondor-CE's caching of authorization lookups is a good idea for testing changes to mapfiles. To disable the caching, create '/etc/condor-ce/config.d/99-disablegsicache.conf' and insert GSS_ASSIST_GRIDMAP_CACHE_EXPIRATION=0 then restart 'condor-ce'. NOTE: It says that disabling caching could increase the load on the CE (makes sense), so keep an eye on things to make sure nothing gets too out of control. It gave me a list of configuration files in order of precedence: /etc/grid-security/ban-mapfile (ban DNs) /etc/grid-security/ban-voms-mapfile (ban VOs) /etc/grid-security/grid-mapfile (map DNs) /etc/grid-security/voms-mapfile (map VOs) /usr/share/osg/voms-mapfile-default (map VOs default) '/etc/grip-security/grid-mapfile' is full of grid mappings, but '/etc/grid-security/voms-mapfile' doesn't exist. Strangely enough, it says that LCMAPS is configured in '/etc/lcmaps.db', the file I thought (and it thought) was misplaced earlier. Huh. Either way, it gives me a bunch of stuff to make sure I have in it. It looks like it contains none of what it's supposed to have. Imma go through and add bunch of stuff, then. Above the 'authorize_only' section, I added the 'gridmapfile', 'banfile', 'banvomsfile', 'vomsmapfile', 'defaultmapfile', and 'verifyproxynokey' parameters. It said to edit the 'authorize_only' section to exactly what it is now; I've commented out what was already there. It also said to make sure '/etc/grid-security/gsi-authz.conf' containes a certain line (that terminates with a newline), but that's already there (including the newline). That's the end of the document. Now let's see what happens. That globus_gss_assist_gridmap() is still failing. Oh, turns out this troubleshooting guide I was following is just the tail end of the whole LCMAPS page. Imma run down it from the top and see what I can see. It says to enable the LCMAPS VOMS plugin, I have to add the following to '/etc/ost/config.d/10-misc.ini': edit_lcmaps = True authorization_method = vomsmap It also said to comment out 'glexec_location', and I've commented out the existing 'authorization_method'. It says that a Unix account must be created for each VO, VO role, VO group, and user that I wish to support. I'm not sure if that means every single user in '/usr/share/osg/voms-mapfile-default' or not, because that's a bunch of users. I can probably ask OSG about that. It says the 'allowed_vos' parameter in '/etc/osg/config.d/30-gip.ini' should be populated with the supported VOs per subcluster (worker node hardware) or resourceEntry (set of subclusters) section. Not entirely sure what it means by that, but our 'allowed_vos' in empty and commented out. I'll also ask OSG about that. cont. 02/03/2018 They think we may not have the OSG version of LCMAPS. To see what version we have, I ran `rpm -q lcmaps`, and it told me we're running version 'osg33', while the most updated version is 'osg34'. Ah ha! I'll see about fixing that up. I've run a `yumUp`. That didn't cut it, I may have to do other things. Brian also said that I may have not run 'osg-configure', and he's right, I haven't! I've run `osg-configure -v`, and it gave me some info. It said I'll either have to specify a list of VOs or a '*' for the 'allowed_vos' option. It also said that I need to fix the 'gram_ce_hosts' option in '/etc/osg/config.d/30-rsv.ini', since GRAM is not longer supported (the whole reason for this debacle in the first place). In '/etc/osg/config.d/30-gip.ini', I've set 'allowed_vos' to '*'. I'll probably also have to make user accounts for all the VOs in '/usr/share/osg/voms-mapfile-default'. In '/etc/osg/config.d/30-rsv.ini', I edited 'ce_hosts' to just include HTCondor-CE, and I've commented out the 'gram_ce_hosts' setting. `osg-configure -v` gives me a "No allowed_vos specified for section 'Subcluster FLTECH'" warning, and a VO specification warning, saying that either a list of VOs or '*' must be given. I thought I had already taken care of that by modifying 'allowed_vos' in '/etc/osg/config.d/30-gip.ini'. Huh. I'll just go ahead with the `osg-configure -c` and keep these warnings in mind. The configure reported no errors, just the above warnings. cont. 02/05/2018 OSG also said they wanted an updated `osg-system-profiler`, so I've started that off. cont. 02/16/2018 (RIP, sorry OSG) Since it's been so long, I've made a new `osg-system-profiler`. cont. 02/17/2018 OSG says I've gotta make users for all of the entries in '/usr/share/osg/voms-mapfile-default', so Imma see about doing that. The new users have been created. I've run `osg-configure -c` again and got the following warnings: WARNING No allowed_vos specified for section 'Subcluster FLTECH'. WARNING In OSG 3.4, you will be required to specify either a list of VOs, or a '*' to use an autodetected VO list based on the user accounts available on your CE. WARNING No allowed_vos specified for section 'Subcluster FLTECH'. WARNING In OSG 3.4, you will be required to specify either a list of VOs, or a '*' to use an autodetected VO list based on the user accounts available on your CE. WARNING Can't copy grid3-location file from /etc/osg/grid3-locations.txt to /cmssoft/cms/etc/grid3-locations.txt CRLs exist, skipping fetch-crl invocation The repetition of the first two warnings is most likely a result of `osg-configure -c` first running `osg-configure -v`, and simply printing those warnings for both commands. The last warning, however, I have no explanation for. cont. 02/20/2018 OSG said I forgot to set 'allowed_vos' to '*' under the '[Subcluster FLTECH]' section of '/etc/osg/config.d/30-gip.ini'; I had only done it in the '[SE FLTECH-SE]' section. cont. 02/23/2018 Daniel said he fixed some condor stuff, [02/11/2018], so let's try to run some condor jobs and see what happens. I submitted a job from my account, and it was immediately held. cont. 02/24/2018 Since so much has changed, I'm going to run through the Condor troubleshooting documentation again to see what it says. 04/06/2017 TAGS: CE cannot ssh unresponsive Vallary emailed me saying that she couldn't ssh into the cluster, and neither could I! Upon arriving at the high bay I found the CE unresponsive; just the blue background was visible with the mouse. I power cycled the CE and it rebooted, but condor's not working. `condor_status` returns a communication error stating that it cannot connect to 163.118.42.1:9618. It stopped because /var is 100% full. /var/lib/globus is 3.3G and is full of strange condor files that were created yesterday and the day before. Some are several Megabytes while some are empty. The files seem to contain entries for submitted jobs. I'm going to move all of the "condor.*" files to ~/globusCondorJunk and see if that breaks anything. I fully restarted condor, and all seems to be well. If it turns out that the "condor.*" files are indeed useless, then I'll delete them. 04/10/2017 TAGS: mass deletion of users users are being deleted in 24 hours. I made a file called ~/userdellist.txt that has all the info in it the programs at the bottom will stay for now, some of them are important. 04/11/2017 TAGS: node validation failure tmp full OSG sent us a ticket a while ago (my email wasn't in the list, Ankit told me about it) saying that CMS and OSG glideins were failing node validation upon startup (https://ticket.opensciencegrid.org/32896). The CMS glideins are failing due to being unable to locate CMS software, and the OSG glideins are failing due to a full '/tmp'. CMS Failing Nodes: compute-1-1 compute-1-3 compute-1-6 compute-2-1 compute-2-4 compute-2-5 compute-2-6 compute-2-7 compute-2-8 OSG Failing Nodes: compute-2-5 compute-2-6 compute-2-7 compute-2-8 The OSG Failing Nodes do, in fact, have a completely full primary partition, where '/tmp' is located. cont. 04/12/2017 The problem was that '/scratch' was all filled up because it was the cvmfs cache. I moved the cvmfs cache from '/scratch' to '/var/cache/cvmfs' on all the nodes via a script ('~/Scripts/mvCvmfsCache.sh'). cont. 04/14/2017 The other problem was the CMS failing nodes. The listed nodes contain the script `/var/lib/condor/execute/dir_/glide_/discover_CMSSW.sh`. NOTE: navigate to 'var/lib/condor/execute' then run `find . -name "discover_CMSSW.sh"` to locate the script. It hangs upon execution. The script just looks for other scripts and executes them. If it doesn't find what it's looking for, it's supposed to say so. The script however, doesn't seem to do anything. The discover script is only on some of the nodes listed, and it's not on any that are not listed. 04/13/2017 TAGS: home directory clean Cleared out the home directory for root so it's usable 04/14/2017 TAGS: condor not running diagnostics passwords required ssh The diagnostics page reports that condor is not running on any of the nodes. All of a sudden, I need to enter passwords to ssh from root. Huh, that's strange. Turns out condor's fine, but the monitoring scripts need to ssh into the nodes, which it can't do now because ssh-ing requires passwords for some reason. Riley moved some of the ssh files around when he was reorganizing the home directory, so the CE's ssh keys have been slightly scrambled. cont. 04/17/2017 Ankit said to investigate ROCKS; it made the ssh keys. The ROCKS documentation said that hostbased authentication is controlled by '/etc/ssh/shosts.equiv'; the IPs of the cluster parts are all there. I created a brand new ~/.ssh directory and filled it with a public and private key generated with $ rocks create keys ~/.ssh/id_rsa > ~/.ssh/id_rsa.pub The new key was placed in NAS-1 with $ cat ~/.ssh/id_rsa.pub | nas1 "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys" The new key was confirmed placed where it should be, but a password was still requested. Silly me, I didn't check id_rsa.pub for errors, of which there was one. I need to type the command correctly. $ rocks create keys key=~/.ssh/id_rsa > ~/.ssh/id_rsa.pub The key was created, and it was correctly put onto NAS-1, but it still doesn't work. Instead of using the rocks command to make the keys, I used the normal `ssh-keygen -t rsa` command, then sent the keys over with the normal command. For installing the new key on all of the nodes, I'm installing `sshpass` which will allow for the automation of logging into all of the nodes. I added to the osg-node.sh: cat ~/.ssh/id_rsa.pub | sshpass -p "" ssh -o StrictHostKeyChecking=no compute-fed-nad "mkdir -p ~/.ssh && cat > ~/.ssh/authorized_keys" be sure to comment out the normal ssh line! That worked for compute-2-*, but the passwords for compute-1-* are different. I will have to change them to the normal password. cont. 04/18/2017 To change the root passwords of the other nodes, they must be powercycled and booted into single user mode. After the password has been changed, run `init 5` to resume normal operations. If the node hangs after `init 5`, powercycle it again, and allow it to boot normally. I've changed compute-1-0 to compute-1-3 so far. cont. 04/19/2017 All of the nodes, the SE, NAS-1, and NAS-0 all have the new keys. 04/19/2017 TAGS: gratia accounting osg website GRACC change no job count OSG updated their grid monitoring software from Gratia to GRACC (GRAtia Compatible Collector). GRACC is compatible with all existing Gratia probes. It is shown that we are amassing wall hours, but there is no data for the job count. 04/24/2017 TAGS: squid not running Squid wasn't running. I checked its status with `squid -k check` and it told me that it couldn't find the cache directory. That's because it was moved during Riley's spring cleaning. I changed the squid directories in '/etc/squid/customize.sh' from "ufs /root/squidAccessLogDump/cache 20000 16 256" to "ufs /root/Cluster_System_Files/squidAccessLogDump/cache 20000 16 256". cont. 04/26/2017 'customize.sh' will hang, but it does, in fact, edit the file properly after some time. Squid is good again. rpgpg 04/24/2017 TAGS: NAS0 diagnostics page The NAS0 diagnostics page had been missing the top table for a while, because a new line was missing at the end of /etc/cron.d/nas0chk . The line was added so it works now. 04/25/2017 TAGS: NAS1 yum update rpmforge gpg keys NAS-1 has some trouble yum updating due to non-existant rpmforge gpg keys. I had some trouble finding the keys, and I had to install a security update, so I just turned off the check for the keys be editing '/etc/yum.repos.d/rpmforge.repo'. I've turned the check back on for now.