Sysadmins: Ankit, Daniel, Gunnar, Ryan, Riley, Erik, Hannah 08/20/2015 (Gunnar and Daniel) TAGS: UPS Battery NAS1 hard drive squid diagnostic raid We checked on the UPS and the failed NAS1 HDD. We swapped the NAS1 HDD. To do this, follow Jordan's documentation. The top UPS was an atrocity. All of the batteries were corroded and bulging. To get into the UPS we turned it off, and unplugged everything from the back, and unplugged the UPS from the wall. We then disconnected the top part from the bottom part. We managed to get all 100+lbs out of the rack without kicking it once... We unscrewed the top panel and opened it up. They were terrible. We needed to pry them apart with a screwdriver to get them apart because they were so corroded. Upon testing them with a voltmeter we found that four worked (out of 24). They seemed newer than the rest. They were wired serially in groups of 4, and those 6 groups of 4 were wired in parallel. We also added a line in the diagnostic page to check if squid is running. The command utilized was $ squid -k check This simply retruns 0 if it's running. 08/27/2015 (Daniel) TAGS: UPS Battery NAS1 hard drive raid NAS1 stopped responding during the rebuilding of HDD10 at 16%. After a restart, we could not get back into the webbios raid menu, and df -h showed 48GB of space instead of the normal bunch o' TBs. I removed hdd 10 and restarted the computer again. It prompted the raid menu, and we swapped the drive as normal. Hopefully it will pass 16% this time. We removed the third UPS to check it like we did previously with the second UPS. Only 4 batteries could not produce a voltage ~12V. Like the other UPS, it was wired in 6 parallel groups of 4 batteries in series. Both UPS cases need to be cleaned. As of the time of writing, there are 0 nodes online. The CE, SE, NAS0, and NAS1 are online, plugged into the surge protector strip. 09/6/2015 (Daniel, Gunnar, Ryan) TAGS: UPS Redundancy power nodes down battery The diagnostic website reported the nodes as unoperational. Upon reaching the High Bay, we were greeted by a peircing screech emanating from the downed nodes. That screech signals that the redundant power supply has failed, and the nodes have lost power. Upon further inspection, we discovered that the socket into which the nodes were plugged in had stopped supplying power. To remedy the situation we merely plugged the power strip into the adjacent socket block, thus supplying power once again to the nodes. 09/12/2015 (Daniel, Gunnar, Ryan) TAGS: UPS Battery Power Supply Redundancy Lights Buttons Signals We cleaned out and reinstalled one of the UPS boxes. In the process we changed the power scheme of the cluster. The current configuration with only one UPS installed is that we have 1/2 of the redundant power supplies of the CE, SE, NAS 0, and NAS 1 are plugged into the top UPS. The other half of the CE and NAS 0 are plugged into the middle UPS. The other half of the CE and NAS 1 are plugged into a surge protector. Nodes 2-0 to 2-9 are plugged into the middle UPS. NOTE: Do NOT connect two batteries together (complete battery circuit); the batteries will become 'sploded. REFERENCE FOR THE FUNCTIONS OF THE MIDDLE AND BOTTOM UPS LIGHTS (Labled 1-5 from left to right) 1. (Squiggle): AC Power 2. (Squiggle with Arrow): Voltage adjustment When AC power voltage is wrong, this light signals that the UPS is adjusting. 3. (Balance): Output Load Level Approximate electrical load GREEN: light ORANGE: medium RED: overload 4. (Battery Charge): When operating from utility power, indicates the approximate charge of the UPS GREEN: full ORANGE: medium RED: critical 5. (Battery Warning): light is RED and alarm sounds UPS batteries need to be recharged or replaced If RED charge batteries for 12 hours and test again BUTTONS Power Button: Turns UPS on and off Mute/Test Button: To silence: briefly press and release test button To run self test: with UPS plugged in and turned on, press and hold Mute/Test Button Test will last approximately 10 seconds If output light level remains RED, UPS outlets are overloaded If battery warning light remains RED, batteries need to be recharged or replaced 09/21/2015 (Ankit, Daniel, Ryan) TAGS: GUMS, broken, symlinks, antlr After a "yum update", GUMS may be broken due to broken symlinks that refer to "antlr.jar". To find the symlinks, run $ ls -l /usr/lib/gums/antlr.jar /var/lib/tomcat6/webapps/gums/WEB-INF/lib/antlr.jar If the symlinks are broken, find where the proper file is by running: $ rpm -qlv antlr | grep jar The proper file will be the largest of the four. The old symlinks must then be deleted: $ rm -r /usr/lib/gums/antlr.jar $ rm -r /var/lib/tomcat6/webapps/gums/WEB-INF/lib/antlr.jar To fix the symlinks, run these two commands for each of the files: $ ln -s /usr/lib/gums/antlr.jar $ ln -s /var/lib/tomcat6/webapps/gums/WEB-INF/lib/antlr.jar To ensure that the links are indeed fixed run: $ ls -l /usr/lib/gums/antlr.jar /var/lib/tomcat6/webapps/gums/WEB-INF/lib/antlr.jar 09/21/2015: (Ankit) Looking into the VOBOX for proxy renewal for phedex 09/29/2015 (Ryan) TAGS:Accounts, Users, bash I identified all of the users with gid 100 and not using /sbin/nologin shell with: $ awk -F':' '$4==100 && $7!="/sbin/nologin" {print $1" -- "$5}' passwd >& users 09/29/2015: (Ankit, Ryan) TAGS:Certificate, phedex, gridcert To check the valility of the certificate: $ grid-cert-info -file /etc/grid-security/rsv/rsvcert.pem -startdate -enddate (It is currently set to expire Feb. 19, 2016) 10/04/2015 (Daniel, Gunnar, Ryan, Eric) TAGS: APC UPS Battery Wire Plug The APC UPS can be checked without removing the box. (There are horrible deadly capacitors inside. Don't open the top panel.) Remove the front panel, and there is another pane you can slide left and right. Unscrew it if it is screwed in and slide it left. From here the batteries can be slid out. As of this log all of the batteries in the APC UPS work properly. There is no sign of damage, and they all hold a proper voltage. Also the UPS wires twist and lock into the wall. 10/5/2015 (Daniel, Ankit, Ryan, Eric, Gunnar) TAGS: NAS1 Curtis Backup Vegeta Kakarot CentOS Storage Curtis came and we mounted his storage server into our rack so we can begin backing up NAS1 via a direct link. The wire to connect the storage unit was not with our supplies so this process has not begun. We installed CentOS on the storage server and are awaiting the proper cable to attach the storage unit. 10/06/2015 (Ryan) TAGS: Website, HTML, config, usmcs, grid The config file in which the path for the directory that the website pulls from is found here: /etc/httpd/conf/httpd.conf Currently, the path for the directory is: /var/www/html To save changes, run: $ service httpd restart 10/06/2015 (Daniel) TAGS: Diagnostic nodes cron job I moved the "nodes up" portion of the diagnostic's php script to a seperate file /usr/local/bin/nodes.sh which is ran by a cron job every 60 seconds. It fills a text file DIAGNOSTICPATH/nodes.txt with the number of nodes up, which is read by the diagnostics php script. This greatly improves the loading speed because it does not have to ping each node every time you load the site. 10/16/2015 (Daniel) TAGS: certificate proxy PEM p12 pkcs12 key openssl ankit is old Ankit's user certificate was about to expire (we need it for a few softwares, namely phedex) I replaced it with my own file. To do this, execute the following commands using your .p12 file First backup any certificate files. These will be called usercert.pem and userkey.pem. The .pem files will be stored in /etc/grid-security $ openssl pkcs12 -in YOURcert.p12 -clcerts -nokeys -out usercert.pem $ openssl pkcs12 -in YOURcert.p12 -nocerts -out userkey.pem $ chmod 600 usercert.pem userkey.pem 10/16/2015 (Gunnar, Ankit) TAGS: Nas-1 beeping .bashrc MegaCLI MegaRAID Ankit found details on a software called MegaCLI, which monitors the devices and controllers for the Nas-1. We can use this to stop the beeping that is being caused by a missing hard drive. We installed it from the site http://www.avagotech.com/support/download-search It downloaded as a .zip file and then we extracted it to a .rpm file. After installing the .rpm file, in order for the software to work, we edited the .bashrc file by adding the line export PATH=$PATH:/opt/MegaRAID/MegaCli:. 10/19/2015 (Daniel, Ankit) TAGS: SAM TEST 12 13 14 15 Critical Warning gratia transfer storage xrootd gums link broken The gums link was broken again. $ ls -l /usr/lib/gums/[antlr].jar will show red if the link is broken. If so, the proper file must be tracked down, or downgrade antlr This caused a failure in gums, and consequently SAM test 14 and 15. The SE SAM tests 12 and 13 were also down. We restarted the SE and some services did not restart. Do do this, run $ service gratia-xrootd-transfer start $ service gratia-xrootd-storage start $ service globus-gridftp-server start I also ran $ chkconfig xrootd on so xrootd runs on boot. We will see if this works or not. To test if file transfer works, run $ grid-proxy-init $ touch /tmp/test $ srm-copy file:////tmp/test srm://uscms1-se.fltech-grid3.fit.edu:8443/srm/v2/server?SFN=/mnt/nas1/store/temp/test_2 To run an rsv metric by hand run $ rsv-control --run --host uscms1.fltech-grid3.fit.edu NAME_OF_METRIC To run all test by hand, run $ rsv-control -r --all-enabled 10/20/2015 (Ankit, Ryan) TAGS: ZFS, RAID, nas-0-2 INSTALLATION: We installed the EPEL repository and ZFS on nas-0-2. We installed EPEL with this command: $ sudo rpm -Uvh http://mirrors.kernel.org/fedora-epel/6/i386/epel-release-6-8.noarch.rpm We installed ZFS following the instructions provided here: (EPEL is necessary) http://prefetch.net/blog/index.php/2012/02/13/installing-zfs-on-a-centos-6-linux-server/ THE INSTRUCTIONS DO NOT SHOW THAT: rpmbuild must be installed beforehand: $ yum install rpm-build After the "$ rpm -Uvh*.x86_64.rpm" command is executed, the system must be rebooted in order for the ZFS module to build properly (done by "$ modprobe zfs"). PROBLEMS: The number of drives fdisk is detecting is double the amount of drives in the enclosure (it is also detecting a second enclosure). Zpools can be created with drives from both the bottom and top of the list, so all of the drives are mountable and useable. Some partitions also randomly have ZFS on them when they should not (they were not touched). 10/21/2015 (Daniel) TAGS: password passwd secure attribute sec_attr The root password was changed for security reasons. To change the password on the CE and NAS1 (not part of the rocks network): $ passwd To change the password on the nodes, SE, NAS0 $ rocks set host sec_attr attr=root_pw $ rocks sync host sec_attr 10/25/2015 (Daniel) TAGS: nodes up diagnostic cron job script I fixed Ryan's code for the nodes up part of the diagnostic page. It was mostly typos. He also forgot to move the media files into the folder. Ryan and I both wrote our own scripts to ping the nodes for the diagnostic page. I merged them into one script. 10/26/2015 (Daniel, Ryan) TAGS: zfs, zpool, sd, drives We created a mirrored zpool of 30 groups of 2 drives on the 60-drive nas bay. There are 120 drives detected. To isolate the 60 real drives, arrange the list numerically in base 26 as opposed to alphabetically (where sdb comes before sdaa) and only take the first 60 while ignoring any sda drives. If, while creating the zpool, an error is produced that says that a drive is refered to twice, the drives were entered incorrectly (a duplicate drive address was entered). Some key zfs commands are: $ zpool create -f real mirror sdb sdc (for initially creating the zpool with the first group of two) $ zpool add -f real mirror (for adding additional groups to the existing zpool) For a script detailing how to resolve this exact situation, refer to ~/zpoolCreation.sh 10/27/2015 (Ankit, Ryan) We set up a straight ethernet connection between nas-0-1 and nas-0-2 in order to expediate the file transfer from nas-0-1 to nas-0-2. The /etc/sysconfig/static-routes file must be created and the gateway must be put into it so that the change becomes permanent when the network is restarted. In order to create the connection, the IP address of nas-0-2 must be the gateway of nas-0-1 and the gateway of nas-0-2 must be the IP address of nas-0-1. The default gateway of nas-0-1 was also changed to 10.1.255.232. EDIT: When transfering files, using NFS is faster than a direct ethernet connection. The speed problems were coming from the use of the -z option in rsync. -z compresses files before sending them, which severely curtails the speed of the transfer. 10/27/2015 (Ankit, Ryan) TAGS: ip tables, iptables, nuts, ports, ups, batteries, battery When adding the port through which NUTS will communitcate to the server: 1) add port number to /etc/sysconfig/iptables 2) run command: $ iptables -A INPUT -p -m --dport 3493 -j ACCEPT 3) run: $ service iptables restart 4) to see if the port is listening, run: $ nmap -sS -) -p 5) define the UPS in /ect/usp/ups.conf: for the APC UPS: [3000] = apcsmart = 3493 (default) 10/28/2015 (Ankit, Ryan) TAGS: cvmfs, sam tests The cvmfs sam test had been in warning for about three weeks. The problem is that cvmfs was trying to connect to a server that no longer existed. To fix it, we deleted the server from the CVMFS_SERVER_URL list in: /etc/cvmfs/domain.d/cern.ch.conf (write permissions need to be granted) In this particular case, we deleted the now extinct sinica server from the list. After the change is made, run: $ cvmfs_service reload Everything must be done on all of the nodes as well. The commands for copying the new cern.ch.conf file to all of the nodes and reloading cvmfs is in the ~/osg-node.sh file. (scp is used) Some important cvmfs commands: $ cvmfs_service showconfig -shows where cvmfs is getting all of its information from, and which config files do what 10/29/2015 (Anikit, Ryan) TAGS: important files directories DIRECTORIES ON NAS1 Not Good: TurkeyData Good: BNLZZScan g4hep FNALBeamTest 10/30/2015 (Ankit, Daniel, Ryan) TAGS: sam tests, critical, 4, bestman SAM test 4 suddenly went critical. Everything is done in the SE: We restarted the SE and opened port 2811. (using the "iptables -A" command from above) We refered to the bestman log (/var/log/bestman2). SE SAM test was critical, turned out that the service gridftp was not running, to start it run: service globus-gridftp-server start [add to diagnostics page] chkconfig globus-gridftp-server on (to start it at boot-time) 10/31/2015 (Ankit, Ryan) TAGS: nfsnobody, restart The files are all suddently owned by nfsnobody. First, we copied /etc/passwd /etc/group /etc/shadow to nas2 from the CE so that nas2 would have all of the information. (passwd:group permissions, group:group IDs, shadow:encrypted passwords) Then we restarted nfs by running "service nfs restart". Within /etc/idmapd.conf we changed "nobody" to "nfsnobody" and input the domain as uscms1.fltech-grid3.fit.edu. We restarted restarted nfs again, and the problem was solved. 11/01/2015 (Ankit, Ryan) TAGS: nas, connection issues, not connect, down, offline Both nas-0-1 and nas-0-2 suddenly went offline. Restarting both solved the problem. 11/01/2015 (Ankit, Ryan) TAGS: nas 1, raid, drives, layout, configuation nas-0-1 is arranged in RAID60 with 2 groups of 18 drives. The first group of 18 is made up of the first 18 drives on the face of the enclosure (0-0 to 0-17). The second virtual drive is made up of the remaining drives (0-18 to 1-11). The right-most column on the face of the enclosure is a part of the rear group. An important file that refers to the configuation of the enclosure and its drives is cfg.log 11/04/2015 (Ankit) Completed the storage request form, and uploaded the file to the LSI ftp server. To generate the file, first download the lsiget tar file from this site (http://mycusthelp.info/LSI/_cs/AnswerDetail.aspx?s&inc=8264), then extract it in NAS-1 and run /lsigetlunix.sh inside the extracted folder. then upload the file to the ftp server ftp://tsupport:tsupport@ftp0.lsil.com/incoming (use put command) (check ftp://ftp0.lsil.com/incoming/) Installed latest version of storcli, and enabled pdcache on Virtual drive 0 (storcli /c0/v0 set pdcache=On) Need to run it for a day, and then run lsiget. 11/09/2015 (Ankit, Ryan) TAGS: ups, nuts, serial cable We moved the box of serial cables and Tripplite CD up to the lab. A normal serial cable cannot be used with any of the UPSs, they have their own special cables. Both of them, however, had a port for a USB cable. The Tripplite UPS has also been configured; several configuation files were accessed in /etc/ups 11/10/2015 (Ankit, Ryan) TAGS: network, slow, speed nas-0-1 is responding slowly to command line input After running "free -m", we noticed that almost all of the RAM was being used. Upon further investigation, it was discovered that the actual RAM available to the system is printed in the buffer-/+ row in the Free column, and that a substatial amount of RAM was still available. A RAM shortage is not the problem. Maybe network lag is the problem? "mtr" showed a maximum ping of 0.1s, not a hideously large number. It does not seem as if network lag is the problem, either. 11/12/2015 (Ryan) TAGS: sam tests, critical, se, srm SAM test 12 suddenly went critical and tests 13 and 4 were put into warning. The error stated that the test could not copy the test file to SRM. To test SRM connectivity, run: $ srm-tester -op ping -serviceurl srm://uscms1-se.fltech-grid3.fit.edu:8443/srm/v2/server?SFN= in the SE. It said the credentials needed to be renewed. This does not refer to certificates! It refers to proxy renewal, which can be done with: $ grid-proxy-init To search all of the files in a directory, use: $ grep -rie "" -nH The problem (found in /var/log/bestman2/bestman2.log) was that the SAM tests (specifically gsiftp) were trying to read non-existant files inside etc/grid-security/certificates/ To fix this, the osg-client package must be reinstalled, and the GUMs hostname must be added to the /etc/bestman2/conf/bestman2.rc file, then it must be rebooted. 11/12/2015 (Ryan) TAGS: zfs, nas-0-2, nas2, optimization, configuration, transfer, data inflation nas-0-2 is displaying inflated data sizes for the data we are transfering to it from nas-0-1. Maybe the data inflation is taking place because the RAID60 parities are being transfered alongside the actual files? ZFS parameters can be found in /sys/module/zfs/parameters/ 12/15/2015 (Ankit) TAGS: sam tests, se, critical, 12 SAM Test 12 went critical again. The bestman2.log file is full of these types of errors: Exception in thread "qtp1424335915-3690" java.lang.OutOfMemoryError: GC overhead limit exceeded This, according to Sun, means that "too much time is being spent in garbage collection". It appears that the program "qtp1424335915" is running out of heap memory. SAM SE test is fixed. The issue was the incorrect entry for the gums url in the /etc/bestman2/conf/bestman2.rc file. The correct entry is: GUMSserviceURL=https://uscms1.fltech-grid3.fit.edu:8443/gums/services/GUMSXACMLAuthorizationServicePort 12/14/2015 (Ankit) TAGS: /var, resize, partition, full /var is now at 77%, removed the tripwire files 12/17/2015 (Daniel, Gunnar) TAGS: hard drive, test, lifeguard All 36 NAS1 drives have been tested. 2 drives indicated failure. Also, for some reason, after clearing the test RAID from earlier (but before we touched anything else), many of the drives said "unconfigured bad". These ones were: 0,2,4,6,7,11-14,17-23, and all the ones in the back. This seemed to have no connection with the drives we used in our test RAID. The WD lifeguard diagnostic program showed no sign of this bad state that the RAID card reported. But either way we are one drive short, because we only have one backup, but two that seem to have failed. The hard drives that failed the lifeguard test were 17 and 1-10. 12/26/2015 (Ryan) TAGS: sam, unknown, undefined, 14, 15, job, submit, BDII SAM tests 14 and 15, those that determine if jobs can be authenticated by the gatekeeper and run, have become unknown. A "globusrun" command was run successfully, so the gatekeeper is not the issue. A telnet command on port 2119 was run successfully, so the port is not the issue. According to the twiki page about job submission (https://twiki.cern.ch/twiki/bin/view/CMSPublic/JobSubmission#The_job_aborts_with_the_erro_AN1) and the error message from the SAM test itself (Error: No compatible resource found in BDII.), the CE does not appear in the SAM BDII, and it lists several reasons why. At the end of the section, a command is provided that should determine the cause: $ lcg-info --list-ce --query 'CE=uscms1.fltech-grid3.fit.edu*' --attrs CEStatus,CEVOs The page states that the command must be run from a gLite UI, but after being run from a normal prompt, it reports: lcg-info: LCG_GFAL_INFOSYS undefined. The first bullet is not the problem, because the CE is in the BDII information system (web page at: http://dashb-cms-vo-feed.cern.ch/dashboard/request.py/cmssitemapbdii). The second bullet on the twiki page is not the problem, because, according to the sam test metric result, the CE status is in Production. The third bullet is not the problem, because, according to the sam result, GlueCEAccessControlBaseRule=VO:cms. The website myosg has a bdii monitoring page. Currently, its status is "(no data)". This site: https://twiki.grid.iu.edu/bin/view/Documentation/Release3/TroubleShootingCEMonGIP details how to validate that the data is being published. The 4.1 section of the site provides a link that displays the GIP Validation Status. Our status is that the GIP Validation script is returning unknown overall status: Could not get LDIF Entries. LDIF works with the command ldapmodify, which, when run, displays: ldap_sasl_interactive_bind_s: Can't contact LDAP server (-1) (*) According to , the error could be the result of faulty certificates in the file determined by the TLS_CACERTDIR variable in /etc/openldap/ldap.conf. The certificates are fine, and the problem persists. (*) offers a solution to the problem: within /etc/openldap/ldap.conf, repalce any lines beginning with "TLS_CACERT" with: TLS_CACERT /etc/ssl/certs/ca-bundle.crt The error persists. The problem was with them, not with us; the tests all miraculously went green.