Sysadmins: Ankit, Daniel, Gunnar, Ryan, Riley, Erik, Hannah

08/20/2015
(Gunnar and Daniel)
TAGS: UPS Battery NAS1 hard drive squid diagnostic raid

We checked on the UPS and the failed NAS1 HDD.
We swapped the NAS1 HDD. To do this, follow Jordan's documentation.

The top UPS was an atrocity. All of the batteries were corroded and bulging.
To get into the UPS we turned it off, and unplugged everything from the back,
and unplugged the UPS from the wall. We then disconnected the top part
from the bottom part. We managed to get all 100+lbs out of the rack
without kicking it once... We unscrewed the top panel and opened it up.
They were terrible. We needed to pry them apart with a screwdriver to
get them apart because they were so corroded. Upon testing them with a
voltmeter we found that four worked (out of 24).
They seemed newer than the rest. They were wired serially in groups of 4,
and those 6 groups of 4 were wired in parallel.

We also added a line in the diagnostic page to check if squid is running.
The command utilized was
$ squid -k check
This simply retruns 0 if it's running.

08/27/2015
(Daniel)
TAGS: UPS Battery NAS1 hard drive raid 
NAS1 stopped responding during the rebuilding of HDD10 at 16%.
After a restart, we could not get back into the webbios raid menu,
and df -h showed 48GB of space instead of the normal bunch o' TBs.
I removed hdd 10 and restarted the computer again.
It prompted the raid menu, and we swapped the drive as normal.
Hopefully it will pass 16% this time.

We removed the third UPS to check it like we did previously with the second UPS.
Only 4 batteries could not produce a voltage ~12V. Like the other UPS,
it was wired in 6 parallel groups of 4 batteries in series.
Both UPS cases need to be cleaned.
As of the time of writing, there are 0 nodes online.
The CE, SE, NAS0, and NAS1 are online, plugged into the surge protector strip.

09/6/2015
(Daniel, Gunnar, Ryan)
TAGS: UPS Redundancy power nodes down battery
The diagnostic website reported the nodes as unoperational. Upon reaching the
High Bay, we were greeted by a peircing screech emanating from the downed 
nodes. That screech signals that the redundant power supply has failed, and the
nodes have lost power. Upon further inspection, we discovered that the socket
into which the nodes were plugged in had stopped supplying power. To remedy the
situation we merely plugged the power strip into the adjacent socket block, 
thus supplying power once again to the nodes.

09/12/2015
(Daniel, Gunnar, Ryan)
TAGS: UPS Battery Power Supply Redundancy Lights Buttons Signals
We cleaned out and reinstalled one of the UPS boxes. In the process we changed
the power scheme of the cluster. The current configuration with only one UPS
installed is that we have 1/2 of the redundant power supplies of the CE, SE,
NAS 0, and NAS 1 are plugged into the top UPS. The other half of the CE and
NAS 0 are plugged into the middle UPS. The other half of the CE and NAS 1 are
plugged into a surge protector. Nodes 2-0 to 2-9 are plugged into the middle
UPS.
NOTE: Do NOT connect two batteries together (complete battery circuit); the
      batteries will become 'sploded.

REFERENCE FOR THE FUNCTIONS OF THE MIDDLE AND BOTTOM UPS

LIGHTS
(Labled 1-5 from left to right)
1. (Squiggle):
   AC Power
2. (Squiggle with Arrow):
   Voltage adjustment
   	   When AC power voltage is wrong, this light signals that the UPS is
	   adjusting.
3. (Balance):
   Output Load Level
   	  Approximate electrical load
	  GREEN: light
	  ORANGE: medium
	  RED: overload
4. (Battery Charge):
   When operating from utility power, indicates the approximate charge of the
   UPS
	GREEN: full
	ORANGE: medium
	RED: critical
5. (Battery Warning):
   light is RED and alarm sounds
   	 UPS batteries need to be recharged or replaced
	 If RED charge batteries for 12 hours and test again

BUTTONS
Power Button:
      Turns UPS on and off
Mute/Test Button:
	  To silence: briefly press and release test button
	  To run self test: with UPS plugged in and turned on, press and hold
	  Mute/Test Button
	     Test will last approximately 10 seconds
	     If output light level remains RED, UPS outlets are overloaded
	     If battery warning light remains RED, batteries need to be 
	     	recharged or replaced

09/21/2015
(Ankit, Daniel, Ryan)
TAGS: GUMS, broken, symlinks, antlr
After a "yum update", GUMS may be broken due to broken symlinks that refer
to "antlr.jar". To find the symlinks, run
$ ls -l /usr/lib/gums/antlr.jar /var/lib/tomcat6/webapps/gums/WEB-INF/lib/antlr.jar

If the symlinks are broken, find where the proper file is by running:
$ rpm -qlv antlr | grep jar
The proper file will be the largest of the four.

The old symlinks must then be deleted:
$ rm -r /usr/lib/gums/antlr.jar
$ rm -r /var/lib/tomcat6/webapps/gums/WEB-INF/lib/antlr.jar

To fix the symlinks, run these two commands for each of the files:
$ ln -s <proper file path> /usr/lib/gums/antlr.jar
$ ln -s <proper file path> /var/lib/tomcat6/webapps/gums/WEB-INF/lib/antlr.jar

To ensure that the links are indeed fixed run:
$ ls -l /usr/lib/gums/antlr.jar /var/lib/tomcat6/webapps/gums/WEB-INF/lib/antlr.jar

09/21/2015:
(Ankit)
Looking into the VOBOX for proxy renewal for phedex

09/29/2015
(Ryan)
TAGS:Accounts, Users, bash
I identified all of the users with gid 100 and not using /sbin/nologin shell with:
$ awk -F':' '$4==100 && $7!="/sbin/nologin" {print $1" -- "$5}' passwd >& users

09/29/2015:
(Ankit, Ryan)
TAGS:Certificate, phedex, gridcert
To check the valility of the certificate:
$ grid-cert-info -file /etc/grid-security/rsv/rsvcert.pem -startdate -enddate
(It is currently set to expire Feb. 19, 2016)

10/04/2015
(Daniel, Gunnar, Ryan, Eric)
TAGS: APC UPS Battery Wire Plug
The APC UPS can be checked without removing the box.
(There are horrible deadly capacitors inside. Don't open the top panel.)
Remove the front panel, and there is another pane you can slide left and right.
Unscrew it if it is screwed in and slide it left.
From here the batteries can be slid out.
As of this log all of the batteries in the APC UPS work properly.
There is no sign of damage, and they all hold a proper voltage.

Also the UPS wires twist and lock into the wall.

10/5/2015
(Daniel, Ankit, Ryan, Eric, Gunnar)
TAGS: NAS1 Curtis Backup Vegeta Kakarot CentOS Storage
Curtis came and we mounted his storage server into our rack so we can
begin backing up NAS1 via a direct link. The wire to connect the storage unit
was not with our supplies so this process has not begun.
We installed CentOS on the storage server and are awaiting the proper cable
to attach the storage unit.

10/06/2015
(Ryan)
TAGS: Website, HTML, config, usmcs, grid
The config file in which the path for the directory that the website pulls from
is found here:
/etc/httpd/conf/httpd.conf
Currently, the path for the directory is:
/var/www/html 
To save changes, run:
$ service httpd restart

10/06/2015
(Daniel)
TAGS: Diagnostic nodes cron job
I moved the "nodes up" portion of the diagnostic's php script
to a seperate file /usr/local/bin/nodes.sh which is ran by a cron job
every 60 seconds. It fills a text file DIAGNOSTICPATH/nodes.txt with
the number of nodes up, which is read by the diagnostics php script.
This greatly improves the loading speed because it does not have to
ping each node every time you load the site.

10/16/2015
(Daniel)
TAGS: certificate proxy PEM p12 pkcs12 key openssl ankit is old
Ankit's user certificate was about to expire (we need it for a few softwares, namely phedex)
I replaced it with my own file. To do this, execute the following commands using your .p12 file
First backup any certificate files. These will be called usercert.pem and userkey.pem.
The .pem files will be stored in /etc/grid-security

$ openssl pkcs12 -in YOURcert.p12 -clcerts -nokeys -out usercert.pem
$ openssl pkcs12 -in YOURcert.p12 -nocerts -out userkey.pem
$ chmod 600 usercert.pem userkey.pem

10/16/2015
(Gunnar, Ankit)
TAGS: Nas-1 beeping .bashrc MegaCLI MegaRAID
Ankit found details on a software called MegaCLI, which monitors the devices and controllers
for the Nas-1. We can use this to stop the beeping that is being caused by a missing hard
drive. We installed it from the site 
http://www.avagotech.com/support/download-search
It downloaded as a .zip file and then we extracted it to a .rpm file.
After installing the .rpm file, in order for the software to work,
we edited the .bashrc file by adding the line

export PATH=$PATH:/opt/MegaRAID/MegaCli:.

10/19/2015
(Daniel, Ankit)
TAGS: SAM TEST 12 13 14 15 Critical Warning gratia transfer storage xrootd gums link broken
The gums link was broken again.

$ ls -l /usr/lib/gums/[antlr].jar

will show red if the link is broken. If so, the proper file must be tracked down, or downgrade antlr
This caused a failure in gums, and consequently SAM test 14 and 15.

The SE SAM tests 12 and 13 were also down.
We restarted the SE and some services did not restart. Do do this, run

$ service gratia-xrootd-transfer start
$ service gratia-xrootd-storage start
$ service globus-gridftp-server start

I also ran

$ chkconfig xrootd on

so xrootd runs on boot. We will see if this works or not.
To test if file transfer works, run

$ grid-proxy-init
$ touch /tmp/test
$ srm-copy file:////tmp/test srm://uscms1-se.fltech-grid3.fit.edu:8443/srm/v2/server?SFN=/mnt/nas1/store/temp/test_2


To run an rsv metric by hand run

$ rsv-control --run --host uscms1.fltech-grid3.fit.edu NAME_OF_METRIC

To run all test by hand, run 

$ rsv-control -r --all-enabled

10/20/2015
(Ankit, Ryan)
TAGS: ZFS, RAID, nas-0-2
INSTALLATION:

We installed the EPEL repository and ZFS on nas-0-2.

We installed EPEL with this command:
$ sudo rpm -Uvh http://mirrors.kernel.org/fedora-epel/6/i386/epel-release-6-8.noarch.rpm

We installed ZFS following the instructions provided here: (EPEL is necessary) 
http://prefetch.net/blog/index.php/2012/02/13/installing-zfs-on-a-centos-6-linux-server/

THE INSTRUCTIONS DO NOT SHOW THAT:
    rpmbuild must be installed beforehand:
    $ yum install rpm-build

    After the "$ rpm -Uvh*.x86_64.rpm" command is executed, the system must be rebooted
    in order for the ZFS module to build properly (done by "$ modprobe zfs").

PROBLEMS:
The number of drives fdisk is detecting is double the amount of drives in the enclosure (it is
also detecting a second enclosure). Zpools can be created with drives from both the bottom and
top of the list, so all of the drives are mountable and useable. 
Some partitions also randomly have ZFS on them when they should not (they were not touched). 

10/21/2015
(Daniel)
TAGS: password passwd secure attribute sec_attr
The root password was changed for security reasons.
To change the password on the CE and NAS1 (not part of the rocks network):
$ passwd
To change the password on the nodes, SE, NAS0
$ rocks set host sec_attr attr=root_pw
$ rocks sync host sec_attr

10/25/2015
(Daniel)
TAGS: nodes up diagnostic cron job script
I fixed Ryan's code for the nodes up part of the diagnostic page.
It was mostly typos. He also forgot to move the media files into the folder.
Ryan and I both wrote our own scripts to ping the nodes for the 
diagnostic page. I merged them into one script.

10/26/2015
(Daniel, Ryan)
TAGS: zfs, zpool, sd, drives
We created a mirrored zpool of 30 groups of 2 drives on the 60-drive nas bay.
There are 120 drives detected. To isolate the 60 real drives, arrange the list numerically in 
base 26 as opposed to alphabetically (where sdb comes before sdaa) and only take the first 60
while ignoring any sda drives.
If, while creating the zpool, an error is produced that says that a drive is refered to twice,
the drives were entered incorrectly (a duplicate drive address was entered). 
Some key zfs commands are:
$ zpool create -f real mirror sdb sdc
(for initially creating the zpool with the first group of two)
$ zpool add -f real mirror <drive1> <drive2>
(for adding additional groups to the existing zpool)
For a script detailing how to resolve this exact situation, refer to
~/zpoolCreation.sh

10/27/2015
(Ankit, Ryan)
We set up a straight ethernet connection between nas-0-1 and nas-0-2 in order to
expediate the file transfer from nas-0-1 to nas-0-2.
The /etc/sysconfig/static-routes file must be created and the gateway must 
be put into it so that the change becomes permanent when the network is restarted.
In order to create the connection, the IP address of nas-0-2 must be the gateway 
of nas-0-1 and the gateway of nas-0-2 must be the IP address of nas-0-1. 
The default gateway of nas-0-1 was also changed to 10.1.255.232.

EDIT:
When transfering files, using NFS is faster than a direct ethernet
connection. The speed problems were coming from the use of the 
-z option in rsync. -z compresses files before sending them, which
severely curtails the speed of the transfer.

10/27/2015
(Ankit, Ryan)
TAGS: ip tables, iptables, nuts, ports, ups, batteries, battery
When adding the port through which NUTS will communitcate to the server:
1) add port number to /etc/sysconfig/iptables
2) run command:
   $ iptables -A INPUT -p <tcp/udp> -m <tcp/udp> --dport 3493 -j ACCEPT
3) run:
   $ service iptables restart
4) to see if the port is listening, run:
   $ nmap -sS -) -p<portnumber> <serverIP>
5) define the UPS in /ect/usp/ups.conf:
   for the APC UPS:
       [3000]
       <drivername> = apcsmart
       <portname> = 3493 (default)

10/28/2015
(Ankit, Ryan)
TAGS: cvmfs, sam tests
The cvmfs sam test had been in warning for about three weeks.
The problem is that cvmfs was trying to connect to a server
that no longer existed. To fix it, we deleted the server
from the CVMFS_SERVER_URL list in:
/etc/cvmfs/domain.d/cern.ch.conf
(write permissions need to be granted)
In this particular case, we deleted the now extinct
sinica server from the list.
After the change is made, run:
$ cvmfs_service reload

Everything must be done on all of the nodes as well.
The commands for copying the new cern.ch.conf file to 
all of the nodes and reloading cvmfs is in the 
~/osg-node.sh file. (scp is used)

Some important cvmfs commands:
$ cvmfs_service showconfig
  -shows where cvmfs is getting all of its information
   from, and which config files do what

10/29/2015
(Anikit, Ryan)
TAGS: important files directories
DIRECTORIES ON NAS1
Not Good:
    TurkeyData
Good:
    BNLZZScan
    g4hep
    FNALBeamTest

10/30/2015
(Ankit, Daniel, Ryan)
TAGS: sam tests, critical, 4, bestman
SAM test 4 suddenly went critical.
Everything is done in the SE:
We restarted the SE and opened port 2811. (using
the "iptables -A" command from above) 
We refered to the bestman log (/var/log/bestman2).
SE SAM test was critical, turned out that the service gridftp was not running, to start it run:
service globus-gridftp-server start [add to diagnostics page]
chkconfig globus-gridftp-server on (to start it at boot-time)

10/31/2015
(Ankit, Ryan)
TAGS: nfsnobody, restart
The files are all suddently owned by nfsnobody. First, we copied
/etc/passwd /etc/group /etc/shadow to nas2 from the CE so that
nas2 would have all of the information. (passwd:group permissions, 
group:group IDs, shadow:encrypted passwords) 
Then we restarted nfs by running "service nfs restart". Within
/etc/idmapd.conf we changed "nobody" to "nfsnobody" and 
input the domain as uscms1.fltech-grid3.fit.edu. We 
restarted restarted nfs again, and the problem was solved.

11/01/2015
(Ankit, Ryan)
TAGS: nas, connection issues, not connect, down, offline
Both nas-0-1 and nas-0-2 suddenly went offline.
Restarting both solved the problem.

11/01/2015
(Ankit, Ryan)
TAGS: nas 1, raid, drives, layout, configuation
nas-0-1 is arranged in RAID60 with 2 groups of 18 drives.
The first group of 18 is made up of the first 18 drives on
the face of the enclosure (0-0 to 0-17). The second virtual
drive is made up of the remaining drives (0-18 to 1-11).
The right-most column on the face of the enclosure is a part
of the rear group. 
An important file that refers to the configuation of 
the enclosure and its drives is cfg.log

11/04/2015
(Ankit)
Completed the storage request form, and uploaded the file to the LSI ftp server. 
To generate the file, first download the lsiget tar file from this site (http://mycusthelp.info/LSI/_cs/AnswerDetail.aspx?s&inc=8264), then 
extract it in NAS-1 and run /lsigetlunix.sh inside the extracted folder.
then upload the file to the ftp server  ftp://tsupport:tsupport@ftp0.lsil.com/incoming (use put command)
(check ftp://ftp0.lsil.com/incoming/)
Installed latest version of storcli, and enabled pdcache on Virtual drive 0 (storcli /c0/v0 set pdcache=On)
Need to run it for a day, and then run lsiget.

11/09/2015
(Ankit, Ryan)
TAGS: ups, nuts, serial cable
We moved the box of serial cables and Tripplite CD up to the lab.
A normal serial cable cannot be used with any of the UPSs, they have
their own special cables. Both of them, however, had a port for a 
USB cable. The Tripplite UPS has also been configured; several
configuation files were accessed in /etc/ups

11/10/2015
(Ankit, Ryan)
TAGS: network, slow, speed
nas-0-1 is responding slowly to command line input

After running "free -m", we noticed that almost all of the RAM was being used.
Upon further investigation, it was discovered that the actual RAM available to the system
is printed in the buffer-/+ row in the Free column, and that a substatial amount of RAM was 
still available. A RAM shortage is not the problem.

Maybe network lag is the problem? "mtr" showed a maximum ping of 0.1s, not a hideously large
number. It does not seem as if network lag is the problem, either.

11/12/2015
(Ryan)
TAGS: sam tests, critical, se, srm
SAM test 12 suddenly went critical and tests 13 and 4 were put into warning.
The error stated that the test could not copy the test file to SRM.
To test SRM connectivity, run:
$ srm-tester -op ping -serviceurl srm://uscms1-se.fltech-grid3.fit.edu:8443/srm/v2/server?SFN=
in the SE.
It said the credentials needed to be renewed. This does not refer to certificates!
It refers to proxy renewal, which can be done with:
$ grid-proxy-init
To search all of the files in a directory, use:
$ grep -rie "<term\|...>" -nH
The problem (found in /var/log/bestman2/bestman2.log) was that the SAM tests (specifically 
gsiftp) were trying to read non-existant files inside etc/grid-security/certificates/
To fix this, the osg-client package must be reinstalled, and the GUMs
hostname must be added to the /etc/bestman2/conf/bestman2.rc file, then it must be rebooted.

11/12/2015
(Ryan)
TAGS: zfs, nas-0-2, nas2, optimization, configuration, transfer, data inflation
nas-0-2 is displaying inflated data sizes for the data we are transfering to it
from nas-0-1. Maybe the data inflation is taking place because the RAID60 parities
are being transfered alongside the actual files?
ZFS parameters can be found in /sys/module/zfs/parameters/

12/15/2015
(Ankit)
TAGS: sam tests, se, critical, 12
SAM Test 12 went critical again. The bestman2.log file is full of these types of errors: 
Exception in thread "qtp1424335915-3690" java.lang.OutOfMemoryError: GC overhead limit exceeded
This, according to Sun, means that "too much time is being spent in garbage collection".
It appears that the program "qtp1424335915" is running out of heap memory.

SAM SE test is fixed. The issue was the incorrect entry for the gums url in the /etc/bestman2/conf/bestman2.rc file.
The correct entry is:
GUMSserviceURL=https://uscms1.fltech-grid3.fit.edu:8443/gums/services/GUMSXACMLAuthorizationServicePort

12/14/2015
(Ankit)
TAGS: /var, resize, partition, full 
/var is now at 77%, removed the tripwire files

12/17/2015
(Daniel, Gunnar)
TAGS: hard drive, test, lifeguard
All 36 NAS1 drives have been tested. 2 drives indicated failure. Also, for some reason, after clearing the test RAID from earlier (but before we touched anything else), many of the drives said "unconfigured bad". These ones were: 0,2,4,6,7,11-14,17-23, and all the ones in the back. This seemed to have no connection with the drives we used in our test RAID. The WD lifeguard diagnostic program showed no sign of this bad state that the RAID card reported. But either way we are one drive short, because we only have one backup, but two that seem to have failed. The hard drives that failed the lifeguard test were 17 and 1-10.

12/26/2015
(Ryan)
TAGS: sam, unknown, undefined, 14, 15, job, submit, BDII
SAM tests 14 and 15, those that determine if jobs can be authenticated by the gatekeeper and run, have become unknown. A "globusrun" command was run successfully, so the gatekeeper is not the issue. A telnet command on port 2119 was run successfully, so the port is not the issue. 
According to the twiki page about job submission (https://twiki.cern.ch/twiki/bin/view/CMSPublic/JobSubmission#The_job_aborts_with_the_erro_AN1) and the error message from the SAM test itself (Error: No compatible resource found in BDII.), the CE does not appear in the SAM BDII, and it lists several reasons why. At the end of the section, a command is provided that should determine the cause:
$ lcg-info --list-ce --query 'CE=uscms1.fltech-grid3.fit.edu*' --attrs CEStatus,CEVOs 
The page states that the command must be run from a gLite UI, but after being run from a normal prompt, it reports:
lcg-info: LCG_GFAL_INFOSYS undefined.
The first bullet is not the problem, because the CE is in the BDII information system (web page at: http://dashb-cms-vo-feed.cern.ch/dashboard/request.py/cmssitemapbdii). The second bullet on the twiki page is not the problem, because, according to the sam test metric result, the CE status is in Production. The third bullet is not the problem, because, according to the sam result, GlueCEAccessControlBaseRule=VO:cms.
The website myosg has a bdii monitoring page. Currently, its status is "(no data)". This site:
https://twiki.grid.iu.edu/bin/view/Documentation/Release3/TroubleShootingCEMonGIP
details how to validate that the data is being published. The 4.1 section of the site provides a link that displays the GIP Validation Status. Our status is that the GIP Validation script is returning unknown overall status: Could not get LDIF Entries. LDIF works with the command ldapmodify, which, when run, displays:
ldap_sasl_interactive_bind_s: Can't contact LDAP server (-1)
(*) According to <http://www.openldap.org/faq/data/cache/1432.html>, the error could be the result of faulty certificates in the file determined by the TLS_CACERTDIR variable in /etc/openldap/ldap.conf. The certificates are fine, and the problem persists.
(*) <http://support.jumpcloud.com/knowledgebase/articles/442411-ldap-ldapsearch-can-t-contact-ldap-server-1> offers a solution to the problem: within /etc/openldap/ldap.conf, repalce any lines beginning with "TLS_CACERT" with:
TLS_CACERT /etc/ssl/certs/ca-bundle.crt
The error persists.

The problem was with them, not with us; the tests all miraculously went green.