01/04/2016
(Daniel)
TAGS: NAS1 NAS2 PHEDEX SITECONF GIT COMMIT STORAGE.XML
The backup of NAS1's main data is complete. This is enough to run phedex again.
I changed the storage.xml file
(on SE, /home/phedex/SITECONF/T3_US_FIT/PhEDEx/storage.xml)
to update the location from nas2 to nas1.
To commit the change use
$ git add storage.xml
$ git commit storage.xml -m "Some commit message"

When changing the git user name and email, I accidentally ran git commit
without specifying storage.xml.
This commited quite a few files, including DBParam.
I removed DBParam by first backing it up then
$ git rm -f DBParam
then returning the copy to /home/phedex/SITECONF/T3_US_FIT/PhEDEx/

I then started the phedex agents, which seemed succesful.

01/05/2016
(Ryan)
TAGS: sam 11 12 critical se
The test is complaining that the file:
/mnt/nas1/store/mc/SAM/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0013/52D06725-4BAE-E111-A059-001D09F252DA.root
does not exist. Upon investigation, however, the file does exist on the CE.
The metrics page states that a "cmsRun" command was run, and it could not find the file in question. Upon trying to run "cmsRun", however, both the CE and compute-2-3 reported that the command was not found.

Test 12 has now gone critical as well.

01/08/2016
(Daniel)
TAGS: DIAGNOSTICS NAS DF
The diagnostic page now shows both percent and size for the usage statistics

01/13/2016
(Eric, Ryan)
TAGS: nas1 rocks cluster 
We are adding nas1 to the cluster with rocks.
1) Install rocks on the nas
I unmounted nas1 from both the CE and the SE using plain umount, and I unmounted nas2 from nas1 using the same method. We then plugged in an external disc drive with a rocks 6.1.1 Jumbo DVD disc inside and restarted nas1. nas1 not only did not boot straight into the disc, it failed to recognize the disc at all!Upon further investigation, it was discovered that the disc drive was at fault, not the disc. A bootable usb of rocks is required.

01/14/2016
(Ankit, Ryan)
TAGS: nas2 nas1 sam critical redirection
While nas1 is down for its addition to the cluster, the sam tests that go to it must be redirected to nas2 to avoid the tests becoming critical.
1) delete the old siteconf directory from the CE
rm -rf ~/siteconf/
2) authorize the download:
$ kinit -A -f cernusername@CERN.CH
3) download the siteconf directory:
$ git clone https://:@git.cern.ch/kerberos/siteconf
4) make appropriate changes to ~/siteconf/T3_US_FIT/Phedex/storage.xml
change the directory paths to the desired one
NOTE: make use of the replace-string command in emacs
5) implement changes
$ git add storage.xml
$ git commit -m "NAS1 to NAS2"
$ git push origin master
6) repeat step 4 in the SE (the siteconf directory will be in caps)
NOTE: user must be phedex for steps 6-8
7) restart the Phedex agents
NOTE: user must be in the phedex home directory while executing commands in steps 7 and 8
$ PHEDEX/Utilities/Master -config ~/SITECONF/T3_US_FIT/PhEDEx/Config.Debug start
8) after about 30 minutes, stop the agents
PHEDEX/Utilities/Master -config ~/SITECONF/T3_US_FIT/PhEDEx/Config.Debug stop

01/22/2016
(Ryan)
TAGS: CE full
The CE is full of data! I am finding the largest files and investigating them.

02/01/2016
(Ankit, Ryan)
TAGS: sam test 12 critical
SAM test 12 has gone critical again! The quickest way to fix it is to restart the SE,
but a more permenant solution is needed.

02/06/2016
(Ryan)
TAGS: nodes local condor nas2
I turned off NAS-2 and turned on the other 10 nodes. Five of those nodes
I am reserving for local users (Stefano) by setting START = LOCAL in
/etc/condor/config.d/00personal_condor.config
on compute-2-5 to compute-2-9.
After changing the value in a node, run:
$ condor_reconfig
to save the change, then run:
$ condor_config_val -v START
to verify that the change has been made.


02/09/2016
(Ryan)
TAGS: nodes down du unmount mtab
After NAS-2 was removed, the "du" command hung up on
the nodes. This was because NAS-2 was never properly
unmounted from the nodes. To fix the issue, the entry
for NAS-2 must be manually removed from /etc/mtab
in all of the nodes.


02/09/2016
(Ryan)
TAGS: cleaning
I cleaned up some of the dust on the outside of the cluster with
compressed air, swiffer wipes, and the static-free wipes on 
the desk.


02/12/2016
(Ryan)
TAGS: condor jobs priority
A local user (Stefano) would like to pause all currently running jobs, 
and use as many CPUs as possible to quickly run his jobs, then return 
the nodes back to the cluster when his jobs are complete. 

NOTES:
(*) condor_q is used to view all jobs
(*) condor_prio can be used to change job priority
(*) ganglia cli monitors several metrics for nodes
    (including CPU load)
(*) condor_suspend can be used to pause jobs on the CPU
    -- The job still occupies the slot and is still
       consuming RAM, but it is not consuming CPU cycles.
(*) condor_status can be used to view the status of each
    CPU on each node
(*) $ condor_config_val -v START
    can be used to directly view the value of START
    
QUESTIONS:
(*) Can a new job run on a CPU where there is a paused job?
(*) Can jobs be paused (not killed, not wait to completion)?
    [A] yes, with condor_suspend
(*) Is there a way to monitor how many CPUs are idle?
    [A] yes, with ganglia

SOLUTION:
(*) Stefano queues his jobs, then a  script executes:
    $ condor_suspend -constraint 'Owner =!= "SRSUser"' 
    When the jobs are done (can be detected with 
    condor_status), 
    $ condor_continue -all 
    is run to resume the jobs.
    -- assumes that a new job can be run on a CPU 
       where one is paused and the paused job can
       be resumed once the new job is complete
    P] Even though job is suspended, the CPU is
       still labeled as "Claimed", and no jobs
       will run on it.


02/15/2016
(Ryan)
TAGS: rsv warning certificates
The certificate RSV tests went into warning. The cause 
of the failures is imminent certificate expirations.
To install the host certificate, follow the instructions on:
https://twiki.opensciencegrid.org/bin/view/Documentation/Release3/GetHostServiceCertificates#Request_a_Host_Certificate

Once you have approved the certificate and downloaded
your copy, transfer it to the cluster.
The certificate must be placed in these directories
with their respective owners:
CE:	
/etc/grid-security/hostkey(hostcert).pem - root
/etc/grid-security/rsv/rsvkey(rsvcert).pem - rsv
/etc/grid-security/http/httpkey(httpcert).pem - tomcat
SE:
/etc/grid-security/hostkey(hostcert).pem - root
/etc/grid-security/bestman/bestmankey(bestmancert).pem - bestman
NODES:
/etc/grid-security/hostkey(hostcert).pem - root

The certificates must also be given their proper names.
The services (rsv, globus (service name: tomcat6), bestman2 (in SE)) must also be restarted.

NOTE:
GUMS is the main certificate sofware; certificate problems are 
often in GUMS. Change DC (DigiCert) in GUMS.
Delete RSV mapping and map to DN: of rsv cert /etc/grid-security
rsv/uscms1...
uscms1...

02/15/2016
(Ryan)
TAGS: sam 12 critical
SAM test 12 has gone critical (again), and I made it 
green by restarting the SE.


02/22/2016
(Ryan)
TAGS: date configuration 
Date is installed and must be configured on the cluster.
I have been provided with Michael Staib's powerpoint on date.
The /date directory mentioned in the powerpoint is located at
/mnt/nas1/test_install/opt/date
The scripts
/mnt/nas1/test_install/opt/date/runControl/do_start_dim.sh
/mnt/nas1/test_install/opt/date/setup.sh
must be modified to contain the correct paths
(replace /date with /mnt/nas1/test_install/opt/date)
event.h is an important file used to compile much of the .c files in 
the /mnt/nas1/test_install/opt/date/db directory, and it is located at
/mnt/nas1/test_install/opt/date/commonDefs/event.h
investigate: /mnt/nas1/test_install/opt/date/runControl/do_start_dim.sh


02/24/2016
(Ankit, Ryan)
TAGS: iozone default values
Find default values for iozone.

02/29/2016
(Ankit, Ryan)
TAGS: GUMS administrator adding .pem .p12
To add a new GUMS administrator:
1. Copy the new admin's OSG certificate (.p12).
   to the cluster
2. Convert the .p12 file to a .pem file.
   (openssl pkcs12 -in path.p12 -out newfile.crt.pem -clcerts -nokeys)
3. Determine the DN of the new admin.
   (openssl x509 -in usercert.pem -subject -issuer -dates -noout)
   copy the subject= line
4. Run the add admin command.
   (gums-add-mysql-admin '<subject= line copied from before>')
5. restart tomcat6 (the gums service)
   (service tomcat6 restart)
   

02/29/2016
(Ryan)
TAGS: nas0 nas-0-0 degraded restart rebuild
nas0 had been in a degraded state for about a month.
Physical drive 2 was "not-present" and drive 10
was experiencing a "SMART-failure". Upon a restart
of the system today, however, drive 10 is now rebuilding, 
although drive 2 is still "not-present".
to check rebuild status:
$ tw_cli /c0/u0 show rebuildstatus

The rebuild is stuck at 93%.
The current solution is to upgrade the firmware.

Drive 8 is now listed as "ECC-ERROR" and drive 12
has a "SMART-FAILURE".

The backup of nas0 to nas1 has been started:
$ nohup rsync -av --append /mnt/nas0/home /mnt/nas1/nas0-bak-20160304 &

I found some instructions (originally for Debian, not yet optimized):
1. add:
   deb http://jonas.genannt.name/debian lenny restricted
   to /etc/
2. import key with:
   wget -O - http://jonas.genannt.name/debian/jonas_genannt.pub | apt-key add -
3. 


TO RESTART NAS-0:
1. Unmount nas0 from everything.
   $ umount -l /mnt/nas0
   $ umount -l /home
2. Restart nas0.
3. Remount nas0 on everything.
   $ mount /mnt/nas0
   $ service autofs restart

03/01/2016
(Ankit, Ryan)
TAGS: glexec critical compute-1-1 SAM
The glexec SAM test is critical on compute-1-1.
The fetch-crl cron job was not running.
This was discovered by checking the
/var/log/cron file and searching for
"fetch-crl". It was absent on compute-1-1, but 
present on compute-1-2.
To see fetch-crl status:
$ /etc/init.d/fetch-crl-cron status
To restart fetch-crl:
$ /etc/init.d/fetch-crl-cron restart

03/14/2016
(Ankit, Ryan)
TAGS: drive swap nas0
Drive p10 has failed and must be replaced.
First the drive must be removed from the RAID
$ tw_cli maint remove c0 p<slot number>
The the drive can then be removed and replaced.
NOTE: a screw driver is required to remove the drive housing
Once the new drive is in place, the RAID card must be rescanned.
$ tw_cli /c0 rescan
The new drive should start rebuilding. If it does not start
automatically, the rebuild process can be manually started with:
$ tw_cli maint rebuild c0 u0 p<slot number>
NOTE: c0: card number
      u0: RAID number
use 
$ tw_cli /c0 show alarms
$ tw_cli /c0/u0 show rebuildstatus
to monitor progress

03/17/2016
(Ryan)
TAGS: compute-2-0 df -h hanging sam failure test nfs
$ df -h
hangs when executed on compute-2-0.
Ankit says SAM tests are also failing on it, and he
mentioned that it could be an NFS issue.

03/17/2016
(Ryan)
TAGS: nas0 nas-0 drive 12 SMART failure
On March 15, drive 12 exceeded the SMART threshold, 
so it must be replaced.

03/18/2016
(Ankit, Ryan)
TAGS: compute file system issue sam tests fail df 
Some SAM tests were failing on compute-1-0 and compute-2-0.
The SAM test was reporting that it could not access some files,
and df -h did not work on compute-2-0.
For compute-2-0, NAS-2 was still listed in the mtab file, so 
we removed it.
For compute-1-0, NAS-1 was not mounted.
NOTE:
For filesystem issues, check /etc/fstab and /etc/mtab
df -h should be monitored to check for future issues.

03/21/2016
(Ryan)
TAGS: sam test 1 critical 4 warning glexec compute-1-1 
SAM TEST 1:
The glexec sam test (SAM 1) has failed for compute-1-1.
The fetch-crl is absent from the /var/log/cron file, as
before, so I restarted the process using the command previously
mentioned. When I tried to run
$ /usr/sbin/glexec getting payload uid
(the command mentioned by the SAM test)
it said:
[gLExec]:  environment variable $GLEXEC_CLIENT_CERT is empty.
The SAM test appears to be accessing a non-existant file for 
$GLEXEC_CLIENT_CERT
/var/lib/condor/execute/dir_16220/nagios/probes/org.cms.glexec/testjob/tests/payloadproxy

SAM TEST 4:
SAM test 4 has gone into warning alongside the critical SAM test 1.
It reports that "SIGTERM has been caught" on compute-1-1.

Both SAM tests appear to have been fixed by the simple restart.


03/21/2016
(Ryan)
TAGS: partition compute-1-1 nodes
I changed the START value of compute-1-1 to PART to 
reserve it for partioning testing.
/etc/condor/config.d/00personal_condor.config


03/24/2016
(Ryan)
TAGS: glexec compute-1-1 osg jobs fail
glexec is not working on compute-1-1.
When the diagnostic command:
$ voms-proxy-init -voms cms:/cms
is run while as user amohapatra, it does 
not connect. Because the first diagnostic command will
not work, though, none of the others will work either.
The errors say that there is an SSL handshake error between
compute-1-1 and the two cms servers it tries to connect to.
Contacting fails reportedly due to a SSL handshake error.
It appears that the handshake fails due to outdated certificates.
It says that the CRL has expired.

The problem was that fetch-crl was not running.
Check /var/log/cron for fetch-crl. Use 
$ fetch-crl
>to run fetch-crl.
The automatic running of fetch-crl seems to not be
working properly.


03/28/2016
(Ryan)
TAGS: xrootd
/etc/xrootd/xrootd-clustered.cfg
TFC (Trivial File Catalog)
~/siteconf/T3_US_FIT/PhEDEx/storage.xml


03/30/2016
(Ryan)
TAGS: repartition compute-1-1
I began the test repartitioning of compute-1-1.
I followed the first page of instructions here
to shrink the original /scratch partition:
http://www.htmlgraphic.com/how-to-resize-partition-without-data-loss/
I followed these instrucitons on how to make
/var its own partition:
http://unix.stackexchange.com/questions/131311/moving-var-home-to-separate-partition

IMPORTANT:
Make sure the UUIDs of the partitions are input correctly into /etc/fstab.
The correct UUIDs can be found by running
$ blkid
or
$ ls -l /dev/disk/by-uuid

NOTE: Toward the end of the second instruction set,
      make sure the new /var partition is writeable!
      (1) Unmount the /var partition:
      	  $ umount /dev/sdaX
      (2) Mount the partition with proper permissions
      	  $ mount /dev/sdaX /var -t ext3
      In case the boot partition is also unwriteable:
      $ mount -o remount,rw /
      Test to make sure the /var partition is writeable
      by trying to touch a file to it. 

NOTE: the nodes operate on run level 3


04/06/2016
(Ryan)
TAGS: repartition nodes 
I changed the START values of compute-1-2 and compute-1-3
to PART two days ago to let the jobs currently running on
them to die.

when running:
$ resize2fs /dev/sda3 20000M
this error is printed:
resize2fs: New size smaller than minimum (37039928)
That error was not present on compute-1-1, but it
is present on compute-1-2 and compute-1-3.
I went forward with the resizing on compute-1-2 and 
I resized the partition to an abnormally high value (~130G).

compute-1-2 is experiencing an issue with:
$ tune2fs -j /dev/sda3
It reports that it "Could not allocate block in ext2 filesystem
while trying to create journal file"
I deleted the /dev/sda3 partition with fdisk.
NOTE: before deleting a partition, remove it from
      /etc/fstab
I made a new partition, that uses the area on disk
also occupied by /scratch. I cleared /scratch.
I then followed the set of directions for giving /var
its own partition.

I tried changing the ext3 tag for /dev/sda3 in /etc/fstab to
ext2. $fsck -n /dev/sda3 did its normal 5-step check rather
than report that /dev/sda3 is clean like it normally does 
at this step. A new error was reported when $tune2fs -j /dev/sda3
was run. The new error:
tune2fs: No space left on device while trying to create journal file

The rather large minimum size of the partitions may be due to 
the large amount of data stored in the /scratch partion I'm shrinking.
In order to prevent data loss, I cannot make the partition smaller
than the amount of data stored in it.

I made a mistake while creating the new partition size for 
/dev/sda3 on compute-1-3. I told fdisk to make the partition
much larger than was possible.

I fixed it (forgot what I did).

The /var partition of compute-1-3 was resized to 97G (04/22/2016).


04/13/2016
(Ryan, Ankit)
TAGS: compute-1-1 repartitioning cvmfs
compute-1-1 made almost all of the SAM tests go critical.
Most of the critical tests reported that there was no
CMS software on the node. It has been taken off condor
by changing its START value from TRUE; this stops the
SAM tests from examining the broken node.

The symlink between /cmssoft/cms and /cvmfs/cms.cern.ch
has been broken because /cvmfs/cms.cern.ch is missing.
The files can be obtained by using cvmfs to transfer the 
appropriate data using the URL given (cms.cern.ch).
The /cvmfs/cms.cern.ch magically reappeared where it
should be (04/15/2016), so I turned condor back on to
see what the SAM tests say.
It was green again until 04/21/2016 when they failed again
due to the same problem as before. I turned off its
condor.

/ was very full because the /etc/cvmfs/defauls.local file was 
pointing the cache to /  
It was fixed by pointing it to /var/cache/cvmfs
cvmfs can be checked with
$ cvmfs_config probe

04/18/2016
(Ryan)
TAGS: UPS test Tripplite
I tested the two Tripp-lite UPSs, and no errors were reported.


04/19/2016
(Ryan)
TAGS: hostname '=' wrong incorrect
The hostname for the CE, rather than the ususal
'uscms1.fltech-grid3.fit.edu'
is now just
'='
When hostname is run, '=' is returned.
In /etc/idmapd.conf the domain is correct, 
but there was whitespace on either side of the
'=' to the right of the Domain variable:
Domain = uscms1.fltech-grid3.fit.edu
I tried deleting the space to the left 
of the '=', and I restarted the service with:
$ service rpcidmapd restart
Nothing changed.

SAM tests have started to fail.

I changed the hostname with:
$ hostname uscms1.fltech-grid3.fit.edu
After relogging in, the prompt was fixed
and $hostname returned the proper hostname.
I will wait to see what the SAM tests think.


REPARTITIONING OF OTHER NODES

compute-1-4
(*) begun 04/22/2016
(*) cleared /scratch
(*) followed instructions and completed partitioning

compute-1-5
(*) already complete (148G)

compute-1-6
(*) already complete (40G)

compute-1-7
(*) begun 04/25/2016
(*) complete

compute-1-8
(*) begun 04/25/2016
NOTE: when clearing /scratch be sure to 
      $ rm -rfv /scratch/*
      rather than
      $ rm -rfv /scratch
      The first option actually deletes everything,
      the second option just removes the pointers.
      When the filesystem is recreated after the second
      option, the data will remain and cause problems later on.
(*) experiencing boot problems with new partition
    (rocks won't load properly on boot)
(*) Turns out everything was already done, I just didn't
    mount /var on /dev/sda3 on boot. The error was because
    /var was empty without the mount.

compute-1-9
(*) begun 04/29/2016
(*) cleared /scratch
(*) complete

compute-2-0
(*) begun 04/29/2016
(*) cleared /scratch
(*) complete

compute-2-1
(*) begun 05/02/2016
(*) cleared /scratch
(*) complete

compute-2-2
(*) begun 05/03/2016
(*) cleared /scratch
(*) complete

compute-2-3
(*) begun 05/03/2016
(*) NOTE: strange partition scheme
    	  with 100G /export partition
	  on /dev/sda6
(*) backed up /export onto nas1
    and deleted it on the node
(*) NOTE: the password is the new one
(*) backed /var into /var.old
(*) deleted /dev/sda5 and /dev/sda6
(*) deleted strange "extend" system 
    partition /dev/sda4
(*) created /dev/sda4, does not work
(*) we had to boot into Single User mode
    (-) enter GRUB menu on boot
    (-) select "Red Hat Enterprise Linux" with the version
    	of the kernal you wish to boot
    (-) type 'a' to append the line
    (-) go to end of line and type 'single' as a 
    	seperate word (space before word)
    (-) press ENTER to exit edit mode
(*) /var must be populated on / to boot properly
    (-) we moved the contents of /var.old to /var
(*) everything in /var is now owned by root
    (it should NOT be)
    (-) refer to other nodes for proper permissions	

compute-2-4
(*) already complete (100G)

compute-2-5
(*) begun 05/04/2016
(*) cleared /scratch
(*) complete

compute-2-6
(*) begun 05/04/2016
(*) cleared /scratch
(*) complete

compute-2-7
(*) begun 05/05/2016
(*) cleared /scratch
(*) complete

compute-2-8
(*) begun 05/06/2016
(*) cleared /scratch
(*) complete

compute-2-9
(*) begun 05/06/2016
(*) cleared /scratch
(*) complete


04/27/2016
TAGS: ganglia 1-min loads
The 1-min load on the ganglia load plot has shot up.
On the Ganglia website, the page showed that the 
loads of compute-2-2 and compute-2-3 had shot up
(this is reflected in the 1-min load). Both nodes had
faulty /etc/mtab files.
A symptom of this is that df -h doesn't work.
Once the mtab files are fixed, restart Ganglia on the CE.
$ service gmond restart
$ service gmetad restart


05/02/2016
TAGS: compute-1-3 /tmp cvmfs low space SAM 9 critical
SAM test 9 went critical on compute-1-3; the reason
being that /tmp was full. cvmfs still had its cache
in / which was now full. To change the location of the 
cvmfs cache, edit /etc/cvmfs/default.local
It was changed from /scratch/cvmfs to
/var/cache/cvmfs
The directory cvmfs must be manually created in /var/cache
and its permissions must be changed from root:root to
cvmfs:cvmfs
$ chown -R cvmfs:cvmfs /var/cache/cvmfs/
The service must be reloaded
$ cvmfs_config reload


Clear cache with
$ cvmfs_config wipecache