1/26/15

on nodes, to properly enable fetch-crl cron job, and set it to run at boot.

chkconfig fetch-crl-boot on
chkconfig fetch-crl-cron on
service fetch-crl-boot start
service fetch-crl-cron start

fetch-crl runs when "service fetch-crl-boot start" is executed.
If any part of the program fails, it will send a return value of 1 and the boot lockfile will never be created.
The lockfile is just an empty file that is used as a switch. If present, the action will execute, if not it won't.
To manually create the lockfile:

touch /var/lock/subsys/fetch-crl-boot

The cron-lockfiles did not exist on 1-4, 2-1, 2-2
The four commands needed execution to fix the issue.
The boot-lockfile did not exist on the CE. It was created manually.

1/28/15
on SE:
Edited /etc/gratia/xrootd-transfer/ProbeConfig and Edited /etc/gratia/xrootd-storage/ProbeConfig
Changed "Generic Site" to T3_US_FIT in SiteName
metric org.cms.SRM-VOPut failure possibly due to Bestman and xRootd configuration

on CE:
Edited job priority factor of glow and osg to 10000.
condor_userprio -setfactor osg@uscms1.fltech-grid3.fit.edu 10000

1/29/15
on nodes 2-5 to 2-9
Edited /etc/condor/config.d/00personal_condor.config
Changed Start = True to Start = FITReserve
Only tickets submitted with "+FITReserve = True" will start on those nodes.

2/13/15
On SE:
Installed SRM tester to see if bestman2 is working.
SRMtest was ok.

on CE:
In /etc/condor/config.d/00personal_condor.config:
CONDOR_HOST = uscms1.local
ALLOW_NEGOTIATOR=$(COLLECTOR_HOST)
uscmc1.local was also included in ALLOW_WRITE

2/17/15:
(*)Changed the certificate for Bestman in the SE. It was produced in a SLC6 machine, and hence was not 
working properly, copied the rsv cert files to the SE, and renamed it as bestman cert files, seems 
the DNS resolution error in the VOPUT SAM test has disappeared.
(*)Also, made a directory /mnt/nas1/store, and changed couple of lines in storage.xml, the reason why 
some of the SAM tests fail is that there is no directory called /bestman ..., and the SAM tests were 
trying to copy files to that directory. Lets see what happens when the SAM test runs !
(*)The following nodes have less than 1 GB free space for running jobs:
compute-1-0.local
compute-1-5.local
compute-1-6.local
compute-2-4.local
Changed the EXECUTE variable to /mnt/nas1/execute, lets see what happens !

2/18/2015:
Mounted nas1 successfully, turned out that I had to edit the /etc/exports file, 
by adding the public ip address of the SE, and then running the command exportfs -a

2/23/15
In both CE and SE
Added a symbolic link to /usr/lib64/libglobus_common.so.0 in /usr/lib64/condor.
It was requested by /usr/lib64/libcgsi_plugin.so.1, and may be the cause of SAM test 4 failure

Also modified /var/www/html/diagnostic/index.php, the website that has the grid diagnostics to fix the condor status line.
It looks for "is running" now and checks condor_q to see if " R " is present. That is the indication that jobs are actually running.

2/26/2015:

Saw that only 144 out of 160 slots were being displayed while running condor_status command, turns out that compute node 1-0 and compute node 2-4 were not running condor,
clearing the /var/lib/condor/execute directory, freed up lot of space in /var, and restarting condor fixed the issue.

3/12/15

Fixed the diagnostic script to timeout when running a command on a mounted filesystem. NAS0 was down, and it broke the page waiting for a connection.
I added timeout 5 in front of all commands inside shell_exec( command here ) that accessed a mounted filesystem.

Drive 8 on the nas failed. We ordered a replacement and it arrived. To enter the raid setup page hit alt-3 on startup. The rebuilding the NAS
info pdf is incorrect. I replaced to drive and selected it to rebuild. Hopefully it is actually rebuilding and not just waiting for me to do something else.
After selecting rebuild hit F8 to finish.

3/13/15

Drive 8 on nas0 has been integrated. For some reason, the two nas were not in the fstab file of all of the nodes. There were present in some (at least 1-0).
I added the last four line of the CE's fstab (which was the same as the last four lines of 1-0) to the fstab on all of the nodes using a modified version
of the haha mount/umount script (ffix.sh and subjob4.sh if they still exist when/if you read this).

3/23/2015:
Still no clue why the pilot jobs are failing the authentication tests (Globus error code 7)

3/24/2015:
Fixed the issue with the globus error code 7. In the lcmaps.db file in the CE, had to comment out (uncomment them on the worker nodes)
## Policy 1: GUMS but not SAZ (most common)
#verifyproxy -> gumsclient
#gumsclient -> glexectracking

The reason is that glexec is not installed on the CE, so there is nothing to track !
Waiting for the SAM tests to run again on the site, and hopefully the SAM tests will go back to their original status.

3/25/2015:
Glexec SAM test kept on failing, the reason was that the vo-client package was only updated on the CE and not on the nodes, it needs to be updated
because voms chnged to voms2 and lcg-voms changed to lcg-voms2, so it couldn't get the proxy properly. 
vo-client has been updated on 7 nodes (1-0 to 1-7)
Also the certificates had to be updated on all the nodes, by running fetch-crl

3/28/2015:
Unfortunately restarting the SE, resulted in rocks installation again ! However some of the files are still intact (didn't format everything), but 
configuring everything again. Ran into issues of file ownerships being displayed as nobody on the SE, edited the /etc/idmapd.conf file (see fall 14 log), 
but that didn't fix the issue entirely, however clearing the cache of idmapd helped, just ran nfsidmap -c .

Phedex requires lot of packages to run so run
yum install glibc coreutils bash tcsh zsh perl tcl tk readline openssl ncurses e2fsprogs krb5-libs freetype compat-readline5 ncurses-libs perl-libs perl-ExtUtils-Embed fontconfig compat-libstdc++-33 libidn libX11 libXmu libSM libICE libXcursor libXext libXrandr libXft mesa-libGLU mesa-libGL e2fsprogs-libs libXi libXinerama libXft libXrender libXpm libcom_err

03/29/2015
Xrootd is configured.
Add a fuse mount to SE fstab;
xrootdfs                /mnt/nas1/xrootd              fuse    rdr=xroot://uscms1.fltech-grid3.fit.edu:1094//path/,uid=xrootd 0 0

modify configuration in
/etc/gratia/xrootd-transfer/ProbeConfig and
/etc/gratia/xrootd-storage/ProbeConfig

change settings in both files to 
SiteName="T3_US_FIT"
EnableProbe="1"

make sure everything is running
$ service globus-gridftp-server start
$ service bestman2 start
$ chkconfig bestman2 on
$ service gratia-xrootd-transfer start
$ service gratia-xrootd-storage start

03/30/2015:

Installed Phedex from the phedex admin documentation, the installation was not working for myarch=slc6_amd64_gcc481, which is the 
recommended settings for SLC6, changed it to myarch=slc6_amd64_gcc461, and it worked. Started the agents, and that changed the TFC entries
for Phedex.
Doug suggested that one should create subdirectories in the /mnt/nas1/store directory, and give write access to the grid users. Hence created a 
new usergroup and added all the grid users to it (look at the addusergroup.sh file)
mkdir generator mc merge relval results temp test unmerged user PhEDEx_LoadTest07 data
chown -R root:gridusers /mnt/nas1/store

Used the permissions 1777 for the store directory and all it's sub-directories. 

04/02/2015:
Transferred the host cert and hostkey file from the CE to the SE, and gave it the right permissions, also created a .globus directory, and 
copied the usercert and userkey files and hence  SAM test for the SE have all gone green. (yay !!!!)
Set the file permissions properly for the host certificates

chmod 444 /etc/grid-security/hostcert.pem
chmod 400 /etc/grid-security/hostkey.pem

Also the SAM test number 4 has gone green, so only one more critical and two warning (out of which one is warning for all the T3 sites)
SAM tests to go ! 
Added <catalog url="trivialcatalog_file://cvmfs/cms.cern.ch/SITECONF/T3_US_FIT/PhEDEx/storage.xml?protocol=xrootd"/> to the 
site-local-config.xml file. It was needed for the xrootd fallback SAM test.


04/03/2015:
As expected the fallback SAM test went green. However the SE SAM test flips between Critical and OK, the same way as the rsv gridftp test.
Also the crl expiry rsv test has gone critical since March 30.  

04/08/2015
crl expiry has gone green on its own.
To view/edit rsv metrics go to /etc/rsv/

Changed /etc/xinetd.conf
per_source = UNLIMITED
instances = UNLIMITED

restarted xinetd
/etc/init.d/xinetd restart

The problem was not resolved.


05/11/2015:
Fixed Compute-2-8 finally, and cvmfs finally works. Work in progress on Phedex.