8/15/2014: Test installation of centos 6.5 in compute-2-3. The installation went without any problems. Next step would be to use insert-ethers to make it visible to the frontend (CE). List of softwares to be installed: 1. Condor 2. Ganglia 3. cvmfs 4. Rocks We are running Centos 5.10. To check lsb_release -i -r. uname --a will give the kernal version and architecture version. Installed DRBL (drbl-2.8.25-drbl1.noarch.rpm) Going to read up on softwares we are going to use, PXE, clonezilla, drbl, ... etc. 8/30 To find the IP addresses of all nodes of the cluster look in the file: /etc/hosts Possible to ssh on to nas and check the /etc/exports We successfuly mounted nas1 to the clean compute node (2-3)!!!! To do so: check to make sure that NFS is installed $yum install nfs-utils nfs-utils-lib $mkdir -p /mnt/nas1 add this to the /etc/fstab file nas-0-0:/nas0 /mnt/nas0 nfs defaults 0 0 nas-0-1.local:/nas1 /mnt/nas1 nfs defaults,rsize=65536,wsize=65536,intr,noatime,auto 0 0 nas-0-1.local:/general /mnt/general nfs defaults,rsize=65536,wsize=65536,intr,noatime,auto 0 0 nas-0-1.local:/backup /mnt/backup nfs defaults,rsize=65536,wsize=65536,intr,noatime,auto 0 0 (from a node already configured /etc/fstab file) $mount nas-0-1.local:/nas1 /mnt/nas1 And repeat for the other machines you want to mount delete is the key to bring up the boot menu 9/17 Backed up important files on the CE and started the process of upgrading to Centos 6.5. Using the Jumbo DVD of 6.1.1 we attempted to install. After finishing the install we could not connect to the internet :/ Attempted many different configurations of network values in route -n and ifconfig to no avail. 9/19 Still cannot to connect to the outside world, emailing Daniel Flores, the FIT IT guy who helped the cluster before with the installation of the 10 GB switch. 9/22 Got a response from Daniel seems to be exactly what we needed. After trying to install again, it still wont connect. 9/23 Ankit and I are desperatly trying to find anything that will help, tried making sure we knew which connection actually was eth0 and eth2. Turns out eth0 is going to the back of the switch which we assume goes to the public network, and eth2 is the private. The opposite of what the cluster was before for some reason. Trying to install with switched connection names in the install...It works!!! :) We can now connect to the outside world. While we are here we mounted nas1 using the public IP provided by Daniel. 9/24 Attempting to install rocks on all nodes. Trying to setup the wiki at the same time. This somehow messed up the connection to the nodes, not sure how to fix and probably faster to just format the CE and start again. Install of rocks 6.1.1 ont he CE went nicely. Installing rocks on nodes without messing with mysql or any other wiki things. Installing on nodes is fairly simple, on the CE enter the command insert-ethers --cabinet=1 and then insert the rocks DVD into the first node (1-0). After rebooting the node it should send a DHCP signal to the CE and on the insert-ethers terminal you will see the name of the node and two parenthesis. When the node has recieved the kickstart file there will be a * in between the parenthesis. From here you should open another terminal and enter the command rocks-console compute-1-0 and this will open a window showing the progress of the installation. The DVD is supposed to eject itself at the end of the installation, but for some reason didnt for us so you need to make sure the DVD is out of the node before it boots or it will attempt to install the OS again and that just wastes time. We successfully installed Rocks on nodes 1-1 1-2 1-3 1-4 1-7 1-8 1-9 2-0 2-1 2-2 2-3 2-4 2-5 2-6 2-7 2-8 2-9. For some reason 1-0 gives an error when it is booting up that there is something wrong with scratch may have happened because of taking the DVD out to soon? not sure 1-5 and 1-6 installed fine but are on the wrong physical nodes and need to be switched. After all that we installed rocks on nas0 since it doesnt have a public IP for whatever reason so we cannot mount it that way, and it is much better to use private IPs for mounting the nas's. the installation is very similar the difference being when you give the insert ethers command dont give a cabinet and when the screen pops up you should choose nas instead of compute node. After that you can insert the DVD and wait for the kickstart file to show up on insert-ethers. then you need to switch over to looking at nas0. The only questions it should have is about the partitioning/formatting. Be careful here and make sure to only format the smaller partition when this was written it was 30 gb. The installation worked nicely. only thing is the partition that actually has all of the useful files for some reason didnt mount on its own. Shouldnt be to hard to fix. We changed the fstab file to include this line: /dev/sdb1 /nas0 xfs defaults 0 0 and made the nas0 directory on nas0. After rebooting the partition with actual data mounted!! Lots of progress today :) 9/25 Trying to fix the 1-5 / 1-6 issue. Did insert-ethers --remove="compute-1-5" and insert-ethers --remove="compute-1-6" Trying to reinstall now seems to be bringing them up as the correct names...Its Fixed! Note:insert-ethers may give some error this sholdnt mean anything just running the same command worked for us. Also trying removing 1-0 and reinstalling. Also working! Asked Vallary to try and ssh to the cluster, she could not :/ Think it could be because the ownership of the files on nas0 is now "nobody" instead of the actual user that owns the file. Looking around we found this link http://blog.laimbock.com/2009/05/21/nfsv4-on-centos-53-and-fedora-11/ Basically what we did was change the /etc/idmapd.conf on both the CE and nas0 where we changed the Nobody-User and Nobody-Group from nobody to nfsnobody and changed the domain to uscms1.fltech-grid3.fit.edu This did fix the issue of ownership of the files but Vallary still cannot ssh. hmm probably an issue with the /etc/passwd file. 9/26 10/19/2014: (*)After installing mediatwiki, there were problems executing normal rocks command like (would effectively create problems while adding the SE): rocks list host, it was returning a python error code. The error message was: Traceback (most recent call last): File "/opt/rocks/bin/rocks", line 300, in command.runWrapper(name, args[i:]) File "/opt/rocks/lib/python2.6/site-packages/rocks/commands/__init__.py", line 2213, in runWrapper self.run(self._params, self._args) File "/opt/rocks/lib/python2.6/site-packages/rocks/commands/list/host/__init__.py", line 176, in run for host in self.getHostnames(args): File "/opt/rocks/lib/python2.6/site-packages/rocks/commands/__init__.py", line 752, in getHostnames for host, in self.db.fetchall(): TypeError: 'NoneType' object is not iterable Checked the /var/opt/rocks/mysql/uscms1.fltech-grid3.fit.edu.err and found that the rocksdb user was deleted from the system somehow. Created a user account for rocks db, and followed the following steps to fix it: # chgrp -R rocksdb /var/opt/rocks/mysql # chown -R rocksdb /var/opt/rocks/mysql/ # /sbin/service foundation-mysql restart Can do rocks list host properly now ! (*)Somehow the Iptables were rewritten, no idea how, at the same time the rsv tests were also failing, going to fix the iptables asap Luckily the old iptables were stored in a file called iptables.dump in the root directory, moved it to sysconfig directory, going to see if the tests improve 10/20/2014: (*)Seems like the combination of restarting the gatekeeper (although I did restart it sometime back), restarting globus-gridftp-server and restoring the old iptables has fixed some of the rsv tests (running them by hand now to see the results). The rsv page should update every 5 mins. (**)Update on the rsv issue: 13 out of 14 rsv tests run successfully now ! (*)Ganglia monitering page shows that all the nodes are down! Started services gmond and gmetad, didnt fix the issue. Seems like a firewall issue. Fixed the iptables in /etc/sysconfig/iptables and now it works ! (*)Not able to run yum commands on the compute nodes, seems like fixing the iptables fixed this issue as well, can run yum repolist on the nodes (*)Tried adding the SE, but insert-ethers doesnt seem to work. Seems like a dhcpd server error. Tried restarting dhcp server, didnt work. Sent an email to the rocks mailing list, lets see what happens 10/21/2014: (*)CE was in unknown status in the OSG MAP. Seems like having the GUMs server on the CE was not a good idea after all. Changed the value of the GUMS server to the old value, three critical tests already have gone green ! 10/29/14 Condor is running simple jobs from us now!!! edited config files on all nodes making sure that ALLOW_WRITE = * is there in the 00personal file also changed on the CE 00personal IN_HIGHPORT=9999 IN_LOWPORT=9000 NETWORK_INTERFACE=10.1.1.1 For some reason both of our certificates are not working, need to look into these tihngs. Made a TWiki Account for Aiwu, will start configuring a bit this thursday and friday. 10/30/14 (*)Added Christian to GUMS admins. Removed grid and cms accounts from passwd file Seems like the voms-proxy-init -voms cms keeps on failing, the reason is that voms service is down since a day at CERN. Will Try adding the SE: Method 1: Change the dhcpd file and make it listen to eth1 (Didnt work !) Method 2: Add SE using rocks command, instead of insert-ethers, have to google some more on that. Daniel and Ankit were trying to add the SE. We reinstalled dhcpd, that resulted in dhcpd service restarting successfully (Daniel and Craig's master-stroke :D). After that some "brown magic" happened, and when we booted the SE without the CD in the CD rom drive, and with insert-ethers running on the CE, the SE added itself to make the night memorable :D (*)Should talk about the SAM tests in the friday meeting, make a list of things to ask ! 10/31/2014: (*)Long and eventful Tier-3 meeting with eduardo and rob !!! (*)Created 5000 uscms pool accounts on the CE and the nodes (just create user accounts on the CE, and then do rocks sync users). the auto.home file got changed, and after doing service nfs restart and service autofs restart, rob and eduardo were able to run grid jobs on our cluster !!!!! (so after a month or so, we have grid jobs running on our cluster !!!) (*)Time to configure autofs and nfs on the compute nodes, seems like it is configured already (*)need to mount nas0 and nas1 on the nodes, just added some lines to the /etc/fstab files (*)Created /mnt/nas and other directories by running rocks run host "mkdir -p /mnt/nas0" (*) cvmfs has been installed and configured on all the nodes. (*)Seems like some rsv tests have started failing ! Fixed it by adding rsv user entry in /etc/auto.home, was deleted when I was working on the auto.home file. All but one metrics is green (*)Started the configuration of the SE. Did the basics, like changing the files in /etc/yum.repos.d/. Installed OSG-Client on the SE. Preliminary configuration of the SE is done (*)Assigned a public IP address to the SE /opt/rocks/bin/rocks set host interface ip compute-0-0 eth0 163.118.42.2 /opt/rocks/bin/rocks set host interface subnet compute-0-0 eth0 public (*)Shows the public id in the rocks list host interface command, and in the /etc/hosts file, but cannot log into it properly, need to do it the way we did the CE configuration (*)Changed the settings for eth0, now it seems to start when I do service network restart, can ping to 163.118.42.2, and can ssh from within the cluster, using the public ip, cannot ssh yet from outside the cluster ! 11/01/2014: (*)Couldnt think of a better way to start this month ! Created a script called addpool.sh to create pool accounts on the CE, and the local nodes (*)Seems like the SAM tests may go green finally. As per Bockjoo's suggestion checked the grid000X directory in /var/lib/globus/gram_job_state to see if the permissions were correct, they were not ! Deleted the directories, and after that the new directories were created with correct file permissions, and saw grid jobs running on the cluster ! (*)Will work more tomorrow, depending on how many SUM tests fail/pass, for the moment fingers crossed. 11/02/2014: (*)SUM/SAM3 tests have started going green/critical, except the job submission metric which has gone green. Now that is promising. Come on ! (Thinks of federer !) (*)Should try to get the remaining SUM tests to go green asap. (*)Seems like the file permissions are not set right in the compute nodes (shows nobody nobody), the same error we had in the CE some time back, fixed that by scping the file to the compute nodes, and restarted the nfs service at the nodes, seems to have fixed the error. 11/03/2014: (*)Configured glexec on all the nodes, hopefully the glexec SAM tests will go green soon. 11/04/2014: (*)Christian and Ankit tried configuring the squid, still no success 11/05/2014: (*)Instructions to create a group and give directory access to the group: groupadd groupname usermod -G groupname username chmod 770 directory chgrp group directory (*) Added this line to enable image uploading from url: $wgGroupPermissions['user']['upload_by_url'] = true; Issues with home directories not being mounted. To fix add the user to /etc/auto.home in this format: craig nas-0-0.local:/nas0/home/craig Probably going to have to do this for all users >_> So figured out why squid was giving only misses I believe. The way squid is set up it only allows proxy requests from computers on the private network so when Craig and Christian were attempting other PCs we only got misses, not sure why requests from the cluster itself causes only misses though. Installed Bestman on the SE, requested host certificate, and configured as per the twiki page. Running service fetch-crl failed to grab a certificate. Tried to start Bestman but it failed, no error messages it seems though so not sure where to go from here. 11/10/2014: (*)Defined OSG_APP to /cvmfs, lets see if it fixes the Glexec SAM test (*)Installed bestman and xrootd (yum install osg-se-bestman-xrootd), will configure it later 11/12/2014: Installed lynx (yum install lynx) on the CE and SE in an attempt to determine why the SE is inaccessible from the outside -Discovered that they have the same IP address, is this expected? (lynx --dump http://ipecho.net/plain) -If yes, how does one specify which computer to connect to when connecting from an external network? Ports? Do the SAM tests account for this? -If no, how do we specify a different IP? Attempted to determine why glexec is failing the SAM test, error message states that cmsset_default.sh does not exist, but the test could simply be looking in the wrong place -After looking through the various places suggested online, it appears that the file doesn't exist at all -Should the program CRAB be installed? It is something that keeps recurring when sifting through online sources (Example) https://www.physics.purdue.edu/Tier2/content/crab3-purdue 11/17/2014: More information about the external IP address conflict: [root@compute-0-0 ~]# lynx --dump http://ipecho.net/plain 163.118.42.1 [root@compute-0-0 ~]# logout [root@uscms1 ~]# lynx --dump http://ipecho.net/plain 163.118.42.1 I have no reason to believe that lynx is wrong, but it is certainly possible that this is the case. Need to contact IT about the potential IP conflict 11/19/2014: Seems like the SUM tests are missing, and we are not getting any jobs. So ran fetch-crl by hand, got 3 jobs to run, but no jobs from grid0002. Trying to find out the reason 11/20/2014: Seems like we are getting grid jobs again. Asked the site administrator at UMD about glexec, added a soft link to cmsset_default.sh in /. Lets see if glexec goes green. Changed the SUM test link to SAM test Modified /etc/lcmaps.db, so that glexec tracking is enabled Added the location of glexec in /etc/osg/config.d/10-mis* file, lets see if FIT passes the glexec SAM test, didn't pass the Glexec test Modified the /etc/lcmaps.db file, and the gums.config file (changed read self to read all) Added the rest of the documents to the cluster wiki. 11/21/2014: Changed the permissions of the /etc/glexec.conf and /usr/sbin/glexec Deleted and recreated the glexec user, turned out the gid was not set properly for glexec Glexec now runs successfully locally, the glexec SAM test should go green anytime. Edited site-local-config.xml using git. https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsSiteconfMigrationToGit <- This link has instructions on how to do so. Before doing anything from the page you need to enter the command: kinit -A -f yourcernaccount@CERN.CH Waiting for the changes to show up on CERNs end and then waiting for the squid SAM test to run again hopefully with no errors. deleted the osg user and recreated it, also did rocks sync users. changed the ownership of the osg home directory to osg. Deleted the entry in auto.home also. Deleted the osg directory in /var/lib/globus/gram_job.../ and restarted globus gatekeeper, and we have osg jobs running !!!! 11/26/2014: Special note to Ankit: dont and I meant it dont runm rm -rf on anything as root, think hard, enjoy ! it may have the strength, but the effort, struggle, worthy of the heart Seems like restarting the SE, did the trick !!! yay ! Installed Phedex, and configured it a bit. Caught up with a bigger issue, voms-proxy-init -voms cms doesnt work, possible cause is the migration of voms to voms2, and lcg-voms to lcg-voms2, changed the entry in the /etc/vomses file,