June 5, 2012 added 128.142.202.212/255.255.255.255 to HOST_MONITOR in squid.conf located on SE at /sandbox/squid/frontier-cache/squid/etc/ From Barry, "Your site should open port 3401/udp to requests from: 128.142.202.212/255.255.255.255" apparently we are not responding to SNMP? added ip stuff and restarted squid along with vdt (to restart the gatekeeper) June 8, 2012 Restarted website (restarted vdt) and remounted nas1 today (# mount -a) June 11, 2012 nas1 unmounted again, did mount -a SE not responding, was turned off (found out from Dr. Hohlmann) Dr. Hohlmann turned it back on and still cannot ssh to it pinging doesnt work waiting on Kim to respond to see if she can reboot the cluster, else Dr. Hohlmann will need to June 13, 2012 Cluster reboot due to power outage, also needed to happen. skyped with Dr. Hohlmann, Jessie, and Christian and they turned stuff off. will wait until later to turn stuff on. June 14, 2012 restarted processes on the cluster, noticed /var 94% ??? June 15, 2012 So much wrong with the cluster. restarted some things, sent another email to Kim, that one directory is at 97% now. idk what is causing it. June 20, 2012 Kim fixed the files. It was the cluster emailing itself again. I knew it! Also, CERN account got hacked, changed passwords. June 25, 2012 SE down, restarted vdt and website on CE. Himali needs newest version of CMSSW. I emailed Bockjoo for it. June 26, 2012 SE and NAS1 like to turn off for some reason. BAD! CMSSW versions already there, installed newest crab version 2_8_1. June 29, 2012 helping Himali with node that did not want to recognize cmssw stuff July 3, 2012 Ganglia down, restarted vdt July 9, 2012 NAS0 became unmounted, remounted it Also, condor wasnt running which i found out via Bockjoo. not sure if there is more to fix or not. also fixed something on the twiki. all fixed. condor just stopped working, probably because nas0 became unmounted July 17, 2012 /dev/sda1 100% full. condor spool files HUGE, deleted them July 25, 2012 Condor was down, restarted vdt, condor, and nfs. Works again. Also am becoming site contact/executive for PhEDEx August 24, 2012 Summary of the SPOOL problem: When a user submits a job with checkpointing enabled, a file is created in /scratch/condor/spool. The filename has the job number within the name in which one can go to condor_q and review the details of the job (i.e. who submitted it, how long its been running, the size, etc.). Two of our users had jobs that had been running for about two months and were clearly having problems. This causes HUGE spool files that in turn cause us problems. One can submit a job without the context ON_EXIT_OR_EVICT and instead just use ON_EXIT to not have this checkpointing. Condor should delete these spool files, though as long as the job is running, these files will only get bigger. For most users this is not a problem, but if there are issues with code or a huge file, this is a problem. The solution that needs to be found is a way to limit the spool size or the amount of checkpointing of the job. Currently I have asked the two users to not use checkpointing and to see if their jobs that have been in the cue for so long can be deleted. November 2, 2012 list of stuff wrong on the cluster 1) Network randomly stops working 2) Gram Authentication Error, Possibly DN not registered properly (this also means that new versions of CMSSW cannot be delivered) 3) Phedex shows up as a bash terminal instead of working properly 4) sshing to dev-0-0 gives the error "Address 163.118.42.2 maps to uscms1-se.fltech-grid3.fit.edu, but this does not map back to the address - POSSIBLE BREAK-IN ATTEMPT!" (potentially something to do with /etc/hosts?) 5) Not fully connected to outside world, cannot recieve osg or glow jobs 6) Potentially something wrong with gums 7) nas1 unmounts itself sporatically 8) Himali's SRMCopy issue In /etc/grid-security, changed grid-mapfile-local to include MY information instead of Patrick's. Made a copy of the old one just in case and put it into grid-mapfile.backups November 4, 2012 It was suggested by Jordan that maybe part of our authenication issues may be due to our certificates including our old IP address (encryped in them somehow). So I am going to update our certificates. Boo. It's giving me errors. Apparently it forgot I am a gridadmin. Thunderbird to the rescue! I think I know the error. Between when I became Grid Admin and now, my certificate has been renewed. The cluster is only showing my new one (i.e. the unexpired one, in which the Grid Admin is attached to my old (expired) one. Boo technicalities. Now we wait. In the mean time, things to bring up in the meeting; 1) Renewing the certs to reflect new IP address...working on GridAdmin status (new cert, no longer connected to my active certificate) 2) Will look into updating the drivers and firmware for the network cards (may cause us some trouble with ROCKS as per some of Xenia's documentation, hope not!) 3) Work more on Gram/globus/gums error...potential issue with my DN not being registered properly. (Email OIM people and have them combine my two accounts...this may help) 4) Fix PhEDEx...why is it a bash terminal? November 5, 2012 GUMS is a big issue. I had to add myself as the gums admin and then I reran the grid mapping info for the site and for rsv. I also updated the VO members. https://docs.uabgrid.uab.edu/suragrid/Grid_User_Management_System_%28GUMS%29#Configuration_via_Web_Interface https://uscms1.fltech-grid3.fit.edu:8443 (use the above to check) to configure GUMS: https://www.opensciencegrid.org/bin/view/ReleaseDocumentation/InstallConfigureAndManageGUMS#Install_a_Certificate_Authority Still getting some GRAM authentication errors, but I am definitely on the right track. Also, the squid.conf needs to be updated with the new ips (first I need to find the old ips). Then ill reconfigure squid and restart it. Gram worked for like 20 minutes then stopped. Boo. checking the /opt/osg/globus/var/globus-gatekeeper.log for clues. Error was: "Failure: globus_gss_assist_gridmap() failed authorization. gridmap.c:globus_gss_assist_map_and_authorize:1944: Error invoking callout globus_callout.c:globus_callout_handle_call_type:727: The callout returned an error prima_module.c:Globus Gridmap Callout:394: Gridmap lookup failure: Identity Mapping Service did not permit mapping for /DC=org/DC=doegrids/OU=Services/CN=rsv/uscms1.fltech-grid3.fit.edu" also "Gridmap lookup failure: Failed to talk to the Identity Mapping Server" and "PRIMA ERROR prima_module_scas.cpp:88 ts=2012-11-05T21:05:29-05:00 Username_handler: Error: Couldn't find the username 'cmsprod' in the password file." November 7, 2012 A bunch of nodes are on but not responding (can't ssh to them). According to my research, it turns out that our GRAM Authentication errors are caused by either client: Missing grid-mapfile client: Missing entry in grid-mapfile But, it seems to be having an issue with the SITE not as specific user. Oddly enough, we have a grid-mapfile and a grid-mapfile-local Idea! Add the contents of the grid-mapfile-local to the grid-mapfile. Did this at 12:09. Now we wait. (Last run at ~11:42) And that didn't work. Crap. Okay. So I am super stuck. I'll have Jordan introduce himself into the Hypernews community with an email to them asking for help. AHHHHH. I just want this fixed. November 8, 2012 lcg-info may potentially be helpful must go to /etc/yum.repos.d/lcg/bin # ./lcg-info --list-ce --vo VO:cms --query 'CE=uscms1.fltech-grid3.fit.edu' gives me the error lcg-info: LCG_GFAL_INFOSYS undefined. Boo. A bunch of nodes are not responsive. I'm going to have to go down the the highbay to check it out. November 11, 2012 A bucnch of nodes are down...boo. I am directly connected to them via a monitor. 1-2 was hung up on start up. It put me in a bash shell, exited that and it restarted. The error is now When being started, the PXE client comes up with the PXE copyright message and completes the DHCP phase, but then displays: TFTP.... After a while, the following error message is displayed: PXE-E32: TFTP open timeout Depending on the PXE client's system setup boot device list configuration, the node stops "The "PXE-E32" error indicates that the PXE did not get a reply from the TFTP server when sending a request to download its boot file. Possible causes for this problem are: 1. There is no TFTP server 2. The TFTP server is not running 3. TFTP and DHCP/BOOTP services are running on different machines, but the next-server (066) option was not specified" got this from -->> http://www.bootix.com/support/problems_solutions/pxe_e32_tftp_open_timeout.html I control-c'd out of that. It booted. While booting, it said that it "contains a file system with errors". I have no idea whats going on (...nothing new there). And we are back to the original error thing. "Unexpected Inconsistancy; Run fsck manually.0000-22000" ...okay so doing a fsck /dev/sda1 -f fixed 1-2 and 1-5 to the point that they can be ssh'ed to. But condor still isnt running. sshed and restarted condor. Nodes that are still down: 1-6, 1-8, 2-7, 2-8 Condor isnt on 2-4??? and isnt mounted to nas-0-0 (and wont!) November 12, 2012 1-6, 1-8, 2-7, and 2-8 are up...boom! (small victories!) So, Jordan suggested that VOMS may not be configured correctly with gums for authentication purposes. This may be the same with rsvuser where the account in gums never existed. It is actually kicking off the same error. ...and Patrick has nothing on this :( November 17,2012 We need to update Condor, OSG, Yum, and VDT. Potentially BeStMan as well. Started talking to Burcu and Sam about getting their jobs off the cluster. November 20, 2012 - 21,2012 Jordan and I updated some of the yum stuff on the CE. He also update the Kernal (Patrick has old documentation on this) In these updates the CE went green on RSV!!!! ...and so did the SE!!! MAJOR SCORE!!! November 21, 2012 Also, we are missing a lot of dependencies. Boo. November 26, 2012 I decided to take the plunge and update VDT. This is scary. I found a website from Xenia (and consequently from Patrick on how to do this). I followed the instructions at https://twiki.grid.iu.edu/bin/view/ReleaseDocumentation/OSG12NewUpdateInstructions So, cool. VDT updates OSG and BeStMan as well. :) We went from OSG 1.2.26 to 1.2.31 and all updates are to the absolute newest version available (via updating VDT, the kernel, and yum). When I tried to configure OSG after the update (configure-osg -v and configure-osg -c), the cms VO did not appear. This was done the SE and the CE. Also, using pacman I updated some of the other services November 29, 2012 Updated all the vdt/pacman affiliated services on the SE and CE using pacman. We are now failing the RSV probes. December 6, 2012 I got an email about decommissioning the FLTECH VO. I was a bit confused about this an talked to someone at Fermilab. We dont need our own personal VO, so thats fine. But I did ask for her help regarding our unknown status. This lead me to an email conversation with Rob Snihur. December 9, 2012 Apparently, our issues may spawn from the config.ini file. This file is saved and then replaces the new config.ini file when VDT is updated. This may not be correct, but it's what the documentation said to do. Will look into this further after finals. Also, someone NEEDS a CERN account. :( we will start failing even more :(