01/07/2018 I thought I turned some nodes on, so that I could work on it before the Physics Building opened up, but I guess not. RIP. I guess I'll just have to wait until tomorrow. cont. 01/21/2018 Alright, now that NAS-0 is fixed FOR REAL this time, let's get crackin'. Jk, the nodes won't get power. *sigh* The output breakers for the plugs into which the node power strips are connected are weirded out. So that I can continue to play with condor in spite of this strange issue, I only have five nodes (2-0 to 2-4) turned on. So far, the UPS seems to be alright with that. cont. 01/22/2018 Time to play with condor. Let's start off with a classic 'condor_ce_trace' and see where we end up. First, I need to send off my new usercert. The instructions for converting a '.p12' to a '.pem' are found at [10/16/2015]. I copied both the new 'usercert.pem' and 'userkey.pem' to '/etc/grid-security'. I tried `condor_ce_trace -d uscms1.fltech-grid3.fit.edu`, and it told me that it couldn't connect to the CE; the collector daemon appears to be off. Yup, the collector daemon's down, verified by `condor_ce_status`. I did `service condor-ce start` to start it up. Now I'm getting all kinds of output from 'condor_ce_trace'. It's saying it's unable to create a temporary file in the working directory, '/root'. Imma try to run it as Voytella, and see if I get anything different. Now it's telling me it can't find a X509 proxy in '/tmp/x509up_u14122'. That's because my user certificate is hella outdated. It says to just throw a copy of it and the key into '/home/Voytella/.globus'. Excellent! I've created a valid temporary proxy! Alright, now it's doing what it was doing before: querying every single idle job in the queue. '/var/log/condor/SchedLog' is also reporting a bunch of 'PERMISSION DENIED' errors like it was doing before. cont. 01/26/2018 I'm going through the documentation sent by OSG. It says to look for "DC_AUTHENTICATE" and "PERMISSION DENIED" errors in '/var/log/condor-ce/SchedLog'. While I don't have those errors in the condor-ce SchedLog, they're all over the place in the condor SchedLog. The errors are also slightly different than what's described in the documentation. Alright, despite the documentation being for condor-ce, I'm gonna follow its directions to see what I can discover. First, it says to check GUMS or 'grid-mapfile' to ensure that my DN is known to my authentication method. I made sure that in '/etc/osg/config.d/10-misc.ini', 'authorization_method' was set to 'xacml' and 'gums_host' was set to our hostname. There is also a note that says that if the local batch system is HTCondor, it will attempt to use the LCMAPS callouts if enabled in '/etc/condor-ce/condor_mapfile', and if that's not the desired behavior, to set 'GSI_AUTHZ_CONF=/dev/null' in '/etc/condor-ce/config.d/99-local.conf'. The GSI thing wasn't set, so I set it. Imma try condor_ce_trace again and see what happens. Nothing seems to have changed. Oh, I forgot to `condor_ce_reconfig`. Now let's see if that does anything. I set the 'condor_ce_trace' command on my user side-by-side with a `tail -f /var/log/condor-ce/SchedLog`. The 'condor_ce_trace' is doing the thing where it queries every single job to report that it's idle and sends a "connection request to schedd at <163.118.42.1:9619>". Everytime it makes a new query, it writes to the SchedLog the same thing: saying the number of active workers is 0 and something about forking workers and no more children processes to reap. I wonder if 'condor_ce_trace' writes anything to '/var/log/condor/SchedLog'. While there's a bunch of stuff being written to '/var/log/condor/SchedLog', it doesn't look like it's being caused by the 'condor_ce_trace'; it's just a bunch of the 'DC_AUTHENTICATE' and 'PERMISSION DENIED' errors. NOTE: There are a TON of LCMAPS and GRAM-gatekeeper authentication errors in '/var/log/messages'. Let's see what doing the GSI thing for regular condor does. NOTE: In '/etc/condor/config/d', there's a mysterious '99-condor-ce.conf'. What's that doing there? There's also a '50-condor-ce-defaults.conf'. Maybe they're there so condor can talk to condor-ce? They just say that the super user can impersonate anything. I made the GSI addition and reconfigured condor. Nothing new happened. The next thing it says is to look for LCMAPS errors in '/var/log/messages'. Oh hey! We're drowning in those! Let's investigate! It looks like the error starts with an authentication of a globus user, then it says it can't open file '/etc/lcmaps/lcmaps.db'. That causes a LCMAPS plugin error, with prevents LCMAPS from initializing. Then that failure breaks everything else. Let's see about that file. NOTE: LCMAPS (Local Credential MAPping Service) translates grid credentials to local Unix credentials. Turns out there's only '/etc/lcmaps.db' and no 'lcmaps' directory. I'm gonna try to make that directory and throw the file in it. Now, in '/var/log/messages', a bunch of globus users got authenticated in a row without issue and some other stuff happened. Then it gave a warning about still being "root after the LCMAPS execution. The implicit root-mapping safety is enabled. See documentation for details.", and the next line said that "globus_gss_assist_gridmap() failed authorization" and that the callout returned an unknown error. I'm gonna see about debugging LCMAPS. There's a whole page for troubleshooting LCMAPS on the wiki. First, it said to set up LCMAPS for maximum debugging by adding the following to '/etc/sysconfig/condor-ce': export LCMAPS_DEBUG_LEVEL=5 export LCMAPS_LOG_FILE=/tmp/lcmaps.log Then 'condor-ce' has to be restarted: $ service condor-ce restart It also says that disabling HTCondor-CE's caching of authorization lookups is a good idea for testing changes to mapfiles. To disable the caching, create '/etc/condor-ce/config.d/99-disablegsicache.conf' and insert GSS_ASSIST_GRIDMAP_CACHE_EXPIRATION=0 then restart 'condor-ce'. NOTE: It says that disabling caching could increase the load on the CE (makes sense), so keep an eye on things to make sure nothing gets too out of control. It gave me a list of configuration files in order of precedence: /etc/grid-security/ban-mapfile (ban DNs) /etc/grid-security/ban-voms-mapfile (ban VOs) /etc/grid-security/grid-mapfile (map DNs) /etc/grid-security/voms-mapfile (map VOs) /usr/share/osg/voms-mapfile-default (map VOs default) '/etc/grip-security/grid-mapfile' is full of grid mappings, but '/etc/grid-security/voms-mapfile' doesn't exist. Strangely enough, it says that LCMAPS is configured in '/etc/lcmaps.db', the file I thought (and it thought) was misplaced earlier. Huh. Either way, it gives me a bunch of stuff to make sure I have in it. It looks like it contains none of what it's supposed to have. Imma go through and add bunch of stuff, then. Above the 'authorize_only' section, I added the 'gridmapfile', 'banfile', 'banvomsfile', 'vomsmapfile', 'defaultmapfile', and 'verifyproxynokey' parameters. It said to edit the 'authorize_only' section to exactly what it is now; I've commented out what was already there. It also said to make sure '/etc/grid-security/gsi-authz.conf' containes a certain line (that terminates with a newline), but that's already there (including the newline). That's the end of the document. Now let's see what happens. That globus_gss_assist_gridmap() is still failing. Oh, turns out this troubleshooting guide I was following is just the tail end of the whole LCMAPS page. Imma run down it from the top and see what I can see. It says to enable the LCMAPS VOMS plugin, I have to add the following to '/etc/ost/config.d/10-misc.ini': edit_lcmaps = True authorization_method = vomsmap It also said to comment out 'glexec_location', and I've commented out the existing 'authorization_method'. It says that a Unix account must be created for each VO, VO role, VO group, and user that I wish to support. I'm not sure if that means every single user in '/usr/share/osg/voms-mapfile-default' or not, because that's a bunch of users. I can probably ask OSG about that. It says the 'allowed_vos' parameter in '/etc/osg/config.d/30-gip.ini' should be populated with the supported VOs per subcluster (worker node hardware) or resourceEntry (set of subclusters) section. Not entirely sure what it means by that, but our 'allowed_vos' in empty and commented out. I'll also ask OSG about that. cont. 02/03/2018 They think we may not have the OSG version of LCMAPS. To see what version we have, I ran `rpm -q lcmaps`, and it told me we're running version 'osg33', while the most updated version is 'osg34'. Ah ha! I'll see about fixing that up. I've run a `yumUp`. That didn't cut it, I may have to do other things. Brian also said that I may have not run 'osg-configure', and he's right, I haven't! I've run `osg-configure -v`, and it gave me some info. It said I'll either have to specify a list of VOs or a '*' for the 'allowed_vos' option. It also said that I need to fix the 'gram_ce_hosts' option in '/etc/osg/config.d/30-rsv.ini', since GRAM is not longer supported (the whole reason for this debacle in the first place). In '/etc/osg/config.d/30-gip.ini', I've set 'allowed_vos' to '*'. I'll probably also have to make user accounts for all the VOs in '/usr/share/osg/voms-mapfile-default'. In '/etc/osg/config.d/30-rsv.ini', I edited 'ce_hosts' to just include HTCondor-CE, and I've commented out the 'gram_ce_hosts' setting. `osg-configure -v` gives me a "No allowed_vos specified for section 'Subcluster FLTECH'" warning, and a VO specification warning, saying that either a list of VOs or '*' must be given. I thought I had already taken care of that by modifying 'allowed_vos' in '/etc/osg/config.d/30-gip.ini'. Huh. I'll just go ahead with the `osg-configure -c` and keep these warnings in mind. The configure reported no errors, just the above warnings. cont. 02/05/2018 OSG also said they wanted an updated `osg-system-profiler`, so I've started that off. cont. 02/16/2018 (RIP, sorry OSG) Since it's been so long, I've made a new `osg-system-profiler`. cont. 02/17/2018 OSG says I've gotta make users for all of the entries in '/usr/share/osg/voms-mapfile-default', so Imma see about doing that. The new users have been created. I've run `osg-configure -c` again and got the following warnings: WARNING No allowed_vos specified for section 'Subcluster FLTECH'. WARNING In OSG 3.4, you will be required to specify either a list of VOs, or a '*' to use an autodetected VO list based on the user accounts available on your CE. WARNING No allowed_vos specified for section 'Subcluster FLTECH'. WARNING In OSG 3.4, you will be required to specify either a list of VOs, or a '*' to use an autodetected VO list based on the user accounts available on your CE. WARNING Can't copy grid3-location file from /etc/osg/grid3-locations.txt to /cmssoft/cms/etc/grid3-locations.txt CRLs exist, skipping fetch-crl invocation The repetition of the first two warnings is most likely a result of `osg-configure -c` first running `osg-configure -v`, and simply printing those warnings for both commands. The last warning, however, I have no explanation for. cont. 02/20/2018 OSG said I forgot to set 'allowed_vos' to '*' under the '[Subcluster FLTECH]' section of '/etc/osg/config.d/30-gip.ini'; I had only done it in the '[SE FLTECH-SE]' section. cont. 02/23/2018 Daniel said he fixed some condor stuff, [02/11/2018], so let's try to run some condor jobs and see what happens. I submitted a job from my account, and it was immediately held. cont. 02/24/2018 Since so much has changed, I'm going to run through the Condor troubleshooting documentation again to see what it says. 04/06/2017 TAGS: CE cannot ssh unresponsive Vallary emailed me saying that she couldn't ssh into the cluster, and neither could I! Upon arriving at the high bay I found the CE unresponsive; just the blue background was visible with the mouse. I power cycled the CE and it rebooted, but condor's not working. `condor_status` returns a communication error stating that it cannot connect to 163.118.42.1:9618. It stopped because /var is 100% full. /var/lib/globus is 3.3G and is full of strange condor files that were created yesterday and the day before. Some are several Megabytes while some are empty. The files seem to contain entries for submitted jobs. I'm going to move all of the "condor.*" files to ~/globusCondorJunk and see if that breaks anything. I fully restarted condor, and all seems to be well. If it turns out that the "condor.*" files are indeed useless, then I'll delete them. 04/10/2017 TAGS: mass deletion of users users are being deleted in 24 hours. I made a file called ~/userdellist.txt that has all the info in it the programs at the bottom will stay for now, some of them are important. 04/11/2017 TAGS: node validation failure tmp full OSG sent us a ticket a while ago (my email wasn't in the list, Ankit told me about it) saying that CMS and OSG glideins were failing node validation upon startup (https://ticket.opensciencegrid.org/32896). The CMS glideins are failing due to being unable to locate CMS software, and the OSG glideins are failing due to a full '/tmp'. CMS Failing Nodes: compute-1-1 compute-1-3 compute-1-6 compute-2-1 compute-2-4 compute-2-5 compute-2-6 compute-2-7 compute-2-8 OSG Failing Nodes: compute-2-5 compute-2-6 compute-2-7 compute-2-8 The OSG Failing Nodes do, in fact, have a completely full primary partition, where '/tmp' is located. cont. 04/12/2017 The problem was that '/scratch' was all filled up because it was the cvmfs cache. I moved the cvmfs cache from '/scratch' to '/var/cache/cvmfs' on all the nodes via a script ('~/Scripts/mvCvmfsCache.sh'). cont. 04/14/2017 The other problem was the CMS failing nodes. The listed nodes contain the script `/var/lib/condor/execute/dir_/glide_/discover_CMSSW.sh`. NOTE: navigate to 'var/lib/condor/execute' then run `find . -name "discover_CMSSW.sh"` to locate the script. It hangs upon execution. The script just looks for other scripts and executes them. If it doesn't find what it's looking for, it's supposed to say so. The script however, doesn't seem to do anything. The discover script is only on some of the nodes listed, and it's not on any that are not listed. 04/13/2017 TAGS: home directory clean Cleared out the home directory for root so it's usable 04/14/2017 TAGS: condor not running diagnostics passwords required ssh The diagnostics page reports that condor is not running on any of the nodes. All of a sudden, I need to enter passwords to ssh from root. Huh, that's strange. Turns out condor's fine, but the monitoring scripts need to ssh into the nodes, which it can't do now because ssh-ing requires passwords for some reason. Riley moved some of the ssh files around when he was reorganizing the home directory, so the CE's ssh keys have been slightly scrambled. cont. 04/17/2017 Ankit said to investigate ROCKS; it made the ssh keys. The ROCKS documentation said that hostbased authentication is controlled by '/etc/ssh/shosts.equiv'; the IPs of the cluster parts are all there. I created a brand new ~/.ssh directory and filled it with a public and private key generated with $ rocks create keys ~/.ssh/id_rsa > ~/.ssh/id_rsa.pub The new key was placed in NAS-1 with $ cat ~/.ssh/id_rsa.pub | nas1 "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys" The new key was confirmed placed where it should be, but a password was still requested. Silly me, I didn't check id_rsa.pub for errors, of which there was one. I need to type the command correctly. $ rocks create keys key=~/.ssh/id_rsa > ~/.ssh/id_rsa.pub The key was created, and it was correctly put onto NAS-1, but it still doesn't work. Instead of using the rocks command to make the keys, I used the normal `ssh-keygen -t rsa` command, then sent the keys over with the normal command. For installing the new key on all of the nodes, I'm installing `sshpass` which will allow for the automation of logging into all of the nodes. I added to the osg-node.sh: cat ~/.ssh/id_rsa.pub | sshpass -p "" ssh -o StrictHostKeyChecking=no compute-fed-nad "mkdir -p ~/.ssh && cat > ~/.ssh/authorized_keys" be sure to comment out the normal ssh line! That worked for compute-2-*, but the passwords for compute-1-* are different. I will have to change them to the normal password. cont. 04/18/2017 To change the root passwords of the other nodes, they must be powercycled and booted into single user mode. After the password has been changed, run `init 5` to resume normal operations. If the node hangs after `init 5`, powercycle it again, and allow it to boot normally. I've changed compute-1-0 to compute-1-3 so far. cont. 04/19/2017 All of the nodes, the SE, NAS-1, and NAS-0 all have the new keys. 04/19/2017 TAGS: gratia accounting osg website GRACC change no job count OSG updated their grid monitoring software from Gratia to GRACC (GRAtia Compatible Collector). GRACC is compatible with all existing Gratia probes. It is shown that we are amassing wall hours, but there is no data for the job count. 04/24/2017 TAGS: squid not running Squid wasn't running. I checked its status with `squid -k check` and it told me that it couldn't find the cache directory. That's because it was moved during Riley's spring cleaning. I changed the squid directories in '/etc/squid/customize.sh' from "ufs /root/squidAccessLogDump/cache 20000 16 256" to "ufs /root/Cluster_System_Files/squidAccessLogDump/cache 20000 16 256". cont. 04/26/2017 'customize.sh' will hang, but it does, in fact, edit the file properly after some time. Squid is good again. rpgpg 04/24/2017 TAGS: NAS0 diagnostics page The NAS0 diagnostics page had been missing the top table for a while, because a new line was missing at the end of /etc/cron.d/nas0chk . The line was added so it works now. 04/25/2017 TAGS: NAS1 yum update rpmforge gpg keys NAS-1 has some trouble yum updating due to non-existant rpmforge gpg keys. I had some trouble finding the keys, and I had to install a security update, so I just turned off the check for the keys be editing '/etc/yum.repos.d/rpmforge.repo'. I've turned the check back on for now. 05/11/2017 TAGS: condor idle ce / bloated The diagnostics page says that condor is idle on the CE and '/' is bloated with 'core.*' files. Clearly some shenanigans occured when I updated OSG. I fully restarted condor, but when I tried to run `condor_status`, it said there was a communication error. After waiting a minute, it gives the regular list, but says everything is unclaimed. These "core" files seem to be generated whenever a job crashes. None of the configuration files in '/etc/condor' or '/etc/condor-ce' seem to have been modified by the update, although the directories have been touched. Perhaps files were deleted? Since OSG was updated, '/var/log/condor/MasterLog' reports that condor is unable to create a security session to the CE on port 9618 with TCP. Port 9618 is only listed in the log when it fails; its successful connections are never listed. That port number is listed in '/etc/condor-ce/config.d/50-osg-configure.conf' as the connection port for the 'JOB_ROUTER_SCHEDD2_POOL' variable. That file is said to be managed by 'osg-configure'. 05/15/2017 TAGS: NAS-1 almost full NAS-1 is almost full, and Vallary needs to put stuff on it! I need to investigate directories: g4hep, backup_g4hep, general_g4hep 'g4hep/MTSAtFIT' is a primary offender (14TB of the directory's 15TB), there are some large files in there NOTE: `tree -ifhugD path/to/directory` is a very usefull command for mapping the directory structure. I've made trees at '/mnt/nas1/g4hep/MTSAtFIT/tree.txt', '/mnt/nas1/backup_g4hep/tree.txt', and '/mnt/nas1/general_g4hep/treeTrim.txt'. cont. 05/16/2017 Dr. Hohlmann has said I can safely delete anything with 'alignment' and 'empty' in their names. To see how much space will be freed from one of the three sections: $ grep -iE 'alignment|empty' tree.txt | awk -F' ' '{print $3}' | grep G | sed 's/G//g' | paste -sd+ | bc 05/18/2017 TAGS: add group I'm creating a new user group for Vallary and I: Analysis. 06/09/2017 TAGS: glideins down globus error At the beginning of June, OSG said that our glideins were failing due to a globus error. When Daniel was helping me with Condor, we tried replacing my certificates for his in '~/.globus', which probably caused the errors. I have replaced his cert with my CERN cert. I've updated OSG. cont. 06/12/2017 Elizabeth said to copy 'hostcert.pem' and 'hostkey.pem' from '/etc/grid-security' to '~/.globus'. I have done that, and I've restarted GUMS. She's been updated. cont. 06/15/2017 I misunderstood Elizabeth; she was just making sure the hostcerts weren't expired or otherwise wonked. 06/13/2017 TAGS: batteries UPS check compute-1-7 not working The red light on the APC UPS had been quickly turning red after the routine tests, so I whipped out the batteries and took a multimeter to them. The batteries are rated at 12V, and the multimeter measured just over 13V for each of them, so I put them back. When turning the cluster back on, though, compute-1-7 is having trouble mounting the NFS filesystems. The little ethernet LEDs on the node are off, and the ethernet LED for the node on the router (port 15) is red. Once the node had booted up, the LEDs didn't change. It doesn't seem to have internet, either, which is to be expected. cont. 06/14/2017 I found a manual for the 'HP ProCurve 2910al-24G' router. The blinking orange 'Fault' light means that, "A fault has occurred on the switch, one of the switch ports, module in the rear of the switch, or the fan. The Status LED for the component with the fault will flash simultaneously." In this case, the LED for port 15 is blinking in synchronization with the 'Fault' LED. The 'Test' LED is also blinking along with the others, and it means that one of the components has failed its self-test, so port 15 failed its test. The manual recommends power-cycling the router, so I'll do that tomorrow morning. cont. 06/15/2017 I turned the cluster off and power cycled the router (unpluged it), and it displayed no warning lights, so I turned the cluster back on, and all is well! 06/27/2017 TAGS: security update Security update day! I'm yum updating everything, and rebooting the cluster. Everything booted up properly! 07/11/2017 TAGS: nas0 drive not-present Drive 15 in NAS-0 suddenly became labled as "NOT-PRESENT". I removed the drive and put it back in, and the drive is now rebuilding. cont. 07/12/2017 Drive 15 has returned to the "NOT-PRESENT" state again, I'm gonna try replacing the drive. I've replaced the drive, and it's rebuilding. cont. 07/13/2017 The new drive has experienced a SMART failure, so I'm gonna replace it with the other spare drive. I've started the rebuild. cont. 07/14/2017 The new drive rebuilt successfully, and everything is good. 08/22/2017 TAGS: NAS-0 not working crash on boot Everything is not good. During break, a catastrophic hardware calamity had befallen NAS-0. Two drives are dead, and the BBU (Backup Battery Unit) on the RAID card has failed. NAS-0 kernel panics on boot, a reported symptom of the battery. The card seems fine, however, because its settings can be accessed during boot. New drives and a battery have been ordered. Another scary symptom of NAS-0's inoperability is the hanging of `df`. cont. 08/23/2017 I searched the settings of the controller's BIOS for options to boot without the BBU. I found something that would ignore the RAID controller on boot, but then boot failed due to not finding an operating system, which is probably stored in the RAID. It might be a good idea to have the boot disk seperate from the RAID in the future. cont. 08/25/2017 While we wait for the new battery to arrive, I replaced the two failed drives and started the rebuild process from the controller's BIOS. cont. 09/15/2017 The battery is here! We've installed it and are ready to turn NAS-0 on! But first, I'm shutting the entire cluster down so I can bring everything up in the proper order. Turns out the battery needs to charge first, so I'm gonna have to wait until Monday to do anything. cont. 09/18/2017 NAS-0 still kernel panics on boot. *sigh* I tried booting from the CentOS 6.5 disc, but no dice; It looked like it booted, but it hung on a black screen with a mouse pointer. I also tried booting from the Rocks 5 disc, but when it couldn't find an IP address it wanted, it restarted and began the loop again. I started playing with the GRUB, let's see where that goes. cont. 09/19/2017 I tried the Rocks CD again (this time we have internet!), and it advanced to the next step! It's looking for a Rocks image, and can't find one. I'd assume that the image would be on the Rocks CD in the drive, but I guess not. None of the hard drives have an image hidden in them either, it seems. Although, Rocks was unable to retrieve a file from somewhere on NAS-0, so maybe that had something to do with it. I found some Rocks 6.1.1 Jumbo DVDs, and I threw one into NAS-0. It has a rescue mode that I've entered. Welp, when I turned NAS-0 on to play with the Jumbo DVD, drive 8 decided to disappear. When I restarted, drive 15 also disappeared. So now drives 8 and 15 are gone with drive 10 still in "rebuild" status. Also, when I try to choose the "Installation Method" for Rocks, it rejects the Rocks DVD already in the slot. It says the installation material isn't present on it. Which disc contains the proper info, then? Drive 15 suddenly reappeared! That's nice. cont. 09/22/2017 I had replaced both drive 8 and 15 (which disappeared again after replacing drive 8), but it wouldn't let me add the new drives to the RAID group. Perhaps because it was already labeled as "REBUILDING". After the replacements had been made, I exited the controller BIOS to start booting. There was a CentOS 6.5 boot DVD in NAS-0. It didn't hang on a black screen this time; it booted into the liveCD properly! I have some bad news: NAS-0 is dead. The 3ware BIOS manager (the RAID card's BIOS) reports the RAID array as "unusable". The 3ware documentation says that an "unusable" array is totally dead; it's suffered too many failures to be brought back. I'm asking Blueshark (Daniel Campos) to take a look at it anyway, though, in case there's some crazy nonsense we can do to resurrect it. Today is a dark day for the cluster. Daniel Campos said that our last hope is to try to image the broken disks and put their information on the good disks, then throw them back into the RAID. cont. 09/25/2017 I tested the drives. The three that had any data on them are physically busted; they click and are not recognized by the computer at all. The data is lost. NAS-0 is no longer with us. 08/22/2017 TAGS: mount NAS-1 remotely on seperate machine Since no one can log onto the cluster with NAS-0 dead, we need to mount NAS-1 remotely to access it. First the IP of the machine must be added to '/etc/exports' on NAS-1, then the changes must be saved with `exportfs -ra`. To mount it on mac: `sudo mount -o resvport 163.118.42.3:/nas1 /location/on/local/machine/` 08/23/2017 TAGS: /var full '/var' is full again. '/var/log/tomcat6/gums-service-cybersecurity.log*' were taking up 100M per file (of which there were five), and they only contained the same java error message repeated several times. I have removed the five old files, and kept the latest log. '/var/log/maillog' (1.8G) is full of messages reporting that mail sent to NAS-0 has bounced; I've cleared the log. 08/25/2017 TAGS: nas1 NAS-1 failed drive replace A drive failed on NAS-1 and we're gonna replace it. To view NAS-1's RAID, run `storcli /c0 show`. To remove the drive with storcli: $ storcli /c0/e/s set offline (*) in the left-most column of `storcli /c0 show` is the drive names in 'enclosureID:slotID' format $ storcli /c0/e/s set missing $ storcli /c0/e/s spindown (*) spins down the drive and makes it safe for removal The drive can now be safely removed. Once the new drive is in place it should automatically start rebuilding. If the drive's status doesn't change to "Rbld", the rebuild can be manually started with `storcli /c0/e/s/s show rebuild`. 08/28/2017 TAGS: nodes acting funny The second group of nodes (2-0, ...) is acting kinda strange. When I logged on, I saw the splash text that usually appears after the nodes are turned back on from a restart, and the diagnostics page shows that they have NAS-0 mounted and a 0 load average, while the other 10 nodes have super high load averages (~5000). cont. 08/29/2017 Time to exorcise the nodes! The script that gathers data from the nodes is '/usr/local/bin/cn.sh', and it writes to '~/diagnostics/cn.json'. The script checks for a mounted file system by running `df -h /filesystem/mount/point/` and seeing if anything is returned. On the '1-' nodes, `df` just hangs like on the rest of the cluster. On the '2-' nodes, however, it returns the line with the mount point '/'. While that's not NAS-0, it's something, so the website reports a success. The load average is found with `cat /proc/loadavg`. That's not explaining why the load is so high, however. The load average is high because the diagnostic script runs `df`, which hangs on the '1-' nodes; serveral instances of a hung up process are trying to run simultaneously. I've restarted the nodes, which will fix the problem; `df` will work fine. The '1-' nodes aren't ssh-able. I'll have to investigate that later. The '1-' nodes all tried to mount NAS-0 on boot, and they all failed to complete booting because they though NAS-0 was a busy device. I'm gonna powercycle them to see if that'll work. They're good, now. Now all of the nodes have a low load average, and they all falsely report NAS-0 to be mounted. 09/01/2017 TAGS: NAS-1 RAID card Today some strange nonsense happened. NAS-1 was telling me that its RAID card had suffered some catastrophic failure, and was no longer operable. I powercycled NAS-1 because everything on NAS-1 hung. On boot, the RAID card would beep, and nothing would appear on the monitor. Everything on the CE also hung. Scary. I turned the whole cluster off, and tested the APC UPS, which yelled at me, so I manually checked all of its batteries. After all of the batteries had passed inspection, I put them back in and turned everything back on. Everything, except NAS-0 of course, booted up just fine. I have no idea what caused the issue in the first place. 09/05/2017 TAGS: new hostcert OSG emailed me saying that my hostcert is about to expire. The new hostcert and hostkey are obtained. 09/05/2017 TAGS: CE hung The CE decided to hang; nothing could be performed on it. I restarted the cluster, and it's good, now. 09/14/2017 TAGS: UPS no power not turning on When we plugged everything back in after the hurricane, the top Tripplite SmartPro UPS refused to accept power. No lights turned on indicating that it sees any kind of power at all. I tried plugging it into different outlets, but the bottom UPS accepted the outlets just fine. The model number of the Tripplite UPSs is "SMART5000RT3U". cont. 09/15/2017 The power button of the busted UPS feels kinda wonky. It feels like there's not even a button behind the flexible plastic button cover; the plastic just gives with hardly any resistance, unlike the bottom UPS which has a more solid feeling button press. However, the button could just feel strange because it's not getting any power; the other button (the alarm button) won't even depress at all. I ripped the UPS's face off to investigate the buttons on the circuit board; they're both fine. cont. 09/20/2017 I called Tripplite for assistance, and he told me to check the batteries. Just what I feared he'd say! Well, let's get them out of the rack and see what's up. The batteries are all destroyed. They are all swollen, and there's corrosion everywhere. It's a repeat of 2 years ago! (Fun Fact: We replaced the batteries 09/21/2015, almost EXACTLY 2 years ago!) 09/26/2017 TAGS: NAS-1 diagnostics strange The RAID monitoring for NAS-1 on the diagnostics page is a bit wonked out. The script is having trouble when it tries to ssh into NAS-1; some drives have '/root/.bashrc' errors. Oh, when I tried to install root on NAS-1 earlier, I put some nonsense in its '.bashrc' that spits out errors whenever it's run. The scripts write down whatever was written to standard output, which, in this case, includes error messages for the first two lines. So the website is reading the first two error messages and displaying them on the website. Whoops! Let's fix NAS-1's '.bashrc'. I commented out the broken root line, it's all good, now. 09/26/2017 TAGS: squid not running The diagnostics page says that 'squid' isn't running. I tried to start it with `service frontier-squid start`, but it complained that '/home/squid' didn't exist. RIP; I guess it's dead until we can resurrect NAS-0. 09/26/2017 TAGS: NAS-0 redo Welp, NAS-0's dead. But now we have an opportunity to redo its RAID configuration! What shall it be? I really wanted to do ZFS, because it's the best, but it's slowly turning out to not be viable. The hardware may not cooperate nicely with it, and we may need new hardware to connect all of the drives together in the absence of a RAID card. So, I think we're gonna have to stick to the card we've got. Unfortunately, since the card doesn't support RAID-60, we're gonna have to come up with a more creative solution (I wanna see if there are better options than just straight RAID-6). 09/27/2017 TAGS: rack rearrangement Today, we're taking out the bottom Tripplite UPS to examine its batteries. We're also gonna take the UPSs completely out, put NAS-1 and the SE where the UPSs were, then put the UPSs, spread out, on the left rack. cont. 10/04/2017 Alright, everything's done. The rearrangement went wonderfully. I even rewired everything! I'm going to make a document showing where I plugged everything in. The batteries also came in, and we installed those. They're charging themselves up and they're working great! 10/16/2017 TAGS: SE no ethernet All the ethernet ports have their red lights on, so Imma restart everything to see if that does anything. I restarted everything, but the red persists. Huh. cont. 10/17/2017 Well, we need ethernet to add NAS-0 back to the cluster, so this has got to be fixed. The four weirded-out parts (CE, SE, NAS-1, NAS-0) are all plugged into a group of four dual-personality ports. Maybe the dual-personality ports have the wrong personality? I tried plugging one of the devices into an adjacent, regular ethernet port on the router, but the light is still red. Although, NAS-0's light has mysteriously decided to turn green. cont. 10/18/2017 Well, I've discovered some things today. It's looking like I'm gonna have to interface with the router's console to see what's up. To do that, though, I need the console cable, which is Ethernet-Serial (RJ-45 to DB-9(female)). Of course, we don't have that cable, and I found supplies to maybe make one, but that for sure won't work, so I'm probably just gonna have to buy one. *sigh* more waiting... cont. 10/23/2017 The cable came in early! Imma hook the router up to the CE and see if it'll work. Gotta get that VT-100 emulator up and running first, though. I got the emulator 'minicom'. cont. 10/24/2017 minicom must have the following configuration: A-Serial Device: /dev/ttyS0 B-Lockfile Location: /var/lock C-Callin Program: D-Callout Program: E-Bps/Par/Bits: 9600 8N1 F-Hardware Flow Control: No G-Software Flow Control: No cont. 10/25/2017 (Yo, the output from the router looks really cool because you can see it written to the screen since it's serial!) nothing works The switch has been configured with the following important properties: Default Gateway: 172.16.42.126 (what was already there) Time Sync Method: SNTP (what was already there) SNTP Mode: Unicast (what was already there) Poll Interval: 720 (default) Server Address: 163.118.171.4 (was was already there) I have been experimenting with the 'IP Config' settings. Right now, it's set to: IP Address: 163.118.42.126 Subnet Mask: 255.255.255.128 I've also tried setting it to 'disable', but to no avail. cont. 10/30/2017 Summary thus far: The high GB/s connections are working fine; the CE, SE, and NAS-1 have internet no problem. The switch shows no error lights on itself, but the ethernet ports of all connected machines display a red LED indicating that the connection is dead. I've adjusted the dimensions of the console window: length: 64 width: 78 `show interfaces brief` displays the status' of the ports, and it says nothing's wrong. `show interfaces display` reports that there is data running through all of the ports, almost 100M for each of the ethernet ports and between 1.5G and 2G for the high-speed ports, which are operational. cont. 11/02/2017 Daniel Campos came by and took a look at the switch. He did a bunch of fancy stuff, and it turns out that it matters which ethernet port on the computers is used, and I used the wrong one. *siiiigh* I threw everything in the proper port, but I can't test it now because class. Hopefully it's good now! cont. 11/06/2017 Ethernet's golden! Now we can play with NAS-0. 10/17/2017 TAGS: creating NAS0 NAS-0 RAID The time has come to finally reconstruct NAS-0's RAID! We've opted to use RAID-10, which is a staggering improvement in security over the previous configuration (RAID-6), although we're taking a considerable hit to available space, only half of the drives' 12TB are useable. I have included all 16 drives in the array, and configured it to heavily favor protection rather than performance. Ok, I'm super sketched out by this RAID card. It won't let me configure how I want RAID-10 done. I would like to make it into 2 groups of 8 drives each, so that the tolerance is a minimum of 4 drives (4 drives all from the same group). Unfortunately, this RAID card is lame af, so it automatically puts the drives into RAID-1 pairs that are all striped together. This only allows for a minimum tolerance of 1 drive; if both drives in a RAID-1 pair fail, the array dies. While this is among the lamest things I've seen, in 14/15 cases, it's at least as safe as RAID-60 when 2 drives fail, and infinitely safer when 3 fail. For that reason, I'm gonna stick with RAID-10 over doing RAID-6 again. cont. 10/18/2017 Maybe ZFS is a viable option! When searching for a cable, I found a massive cache of RAM in the supply closet. There are several sticks of 2GB, 4GB, and 8GB. While we're waiting for the router console cable, I could play with ZFS on NAS-0, which could be interesting. cont. 10/20/2017 NAS-0's motherboard is a Supermicro X7DB8. It can support up to 32GB of 667/553MHz DDR2 RAM of sizes 512MB, 1GB, 2GB, and 4GB. We wouldn't be able to use all of the RAM, but a good bit of it is still available. Another problem, though, is much more concerning. How will the drives be directly connected to the motherboard without a RAID card? I doubt there are enouogh slots on the board, so a SATA hub may be necessary. cont. 11/07/2017 Since this card is actual trash (the only RAID-10 option is literally the worst possible configuration of RAID-10 (it only supports RAID-1 pairs connected in RAID-0)), we're gonna try to use it as a SATA hub for the drives to be run in ZFS. Can the card be configured to to run the disks in JBOD? A'ight, so here's the thing. I need to dedicate at least one drive to house the OS, and I'd like that drive to be backed up; we're left with 14 drives, which is still plenty. There's a few good ZFS options we can do: 1) 2 striped RAIDZ2 vdevs (RAID60 with 2 groups of 7) - min: 2, max: 4, 7.5TB 2) 2 striped RAIDZ2 vdevs with 2 hot spares - min: 2 + ~2, max: 4 + ~2, 6TB; immediate replacement of 2 failures in quick succession (effectively 2 base tolerance with 2 extra tolerance per group) 3) 2 striped RAIDZ3 vdevs - min: 3, max: 6, 6TB Imma try out option 2, just to see if it'll work out. First, I need to make a RAID10 array with 2 drives; this'll be the OS drive. With the small array made, I threw the ROCKS disc in, and it did some things. I formated the array as ext4, and it installed a bunch of stuff. I whipped the disc out, restarted it, and it booted into CentOS! Unfortunately, though, it's asking for a password that doesn't exist. That's fine, though, because I can ssh into it just fine (nice!). It's yelling at me because the RSA keys are all messed up, but that's fine, I'll fix it later. NAS-0 has an OS again! Now the task is to make the other drives visible to the OS.` cont. 11/13/2017 A'ight, let's get ZFS installed on NAS-0! INCORRECT MISSTEPS: First we must install some dependencies: $ yum install kernel-devel zlib-devel libuuid-devel libblkid-devel libselinux-devel parted lsscsi Actually, nevermind, the link this guide provides doesn't work; let's try a new one. Here are the dependencies for this guide: $ yum install dkms gcc make kernel-devel perl Everything was preinstalled except 'dkms' (Dynamic Kernel Module Support: without it, kernel updates could break software), which is a part of the RPMForge repository. Since NAS-0 is 64 bit, to install RPMForge: Nevermind, turns out RPMForge (aka RepoForge) is now deprocated, and big letters on the CentOS Wiki say to not use it. So forget that, Imma install EPEL: $ wget http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm $ rpm -ivh epel-release-6-8.noarch.rpm Except `yum repolist` shows no sign of EPEL. *sigh* Turns out the repo's gotta be turned on. 'enabled' in '/etc/yum.repos.d/epel.repo' needs to be set equal to '1' rather than '0'. Now EPEL shows up in `yum repolist`. Nice! Now dkms can be installed: $ yum install dkms The next instruction calls for installing 'spl' and 'zfs': $ yum install spl zfs Unfortunately, neither of these packages can be found. CORRECT METHOD: Fortunately, ZFS can be installed a different way. First, the ZFS repo must be installed: $ yum install http://download.zfsonlinux.org/epel/zfs-release.el6.noarch.rpm Then, ZFS itself must be installed: $ yum install kernel-devel zfs ZFS is now installed! Hooray! Now we've gotta get those drives visible. An important thing we gotta do is get 'tw_cli' installed, the RAID monitoring software. First, the ASL repo must be installed: $ wget http://updates.aslab.com/asl/el/6/x86_64/asl-el-release-6-3.noarch.rpm $ rpm -Uvh asl-el-release-6-3.noarch.rpm Then the software needs to be installed: $ yum install 3ware-3dm* Now NAS-0 needs to be restarted. 'tw_cli' is installed and works great! I can see the unconfigured drives in 'tw_cli'; hopefully I can work with them. Looks like if I put all the other disks in their own seperate units (putting them all in single disk mode), they'll be visible to the OS. Let's try it! I can see all the drives! Now we can get ZFS up and running! cont. 11/14/2017 I tried making the zpool, but it didn't like the 1TB replacement drive we threw in there, so I'm just gonna replace it with a normal 750GB. When I tried to remove the drive with 'tw_cli', though, it couldn't. That's because I was trying to remove the only drive in the its unit, which it isn't happy with. I'm gonna have to delete the unit, and remake it with the new drive. The zpool with option 2 was made: $ zpool create nas0 raidz2 sdb sdc sdd sde sdf sdg raidz2 sdh sdi sdj sdk sdl sdm spare sdn sdo Unfortunately, though, it only has 5.2TB of space, which is a bit less than the alread expected low amount of 6TB. Imma try option 1, the most spacious one. It wouldn't let me destroy the zpool; it said it was busy. Even after unmounting it, it still complained, so I restarted NAS-0. It's still busy. I'm gonna try to see what holding it open with `lsof | grep deleted`. Nothing is printed. `lsof` didn't list anything with "nas0", but there are a few processes related to "zfs". `zfs iostat` revealed that there is some IO going on in 'nas0' (also that there are 8.1TB free, suspicious, it's probably got something to do with parity and other ZFS data). Later, I'll try killing all of the ZFS processes. cont. 11/20/2017 I just ran `zpool destroy nas0` and it seemed to have worked just fine. Huh, well problem solved, I guess. I'm gonna try to make Option 1 and see how much space that one actually gives us. It only gave us 6.6T of the expected 7.5T. I reported my findings at the meeting, and we've opted to go for Option 2, the RAID-60EE equivalent. cont. 11/27/2017 Let's make Option 2 and start the copy of the '/home' backup. '/nas0' is busy, so I'm gonna comment out 'nas0' in '/etc/mtab' so that it won't be mounted on restart. After much fandangling, turns out the best course of action is to just restart NAS-0, then 'zfs unmount nas0' and 'zpool destroy nas0' as quickly as possible, before any crazy processes can start acting on it. Now, I've gotta mount NAS-0 onto the CE so that data from NAS-1 can be sent over. cont. 11/29/2017 Even though '/etc/fstab' contains an entry for NAS-0, 'mount' doesn't see '/nas0' available. There is a 'sharenfs' property on ZFS that allows ZFS volumes to be shared via NFS; it's set on /nas0. NFS is already good to go on NAS-0, but we've gotta add '/nas0' to '/etc/exports' so that NAS-0 knows to allow the CE to mount '/nas0'. I've added the following line to '/etc/exports': /nas0 163.118.42.1(rw,sync,no_root_squash) /nas0: the filesystem to be mounted 163.118.42.1: the highspeed ethernet connection on the CE rw: allow read/write sync: server confims client requests only when the changes have been committed (safety) no_root_squash: allows root to mount filesystem By default there was an entry in '/etc/exports' called '/export/data1'. It caused some problems, so I commented it out. I then ran `exportfs -ra`. When I try a `mount /mnt/nas0` on the CE, I get the following error: mount.nfs: access denied by server while mounting nas-0-0.local:/nas0 The error was because it doesn't like the IP for the CE I gave it; it prefers the LAN IP (10.1.1.1). '/nas0' is mounted fine, now. Now the data transfer can begin! I used the command: $ rsync -av --append /mnt/nas1/nas0-bak-20160304/home/ /mnt/nas0/home/ I ommitted the 'nohup' because it was giving me problems, and I wanted to manually monitor the progress (it took a couple days). cont. 12/01/2017 Data transfer complete! Good news: all of the data transfered over just fine Bad news: none of the file permissions were saved; I'm gonna have to fix that. The permissions can be fixed by following the instructions from [10/31/2015]. The home directories also need to be mounted on '/home' rather than '/mnt/nas0/home'. So let's fix that mount point. Oh wait, hold on. Some of the home directories (mine, Ankit's, and couple others) are already mounted on '/home' from '/mnt/nas0/home'. Looks like we're good! I'm able to login remotely with an shh key again! Hooray!!! 10/31/17 Riley TAGS: NAS-0, NAS-0 RAID 10, Batteries, NAS-0 RAID card model, ZFS info There isn't any literature I can find on the admin log about doing a battery test. I'll look on the twiki, but as for now the project is at a standstill. For some reason the glorius Google (TM) only gives me things online about MicroSoft (TM) Clusters and UPS systems, so finding something won't be as easy as I initially thought. As for today, I'm ripping out NAS-0 and looking inside. I need to know the model of the RAID card for research, and how many ports it has. This info will be recorded here. I am seeing if it can be used as a hub for ZFS, and if it can I'm planning on putting a bunch of RAM in it. For glory. Happy Halloween, My cluster friends. 10/31/17 cont. Found the things for the UPS. All the info we have as of right now is the location of the UPS documentaion on the cluster: /etc/ups. Ryan has a couple of things from 2 years ago, but there isn't any exsisting code to check the batteries. Im going to start working on a code to check the batteries Moving onto the RAID card, the model is AMCC 9650SE-12 ml. it currently goes for $430 on the market, even though its some dated tech, which leads me to believe that if any RAID card from that era could be used as a hub, this is it. the only problem is everything online says it's possible to use a RAID card as the hub, but no one says how because they unanimously say its a terrible descision. 11/2/17 Riley TAGS: NAS-0, RAM, RAID card, Battery test in order to use the NAS-0 RAID card as a hub for ZFS, we need a metric tonne of RAM. lucky the motherboard can support 16 RAM sticks and in the admin log it does say that it can handle up to 4GB sticks of DDR 2. the only problem is that the RAM in the motherboard isnt DDR, its FB-DIMM. more research is needed to find out if there is any potential compatability problems Daniel Campos gave me some amazing resources for running APC diagnostics tests. I'm going to try and make the APC as schnazzie as possible. hopefully the Trip-Lite battery tests wont be too much more difficult. The battery info can be found at /etc/ups BATTERY LOCATION: /etc/ups 11/2/17 cont. TAGS: RAM, NAS-0 it seems that the RAM is an implied DDR-2, even though it doesn't say anything about DDR on it UPDATE: We (With the help of Daniel Campos) found a decent way to solve our issues. NONE of the RAM fit into the mothernboard, which is fine because we dont need it anymore. Daniel suggested we use JBOD to host ZFS, and it doesn't really need a lot of RAM. 11/27/2017 TAGS: CE hang The CE hung again today, so I powercycled it, and now it's fixed. It took FOREVER to turn on, though. There were some mad NFS timeout times, so I'm gonna try to reduce that. I changed the timeouts in '/etc/auto.master' from 1200 to 500. Hopefully that'll fix the problem. 12/04/2017 TAGS: nas0 dashboard diagnostics page The RAID health check for NAS-0 is all kinds of messed up because NAS-0 has crazy splash text on login. Let's fix it! It said that line 29 in '/etc/ssh/ssh_known_hosts' in the CE was the offending line. That's the line for the old NAS-0; it was trying, and failing, to match the new NAS-0's key with the old key the CE had. I just deleted that line, and it put the new key on the CE. All is now well! 12/04/2017 TAGS: NAS-0 no root login Ankit recommended we disable root login for NAS-0, which is probably not a bad idea. I created a user "fakeroot" and put `su -` in its '.bashrc', so that the root password must be entered to gain access to NAS-0. I copied over the CE's ssh key, but it still didn't work. I changed the permissions for '~/.ssh' and '~/.ssh/authorized_keys' in 'fakeroot''s home directory on NAS-0, and I ran `restorecon -Rv ~/.ssh`, which resets the SELinux configuration to default. It works fine! I can login to NAS-0 from the CE with RSA. I've also added 'fakeuser' to the sudoers group on NAS-0: $ usermod -aG wheel fakeuser For changes to take effect, log out and back in. I disabled ssh login for root on NAS-0 by setting 'PermitRootLogin' to 'no' in '/etc/ssh/sshd_config'. I made the root password required for any 'sudo' activity by adding 'Defaults rootpw' to '/etc/sudoers'. 12/19/2017 TAGS: NAS0 ZFS I tried to work on the cluster remotely, only to find that my certificate wasn't working. Uh Oh. Turns out ZFS didn't start up correctly on NAS-0, so '/nas0' wasn't mounted. I logged in as 'root' and tried a `zfs list`, but it just told me that no datasets were found. Maaaaaan. I'm gonna try unmounting NAS-0 from the CE, then restarting the thing. No dice. Imma try an update and restart No dice x2. `zpool import` gave me data on the pool, and told me a drive failed. The error message gave me this URL: http://zfsonlinux.org/msg/ZFS-8000-4J/ Turns out, since 'nas0' is an exported pool, it needs to be imported, which failed because it was degraded. It can still be manually imported, however, so that it can be worked on. *sigh* Turns out the issue is that THREE drives decided to fail IMMEDIATELY after I left. *sigh* Man, c'mon now. There's gotta be a reason why all this nonsense always happens. Why do the drives in NAS-0 fail so often? NAS-0's super important. Maybe it's just 'cause all the drives are super old. I mean, it is a bunch of 750GB, which is an outdated size anyway. That's probably it; they're just super old. I guess even the "new" drives we get would be old even if they've never been used. I don't even know how to fix that, though, short of replacing all the drives, but that's super expensive. *sigh* Who knows, man? Who knows? I haven't decided if I'm gonna run down there to replace the drives or not. Since it's still operational, and nothing new's been put on it, I'll probably just leave it. 01/04/2018 TAGS: Intel security Intel done messed up their processors, and they are vulnerable. I'm doing a 'yumUp' on the CE now, and will update the nodes when they're operational. 01/08/2018 TAGS: UPS beeping red I've returned from Christmas break, and the APC UPS is beeping at me. It had been beeping a bit more often than usual before I left, so Imma take the batteries out and test them. EMT ended early, so now I have a full hour to play with batteries! Let's start by shutting everything down. A'ight, so the batteries are mostly fine, but one in the left tray is reading 11V instead of the regular 13V and the required 12V. 01/08/2018 TAGS: NAS0 nas0 drives failed I've also got those three drives in NAS-0 down; one in a pool and both spares. How do I figure out which hard drives failed so that I replace the right ones; there aren't any helpful red lights. `zpool status -x` gave me the status' of all the drives. It also told me that the failed drive in the pool was '/dev/sdb1'. The following command can be run to find the slot of the 'sdb' drive: $ udevadm info --query=all --path=/block/sdb In the 'DEVPATH' line of the output, we're looking for 'target0:0:2', which indicates that the drive is in the second slot. (sda is 0 and sdb jumps to 2 because sda is made up of two drives; it's mirrored OS array managed by the RAID card.) To replace the drive, the zpool must first be brought offline: $ zpool offline Now that the drive is offline, I'm gonna try to remove the drive in slot 2. With the drive removed, the status of the drive is still reported to be 'offline'. Now, I'm gonna insert the new drive. The new drive must now be brought online: $ sudo zpool online nas0 15433276318644629044 (this step may be able to be skipped because it gave me a warning that said 'zpool replace' should just be used instead) I tried to use 'zpool replace nas0 /dev/sdb', and it told be that no such thing existed. Since it said that the failed drive used to be '/dev/sdb1', I tried using that. It told me that '/dev/sdb1' is already a part of 'nas0'. And it says it's FAULTED like before. Hmm... What's goin' on here? I even tried unmounting the whole pool with 'zpool export nas0', but it couldn't because the device is busy. I'm gonna try a full restart, then. Which works for me, since I have to check the APC batteries anyway. Unfortunately, I don't have enough time for that right now, so I'll have to try later. cont. 01/09/2018 I'm messing around with it some more, and I'm gonna try to throw a different drive in to see if my replacement also didn't fail. Interestingly enough, the zpool doesn't show that the drive has been removed, only that it continues to be "offline". "offline" probably just includes "ejected". Hmm, I wonder what happens if I try to bring the "drive" back online. It told me that the drive was onlined, but remained in a "faulted state". Additionally, `zpool status -x` now hangs. Interesting! A'ight, so I took the phantom drive back offline, and I was able to run a `zpool status -x`. What does this thing think it's doing? It's resilvering two drives in the other RAIDZ2 pool for some reason. Por que? Well, now I'm scared to interrupt it, so Imma just let it sit for a bit and sort itself out. cont. 01/10/2018 Daniel Campos came by and taught me some ZFS and general drive things. SCSI commands and numbers are useful. `dmesg` will tell me the name of a newly plugged in drive, which is real nice. Also, I have to use the version of 'zpool replace' that uses two drives. The name of the old drive is just the long number that `zpool status -x` provides, and the name of the new drive is the name gotten from 'dmesg' or the other commands. The directory '/dev/disk/by-path' is very interesting. It shows the physical locations of all of the drives. When I remove the drive from slot 2 again, though, the entry in '/dev/disk/by-path' doesn't change; it still thinks there's something in slot 2. It's also not counting all 16 drives when all are inserted; slot 15 isn't mentioned. cont. 01/12/2018 Daniel came by again, and we discovered that the disk wasn't being seen because I configured the drives in the RAID card to all be "single-disks" rather than JBOD. I did that because the documentation had said that "single-disk" was better than JBOD, but that was only in terms of fault tolerance; we've got ZFS taking care of that for us. Daniel reconfigured the drives to be JBOD and the card to automatically export unconfigured drives as JBOD. So now we've got just a bunch of disks for ZFS to play with! First, though, I'm going to test to see if we have the ability to hot swap, now; we couldn't just do that in "single-disk". Turns out ZFS just kind of auto-fixed itself, which is nice. It says all the drives are fine, and the zpool has been restored. When I tried to whip out a drive suddenly, though, to see what would happen, nothing did. `dmesg` reported that the drive had been removed from the slot, but `zpool status nas0` hasn't changed. I'm going to solve this problem, first, before trying to test hot-swapablity. Why won't it mount? `zfs list` displays 'nas0' just fine, so it's not like it's invisible or anything. OK, so the mount point for 'nas0' is '/nas0', which, reported by `mount /nas0`, doesn't show up in '/etc/fstab' or '/etc/mtab'. Which, turns out, shouldn't matter, since there never were entries for it. Which I guess makes sense, since I suppose ZFS takes care of all that nonsense with `zfs mount`. Oh, I'm just dumb, `zfs mount` isn't going to do anything without a zpool to mount. The correct command is: $ zfs mount nas0 Whoops! Now let's try to whip a drive out. The RAID card is totally cool with JBOD! Taking drives in and out is no problem at all. Excellent! Now let's put all the data back onto NAS-0! I started the data transfer with the command from last time: $ rsync -av --append /mnt/nas1/nas0-bak-20160304/home/ /mnt/nas0/home/ I'm anticipating the same initial problems as before, but the solutions to those are documented, so we're good. cont. 01/15/2018 The transfer hung, so I've stopped it, and am gonna try restarting it. It looked like it hung because NAS-0 wierded out; full of "rejecting I/O to offline device" errors. It's not letting me login, though, so I'm thinking I'll just have to restart NAS-0. When I tried to restart NAS-0, the 'shutdown' command gave me an I/O error. Apparently, this means the drive is having mad issues. It looks like I found an alternate restart method, though. $ echo 1 > /proc/sys/kernel/sysrq $ echo b > /proc/sysrq-trigger Will tell the computer to restart, but if the RAID card fails to initialize, the machine must be powercycled. Now let's try to resume the transfer. cont. 01/16/2018 Transfer's still going... cont. 01/18/2018 It looked like the transfer had finished, but some errors were reported. I'm going to run the command again to make sure everything's actually over. cont. 01/19/2018 Transfer's still going... cont. 01/20/2018 Transfer's still going... (It's on Vallary now, though, so it's almost done!) cont. 01/21/2018 Transfer's done with no errors to speak of! Now to prepare everything like I did before. Since I've already copied over the files required from before, all the permissions are already good to go. Nice! Now to get my ssh key back up and running. Everything's good on the logging in front! Now to pickup from where I left off... 01/08/2018 TAGS: nas1 no video out NAS-1 won't give me any video output. When I turn it on, it just beeps at me, and that's it. cont. 01/10/2018 Riley and I whipped out NAS-1 (watch your fingers!) to get the model of the motherboard: "AMIBIOS 786Q 2000 American Megatrends" When I turned NAS-1 on to hear the beeping, it just turned on normally. I guess it just needed to be unplugged and plugged back in. OK 01/09/2018 TAGS: UPS software configuration Riley: Starting to work on UPS software again, picking up from where i left off Ryan and I are going to have weekly meetings. hopefully this issue will be done in like a week. basically the software is there, it jsut needs to be configured. 01/19/2018 TAGS: drive LED red nas1 NAS1 One of the drives on NAS-1 has a red LED! I logged into NAS-1 to see what was up with $ storcli /c0 show and it didn't say anything was amiss; it thinks everything's green. Strange. Maybe it just needs a reboot or something, but it's still transfering data, so I'm gonna have to wait on that. cont. 01/20/2018 The red light's turned off, so I guess we're good. 01/21/2018 TAGS: nodes no power Welp, I went to go turn on the nodes so that I could resume working on condor, buuuut they won't get power. I turned on the top ten, since they alread had their lights on, but the bottom ten didn't even have their little power lights in the back illuminated. Huh. I tried turning the UPS that powers the bottom nodes on and off again, but that only served to kill the power to the top set of nodes. "What?", you may be asking yourself. I, too, am asking myself that same question. Why would the bottom UPS, the one exclusively dedicated to powering the bottom set of nodes, kill the power to the top set of nodes? It's truly a mystery, fo' sho'. Well, I tried the same thing with the top UPS to no change. Now I have two sets of nodes without the slightest inkling of power even though they are both plugged in to fully powered UPSs. *sigh* Alright, looks like the power strips aren't being powered for some reason. I've plugged the desk lamp into one of them so I'll know if it miraculously begins to work again. I plugged one of the power strips into one of the big power strips on the ground, and it worked fine! I guess the strips are fine, the UPSs aren't supplying their power correctly. They're supplying their power to everything else that's redundandly plugged in, though; both lights are green on everything else. Turns out the output breakers for the row of outlets into which the node power strips are plugged keeps getting popped. Everything else turns on like before, though. (Side Note: the screen for the CE has changed, the picture is much brighter for some reason, huh, spooky) How to troubleshoot breakers, I have no clue. In the mean time, I have only five nodes on so that I can still try to fix condor. cont. 01/23/2018 I talked to Daniel about the problem. He says to see what the UPSs think their load is; I can plug my computer into them and investigate using their software. If that doesn't yell at me, he says to see what the UPS thinks its power draw is. If it's tripping at a lower point than what it's supposed to, the breakers are going bad, but if it gets too high, then something else has changed. My computer's having a hard time finding the UPS that's plugged into it. I plugged the UPS into the CE, and I'm gonna try accessing it that way. I found it at '/dev/usb/hiddev0', but with no way to access it. `lsusb` also shows that the UPS is detected. >>>>> IMPORTANT -a id the day that is today:(i think its like the 29)ya i cheqt my phone it is so:01/29/2k18: so (((DAnieL))) has been helping out i havent been posting here but hes helping a lot he keeps calling out but im pretty sure that next time we meet we're going to get everything donezaroni with trippmeme software. He said APC is 'easier' and 'usable' so i think that means once tripplite is like done we are a day ///maybe/// from bean finish after everyhign is done i need to write the output to a cronjob and that i can <<>> do by my lonesome but IDK. I have to report today that I've been lost af for like 3 months and now that i have help its just a matter of meeting with daniel. IDK how anyone is supposed to learn this themselves, daniel has been doing this for what he claims to be like 10 years and even he is having issues... well at least he has some sembalance of an idea of what to do. even when he tries to help there just doesnt seem to be any rhyme or reason to how this nonsense works. I vote we ban trippmeme to the shadow realm and use only the best and brightest of the APC software/hardware. hopefully nexttime i can report we made actual headway to this issue that ive been dealing with for litterlaly 4 months with no real progress. once this is taken care of tho, we stilll have: - the website - nas0 - nas1 - the xXx_iNodes489_xXx - making the SE run an actual OS - making sure theres no BUGSSS - replacing the file-managin system BestMan with The Open Source Framework, "Hadoop(c)" ()()()()(HOWW)()()()() - doing a bunch of YumUpdates and YumProvides to whatever tf needs it - i hope the CE isnt totally destroyed - we can only hope we dont need to replace the batteries again before this is out --also side note. the batteries are hooked up using red/black alagator jaws and copper wire. how are we going to get an individual battery report? if we want that we need to completly destroy the units and start over. not to mention the current "Storage Solution" is the sole reason the batteries last an eighth their lifetime and need to be foreably removed. maybe a change to storage isnt a bad idea. also the website Rilo O U T 01/31/2018 TAGS: Daniel, UPS today Daniel and I sat down to work on tripplite. We need to upgrade to CentOS 7. We can't use the proprietary software with CentOS 6, and it'll just fix like every issue. For the things that need to stay in CO6, we can do some weird hmount thing that keeps them in 6 while all the real software is on 7. also i think we hit something bc the entire cluster started screaming. daniel eventually figured it out, but we need the APC to be on. it started screaming right before i did anyhting which was weird timing. i may have knocked a wire loose. Ryan hasnt put anything in the log for a while so i guess he hasnt made any progress recently. R?I?P? ?Ryan? cont. today i am teaching sam how to use the bash hopefully shell stop bean an scrub and become a sysadmin "Are you adding this to the offical log?" - Samantha Worjlsthaer, 2k18 "I got it" -sam 2k18 "No, dont put that where anyone else can see it!" -still sam 02/03/2018 TAGS: nas0 drive failure A drive's failed in NAS-0. While this would normally be bad news, it served as an excellent test to see if everything works, though. Once the drive failed, the hot spare immediately took over no problem, so it's all working great! I still gotta change out the drive, though, so that's what I'm doing today. It says that 'sdi' is the one that failed, so I ran the following command to find 'sdi's physical port number. $ udevadm info --query=all --path=/block/sdi It looks like it's in slot 8. I brought the drive offline with: $ sudo zpool offline nas0 sdi and whipped drive 8 out of the array. I've inserted the new drive and run $ sudo zpool clear nas0 sdi It now says it's repairing both 'sdi' and the spare 'sdn'. Another drive, though, 'sdh' is now 'faulted'. I'm going to wait for the repairing to complete before messing with 'sdh', though. I was going to replace the battery in the APC UPS today, but I don't wanna turn off NAS-0 in the middle of this repair, though, so I'll save that for tomorrow. cont. 02/05/2018 I've replaced 'sdh', but now 'sdg' has faulted. I've been extracting the drives from the incorrect slots. The `udevadm` command seems to count the first two OS drives as one drive, so I've just been working my way up the array taking out incorrect drives. I'm gonna wait for 'sdh' to get itself fixed up before I play with 'sdg'. cont. 02/13/2018 Alright, 'sdi' is degraded, and what used to be 'sdg' is unavailable. Let's get 'sdg' situated first. The 'udevadm' command listed 'sdg's position as slot 6, so I'm going to remove the drive in slot 7, because that's the actual slot when taking 'sda' (2 physical drives for 1 logical drive) into account. Sliding the drive already in the slot back in, since the drive was a replacement anyway, and running `zpool replace nas0 /dev/sdg` did the trick! The drive is now being resilvered. 'sdi' is also being resilvered, so I'm going to wait until it's done doing what it's doing before I fix that one next. cont. 02/16/2018 Time to replace 'sdi'. I've offlined it and replaced the drive. 'replace' doesn't work, though, it gives me a "cannot label" error. Huh, maybe I just accidentally threw in a bad drive. I've tried a different drive with the same result. When I try $ sudo zpool replace nas0 /dev/sdi1 though, instead of $ sudo zpool replace nas0 /dev/sdi I get a "one or more devices is currently unavailable" error. Hmm. I glanced at the history to remind myself of how I did the previous drive, but it's just a 'zpool replace'. Man, what's up? cont. 03/02/2018 Now the drives are all good, so Imma throw it in. 02/05/2018 TAGS: APC UPS battery low One of the batteries in the APC UPS was measured to be 10V instead of the rated 12V and the regularly reported 13V. We ordered new batteries, and I'm gonna throw in the new one while checking the other batteries. I've replaced the low one with the new one, and everything seems to be fine. 2018/02/11 (Daniel C.) Fixed condor scheduler, /etc/condor/config.d/00personal_condor.config has CONDOR_HOST set to a local address instead of FULL_HOSTNAME. Needs further investigation: /etc/hosts defines the listening ip (10.1.1.1) as uscms1.local. The preferred solution is to make 10.1.1.1 resolve to uscms1.fltech-grid3.fit.edu. The current solution is to add exceptions to /etc/condor/config.d/00personal_condor.config and add 10.1.1.1 to COLLECTOR_HOST and ALLOW_NEGOTIATOR. Investigated HTCondor-CE authentication issues and determined that only LCMAPS VOMS is supported for OSG 3.4. May require some new stuff to be done. 2018/02/12 (Daniel C.) Fixed dashboardNAS0.php disk updates. dashboardNAS0.php reads from nas0check.txt in /var/www/html/diagnostics. This file is updated by a cron job located at nas0chk, which runs /usr/local/bin/nas0check.sh This script was currently not working and needed tweaking. Instead of ssh'ing to nas-0-0 as root, it needed to be fakeroot. The ssh shell only needed to run tw_cli, not awk as well, so that was moved client side. nas-0-0 had a sudoers.d file added (sudoers.d/tw_cli-nas0check) with the following contents: fakeroot ALL=(ALL) NOPASSWD: /usr/sbin/tw_cli /c0 show for whatever reason, fakeroot is not root. (I mean, no duh but I don't understand why it exists.) fakeroot is given sudo access to run tw_cli /c0 show and that command only with no password. The script now works and is reporting correctly. 02/13/2018 TAGS: nas1 drive failure Drive 41:2 (physical slot 1:2) on NAS-1 has failed. I've run the following commands to replace it: $ storcli /c0/e41/s2 set offline $ storcli /c0/e41/s2 set missing $ storcli /c0/e41/s2 spindown Once the new drive is in place, it should automaticall start to rebuild. The status of the rebuild can be checked with: $ storcli /c0/e41/s2 show rebuild If it does not automatically begin, the rebuild can be manually started with: $ storcli /c0/e41/s2 start rebuild 02/16/2018 - Riley TAGS: Fail2Ban, OSG User Accounts, CERN User Accounts installed fail2ban on the SE, and will install on NAS-1. SE has basic configurations, NAS-1 will have some fancy stuff on it. im putting off configuring the batteries until i have time to redo the process for CentOS #Dumb, or until I just make a chroot in CentOS #real (777). I am also making user accounts for myself (Riley) for cern and OSG I think we are actually getting close to having a functioning cluster, or at least closer. the battery check is not necessary, and i think im just going to leave that to be something to do later. realistically its just something nice to have, and there is no reason to do that now. there are much more pressing issues. 02/19/2018 TAGS: APC UPS tripping off [import from offline adminlog] 02/20/2018 TAGS: NAS0 mount incorrectly When turning everything back on, it seems that the CE mounted the wrong part of NAS-0; only NAS-0's OS drive was mounted rather than the storage partition. `zpool status -x` says that no pools are found, which is worrisome because it's hooked up to nas0. I guess I'll just have to come back later and restart the whole thing again to see if that'll fix it. cont. 02/23/2018 Riley must have restarted the cluster, because nas0 is back online! 02/21/2018 TAGS: Fail2Ban strange issue with f2b, its the debian distro. I used "yum install fail2ban" to get the files but somehow they're not the redhat files the cluster needs maybe. it doesnt seem to be working yet bc the chinese tried to log in like 50 times in 5 days. the specifications for the current configuration shouldnt allow more than 25 login attempts over the course of 5 days if theyre just spamming, either that or they're in some jail somewhere and they're still allowed to ping for some reason. idk man 02/23/2018 TAGS: Fail2Ban, Drives Fail2Ban is now complete. I have become the offical ban hammer of the clsuter. Daniel showed me some dank commands to run to check disks, which pretty much eliminates the need to rip disks out. he said this is what normal people do and im upset that I've never even seen it before. anyway, rip in peace rian, I hope i can get the command onto the adminlog when im feeling less lazy. TODO: install postfix/configure to use fit.edu as relay for fail2ban notifications 02/23/2018 TAGS: APC UPS battery light red The battery light on the APC UPS is red again. Imma turn everything off, then run the UPS's test. The test was fine, and I've turned everything back on. 02/27/2018 TAGS: Certs I'm looking into how to make certs for CERN and OSG so there will be 2 SysAdmins with ProCerts BanHammer out