08/22/2017 TAGS: NAS-0 not working crash on boot Everything is not good. During break, a catastrophic hardware calamity had befallen NAS-0. Two drives are dead, and the BBU (Backup Battery Unit) on the RAID card has failed. NAS-0 kernel panics on boot, a reported symptom of the battery. The card seems fine, however, because its settings can be accessed during boot. New drives and a battery have been ordered. Another scary symptom of NAS-0's inoperability is the hanging of `df`. cont. 08/23/2017 I searched the settings of the controller's BIOS for options to boot without the BBU. I found something that would ignore the RAID controller on boot, but then boot failed due to not finding an operating system, which is probably stored in the RAID. It might be a good idea to have the boot disk seperate from the RAID in the future. cont. 08/25/2017 While we wait for the new battery to arrive, I replaced the two failed drives and started the rebuild process from the controller's BIOS. cont. 09/15/2017 The battery is here! We've installed it and are ready to turn NAS-0 on! But first, I'm shutting the entire cluster down so I can bring everything up in the proper order. Turns out the battery needs to charge first, so I'm gonna have to wait until Monday to do anything. cont. 09/18/2017 NAS-0 still kernel panics on boot. *sigh* I tried booting from the CentOS 6.5 disc, but no dice; It looked like it booted, but it hung on a black screen with a mouse pointer. I also tried booting from the Rocks 5 disc, but when it couldn't find an IP address it wanted, it restarted and began the loop again. I started playing with the GRUB, let's see where that goes. cont. 09/19/2017 I tried the Rocks CD again (this time we have internet!), and it advanced to the next step! It's looking for a Rocks image, and can't find one. I'd assume that the image would be on the Rocks CD in the drive, but I guess not. None of the hard drives have an image hidden in them either, it seems. Although, Rocks was unable to retrieve a file from somewhere on NAS-0, so maybe that had something to do with it. I found some Rocks 6.1.1 Jumbo DVDs, and I threw one into NAS-0. It has a rescue mode that I've entered. Welp, when I turned NAS-0 on to play with the Jumbo DVD, drive 8 decided to disappear. When I restarted, drive 15 also disappeared. So now drives 8 and 15 are gone with drive 10 still in "rebuild" status. Also, when I try to choose the "Installation Method" for Rocks, it rejects the Rocks DVD already in the slot. It says the installation material isn't present on it. Which disc contains the proper info, then? Drive 15 suddenly reappeared! That's nice. cont. 09/22/2017 I had replaced both drive 8 and 15 (which disappeared again after replacing drive 8), but it wouldn't let me add the new drives to the RAID group. Perhaps because it was already labeled as "REBUILDING". After the replacements had been made, I exited the controller BIOS to start booting. There was a CentOS 6.5 boot DVD in NAS-0. It didn't hang on a black screen this time; it booted into the liveCD properly! I have some bad news: NAS-0 is dead. The 3ware BIOS manager (the RAID card's BIOS) reports the RAID array as "unusable". The 3ware documentation says that an "unusable" array is totally dead; it's suffered too many failures to be brought back. I'm asking Blueshark (Daniel Campos) to take a look at it anyway, though, in case there's some crazy nonsense we can do to resurrect it. Today is a dark day for the cluster. Daniel Campos said that our last hope is to try to image the broken disks and put their information on the good disks, then throw them back into the RAID. cont. 09/25/2017 I tested the drives. The three that had any data on them are physically busted; they click and are not recognized by the computer at all. The data is lost. NAS-0 is no longer with us. 08/22/2017 TAGS: mount NAS-1 remotely on seperate machine Since no one can log onto the cluster with NAS-0 dead, we need to mount NAS-1 remotely to access it. First the IP of the machine must be added to '/etc/exports' on NAS-1, then the changes must be saved with `exportfs -ra`. To mount it on mac: `sudo mount -o resvport 163.118.42.3:/nas1 /location/on/local/machine/` 08/23/2017 TAGS: /var full '/var' is full again. '/var/log/tomcat6/gums-service-cybersecurity.log*' were taking up 100M per file (of which there were five), and they only contained the same java error message repeated several times. I have removed the five old files, and kept the latest log. '/var/log/maillog' (1.8G) is full of messages reporting that mail sent to NAS-0 has bounced; I've cleared the log. 08/25/2017 TAGS: nas1 NAS-1 failed drive replace A drive failed on NAS-1 and we're gonna replace it. To view NAS-1's RAID, run `storcli /c0 show`. To remove the drive with storcli: $ storcli /c0/e/s set offline (*) in the left-most column of `storcli /c0 show` is the drive names in 'enclosureID:slotID' format $ storcli /c0/e/s set missing $ storcli /c0/e/s spindown (*) spins down the drive and makes it safe for removal The drive can now be safely removed. Once the new drive is in place it should automatically start rebuilding. If the drive's status doesn't change to "Rbld", the rebuild can be manually started with `storcli /c0/e/s/s show rebuild`. 08/28/2017 TAGS: nodes acting funny The second group of nodes (2-0, ...) is acting kinda strange. When I logged on, I saw the splash text that usually appears after the nodes are turned back on from a restart, and the diagnostics page shows that they have NAS-0 mounted and a 0 load average, while the other 10 nodes have super high load averages (~5000). cont. 08/29/2017 Time to exorcise the nodes! The script that gathers data from the nodes is '/usr/local/bin/cn.sh', and it writes to '~/diagnostics/cn.json'. The script checks for a mounted file system by running `df -h /filesystem/mount/point/` and seeing if anything is returned. On the '1-' nodes, `df` just hangs like on the rest of the cluster. On the '2-' nodes, however, it returns the line with the mount point '/'. While that's not NAS-0, it's something, so the website reports a success. The load average is found with `cat /proc/loadavg`. That's not explaining why the load is so high, however. The load average is high because the diagnostic script runs `df`, which hangs on the '1-' nodes; serveral instances of a hung up process are trying to run simultaneously. I've restarted the nodes, which will fix the problem; `df` will work fine. The '1-' nodes aren't ssh-able. I'll have to investigate that later. The '1-' nodes all tried to mount NAS-0 on boot, and they all failed to complete booting because they though NAS-0 was a busy device. I'm gonna powercycle them to see if that'll work. They're good, now. Now all of the nodes have a low load average, and they all falsely report NAS-0 to be mounted. 09/01/2017 TAGS: NAS-1 RAID card Today some strange nonsense happened. NAS-1 was telling me that its RAID card had suffered some catastrophic failure, and was no longer operable. I powercycled NAS-1 because everything on NAS-1 hung. On boot, the RAID card would beep, and nothing would appear on the monitor. Everything on the CE also hung. Scary. I turned the whole cluster off, and tested the APC UPS, which yelled at me, so I manually checked all of its batteries. After all of the batteries had passed inspection, I put them back in and turned everything back on. Everything, except NAS-0 of course, booted up just fine. I have no idea what caused the issue in the first place. 09/05/2017 TAGS: new hostcert OSG emailed me saying that my hostcert is about to expire. The new hostcert and hostkey are obtained. 09/05/2017 TAGS: CE hung The CE decided to hang; nothing could be performed on it. I restarted the cluster, and it's good, now. 09/14/2017 TAGS: UPS no power not turning on When we plugged everything back in after the hurricane, the top Tripplite SmartPro UPS refused to accept power. No lights turned on indicating that it sees any kind of power at all. I tried plugging it into different outlets, but the bottom UPS accepted the outlets just fine. The model number of the Tripplite UPSs is "SMART5000RT3U". cont. 09/15/2017 The power button of the busted UPS feels kinda wonky. It feels like there's not even a button behind the flexible plastic button cover; the plastic just gives with hardly any resistance, unlike the bottom UPS which has a more solid feeling button press. However, the button could just feel strange because it's not getting any power; the other button (the alarm button) won't even depress at all. I ripped the UPS's face off to investigate the buttons on the circuit board; they're both fine. cont. 09/20/2017 I called Tripplite for assistance, and he told me to check the batteries. Just what I feared he'd say! Well, let's get them out of the rack and see what's up. The batteries are all destroyed. They are all swollen, and there's corrosion everywhere. It's a repeat of 2 years ago! (Fun Fact: We replaced the batteries 09/21/2015, almost EXACTLY 2 years ago!) 09/26/2017 TAGS: NAS-1 diagnostics strange The RAID monitoring for NAS-1 on the diagnostics page is a bit wonked out. The script is having trouble when it tries to ssh into NAS-1; some drives have '/root/.bashrc' errors. Oh, when I tried to install root on NAS-1 earlier, I put some nonsense in its '.bashrc' that spits out errors whenever it's run. The scripts write down whatever was written to standard output, which, in this case, includes error messages for the first two lines. So the website is reading the first two error messages and displaying them on the website. Whoops! Let's fix NAS-1's '.bashrc'. I commented out the broken root line, it's all good, now. 09/26/2017 TAGS: squid not running The diagnostics page says that 'squid' isn't running. I tried to start it with `service frontier-squid start`, but it complained that '/home/squid' didn't exist. RIP; I guess it's dead until we can resurrect NAS-0. 09/26/2017 TAGS: NAS-0 redo Welp, NAS-0's dead. But now we have an opportunity to redo its RAID configuration! What shall it be? I really wanted to do ZFS, because it's the best, but it's slowly turning out to not be viable. The hardware may not cooperate nicely with it, and we may need new hardware to connect all of the drives together in the absence of a RAID card. So, I think we're gonna have to stick to the card we've got. Unfortunately, since the card doesn't support RAID-60, we're gonna have to come up with a more creative solution (I wanna see if there are better options than just straight RAID-6). 09/27/2017 TAGS: rack rearrangement Today, we're taking out the bottom Tripplite UPS to examine its batteries. We're also gonna take the UPSs completely out, put NAS-1 and the SE where the UPSs were, then put the UPSs, spread out, on the left rack. cont. 10/04/2017 Alright, everything's done. The rearrangement went wonderfully. I even rewired everything! I'm going to make a document showing where I plugged everything in. The batteries also came in, and we installed those. They're charging themselves up and they're working great! 10/16/2017 TAGS: SE no ethernet All the ethernet ports have their red lights on, so Imma restart everything to see if that does anything. I restarted everything, but the red persists. Huh. cont. 10/17/2017 Well, we need ethernet to add NAS-0 back to the cluster, so this has got to be fixed. The four weirded-out parts (CE, SE, NAS-1, NAS-0) are all plugged into a group of four dual-personality ports. Maybe the dual-personality ports have the wrong personality? I tried plugging one of the devices into an adjacent, regular ethernet port on the router, but the light is still red. Although, NAS-0's light has mysteriously decided to turn green. cont. 10/18/2017 Well, I've discovered some things today. It's looking like I'm gonna have to interface with the router's console to see what's up. To do that, though, I need the console cable, which is Ethernet-Serial (RJ-45 to DB-9(female)). Of course, we don't have that cable, and I found supplies to maybe make one, but that for sure won't work, so I'm probably just gonna have to buy one. *sigh* more waiting... cont. 10/23/2017 The cable came in early! Imma hook the router up to the CE and see if it'll work. Gotta get that VT-100 emulator up and running first, though. I got the emulator 'minicom'. cont. 10/24/2017 minicom must have the following configuration: A-Serial Device: /dev/ttyS0 B-Lockfile Location: /var/lock C-Callin Program: D-Callout Program: E-Bps/Par/Bits: 9600 8N1 F-Hardware Flow Control: No G-Software Flow Control: No cont. 10/25/2017 (Yo, the output from the router looks really cool because you can see it written to the screen since it's serial!) nothing works The switch has been configured with the following important properties: Default Gateway: 172.16.42.126 (what was already there) Time Sync Method: SNTP (what was already there) SNTP Mode: Unicast (what was already there) Poll Interval: 720 (default) Server Address: 163.118.171.4 (was was already there) I have been experimenting with the 'IP Config' settings. Right now, it's set to: IP Address: 163.118.42.126 Subnet Mask: 255.255.255.128 I've also tried setting it to 'disable', but to no avail. cont. 10/30/2017 Summary thus far: The high GB/s connections are working fine; the CE, SE, and NAS-1 have internet no problem. The switch shows no error lights on itself, but the ethernet ports of all connected machines display a red LED indicating that the connection is dead. I've adjusted the dimensions of the console window: length: 64 width: 78 `show interfaces brief` displays the status' of the ports, and it says nothing's wrong. `show interfaces display` reports that there is data running through all of the ports, almost 100M for each of the ethernet ports and between 1.5G and 2G for the high-speed ports, which are operational. cont. 11/02/2017 Daniel Campos came by and took a look at the switch. He did a bunch of fancy stuff, and it turns out that it matters which ethernet port on the computers is used, and I used the wrong one. *siiiigh* I threw everything in the proper port, but I can't test it now because class. Hopefully it's good now! cont. 11/06/2017 Ethernet's golden! Now we can play with NAS-0. 10/17/2017 TAGS: creating NAS0 NAS-0 RAID The time has come to finally reconstruct NAS-0's RAID! We've opted to use RAID-10, which is a staggering improvement in security over the previous configuration (RAID-6), although we're taking a considerable hit to available space, only half of the drives' 12TB are useable. I have included all 16 drives in the array, and configured it to heavily favor protection rather than performance. Ok, I'm super sketched out by this RAID card. It won't let me configure how I want RAID-10 done. I would like to make it into 2 groups of 8 drives each, so that the tolerance is a minimum of 4 drives (4 drives all from the same group). Unfortunately, this RAID card is lame af, so it automatically puts the drives into RAID-1 pairs that are all striped together. This only allows for a minimum tolerance of 1 drive; if both drives in a RAID-1 pair fail, the array dies. While this is among the lamest things I've seen, in 14/15 cases, it's at least as safe as RAID-60 when 2 drives fail, and infinitely safer when 3 fail. For that reason, I'm gonna stick with RAID-10 over doing RAID-6 again. cont. 10/18/2017 Maybe ZFS is a viable option! When searching for a cable, I found a massive cache of RAM in the supply closet. There are several sticks of 2GB, 4GB, and 8GB. While we're waiting for the router console cable, I could play with ZFS on NAS-0, which could be interesting. cont. 10/20/2017 NAS-0's motherboard is a Supermicro X7DB8. It can support up to 32GB of 667/553MHz DDR2 RAM of sizes 512MB, 1GB, 2GB, and 4GB. We wouldn't be able to use all of the RAM, but a good bit of it is still available. Another problem, though, is much more concerning. How will the drives be directly connected to the motherboard without a RAID card? I doubt there are enouogh slots on the board, so a SATA hub may be necessary. cont. 11/07/2017 Since this card is actual trash (the only RAID-10 option is literally the worst possible configuration of RAID-10 (it only supports RAID-1 pairs connected in RAID-0)), we're gonna try to use it as a SATA hub for the drives to be run in ZFS. Can the card be configured to to run the disks in JBOD? A'ight, so here's the thing. I need to dedicate at least one drive to house the OS, and I'd like that drive to be backed up; we're left with 14 drives, which is still plenty. There's a few good ZFS options we can do: 1) 2 striped RAIDZ2 vdevs (RAID60 with 2 groups of 7) - min: 2, max: 4, 7.5TB 2) 2 striped RAIDZ2 vdevs with 2 hot spares - min: 2 + ~2, max: 4 + ~2, 6TB; immediate replacement of 2 failures in quick succession (effectively 2 base tolerance with 2 extra tolerance per group) 3) 2 striped RAIDZ3 vdevs - min: 3, max: 6, 6TB Imma try out option 2, just to see if it'll work out. First, I need to make a RAID10 array with 2 drives; this'll be the OS drive. With the small array made, I threw the ROCKS disc in, and it did some things. I formated the array as ext4, and it installed a bunch of stuff. I whipped the disc out, restarted it, and it booted into CentOS! Unfortunately, though, it's asking for a password that doesn't exist. That's fine, though, because I can ssh into it just fine (nice!). It's yelling at me because the RSA keys are all messed up, but that's fine, I'll fix it later. NAS-0 has an OS again! Now the task is to make the other drives visible to the OS.` cont. 11/13/2017 A'ight, let's get ZFS installed on NAS-0! INCORRECT MISSTEPS: First we must install some dependencies: $ yum install kernel-devel zlib-devel libuuid-devel libblkid-devel libselinux-devel parted lsscsi Actually, nevermind, the link this guide provides doesn't work; let's try a new one. Here are the dependencies for this guide: $ yum install dkms gcc make kernel-devel perl Everything was preinstalled except 'dkms' (Dynamic Kernel Module Support: without it, kernel updates could break software), which is a part of the RPMForge repository. Since NAS-0 is 64 bit, to install RPMForge: Nevermind, turns out RPMForge (aka RepoForge) is now deprocated, and big letters on the CentOS Wiki say to not use it. So forget that, Imma install EPEL: $ wget http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm $ rpm -ivh epel-release-6-8.noarch.rpm Except `yum repolist` shows no sign of EPEL. *sigh* Turns out the repo's gotta be turned on. 'enabled' in '/etc/yum.repos.d/epel.repo' needs to be set equal to '1' rather than '0'. Now EPEL shows up in `yum repolist`. Nice! Now dkms can be installed: $ yum install dkms The next instruction calls for installing 'spl' and 'zfs': $ yum install spl zfs Unfortunately, neither of these packages can be found. CORRECT METHOD: Fortunately, ZFS can be installed a different way. First, the ZFS repo must be installed: $ yum install http://download.zfsonlinux.org/epel/zfs-release.el6.noarch.rpm Then, ZFS itself must be installed: $ yum install kernel-devel zfs ZFS is now installed! Hooray! Now we've gotta get those drives visible. An important thing we gotta do is get 'tw_cli' installed, the RAID monitoring software. First, the ASL repo must be installed: $ wget http://updates.aslab.com/asl/el/6/x86_64/asl-el-release-6-3.noarch.rpm $ rpm -Uvh asl-el-release-6-3.noarch.rpm Then the software needs to be installed: $ yum install 3ware-3dm* Now NAS-0 needs to be restarted. 'tw_cli' is installed and works great! I can see the unconfigured drives in 'tw_cli'; hopefully I can work with them. Looks like if I put all the other disks in their own seperate units (putting them all in single disk mode), they'll be visible to the OS. Let's try it! I can see all the drives! Now we can get ZFS up and running! cont. 11/14/2017 I tried making the zpool, but it didn't like the 1TB replacement drive we threw in there, so I'm just gonna replace it with a normal 750GB. When I tried to remove the drive with 'tw_cli', though, it couldn't. That's because I was trying to remove the only drive in the its unit, which it isn't happy with. I'm gonna have to delete the unit, and remake it with the new drive. The zpool with option 2 was made: $ zpool create nas0 raidz2 sdb sdc sdd sde sdf sdg raidz2 sdh sdi sdj sdk sdl sdm spare sdn sdo Unfortunately, though, it only has 5.2TB of space, which is a bit less than the alread expected low amount of 6TB. Imma try option 1, the most spacious one. It wouldn't let me destroy the zpool; it said it was busy. Even after unmounting it, it still complained, so I restarted NAS-0. It's still busy. I'm gonna try to see what holding it open with `lsof | grep deleted`. Nothing is printed. `lsof` didn't list anything with "nas0", but there are a few processes related to "zfs". `zfs iostat` revealed that there is some IO going on in 'nas0' (also that there are 8.1TB free, suspicious, it's probably got something to do with parity and other ZFS data). Later, I'll try killing all of the ZFS processes. cont. 11/20/2017 I just ran `zpool destroy nas0` and it seemed to have worked just fine. Huh, well problem solved, I guess. I'm gonna try to make Option 1 and see how much space that one actually gives us. It only gave us 6.6T of the expected 7.5T. I reported my findings at the meeting, and we've opted to go for Option 2, the RAID-60EE equivalent. cont. 11/27/2017 Let's make Option 2 and start the copy of the '/home' backup. '/nas0' is busy, so I'm gonna comment out 'nas0' in '/etc/mtab' so that it won't be mounted on restart. After much fandangling, turns out the best course of action is to just restart NAS-0, then 'zfs unmount nas0' and 'zpool destroy nas0' as quickly as possible, before any crazy processes can start acting on it. Now, I've gotta mount NAS-0 onto the CE so that data from NAS-1 can be sent over. cont. 11/29/2017 Even though '/etc/fstab' contains an entry for NAS-0, 'mount' doesn't see '/nas0' available. There is a 'sharenfs' property on ZFS that allows ZFS volumes to be shared via NFS; it's set on /nas0. NFS is already good to go on NAS-0, but we've gotta add '/nas0' to '/etc/exports' so that NAS-0 knows to allow the CE to mount '/nas0'. I've added the following line to '/etc/exports': /nas0 163.118.42.1(rw,sync,no_root_squash) /nas0: the filesystem to be mounted 163.118.42.1: the highspeed ethernet connection on the CE rw: allow read/write sync: server confims client requests only when the changes have been committed (safety) no_root_squash: allows root to mount filesystem By default there was an entry in '/etc/exports' called '/export/data1'. It caused some problems, so I commented it out. I then ran `exportfs -ra`. When I try a `mount /mnt/nas0` on the CE, I get the following error: mount.nfs: access denied by server while mounting nas-0-0.local:/nas0 The error was because it doesn't like the IP for the CE I gave it; it prefers the LAN IP (10.1.1.1). '/nas0' is mounted fine, now. Now the data transfer can begin! I used the command: $ rsync -av --append /mnt/nas1/nas0-bak-20160304/home/ /mnt/nas0/home/ I ommitted the 'nohup' because it was giving me problems, and I wanted to manually monitor the progress (it took a couple days). cont. 12/01/2017 Data transfer complete! Good news: all of the data transfered over just fine Bad news: none of the file permissions were saved; I'm gonna have to fix that. The permissions can be fixed by following the instructions from [10/31/2015]. The home directories also need to be mounted on '/home' rather than '/mnt/nas0/home'. So let's fix that mount point. Oh wait, hold on. Some of the home directories (mine, Ankit's, and couple others) are already mounted on '/home' from '/mnt/nas0/home'. Looks like we're good! I'm able to login remotely with an shh key again! Hooray!!! 10/31/17 Riley TAGS: NAS-0, NAS-0 RAID 10, Batteries, NAS-0 RAID card model, ZFS info There isn't any literature I can find on the admin log about doing a battery test. I'll look on the twiki, but as for now the project is at a standstill. For some reason the glorius Google (TM) only gives me things online about MicroSoft (TM) Clusters and UPS systems, so finding something won't be as easy as I initially thought. As for today, I'm ripping out NAS-0 and looking inside. I need to know the model of the RAID card for research, and how many ports it has. This info will be recorded here. I am seeing if it can be used as a hub for ZFS, and if it can I'm planning on putting a bunch of RAM in it. For glory. Happy Halloween, My cluster friends. 10/31/17 cont. Found the things for the UPS. All the info we have as of right now is the location of the UPS documentaion on the cluster: /etc/ups. Ryan has a couple of things from 2 years ago, but there isn't any exsisting code to check the batteries. Im going to start working on a code to check the batteries Moving onto the RAID card, the model is AMCC 9650SE-12 ml. it currently goes for $430 on the market, even though its some dated tech, which leads me to believe that if any RAID card from that era could be used as a hub, this is it. the only problem is everything online says it's possible to use a RAID card as the hub, but no one says how because they unanimously say its a terrible descision. 11/2/17 Riley TAGS: NAS-0, RAM, RAID card, Battery test in order to use the NAS-0 RAID card as a hub for ZFS, we need a metric tonne of RAM. lucky the motherboard can support 16 RAM sticks and in the admin log it does say that it can handle up to 4GB sticks of DDR 2. the only problem is that the RAM in the motherboard isnt DDR, its FB-DIMM. more research is needed to find out if there is any potential compatability problems Daniel Campos gave me some amazing resources for running APC diagnostics tests. I'm going to try and make the APC as schnazzie as possible. hopefully the Trip-Lite battery tests wont be too much more difficult. The battery info can be found at /etc/ups BATTERY LOCATION: /etc/ups 11/2/17 cont. TAGS: RAM, NAS-0 it seems that the RAM is an implied DDR-2, even though it doesn't say anything about DDR on it UPDATE: We (With the help of Daniel Campos) found a decent way to solve our issues. NONE of the RAM fit into the mothernboard, which is fine because we dont need it anymore. Daniel suggested we use JBOD to host ZFS, and it doesn't really need a lot of RAM. 11/27/2017 TAGS: CE hang The CE hung again today, so I powercycled it, and now it's fixed. It took FOREVER to turn on, though. There were some mad NFS timeout times, so I'm gonna try to reduce that. I changed the timeouts in '/etc/auto.master' from 1200 to 500. Hopefully that'll fix the problem. 12/04/2017 TAGS: nas0 dashboard diagnostics page The RAID health check for NAS-0 is all kinds of messed up because NAS-0 has crazy splash text on login. Let's fix it! It said that line 29 in '/etc/ssh/ssh_known_hosts' in the CE was the offending line. That's the line for the old NAS-0; it was trying, and failing, to match the new NAS-0's key with the old key the CE had. I just deleted that line, and it put the new key on the CE. All is now well! 12/04/2017 TAGS: NAS-0 no root login Ankit recommended we disable root login for NAS-0, which is probably not a bad idea. I created a user "fakeroot" and put `su -` in its '.bashrc', so that the root password must be entered to gain access to NAS-0. I copied over the CE's ssh key, but it still didn't work. I changed the permissions for '~/.ssh' and '~/.ssh/authorized_keys' in 'fakeroot''s home directory on NAS-0, and I ran `restorecon -Rv ~/.ssh`, which resets the SELinux configuration to default. It works fine! I can login to NAS-0 from the CE with RSA. I've also added 'fakeuser' to the sudoers group on NAS-0: $ usermod -aG wheel fakeuser For changes to take effect, log out and back in. I disabled ssh login for root on NAS-0 by setting 'PermitRootLogin' to 'no' in '/etc/ssh/sshd_config'. I made the root password required for any 'sudo' activity by adding 'Defaults rootpw' to '/etc/sudoers'. 12/19/2017 TAGS: NAS0 ZFS I tried to work on the cluster remotely, only to find that my certificate wasn't working. Uh Oh. Turns out ZFS didn't start up correctly on NAS-0, so '/nas0' wasn't mounted. I logged in as 'root' and tried a `zfs list`, but it just told me that no datasets were found. Maaaaaan. I'm gonna try unmounting NAS-0 from the CE, then restarting the thing. No dice. Imma try an update and restart No dice x2. `zpool import` gave me data on the pool, and told me a drive failed. The error message gave me this URL: http://zfsonlinux.org/msg/ZFS-8000-4J/ Turns out, since 'nas0' is an exported pool, it needs to be imported, which failed because it was degraded. It can still be manually imported, however, so that it can be worked on. *sigh* Turns out the issue is that THREE drives decided to fail IMMEDIATELY after I left. *sigh* Man, c'mon now. There's gotta be a reason why all this nonsense always happens. Why do the drives in NAS-0 fail so often? NAS-0's super important. Maybe it's just 'cause all the drives are super old. I mean, it is a bunch of 750GB, which is an outdated size anyway. That's probably it; they're just super old. I guess even the "new" drives we get would be old even if they've never been used. I don't even know how to fix that, though, short of replacing all the drives, but that's super expensive. *sigh* Who knows, man? Who knows? I haven't decided if I'm gonna run down there to replace the drives or not. Since it's still operational, and nothing new's been put on it, I'll probably just leave it.