08/22/2017
TAGS: NAS-0 not working crash on boot
Everything is not good. During break, a catastrophic hardware calamity had 
befallen NAS-0. Two drives are dead, and the BBU (Backup Battery Unit) on 
the RAID card has failed. NAS-0 kernel panics on boot, a reported symptom 
of the battery. The card seems fine, however, because its settings can be 
accessed during boot. New drives and a battery have been ordered.
Another scary symptom of NAS-0's inoperability is the hanging of `df`.

cont.
08/23/2017
I searched the settings of the controller's BIOS for options to boot without
the BBU. I found something that would ignore the RAID controller on boot, but 
then boot failed due to not finding an operating system, which is probably stored
in the RAID. It might be a good idea to have the boot disk seperate from the RAID in
the future.

cont.
08/25/2017
While we wait for the new battery to arrive, I replaced the two failed drives and
started the rebuild process from the controller's BIOS.

cont.
09/15/2017
The battery is here! We've installed it and are ready to turn NAS-0 on! But first,
I'm shutting the entire cluster down so I can bring everything up in the proper order.
Turns out the battery needs to charge first, so I'm gonna have to wait until Monday
to do anything.

cont.
09/18/2017
NAS-0 still kernel panics on boot. *sigh*
I tried booting from the CentOS 6.5 disc, but no dice; It looked like it booted, but
it hung on a black screen with a mouse pointer. I also tried booting from the Rocks 5
disc, but when it couldn't find an IP address it wanted, it restarted and began the loop
again. I started playing with the GRUB, let's see where that goes.

cont.
09/19/2017
I tried the Rocks CD again (this time we have internet!), and it advanced to the next step!
It's looking for a Rocks image, and can't find one. I'd assume that the image would be on
the Rocks CD in the drive, but I guess not. None of the hard drives have an image hidden
in them either, it seems. Although, Rocks was unable to retrieve a file from somewhere on
NAS-0, so maybe that had something to do with it.
I found some Rocks 6.1.1 Jumbo DVDs, and I threw one into NAS-0. It has a rescue mode that
I've entered.
Welp, when I turned NAS-0 on to play with the Jumbo DVD, drive 8 decided to disappear.
When I restarted, drive 15 also disappeared. So now drives 8 and 15 are gone with drive 10
still in "rebuild" status. 
Also, when I try to choose the "Installation Method" for Rocks, it rejects the Rocks DVD already
in the slot. It says the installation material isn't present on it. Which disc contains the proper
info, then?
Drive 15 suddenly reappeared! That's nice.

cont.
09/22/2017
I had replaced both drive 8 and 15 (which disappeared again after replacing drive 8), but
it wouldn't let me add the new drives to the RAID group. Perhaps because it was already labeled
as "REBUILDING". After the replacements had been made, I exited the controller BIOS to start booting.
There was a CentOS 6.5 boot DVD in NAS-0. It didn't hang on a black screen this time; it booted into
the liveCD properly! 

I have some bad news: NAS-0 is dead.

The 3ware BIOS manager (the RAID card's BIOS) reports the RAID array as "unusable". The 3ware documentation
says that an "unusable" array is totally dead; it's suffered too many failures to be brought back. I'm asking
Blueshark (Daniel Campos) to take a look at it anyway, though, in case there's some crazy nonsense we can do
to resurrect it. Today is a dark day for the cluster.
Daniel Campos said that our last hope is to try to image the broken disks and put their information on the good
disks, then throw them back into the RAID.

cont.
09/25/2017
I tested the drives. The three that had any data on them are physically busted; they click and are not 
recognized by the computer at all. The data is lost. NAS-0 is no longer with us.


08/22/2017 
TAGS: mount NAS-1 remotely on seperate machine
Since no one can log onto the cluster with NAS-0 dead, we need to mount NAS-1 remotely to
access it. First the IP of the machine must be added to '/etc/exports' on NAS-1, then the 
changes must be saved with `exportfs -ra`. To mount it on mac:
`sudo mount -o resvport 163.118.42.3:/nas1 /location/on/local/machine/`


08/23/2017
TAGS: /var full
'/var' is full again. '/var/log/tomcat6/gums-service-cybersecurity.log*' were taking up
100M per file (of which there were five), and they only contained the same java error message
repeated several times. I have removed the five old files, and kept the latest log.
'/var/log/maillog' (1.8G) is full of messages reporting that mail sent to NAS-0 has bounced;
I've cleared the log.

08/25/2017
TAGS: nas1 NAS-1 failed drive replace
A drive failed on NAS-1 and we're gonna replace it.
To view NAS-1's RAID, run `storcli /c0 show`.

To remove the drive with storcli:
$ storcli /c0/e<enclosureID>/s<slotID> set offline
  (*) in the left-most column of `storcli /c0 show` is the drive names in 'enclosureID:slotID' format
$ storcli /c0/e<enclosureID>/s<slotID> set missing
$ storcli /c0/e<enclosureID>/s<slotID> spindown
  (*) spins down the drive and makes it safe for removal

The drive can now be safely removed.

Once the new drive is in place it should automatically start rebuilding. If 
the drive's status doesn't change to "Rbld", the rebuild can be manually
started with `storcli /c0/e<enclosureID>/s<slotID start rebuild`. The rebuild status can be
monitored with `storcli /c0/e<enclosureID>/s<slotID> show rebuild`.


08/28/2017
TAGS: nodes acting funny
The second group of nodes (2-0, ...) is acting kinda strange. When I logged on,
I saw the splash text that usually appears after the nodes are turned back on from
a restart, and the diagnostics page shows that they have NAS-0 mounted and a 0 load
average, while the other 10 nodes have super high load averages (~5000).

cont.
08/29/2017
Time to exorcise the nodes! The script that gathers data from the nodes is
'/usr/local/bin/cn.sh', and it writes to '~/diagnostics/cn.json'. The script
checks for a mounted file system by running `df -h /filesystem/mount/point/`
and seeing if anything is returned. On the '1-' nodes, `df` just hangs like on
the rest of the cluster. On the '2-' nodes, however, it returns the line with
the mount point '/'. While that's not NAS-0, it's something, so the website reports
a success.
The load average is found with `cat /proc/loadavg`. That's not explaining why the
load is so high, however. The load average is high because the diagnostic script runs
`df`, which hangs on the '1-' nodes; serveral instances of a hung up process are trying to
run simultaneously. I've restarted the nodes, which will fix the problem; `df` will work fine.
The '1-' nodes aren't ssh-able. I'll have to investigate that later.
The '1-' nodes all tried to mount NAS-0 on boot, and they all failed to complete booting
because they though NAS-0 was a busy device. I'm gonna powercycle them to see if that'll work.
They're good, now. Now all of the nodes have a low load average, and they all falsely report
NAS-0 to be mounted.


09/01/2017
TAGS: NAS-1 RAID card
Today some strange nonsense happened. NAS-1 was telling me that its RAID card had suffered some
catastrophic failure, and was no longer operable. I powercycled NAS-1 because everything on NAS-1
hung. On boot, the RAID card would beep, and nothing would appear on the monitor. Everything on the
CE also hung. Scary. I turned the whole cluster off, and tested the APC UPS, which yelled at me, so I
manually checked all of its batteries. After all of the batteries had passed inspection, I put them back
in and turned everything back on. Everything, except NAS-0 of course, booted up just fine. I have no idea
what caused the issue in the first place.


09/05/2017
TAGS: new hostcert
OSG emailed me saying that my hostcert is about to expire. The new hostcert and hostkey are obtained.


09/05/2017
TAGS: CE hung
The CE decided to hang; nothing could be performed on it. I restarted the cluster, and it's good, now.


09/14/2017
TAGS: UPS no power not turning on
When we plugged everything back in after the hurricane, the top Tripplite SmartPro UPS refused to
accept power. No lights turned on indicating that it sees any kind of power at all. I tried plugging
it into different outlets, but the bottom UPS accepted the outlets just fine. The model number of the 
Tripplite UPSs is "SMART5000RT3U".

cont.
09/15/2017
The power button of the busted UPS feels kinda wonky. It feels like there's not even a button behind
the flexible plastic button cover; the plastic just gives with hardly any resistance, unlike the bottom
UPS which has a more solid feeling button press. However, the button could just feel strange because it's
not getting any power; the other button (the alarm button) won't even depress at all.
I ripped the UPS's face off to investigate the buttons on the circuit board; they're both fine.

cont.
09/20/2017
I called Tripplite for assistance, and he told me to check the batteries. Just what I feared
he'd say! Well, let's get them out of the rack and see what's up.
The batteries are all destroyed. They are all swollen, and there's corrosion everywhere. It's a repeat
of 2 years ago! (Fun Fact: We replaced the batteries 09/21/2015, almost EXACTLY 2 years ago!)


09/26/2017
TAGS: NAS-1 diagnostics strange
The RAID monitoring for NAS-1 on the diagnostics page is a bit wonked out. The script is having trouble
when it tries to ssh into NAS-1; some drives have '/root/.bashrc' errors. Oh, when I tried to install root
on NAS-1 earlier, I put some nonsense in its '.bashrc' that spits out errors whenever it's run. The scripts
write down whatever was written to standard output, which, in this case, includes error messages for the first
two lines. So the website is reading the first two error messages and displaying them on the website. Whoops!
Let's fix NAS-1's '.bashrc'. I commented out the broken root line, it's all good, now.


09/26/2017
TAGS: squid not running
The diagnostics page says that 'squid' isn't running. I tried to start it with `service frontier-squid start`,
but it complained that '/home/squid' didn't exist. RIP; I guess it's dead until we can resurrect NAS-0.


09/26/2017
TAGS: NAS-0 redo
Welp, NAS-0's dead. But now we have an opportunity to redo its RAID configuration! What shall it be?
I really wanted to do ZFS, because it's the best, but it's slowly turning out to not be viable. The hardware
may not cooperate nicely with it, and we may need new hardware to connect all of the drives together in the
absence of a RAID card. So, I think we're gonna have to stick to the card we've got. Unfortunately, since the card
doesn't support RAID-60, we're gonna have to come up with a more creative solution (I wanna see if there are better
options than just straight RAID-6).


09/27/2017
TAGS: rack rearrangement
Today, we're taking out the bottom Tripplite UPS to examine its batteries. We're also gonna take the UPSs completely out,
put NAS-1 and the SE where the UPSs were, then put the UPSs, spread out, on the left rack.

cont.
10/04/2017
Alright, everything's done. The rearrangement went wonderfully. I even rewired everything! I'm going to make a document
showing where I plugged everything in. The batteries also came in, and we installed those. They're charging themselves up
and they're working great! 


10/16/2017
TAGS: SE no ethernet
All the ethernet ports have their red lights on, so Imma restart everything to see if that does anything.
I restarted everything, but the red persists. Huh.

cont.
10/17/2017
Well, we need ethernet to add NAS-0 back to the cluster, so this has got to be fixed.
The four weirded-out parts (CE, SE, NAS-1, NAS-0) are all plugged into a group of four
dual-personality ports. Maybe the dual-personality ports have the wrong personality?
I tried plugging one of the devices into an adjacent, regular ethernet port on the router,
but the light is still red. Although, NAS-0's light has mysteriously decided to turn green.

cont.
10/18/2017
Well, I've discovered some things today. It's looking like I'm gonna have to interface with the router's
console to see what's up. To do that, though, I need the console cable, which is Ethernet-Serial 
(RJ-45 to DB-9(female)). Of course, we don't have that cable, and I found supplies to maybe make one, but
that for sure won't work, so I'm probably just gonna have to buy one. *sigh* more waiting...

cont.
10/23/2017
The cable came in early! Imma hook the router up to the CE and see if it'll work. Gotta get that
VT-100 emulator up and running first, though. I got the emulator 'minicom'. 

cont.
10/24/2017
minicom must have the following configuration:
A-Serial Device: /dev/ttyS0
B-Lockfile Location: /var/lock
C-Callin Program:
D-Callout Program:
E-Bps/Par/Bits: 9600 8N1
F-Hardware Flow Control: No
G-Software Flow Control: No

cont.
10/25/2017
(Yo, the output from the router looks really cool because you can see it written to the screen since it's serial!)
nothing works

The switch has been configured with the following important properties:
Default Gateway: 172.16.42.126 (what was already there)
Time Sync Method: SNTP (what was already there)
SNTP Mode: Unicast (what was already there)
Poll Interval: 720 (default)
Server Address: 163.118.171.4 (was was already there)

I have been experimenting with the 'IP Config' settings. Right now, it's
set to:
    IP Address: 163.118.42.126
    Subnet Mask: 255.255.255.128
I've also tried setting it to 'disable', but to no avail.

cont.
10/30/2017
Summary thus far:
The high GB/s connections are working fine; the CE, SE, and NAS-1 have 
internet no problem. The switch shows no error lights on itself, but the 
ethernet ports of all connected machines display a red LED indicating that 
the connection is dead.

I've adjusted the dimensions of the console window:
     length: 64
     width: 78
 
`show interfaces brief` displays the status' of the ports, and it
says nothing's wrong.
`show interfaces display` reports that there is data running through all of
the ports, almost 100M for each of the ethernet ports and between 1.5G and
2G for the high-speed ports, which are operational.

cont.
11/02/2017
Daniel Campos came by and took a look at the switch. He did a bunch of fancy stuff, and it turns out
that it matters which ethernet port on the computers is used, and I used the wrong one. *siiiigh*
I threw everything in the proper port, but I can't test it now because class. Hopefully it's good now!

cont.
11/06/2017
Ethernet's golden! Now we can play with NAS-0.


10/17/2017
TAGS: creating NAS0 NAS-0 RAID
The time has come to finally reconstruct NAS-0's RAID! We've opted to use RAID-10, which is a staggering improvement
in security over the previous configuration (RAID-6), although we're taking a considerable hit to available space, only
half of the drives' 12TB are useable.
I have included all 16 drives in the array, and configured it to heavily favor protection rather than performance. 
Ok, I'm super sketched out by this RAID card. It won't let me configure how I want RAID-10 done. I would like to
make it into 2 groups of 8 drives each, so that the tolerance is a minimum of 4 drives (4 drives all from the same group).
Unfortunately, this RAID card is lame af, so it automatically puts the drives into RAID-1 pairs that are all striped together.
This only allows for a minimum tolerance of 1 drive; if both drives in a RAID-1 pair fail, the array dies. While this is among 
the lamest things I've seen, in 14/15 cases, it's at least as safe as RAID-60 when 2 drives fail, and infinitely safer when 3 fail.
For that reason, I'm gonna stick with RAID-10 over doing RAID-6 again. 

cont.
10/18/2017
Maybe ZFS is a viable option! When searching for a cable, I found a massive cache of RAM in the supply closet. There are
several sticks of 2GB, 4GB, and 8GB. While we're waiting for the router console cable, I could play with ZFS on NAS-0, which
could be interesting.

cont.
10/20/2017
NAS-0's motherboard is a Supermicro X7DB8. It can support up to 32GB of 667/553MHz DDR2 RAM of sizes 512MB, 1GB, 2GB, and 4GB.
We wouldn't be able to use all of the RAM, but a good bit of it is still available. Another problem, though, is much more concerning.
How will the drives be directly connected to the motherboard without a RAID card? I doubt there are enouogh slots on the board, so a
SATA hub may be necessary.

cont.
11/07/2017
Since this card is actual trash (the only RAID-10 option is literally the worst possible configuration of RAID-10 (it only supports
RAID-1 pairs connected in RAID-0)), we're gonna try
to use it as a SATA hub for the drives to be run in ZFS. Can the card be configured to to run the disks in JBOD? 
A'ight, so here's the thing.
I need to dedicate at least one drive to house the OS, and I'd like that drive to be backed up; we're left with 14 drives,
which is still plenty. There's a few good ZFS options we can do:
1) 2 striped RAIDZ2 vdevs (RAID60 with 2 groups of 7) - min: 2, max: 4, 7.5TB
2) 2 striped RAIDZ2 vdevs with 2 hot spares - min: 2 + ~2, max: 4 + ~2, 6TB; immediate replacement of 2 failures in quick succession
     	     	    	       	     	      	      	      	   (effectively 2 base tolerance with 2 extra tolerance per group)
3) 2 striped RAIDZ3 vdevs - min: 3, max: 6, 6TB
Imma try out option 2, just to see if it'll work out. First, I need to make a RAID10 array with 2 drives; this'll be the OS drive.
With the small array made, I threw the ROCKS disc in, and it did some things. I formated the array as ext4, and it installed 
a bunch of stuff. I whipped the disc out, restarted it, and it booted into CentOS! Unfortunately, though, it's asking for a password
that doesn't exist. That's fine, though, because I can ssh into it just fine (nice!). It's yelling at me because the RSA keys are all
messed up, but that's fine, I'll fix it later. NAS-0 has an OS again! Now the task is to make the other drives visible to the OS.`

cont.
11/13/2017
A'ight, let's get ZFS installed on NAS-0! 

INCORRECT MISSTEPS:
First we must install some dependencies:
$ yum install kernel-devel zlib-devel libuuid-devel libblkid-devel libselinux-devel parted lsscsi
Actually, nevermind, the link this guide provides doesn't work; let's try
a new one.
Here are the dependencies for this guide:
$ yum install dkms gcc make kernel-devel perl
Everything was preinstalled except 'dkms' (Dynamic Kernel Module Support:
without it, kernel updates could break software), which is a part of
the RPMForge repository.
Since NAS-0 is 64 bit, to install RPMForge:
Nevermind, turns out RPMForge (aka RepoForge) is now deprocated, and big
letters on the CentOS Wiki say to not use it.
So forget that, Imma install EPEL:
$ wget http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
$ rpm -ivh epel-release-6-8.noarch.rpm
Except `yum repolist` shows no sign of EPEL. *sigh*
Turns out the repo's gotta be turned on. 'enabled' in 
'/etc/yum.repos.d/epel.repo' needs to be set equal to '1' rather than '0'.
Now EPEL shows up in `yum repolist`. Nice!
Now dkms can be installed:
$ yum install dkms
The next instruction calls for installing 'spl' and 'zfs':
$ yum install spl zfs
Unfortunately, neither of these packages can be found.

CORRECT METHOD:
Fortunately, ZFS can be installed a different way.
First, the ZFS repo must be installed:
$ yum install http://download.zfsonlinux.org/epel/zfs-release.el6.noarch.rpm
Then, ZFS itself must be installed:
$ yum install kernel-devel zfs
ZFS is now installed! Hooray! Now we've gotta get those drives visible.

An important thing we gotta do is get 'tw_cli' installed, the RAID monitoring software.
First, the ASL repo must be installed:
$ wget http://updates.aslab.com/asl/el/6/x86_64/asl-el-release-6-3.noarch.rpm
$ rpm -Uvh asl-el-release-6-3.noarch.rpm
Then the software needs to be installed:
$ yum install 3ware-3dm*
Now NAS-0 needs to be restarted.
'tw_cli' is installed and works great!
I can see the unconfigured drives in 'tw_cli'; hopefully I can work with them.
Looks like if I put all the other disks in their own seperate units (putting
them all in single disk mode), they'll be visible to the OS. Let's try it!
I can see all the drives! Now we can get ZFS up and running!

cont.
11/14/2017
I tried making the zpool, but it didn't like the 1TB replacement drive we threw in there, so
I'm just gonna replace it with a normal 750GB. When I tried to remove the drive with 'tw_cli',
though, it couldn't. That's because I was trying to remove the only drive in the its unit, which
it isn't happy with. I'm gonna have to delete the unit, and remake it with the new drive.
The zpool with option 2 was made:
$ zpool create nas0 raidz2 sdb sdc sdd sde sdf sdg raidz2 sdh sdi sdj sdk sdl sdm spare sdn sdo
Unfortunately, though, it only has 5.2TB of space, which is a bit less than the alread expected
low amount of 6TB. Imma try option 1, the most spacious one.
It wouldn't let me destroy the zpool; it said it was busy. Even after unmounting it, it still complained,
so I restarted NAS-0. It's still busy. I'm gonna try to see what holding it open with `lsof | grep deleted`.
Nothing is printed. `lsof` didn't list anything with "nas0", but there are a few processes related to "zfs".
`zfs iostat` revealed that there is some IO going on in 'nas0' (also that there are 8.1TB free, suspicious,
it's probably got something to do with parity and other ZFS data). Later, I'll try killing all of the ZFS
processes.

cont.
11/20/2017
I just ran `zpool destroy nas0` and it seemed to have worked just fine. Huh, well problem solved, I guess. 
I'm gonna try to make Option 1 and see how much space that one actually gives us. It only gave us 6.6T of
the expected 7.5T. I reported my findings at the meeting, and we've opted to go for Option 2, the RAID-60EE
equivalent.

cont.
11/27/2017
Let's make Option 2 and start the copy of the '/home' backup. '/nas0' is busy, so I'm gonna comment out
'nas0' in '/etc/mtab' so that it won't be mounted on restart. After much fandangling, turns out the best
course of action is to just restart NAS-0, then 'zfs unmount nas0' and 'zpool destroy nas0' as quickly as
possible, before any crazy processes can start acting on it. Now, I've gotta mount NAS-0 onto the CE so that
data from NAS-1 can be sent over.

cont.
11/29/2017
Even though '/etc/fstab' contains an entry for NAS-0, 'mount' doesn't see '/nas0' available.
There is a 'sharenfs' property on ZFS that allows ZFS volumes to be shared via NFS; it's set on /nas0.
NFS is already good to go on NAS-0, but we've gotta add '/nas0' to '/etc/exports' so that NAS-0 knows
to allow the CE to mount '/nas0'. I've added the following line to '/etc/exports':
/nas0         163.118.42.1(rw,sync,no_root_squash)

/nas0:		the filesystem to be mounted
163.118.42.1: 	the highspeed ethernet connection on the CE
rw:	      	allow read/write
sync:	      	server confims client requests only when the changes have been committed (safety)
no_root_squash: allows root to mount filesystem 

By default there was an entry in '/etc/exports' called '/export/data1'. It caused some problems, so I commented it out.
I then ran `exportfs -ra`.
When I try a `mount /mnt/nas0` on the CE, I get the following error:
mount.nfs: access denied by server while mounting nas-0-0.local:/nas0

The error was because it doesn't like the IP for the CE I gave it; it prefers the LAN IP (10.1.1.1).
'/nas0' is mounted fine, now.

Now the data transfer can begin!
I used the command:
$ rsync -av --append /mnt/nas1/nas0-bak-20160304/home/ /mnt/nas0/home/
I ommitted the 'nohup' because it was giving me problems, and I wanted to manually monitor the progress (it took a couple days).

cont.
12/01/2017
Data transfer complete!
Good news: all of the data transfered over just fine
Bad news: none of the file permissions were saved; I'm gonna have to fix that.
The permissions can be fixed by following the instructions from [10/31/2015]. The home directories also need
to be mounted on '/home' rather than '/mnt/nas0/home'. So let's fix that mount point. Oh wait, hold on. Some of
the home directories (mine, Ankit's, and couple others) are already mounted on '/home' from '/mnt/nas0/home'.
Looks like we're good! I'm able to login remotely with an shh key again! Hooray!!!

10/31/17
Riley
TAGS: NAS-0, NAS-0 RAID 10, Batteries, NAS-0 RAID card model, ZFS info
There isn't any literature I can find on the admin log about doing a battery test. I'll look on the twiki, but as for now the project
is at a standstill. For some reason the glorius Google (TM) only gives me things online about MicroSoft (TM) Clusters and UPS systems,
so finding something won't be as easy as I initially thought. 
As for today, I'm ripping out NAS-0 and looking inside. I need to know the model of the RAID card for research, and how many ports it has.
This info will be recorded here. I am seeing if it can be used as a hub for ZFS, and if it can I'm planning on putting a bunch of RAM in it. 
For glory. 
Happy Halloween, My cluster friends. 

10/31/17
cont.
Found the things for the UPS. All the info we have as of right now is the location of the UPS documentaion on the cluster: /etc/ups.
Ryan has a couple of things from 2 years ago, but there isn't any exsisting code to check the batteries. Im going to start working on a code
to check the batteries

Moving onto the RAID card, the model is AMCC 9650SE-12 ml. it currently goes for $430 on the market, even though its some dated tech, which
leads me to believe that if any RAID card from that era could be used as a hub, this is it. the only problem is everything online says it's 
possible to use a RAID card as the hub, but no one says how because they unanimously say its a terrible descision. 

11/2/17
Riley
TAGS: NAS-0, RAM, RAID card, Battery test
in order to use the NAS-0 RAID card as a hub for ZFS, we need a metric tonne of RAM. lucky the motherboard can support 16 RAM sticks and
in the admin log it does say that it can handle up to 4GB sticks of DDR 2.
the only problem is that the RAM in the motherboard isnt DDR, its FB-DIMM. more research is needed to find out if there is any 
potential compatability problems
Daniel Campos gave me some amazing resources for running APC diagnostics tests. I'm going to try and make the APC as schnazzie as possible.
hopefully the Trip-Lite battery tests wont be too much more difficult. The battery info can be found at /etc/ups

BATTERY LOCATION: /etc/ups

11/2/17
cont.
TAGS: RAM, NAS-0
it seems that the RAM is an implied DDR-2, even though it doesn't say anything about DDR on it
UPDATE: We (With the help of Daniel Campos) found a decent way to solve our issues. NONE of the RAM fit into the mothernboard,
which is fine because we dont need it anymore. Daniel suggested we use JBOD to host ZFS, and it doesn't really need a lot of RAM.


11/27/2017
TAGS: CE hang
The CE hung again today, so I powercycled it, and now it's fixed. It took FOREVER to turn on, though.
There were some mad NFS timeout times, so I'm gonna try to reduce that. I changed the timeouts in 
'/etc/auto.master' from 1200 to 500. Hopefully that'll fix the problem.


12/04/2017
TAGS: nas0 dashboard diagnostics page
The RAID health check for NAS-0 is all kinds of messed up because NAS-0 has crazy splash text on login.
Let's fix it! 
It said that line 29 in '/etc/ssh/ssh_known_hosts' in the CE was the offending line. That's the line for the old
NAS-0; it was trying, and failing, to match the new NAS-0's key with the old key the CE had. I just deleted that line,
and it put the new key on the CE. All is now well!


12/04/2017
TAGS: NAS-0 no root login
Ankit recommended we disable root login for NAS-0, which is probably not a bad idea.
I created a user "fakeroot" and put `su -` in its '.bashrc', so that the root password must
be entered to gain access to NAS-0. I copied over the CE's ssh key, but it still didn't work.
I changed the permissions for '~/.ssh' and '~/.ssh/authorized_keys' in 'fakeroot''s home directory
on NAS-0, and I ran `restorecon -Rv ~/.ssh`, which resets the SELinux configuration to default.
It works fine! I can login to NAS-0 from the CE with RSA.
I've also added 'fakeuser' to the sudoers group on NAS-0:
$ usermod -aG wheel fakeuser
For changes to take effect, log out and back in.
I disabled ssh login for root on NAS-0 by setting 'PermitRootLogin' to 'no' in '/etc/ssh/sshd_config'.
I made the root password required for any 'sudo' activity by adding 'Defaults rootpw' to '/etc/sudoers'.


12/19/2017
TAGS: NAS0 ZFS
I tried to work on the cluster remotely, only to find that my certificate wasn't working. Uh Oh.
Turns out ZFS didn't start up correctly on NAS-0, so '/nas0' wasn't mounted. I logged in as 'root'
and tried a `zfs list`, but it just told me that no datasets were found. Maaaaaan. I'm gonna try 
unmounting NAS-0 from the CE, then restarting the thing. No dice. Imma try an update and restart
No dice x2.
`zpool import` gave me data on the pool, and told me a drive failed. The error message gave me
this URL: 
http://zfsonlinux.org/msg/ZFS-8000-4J/
Turns out, since 'nas0' is an exported pool, it needs to be imported, which failed because it
was degraded. It can still be manually imported, however, so that it can be worked on.
*sigh* Turns out the issue is that THREE drives decided to fail IMMEDIATELY after I left. 
*sigh* Man, c'mon now. There's gotta be a reason why all this nonsense always happens. Why do
the drives in NAS-0 fail so often? NAS-0's super important. Maybe it's just 'cause all the drives
are super old. I mean, it is a bunch of 750GB, which is an outdated size anyway. That's probably
it; they're just super old. I guess even the "new" drives we get would be old even if they've 
never been used. I don't even know how to fix that, though, short of replacing all the drives, 
but that's super expensive. *sigh* Who knows, man? Who knows?
I haven't decided if I'm gonna run down there to replace the drives or not. Since it's still operational,
and nothing new's been put on it, I'll probably just leave it.