01/07/2018
I thought I turned some nodes on, so that I could work on it before the Physics Building
opened up, but I guess not. RIP. I guess I'll just have to wait until tomorrow.

cont.
01/21/2018
Alright, now that NAS-0 is fixed FOR REAL this time, let's get crackin'.
Jk, the nodes won't get power. *sigh*
The output breakers for the plugs into which the node power strips are connected
are weirded out. So that I can continue to play with condor in spite of this 
strange issue, I only have five nodes (2-0 to 2-4) turned on. So far, the UPS 
seems to be alright with that. 

cont.
01/22/2018
Time to play with condor. Let's start off with a classic 'condor_ce_trace' and
see where we end up. First, I need to send off my new usercert. The instructions
for converting a '.p12' to a '.pem' are found at [10/16/2015]. I copied both
the new 'usercert.pem' and 'userkey.pem' to '/etc/grid-security'. 
I tried `condor_ce_trace -d uscms1.fltech-grid3.fit.edu`, and it told me
that it couldn't connect to the CE; the collector daemon appears to be off.
Yup, the collector daemon's down, verified by `condor_ce_status`.
I did `service condor-ce start` to start it up. Now I'm getting all kinds of
output from 'condor_ce_trace'. It's saying it's unable to create a temporary file
in the working directory, '/root'. Imma try to run it as Voytella, and see if I
get anything different. Now it's telling me it can't find a X509 proxy in
'/tmp/x509up_u14122'. That's because my user certificate is hella outdated.
It says to just throw a copy of it and the key into '/home/Voytella/.globus'.
Excellent! I've created a valid temporary proxy! Alright, now it's doing
what it was doing before: querying every single idle job in the queue.
'/var/log/condor/SchedLog' is also reporting a bunch of 'PERMISSION DENIED'
errors like it was doing before.

cont.
01/26/2018
I'm going through the documentation sent by OSG.
It says to look for "DC_AUTHENTICATE" and "PERMISSION DENIED" errors
in '/var/log/condor-ce/SchedLog'. While I don't have those errors in 
the condor-ce SchedLog, they're all over the place in the condor SchedLog.
The errors are also slightly different than what's described in the documentation.
Alright, despite the documentation being for condor-ce, I'm gonna follow its
directions to see what I can discover. 

First, it says to check GUMS or 'grid-mapfile' to ensure that my DN is known
to my authentication method. I made sure that in '/etc/osg/config.d/10-misc.ini',
'authorization_method' was set to 'xacml' and 'gums_host' was set to our hostname.
There is also a note that says that if the local batch system is HTCondor, it will
attempt to use the LCMAPS callouts if enabled in '/etc/condor-ce/condor_mapfile', and
if that's not the desired behavior, to set 'GSI_AUTHZ_CONF=/dev/null' in
'/etc/condor-ce/config.d/99-local.conf'. The GSI thing wasn't set, so I set it.
Imma try condor_ce_trace again and see what happens. Nothing seems to have changed.
Oh, I forgot to `condor_ce_reconfig`. Now let's see if that does anything.
I set the 'condor_ce_trace' command on my user side-by-side with a 
`tail -f /var/log/condor-ce/SchedLog`. The 'condor_ce_trace' is doing the thing
where it queries every single job to report that it's idle and sends a "connection
request to schedd at <163.118.42.1:9619>". Everytime it makes a new query, it writes to
the SchedLog the same thing: saying the number of active workers is 0 and something 
about forking workers and no more children processes to reap. I wonder if 'condor_ce_trace'
writes anything to '/var/log/condor/SchedLog'. While there's a bunch of stuff being
written to '/var/log/condor/SchedLog', it doesn't look like it's being caused by
the 'condor_ce_trace'; it's just a bunch of the 'DC_AUTHENTICATE' and 'PERMISSION DENIED'
errors.
NOTE: There are a TON of LCMAPS and GRAM-gatekeeper authentication errors in '/var/log/messages'.
Let's see what doing the GSI thing for regular condor does.
NOTE: In '/etc/condor/config/d', there's a mysterious '99-condor-ce.conf'. What's that
      doing there? There's also a '50-condor-ce-defaults.conf'. Maybe they're there so 
      condor can talk to condor-ce? They just say that the super user can impersonate
      anything.
I made the GSI addition and reconfigured condor. Nothing new happened.

The next thing it says is to look for LCMAPS errors in '/var/log/messages'.
Oh hey! We're drowning in those! Let's investigate!
It looks like the error starts with an authentication of a globus user, then
it says it can't open file '/etc/lcmaps/lcmaps.db'. That causes a LCMAPS plugin
error, with prevents LCMAPS from initializing. Then that failure breaks everything
else. Let's see about that file.
NOTE: LCMAPS (Local Credential MAPping Service) translates grid credentials to local
      Unix credentials.
Turns out there's only '/etc/lcmaps.db' and no 'lcmaps' directory. I'm gonna try to make
that directory and throw the file in it.
Now, in '/var/log/messages', a bunch of globus users got authenticated in a row without
issue and some other stuff happened. Then it gave a warning about still being "root after the 
LCMAPS execution. The implicit root-mapping safety is enabled. See documentation for details.",
and the next line said that "globus_gss_assist_gridmap() failed authorization" and that the
callout returned an unknown error.

I'm gonna see about debugging LCMAPS. There's a whole page for troubleshooting LCMAPS on
the wiki. First, it said to set up LCMAPS for maximum debugging by adding the following 
to '/etc/sysconfig/condor-ce':
export LCMAPS_DEBUG_LEVEL=5
export LCMAPS_LOG_FILE=/tmp/lcmaps.log
Then 'condor-ce' has to be restarted:
$ service condor-ce restart
It also says that disabling HTCondor-CE's caching of authorization lookups is a good
idea for testing changes to mapfiles. To disable the caching, create
'/etc/condor-ce/config.d/99-disablegsicache.conf' and insert
GSS_ASSIST_GRIDMAP_CACHE_EXPIRATION=0
then restart 'condor-ce'.
NOTE: It says that disabling caching could increase the load on the CE (makes sense),
      so keep an eye on things to make sure nothing gets too out of control.
It gave me a list of configuration files in order of precedence:
/etc/grid-security/ban-mapfile (ban DNs)
/etc/grid-security/ban-voms-mapfile (ban VOs)
/etc/grid-security/grid-mapfile (map DNs)
/etc/grid-security/voms-mapfile (map VOs)
/usr/share/osg/voms-mapfile-default (map VOs default)
'/etc/grip-security/grid-mapfile' is full of grid mappings, but
'/etc/grid-security/voms-mapfile' doesn't exist.
Strangely enough, it says that LCMAPS is configured in '/etc/lcmaps.db', the
file I thought (and it thought) was misplaced earlier. Huh. Either way, it gives
me a bunch of stuff to make sure I have in it. It looks like it contains none of
what it's supposed to have. Imma go through and add bunch of stuff, then.
Above the 'authorize_only' section, I added the 'gridmapfile', 'banfile', 'banvomsfile',
'vomsmapfile', 'defaultmapfile', and 'verifyproxynokey' parameters. It said to edit
the 'authorize_only' section to exactly what it is now; I've commented out what was
already there. It also said to make sure '/etc/grid-security/gsi-authz.conf' containes
a certain line (that terminates with a newline), but that's already there (including
the newline).
That's the end of the document. Now let's see what happens.
That globus_gss_assist_gridmap() is still failing.

Oh, turns out this troubleshooting guide I was following is just the tail end of the
whole LCMAPS page. Imma run down it from the top and see what I can see. 
It says to enable the LCMAPS VOMS plugin, I have to add the following to 
'/etc/ost/config.d/10-misc.ini':
edit_lcmaps = True
authorization_method = vomsmap
It also said to comment out 'glexec_location', and I've commented out the existing
'authorization_method'.
It says that a Unix account must be created for each VO, VO role, VO group, and user
that I wish to support. I'm not sure if that means every single user in 
'/usr/share/osg/voms-mapfile-default' or not, because that's a bunch of users. I can
probably ask OSG about that.
It says the 'allowed_vos' parameter in '/etc/osg/config.d/30-gip.ini' should be populated
with the supported VOs per subcluster (worker node hardware) or resourceEntry (set of subclusters)
section. Not entirely sure what it means by that, but our 'allowed_vos' in empty and commented
out. I'll also ask OSG about that.

cont.
02/03/2018
They think we may not have the OSG version of LCMAPS. To see what version we have,
I ran `rpm -q lcmaps`, and it told me we're running version 'osg33', while the most
updated version is 'osg34'. Ah ha! I'll see about fixing that up. I've run a `yumUp`.
That didn't cut it, I may have to do other things. Brian also said that I may have
not run 'osg-configure', and he's right, I haven't! I've run `osg-configure -v`, and
it gave me some info. It said I'll either have to specify a list of VOs or a '*' for 
the 'allowed_vos' option. It also said that I need to fix the 'gram_ce_hosts' option in 
'/etc/osg/config.d/30-rsv.ini', since GRAM is not longer supported (the whole reason for
this debacle in the first place).
In '/etc/osg/config.d/30-gip.ini', I've set 'allowed_vos' to '*'. I'll probably also 
have to make user accounts for all the VOs in '/usr/share/osg/voms-mapfile-default'.
In '/etc/osg/config.d/30-rsv.ini', I edited 'ce_hosts' to just include HTCondor-CE, and
I've commented out the 'gram_ce_hosts' setting.
`osg-configure -v` gives me a "No allowed_vos specified for section 'Subcluster FLTECH'"
warning, and a VO specification warning, saying that either a list of VOs or '*'
must be given. I thought I had already taken care of that by modifying 'allowed_vos'
in '/etc/osg/config.d/30-gip.ini'. Huh. I'll just go ahead with the `osg-configure -c`
and keep these warnings in mind. The configure reported no errors, just the above warnings.

cont.
02/05/2018
OSG also said they wanted an updated `osg-system-profiler`, so I've started that off.

cont.
02/16/2018
(RIP, sorry OSG) Since it's been so long, I've made a new `osg-system-profiler`.

cont.
02/17/2018
OSG says I've gotta make users for all of the entries in '/usr/share/osg/voms-mapfile-default',
so Imma see about doing that. The new users have been created.
I've run `osg-configure -c` again and got the following warnings:

WARNING  No allowed_vos specified for section 'Subcluster FLTECH'.
WARNING  In OSG 3.4, you will be required to specify either a list of VOs, or a '*' to use an
autodetected VO list based on the user accounts available on your CE.
WARNING  No allowed_vos specified for section 'Subcluster FLTECH'.
WARNING  In OSG 3.4, you will be required to specify either a list of VOs, or a '*' to use an
autodetected VO list based on the user accounts available on your CE.
WARNING  Can't copy grid3-location file from /etc/osg/grid3-locations.txt to /cmssoft/cms/etc/grid3-locations.txt
CRLs exist, skipping fetch-crl invocation

The repetition of the first two warnings is most likely a result of `osg-configure -c` first
running `osg-configure -v`, and simply printing those warnings for both commands. The last warning,
however, I have no explanation for.

cont.
02/20/2018
OSG said I forgot to set 'allowed_vos' to '*' under the '[Subcluster FLTECH]' section
of '/etc/osg/config.d/30-gip.ini'; I had only done it in the '[SE FLTECH-SE]' section.

cont.
02/23/2018
Daniel said he fixed some condor stuff, [02/11/2018], so let's try to run some condor jobs
and see what happens. I submitted a job from my account, and it was immediately held.

cont.
02/24/2018
Since so much has changed, I'm going to run through the Condor troubleshooting documentation
again to see what it says.


04/06/2017
TAGS: CE cannot ssh unresponsive
Vallary emailed me saying that she couldn't ssh into the cluster, and neither could I! Upon arriving at the high bay
I found the CE unresponsive; just the blue background was visible with the mouse. I power cycled the CE and it rebooted,
but condor's not working. `condor_status` returns a communication error stating that it cannot connect to 163.118.42.1:9618.
It stopped because /var is 100% full. /var/lib/globus is 3.3G and is full of strange condor files that were created yesterday
and the day before. Some are several Megabytes while some are empty. The files seem to contain entries for submitted jobs.
I'm going to move all of the "condor.*" files to ~/globusCondorJunk and see if that breaks anything. I fully restarted condor,
and all seems to be well. If it turns out that the "condor.*" files are indeed useless, then I'll delete them. 


04/10/2017
TAGS: mass deletion of users
users are being deleted in 24 hours. I made a file called ~/userdellist.txt that has all the info in it
the programs at the bottom will stay for now, some of them are important. 


04/11/2017
TAGS: node validation failure tmp full
OSG sent us a ticket a while ago (my email wasn't in the list, Ankit told me about it) saying that CMS and OSG 
glideins were failing node validation upon startup (https://ticket.opensciencegrid.org/32896). The CMS glideins
are failing due to being unable to locate CMS software, and the OSG glideins are failing due to a full '/tmp'.
CMS Failing Nodes:
compute-1-1
compute-1-3
compute-1-6
compute-2-1
compute-2-4
compute-2-5
compute-2-6
compute-2-7
compute-2-8
OSG Failing Nodes:
compute-2-5
compute-2-6
compute-2-7
compute-2-8
The OSG Failing Nodes do, in fact, have a completely full primary partition, where '/tmp' is located.

cont.
04/12/2017
The problem was that '/scratch' was all filled up because it was the cvmfs cache. I moved the cvmfs
cache from '/scratch' to '/var/cache/cvmfs' on all the nodes via a script ('~/Scripts/mvCvmfsCache.sh').

cont.
04/14/2017
The other problem was the CMS failing nodes. The listed nodes contain the script
`/var/lib/condor/execute/dir_<someNumber>/glide_<someAlphaNumericCharacters>/discover_CMSSW.sh`.
NOTE: navigate to 'var/lib/condor/execute' then run `find . -name "discover_CMSSW.sh"` to locate the script.
It hangs upon execution. The script just looks for other scripts and executes them. If it doesn't find
what it's looking for, it's supposed to say so. The script however, doesn't seem to do anything. The
discover script is only on some of the nodes listed, and it's not on any that are not listed.


04/13/2017
TAGS: home directory clean
Cleared out the home directory for root so it's usable


04/14/2017
TAGS: condor not running diagnostics passwords required ssh
The diagnostics page reports that condor is not running on any of the nodes. All of a sudden, I need to enter
passwords to ssh from root. Huh, that's strange. Turns out condor's fine, but the monitoring scripts
need to ssh into the nodes, which it can't do now because ssh-ing requires passwords for some reason.
Riley moved some of the ssh files around when he was reorganizing the home directory, so the CE's ssh keys
have been slightly scrambled.

cont.
04/17/2017
Ankit said to investigate ROCKS; it made the ssh keys. The ROCKS documentation said that hostbased
authentication is controlled by '/etc/ssh/shosts.equiv'; the IPs of the cluster parts are all there.
I created a brand new ~/.ssh directory and filled it with a public and private key generated with
$ rocks create keys ~/.ssh/id_rsa > ~/.ssh/id_rsa.pub 
The new key was placed in NAS-1 with
$ cat ~/.ssh/id_rsa.pub | nas1 "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
The new key was confirmed placed where it should be, but a password was still requested.
Silly me, I didn't check id_rsa.pub for errors, of which there was one. I need to type the 
command correctly. 
$ rocks create keys key=~/.ssh/id_rsa > ~/.ssh/id_rsa.pub 
The key was created, and it was correctly put onto NAS-1, but it still doesn't work.
Instead of using the rocks command to make the keys, I used the normal `ssh-keygen -t rsa` command,
then sent the keys over with the normal command.

For installing the new key on all of the nodes, I'm installing `sshpass` which will allow for
the automation of logging into all of the nodes. I added to the osg-node.sh:
cat ~/.ssh/id_rsa.pub | sshpass -p "<password>" ssh -o StrictHostKeyChecking=no compute-fed-nad "mkdir -p ~/.ssh && cat > ~/.ssh/authorized_keys"
be sure to comment out the normal ssh line!
That worked for compute-2-*, but the passwords for compute-1-* are different. I will have to change them to the normal password.

cont.
04/18/2017
To change the root passwords of the other nodes, they must be powercycled and booted into single user mode.
After the password has been changed, run `init 5` to resume normal operations. If the node hangs after `init 5`,
powercycle it again, and allow it to boot normally. I've changed compute-1-0 to compute-1-3 so far.

cont.
04/19/2017
All of the nodes, the SE, NAS-1, and NAS-0 all have the new keys.


04/19/2017
TAGS: gratia accounting osg website GRACC change no job count
OSG updated their grid monitoring software from Gratia to GRACC (GRAtia Compatible Collector). GRACC
is compatible with all existing Gratia probes.
It is shown that we are amassing wall hours, but there is no data for the job count.


04/24/2017
TAGS: squid not running
Squid wasn't running. I checked its status with `squid -k check` and it told me that it couldn't
find the cache directory. That's because it was moved during Riley's spring cleaning. I changed
the squid directories in '/etc/squid/customize.sh' from "ufs /root/squidAccessLogDump/cache 20000 16 256"
to "ufs /root/Cluster_System_Files/squidAccessLogDump/cache 20000 16 256".

cont.
04/26/2017
'customize.sh' will hang, but it does, in fact, edit the file properly after some time. Squid is good again.

rpgpg
04/24/2017
TAGS: NAS0 diagnostics page
The NAS0 diagnostics page had been missing the top table for a while, because a new line was missing
at the end of /etc/cron.d/nas0chk . The line was added so it works now.
 

04/25/2017
TAGS: NAS1 yum update rpmforge gpg keys
NAS-1 has some trouble yum updating due to non-existant rpmforge gpg keys.
I had some trouble finding the keys, and I had to install a security update,
so I just turned off the check for the keys be editing '/etc/yum.repos.d/rpmforge.repo'.
I've turned the check back on for now.


05/11/2017
TAGS: condor idle ce / bloated
The diagnostics page says that condor is idle on the CE and '/' is bloated with
'core.*' files. Clearly some shenanigans occured when I updated OSG. 
I fully restarted condor, but when I tried to run `condor_status`, it
said there was a communication error. After waiting a minute, it gives the regular list,
but says everything is unclaimed. 
These "core" files seem to be generated
whenever a job crashes. None of the configuration files in '/etc/condor' or
'/etc/condor-ce' seem to have been modified by the update, although the directories
have been touched. Perhaps files were deleted? 
Since OSG was updated, '/var/log/condor/MasterLog' reports that condor is unable to create a security session
to the CE on port 9618 with TCP. Port 9618 is only listed in the log when it fails; its successful connections are
never listed. That port number is listed in '/etc/condor-ce/config.d/50-osg-configure.conf' as the connection
port for the 'JOB_ROUTER_SCHEDD2_POOL' variable. That file is said to be managed by 'osg-configure'.


05/15/2017
TAGS: NAS-1 almost full
NAS-1 is almost full, and Vallary needs to put stuff on it!
I need to investigate directories: g4hep, backup_g4hep, general_g4hep
'g4hep/MTSAtFIT' is a primary offender (14TB of the directory's 15TB), there are some large files in there
NOTE: `tree -ifhugD path/to/directory` is a very usefull command for mapping the directory structure.
I've made trees at '/mnt/nas1/g4hep/MTSAtFIT/tree.txt', '/mnt/nas1/backup_g4hep/tree.txt', 
and '/mnt/nas1/general_g4hep/treeTrim.txt'.

cont.
05/16/2017
Dr. Hohlmann has said I can safely delete anything with 'alignment' and 'empty'
in their names. To see how much space will be freed from one of the three sections:
$ grep -iE 'alignment|empty' tree.txt | awk -F' ' '{print $3}' | grep G | sed 's/G//g' | paste -sd+ | bc


05/18/2017
TAGS: add group
I'm creating a new user group for Vallary and I: Analysis.


06/09/2017
TAGS: glideins down globus error
At the beginning of June, OSG said that our glideins were failing
due to a globus error. When Daniel was helping me with Condor,
we tried replacing my certificates for his in '~/.globus', which
probably caused the errors. I have replaced his cert with my CERN cert.
I've updated OSG.

cont.
06/12/2017
Elizabeth said to copy 'hostcert.pem' and 'hostkey.pem' from 
'/etc/grid-security' to '~/.globus'. I have done that, and I've 
restarted GUMS. She's been updated.

cont.
06/15/2017
I misunderstood Elizabeth; she was just making sure the hostcerts weren't
expired or otherwise wonked.


06/13/2017
TAGS: batteries UPS check compute-1-7 not working
The red light on the APC UPS had been quickly turning red after the routine
tests, so I whipped out the batteries and took a multimeter to them. The batteries
are rated at 12V, and the multimeter measured just over 13V for each of them, so I 
put them back.
When turning the cluster back on, though, compute-1-7 is having trouble mounting the 
NFS filesystems. The little ethernet LEDs on the node are off, and the ethernet LED
for the node on the router (port 15) is red. Once the node had booted up, the LEDs 
didn't change. It doesn't seem to have internet, either, which is to be expected.

cont.
06/14/2017 
I found a manual for the 'HP ProCurve 2910al-24G'
router.  The blinking orange 'Fault' light means that, "A fault has
occurred on the switch, one of the switch ports, module in the rear of
the switch, or the fan. The Status LED for the component with the
fault will flash simultaneously." In this case, the LED for port 15
is blinking in synchronization with the 'Fault' LED. The 'Test' LED
is also blinking along with the others, and it means that one of the
components has failed its self-test, so port 15 failed its test.
The manual recommends power-cycling the router, so I'll do that
tomorrow morning.

cont.
06/15/2017
I turned the cluster off and power cycled the router (unpluged it), and
it displayed no warning lights, so I turned the cluster back on, and all
is well! 


06/27/2017
TAGS: security update
Security update day! I'm yum updating everything, and rebooting the cluster.
Everything booted up properly!


07/11/2017
TAGS: nas0 drive not-present
Drive 15 in NAS-0 suddenly became labled as "NOT-PRESENT". 
I removed the drive and put it back in, and the drive is now
rebuilding.

cont.
07/12/2017
Drive 15 has returned to the "NOT-PRESENT" state again, I'm gonna try 
replacing the drive. I've replaced the drive, and it's rebuilding.

cont.
07/13/2017
The new drive has experienced a SMART failure, so I'm gonna replace it
with the other spare drive. I've started the rebuild.

cont.
07/14/2017
The new drive rebuilt successfully, and everything is good.


08/22/2017
TAGS: NAS-0 not working crash on boot
Everything is not good. During break, a catastrophic hardware calamity had 
befallen NAS-0. Two drives are dead, and the BBU (Backup Battery Unit) on 
the RAID card has failed. NAS-0 kernel panics on boot, a reported symptom 
of the battery. The card seems fine, however, because its settings can be 
accessed during boot. New drives and a battery have been ordered.
Another scary symptom of NAS-0's inoperability is the hanging of `df`.

cont.
08/23/2017
I searched the settings of the controller's BIOS for options to boot without
the BBU. I found something that would ignore the RAID controller on boot, but 
then boot failed due to not finding an operating system, which is probably stored
in the RAID. It might be a good idea to have the boot disk seperate from the RAID in
the future.

cont.
08/25/2017
While we wait for the new battery to arrive, I replaced the two failed drives and
started the rebuild process from the controller's BIOS.

cont.
09/15/2017
The battery is here! We've installed it and are ready to turn NAS-0 on! But first,
I'm shutting the entire cluster down so I can bring everything up in the proper order.
Turns out the battery needs to charge first, so I'm gonna have to wait until Monday
to do anything.

cont.
09/18/2017
NAS-0 still kernel panics on boot. *sigh*
I tried booting from the CentOS 6.5 disc, but no dice; It looked like it booted, but
it hung on a black screen with a mouse pointer. I also tried booting from the Rocks 5
disc, but when it couldn't find an IP address it wanted, it restarted and began the loop
again. I started playing with the GRUB, let's see where that goes.

cont.
09/19/2017
I tried the Rocks CD again (this time we have internet!), and it advanced to the next step!
It's looking for a Rocks image, and can't find one. I'd assume that the image would be on
the Rocks CD in the drive, but I guess not. None of the hard drives have an image hidden
in them either, it seems. Although, Rocks was unable to retrieve a file from somewhere on
NAS-0, so maybe that had something to do with it.
I found some Rocks 6.1.1 Jumbo DVDs, and I threw one into NAS-0. It has a rescue mode that
I've entered.
Welp, when I turned NAS-0 on to play with the Jumbo DVD, drive 8 decided to disappear.
When I restarted, drive 15 also disappeared. So now drives 8 and 15 are gone with drive 10
still in "rebuild" status. 
Also, when I try to choose the "Installation Method" for Rocks, it rejects the Rocks DVD already
in the slot. It says the installation material isn't present on it. Which disc contains the proper
info, then?
Drive 15 suddenly reappeared! That's nice.

cont.
09/22/2017
I had replaced both drive 8 and 15 (which disappeared again after replacing drive 8), but
it wouldn't let me add the new drives to the RAID group. Perhaps because it was already labeled
as "REBUILDING". After the replacements had been made, I exited the controller BIOS to start booting.
There was a CentOS 6.5 boot DVD in NAS-0. It didn't hang on a black screen this time; it booted into
the liveCD properly! 

I have some bad news: NAS-0 is dead.

The 3ware BIOS manager (the RAID card's BIOS) reports the RAID array as "unusable". The 3ware documentation
says that an "unusable" array is totally dead; it's suffered too many failures to be brought back. I'm asking
Blueshark (Daniel Campos) to take a look at it anyway, though, in case there's some crazy nonsense we can do
to resurrect it. Today is a dark day for the cluster.
Daniel Campos said that our last hope is to try to image the broken disks and put their information on the good
disks, then throw them back into the RAID.

cont.
09/25/2017
I tested the drives. The three that had any data on them are physically busted; they click and are not 
recognized by the computer at all. The data is lost. NAS-0 is no longer with us.


08/22/2017 
TAGS: mount NAS-1 remotely on seperate machine
Since no one can log onto the cluster with NAS-0 dead, we need to mount NAS-1 remotely to
access it. First the IP of the machine must be added to '/etc/exports' on NAS-1, then the 
changes must be saved with `exportfs -ra`. To mount it on mac:
`sudo mount -o resvport 163.118.42.3:/nas1 /location/on/local/machine/`


08/23/2017
TAGS: /var full
'/var' is full again. '/var/log/tomcat6/gums-service-cybersecurity.log*' were taking up
100M per file (of which there were five), and they only contained the same java error message
repeated several times. I have removed the five old files, and kept the latest log.
'/var/log/maillog' (1.8G) is full of messages reporting that mail sent to NAS-0 has bounced;
I've cleared the log.

08/25/2017
TAGS: nas1 NAS-1 failed drive replace
A drive failed on NAS-1 and we're gonna replace it.
To view NAS-1's RAID, run `storcli /c0 show`.

To remove the drive with storcli:
$ storcli /c0/e<enclosureID>/s<slotID> set offline
  (*) in the left-most column of `storcli /c0 show` is the drive names in 'enclosureID:slotID' format
$ storcli /c0/e<enclosureID>/s<slotID> set missing
$ storcli /c0/e<enclosureID>/s<slotID> spindown
  (*) spins down the drive and makes it safe for removal

The drive can now be safely removed.

Once the new drive is in place it should automatically start rebuilding. If 
the drive's status doesn't change to "Rbld", the rebuild can be manually
started with `storcli /c0/e<enclosureID>/s<slotID start rebuild`. The rebuild status can be
monitored with `storcli /c0/e<enclosureID>/s<slotID> show rebuild`.


08/28/2017
TAGS: nodes acting funny
The second group of nodes (2-0, ...) is acting kinda strange. When I logged on,
I saw the splash text that usually appears after the nodes are turned back on from
a restart, and the diagnostics page shows that they have NAS-0 mounted and a 0 load
average, while the other 10 nodes have super high load averages (~5000).

cont.
08/29/2017
Time to exorcise the nodes! The script that gathers data from the nodes is
'/usr/local/bin/cn.sh', and it writes to '~/diagnostics/cn.json'. The script
checks for a mounted file system by running `df -h /filesystem/mount/point/`
and seeing if anything is returned. On the '1-' nodes, `df` just hangs like on
the rest of the cluster. On the '2-' nodes, however, it returns the line with
the mount point '/'. While that's not NAS-0, it's something, so the website reports
a success.
The load average is found with `cat /proc/loadavg`. That's not explaining why the
load is so high, however. The load average is high because the diagnostic script runs
`df`, which hangs on the '1-' nodes; serveral instances of a hung up process are trying to
run simultaneously. I've restarted the nodes, which will fix the problem; `df` will work fine.
The '1-' nodes aren't ssh-able. I'll have to investigate that later.
The '1-' nodes all tried to mount NAS-0 on boot, and they all failed to complete booting
because they though NAS-0 was a busy device. I'm gonna powercycle them to see if that'll work.
They're good, now. Now all of the nodes have a low load average, and they all falsely report
NAS-0 to be mounted.


09/01/2017
TAGS: NAS-1 RAID card
Today some strange nonsense happened. NAS-1 was telling me that its RAID card had suffered some
catastrophic failure, and was no longer operable. I powercycled NAS-1 because everything on NAS-1
hung. On boot, the RAID card would beep, and nothing would appear on the monitor. Everything on the
CE also hung. Scary. I turned the whole cluster off, and tested the APC UPS, which yelled at me, so I
manually checked all of its batteries. After all of the batteries had passed inspection, I put them back
in and turned everything back on. Everything, except NAS-0 of course, booted up just fine. I have no idea
what caused the issue in the first place.


09/05/2017
TAGS: new hostcert
OSG emailed me saying that my hostcert is about to expire. The new hostcert and hostkey are obtained.


09/05/2017
TAGS: CE hung
The CE decided to hang; nothing could be performed on it. I restarted the cluster, and it's good, now.


09/14/2017
TAGS: UPS no power not turning on
When we plugged everything back in after the hurricane, the top Tripplite SmartPro UPS refused to
accept power. No lights turned on indicating that it sees any kind of power at all. I tried plugging
it into different outlets, but the bottom UPS accepted the outlets just fine. The model number of the 
Tripplite UPSs is "SMART5000RT3U".

cont.
09/15/2017
The power button of the busted UPS feels kinda wonky. It feels like there's not even a button behind
the flexible plastic button cover; the plastic just gives with hardly any resistance, unlike the bottom
UPS which has a more solid feeling button press. However, the button could just feel strange because it's
not getting any power; the other button (the alarm button) won't even depress at all.
I ripped the UPS's face off to investigate the buttons on the circuit board; they're both fine.

cont.
09/20/2017
I called Tripplite for assistance, and he told me to check the batteries. Just what I feared
he'd say! Well, let's get them out of the rack and see what's up.
The batteries are all destroyed. They are all swollen, and there's corrosion everywhere. It's a repeat
of 2 years ago! (Fun Fact: We replaced the batteries 09/21/2015, almost EXACTLY 2 years ago!)


09/26/2017
TAGS: NAS-1 diagnostics strange
The RAID monitoring for NAS-1 on the diagnostics page is a bit wonked out. The script is having trouble
when it tries to ssh into NAS-1; some drives have '/root/.bashrc' errors. Oh, when I tried to install root
on NAS-1 earlier, I put some nonsense in its '.bashrc' that spits out errors whenever it's run. The scripts
write down whatever was written to standard output, which, in this case, includes error messages for the first
two lines. So the website is reading the first two error messages and displaying them on the website. Whoops!
Let's fix NAS-1's '.bashrc'. I commented out the broken root line, it's all good, now.


09/26/2017
TAGS: squid not running
The diagnostics page says that 'squid' isn't running. I tried to start it with `service frontier-squid start`,
but it complained that '/home/squid' didn't exist. RIP; I guess it's dead until we can resurrect NAS-0.


09/26/2017
TAGS: NAS-0 redo
Welp, NAS-0's dead. But now we have an opportunity to redo its RAID configuration! What shall it be?
I really wanted to do ZFS, because it's the best, but it's slowly turning out to not be viable. The hardware
may not cooperate nicely with it, and we may need new hardware to connect all of the drives together in the
absence of a RAID card. So, I think we're gonna have to stick to the card we've got. Unfortunately, since the card
doesn't support RAID-60, we're gonna have to come up with a more creative solution (I wanna see if there are better
options than just straight RAID-6).


09/27/2017
TAGS: rack rearrangement
Today, we're taking out the bottom Tripplite UPS to examine its batteries. We're also gonna take the UPSs completely out,
put NAS-1 and the SE where the UPSs were, then put the UPSs, spread out, on the left rack.

cont.
10/04/2017
Alright, everything's done. The rearrangement went wonderfully. I even rewired everything! I'm going to make a document
showing where I plugged everything in. The batteries also came in, and we installed those. They're charging themselves up
and they're working great! 


10/16/2017
TAGS: SE no ethernet
All the ethernet ports have their red lights on, so Imma restart everything to see if that does anything.
I restarted everything, but the red persists. Huh.

cont.
10/17/2017
Well, we need ethernet to add NAS-0 back to the cluster, so this has got to be fixed.
The four weirded-out parts (CE, SE, NAS-1, NAS-0) are all plugged into a group of four
dual-personality ports. Maybe the dual-personality ports have the wrong personality?
I tried plugging one of the devices into an adjacent, regular ethernet port on the router,
but the light is still red. Although, NAS-0's light has mysteriously decided to turn green.

cont.
10/18/2017
Well, I've discovered some things today. It's looking like I'm gonna have to interface with the router's
console to see what's up. To do that, though, I need the console cable, which is Ethernet-Serial 
(RJ-45 to DB-9(female)). Of course, we don't have that cable, and I found supplies to maybe make one, but
that for sure won't work, so I'm probably just gonna have to buy one. *sigh* more waiting...

cont.
10/23/2017
The cable came in early! Imma hook the router up to the CE and see if it'll work. Gotta get that
VT-100 emulator up and running first, though. I got the emulator 'minicom'. 

cont.
10/24/2017
minicom must have the following configuration:
A-Serial Device: /dev/ttyS0
B-Lockfile Location: /var/lock
C-Callin Program:
D-Callout Program:
E-Bps/Par/Bits: 9600 8N1
F-Hardware Flow Control: No
G-Software Flow Control: No

cont.
10/25/2017
(Yo, the output from the router looks really cool because you can see it written to the screen since it's serial!)
nothing works

The switch has been configured with the following important properties:
Default Gateway: 172.16.42.126 (what was already there)
Time Sync Method: SNTP (what was already there)
SNTP Mode: Unicast (what was already there)
Poll Interval: 720 (default)
Server Address: 163.118.171.4 (was was already there)

I have been experimenting with the 'IP Config' settings. Right now, it's
set to:
    IP Address: 163.118.42.126
    Subnet Mask: 255.255.255.128
I've also tried setting it to 'disable', but to no avail.

cont.
10/30/2017
Summary thus far:
The high GB/s connections are working fine; the CE, SE, and NAS-1 have 
internet no problem. The switch shows no error lights on itself, but the 
ethernet ports of all connected machines display a red LED indicating that 
the connection is dead.

I've adjusted the dimensions of the console window:
     length: 64
     width: 78
 
`show interfaces brief` displays the status' of the ports, and it
says nothing's wrong.
`show interfaces display` reports that there is data running through all of
the ports, almost 100M for each of the ethernet ports and between 1.5G and
2G for the high-speed ports, which are operational.

cont.
11/02/2017
Daniel Campos came by and took a look at the switch. He did a bunch of fancy stuff, and it turns out
that it matters which ethernet port on the computers is used, and I used the wrong one. *siiiigh*
I threw everything in the proper port, but I can't test it now because class. Hopefully it's good now!

cont.
11/06/2017
Ethernet's golden! Now we can play with NAS-0.


10/17/2017
TAGS: creating NAS0 NAS-0 RAID
The time has come to finally reconstruct NAS-0's RAID! We've opted to use RAID-10, which is a staggering improvement
in security over the previous configuration (RAID-6), although we're taking a considerable hit to available space, only
half of the drives' 12TB are useable.
I have included all 16 drives in the array, and configured it to heavily favor protection rather than performance. 
Ok, I'm super sketched out by this RAID card. It won't let me configure how I want RAID-10 done. I would like to
make it into 2 groups of 8 drives each, so that the tolerance is a minimum of 4 drives (4 drives all from the same group).
Unfortunately, this RAID card is lame af, so it automatically puts the drives into RAID-1 pairs that are all striped together.
This only allows for a minimum tolerance of 1 drive; if both drives in a RAID-1 pair fail, the array dies. While this is among 
the lamest things I've seen, in 14/15 cases, it's at least as safe as RAID-60 when 2 drives fail, and infinitely safer when 3 fail.
For that reason, I'm gonna stick with RAID-10 over doing RAID-6 again. 

cont.
10/18/2017
Maybe ZFS is a viable option! When searching for a cable, I found a massive cache of RAM in the supply closet. There are
several sticks of 2GB, 4GB, and 8GB. While we're waiting for the router console cable, I could play with ZFS on NAS-0, which
could be interesting.

cont.
10/20/2017
NAS-0's motherboard is a Supermicro X7DB8. It can support up to 32GB of 667/553MHz DDR2 RAM of sizes 512MB, 1GB, 2GB, and 4GB.
We wouldn't be able to use all of the RAM, but a good bit of it is still available. Another problem, though, is much more concerning.
How will the drives be directly connected to the motherboard without a RAID card? I doubt there are enouogh slots on the board, so a
SATA hub may be necessary.

cont.
11/07/2017
Since this card is actual trash (the only RAID-10 option is literally the worst possible configuration of RAID-10 (it only supports
RAID-1 pairs connected in RAID-0)), we're gonna try
to use it as a SATA hub for the drives to be run in ZFS. Can the card be configured to to run the disks in JBOD? 
A'ight, so here's the thing.
I need to dedicate at least one drive to house the OS, and I'd like that drive to be backed up; we're left with 14 drives,
which is still plenty. There's a few good ZFS options we can do:
1) 2 striped RAIDZ2 vdevs (RAID60 with 2 groups of 7) - min: 2, max: 4, 7.5TB
2) 2 striped RAIDZ2 vdevs with 2 hot spares - min: 2 + ~2, max: 4 + ~2, 6TB; immediate replacement of 2 failures in quick succession
     	     	    	       	     	      	      	      	   (effectively 2 base tolerance with 2 extra tolerance per group)
3) 2 striped RAIDZ3 vdevs - min: 3, max: 6, 6TB
Imma try out option 2, just to see if it'll work out. First, I need to make a RAID10 array with 2 drives; this'll be the OS drive.
With the small array made, I threw the ROCKS disc in, and it did some things. I formated the array as ext4, and it installed 
a bunch of stuff. I whipped the disc out, restarted it, and it booted into CentOS! Unfortunately, though, it's asking for a password
that doesn't exist. That's fine, though, because I can ssh into it just fine (nice!). It's yelling at me because the RSA keys are all
messed up, but that's fine, I'll fix it later. NAS-0 has an OS again! Now the task is to make the other drives visible to the OS.`

cont.
11/13/2017
A'ight, let's get ZFS installed on NAS-0! 

INCORRECT MISSTEPS:
First we must install some dependencies:
$ yum install kernel-devel zlib-devel libuuid-devel libblkid-devel libselinux-devel parted lsscsi
Actually, nevermind, the link this guide provides doesn't work; let's try
a new one.
Here are the dependencies for this guide:
$ yum install dkms gcc make kernel-devel perl
Everything was preinstalled except 'dkms' (Dynamic Kernel Module Support:
without it, kernel updates could break software), which is a part of
the RPMForge repository.
Since NAS-0 is 64 bit, to install RPMForge:
Nevermind, turns out RPMForge (aka RepoForge) is now deprocated, and big
letters on the CentOS Wiki say to not use it.
So forget that, Imma install EPEL:
$ wget http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
$ rpm -ivh epel-release-6-8.noarch.rpm
Except `yum repolist` shows no sign of EPEL. *sigh*
Turns out the repo's gotta be turned on. 'enabled' in 
'/etc/yum.repos.d/epel.repo' needs to be set equal to '1' rather than '0'.
Now EPEL shows up in `yum repolist`. Nice!
Now dkms can be installed:
$ yum install dkms
The next instruction calls for installing 'spl' and 'zfs':
$ yum install spl zfs
Unfortunately, neither of these packages can be found.

CORRECT METHOD:
Fortunately, ZFS can be installed a different way.
First, the ZFS repo must be installed:
$ yum install http://download.zfsonlinux.org/epel/zfs-release.el6.noarch.rpm
Then, ZFS itself must be installed:
$ yum install kernel-devel zfs
ZFS is now installed! Hooray! Now we've gotta get those drives visible.

An important thing we gotta do is get 'tw_cli' installed, the RAID monitoring software.
First, the ASL repo must be installed:
$ wget http://updates.aslab.com/asl/el/6/x86_64/asl-el-release-6-3.noarch.rpm
$ rpm -Uvh asl-el-release-6-3.noarch.rpm
Then the software needs to be installed:
$ yum install 3ware-3dm*
Now NAS-0 needs to be restarted.
'tw_cli' is installed and works great!
I can see the unconfigured drives in 'tw_cli'; hopefully I can work with them.
Looks like if I put all the other disks in their own seperate units (putting
them all in single disk mode), they'll be visible to the OS. Let's try it!
I can see all the drives! Now we can get ZFS up and running!

cont.
11/14/2017
I tried making the zpool, but it didn't like the 1TB replacement drive we threw in there, so
I'm just gonna replace it with a normal 750GB. When I tried to remove the drive with 'tw_cli',
though, it couldn't. That's because I was trying to remove the only drive in the its unit, which
it isn't happy with. I'm gonna have to delete the unit, and remake it with the new drive.
The zpool with option 2 was made:
$ zpool create nas0 raidz2 sdb sdc sdd sde sdf sdg raidz2 sdh sdi sdj sdk sdl sdm spare sdn sdo
Unfortunately, though, it only has 5.2TB of space, which is a bit less than the alread expected
low amount of 6TB. Imma try option 1, the most spacious one.
It wouldn't let me destroy the zpool; it said it was busy. Even after unmounting it, it still complained,
so I restarted NAS-0. It's still busy. I'm gonna try to see what holding it open with `lsof | grep deleted`.
Nothing is printed. `lsof` didn't list anything with "nas0", but there are a few processes related to "zfs".
`zfs iostat` revealed that there is some IO going on in 'nas0' (also that there are 8.1TB free, suspicious,
it's probably got something to do with parity and other ZFS data). Later, I'll try killing all of the ZFS
processes.

cont.
11/20/2017
I just ran `zpool destroy nas0` and it seemed to have worked just fine. Huh, well problem solved, I guess. 
I'm gonna try to make Option 1 and see how much space that one actually gives us. It only gave us 6.6T of
the expected 7.5T. I reported my findings at the meeting, and we've opted to go for Option 2, the RAID-60EE
equivalent.

cont.
11/27/2017
Let's make Option 2 and start the copy of the '/home' backup. '/nas0' is busy, so I'm gonna comment out
'nas0' in '/etc/mtab' so that it won't be mounted on restart. After much fandangling, turns out the best
course of action is to just restart NAS-0, then 'zfs unmount nas0' and 'zpool destroy nas0' as quickly as
possible, before any crazy processes can start acting on it. Now, I've gotta mount NAS-0 onto the CE so that
data from NAS-1 can be sent over.

cont.
11/29/2017
Even though '/etc/fstab' contains an entry for NAS-0, 'mount' doesn't see '/nas0' available.
There is a 'sharenfs' property on ZFS that allows ZFS volumes to be shared via NFS; it's set on /nas0.
NFS is already good to go on NAS-0, but we've gotta add '/nas0' to '/etc/exports' so that NAS-0 knows
to allow the CE to mount '/nas0'. I've added the following line to '/etc/exports':
/nas0         163.118.42.1(rw,sync,no_root_squash)

/nas0:		the filesystem to be mounted
163.118.42.1: 	the highspeed ethernet connection on the CE
rw:	      	allow read/write
sync:	      	server confims client requests only when the changes have been committed (safety)
no_root_squash: allows root to mount filesystem 

By default there was an entry in '/etc/exports' called '/export/data1'. It caused some problems, so I commented it out.
I then ran `exportfs -ra`.
When I try a `mount /mnt/nas0` on the CE, I get the following error:
mount.nfs: access denied by server while mounting nas-0-0.local:/nas0

The error was because it doesn't like the IP for the CE I gave it; it prefers the LAN IP (10.1.1.1).
'/nas0' is mounted fine, now.

Now the data transfer can begin!
I used the command:
$ rsync -av --append /mnt/nas1/nas0-bak-20160304/home/ /mnt/nas0/home/
I ommitted the 'nohup' because it was giving me problems, and I wanted to manually monitor the progress (it took a couple days).

cont.
12/01/2017
Data transfer complete!
Good news: all of the data transfered over just fine
Bad news: none of the file permissions were saved; I'm gonna have to fix that.
The permissions can be fixed by following the instructions from [10/31/2015]. The home directories also need
to be mounted on '/home' rather than '/mnt/nas0/home'. So let's fix that mount point. Oh wait, hold on. Some of
the home directories (mine, Ankit's, and couple others) are already mounted on '/home' from '/mnt/nas0/home'.
Looks like we're good! I'm able to login remotely with an shh key again! Hooray!!!

10/31/17
Riley
TAGS: NAS-0, NAS-0 RAID 10, Batteries, NAS-0 RAID card model, ZFS info
There isn't any literature I can find on the admin log about doing a battery test. I'll look on the twiki, but as for now the project
is at a standstill. For some reason the glorius Google (TM) only gives me things online about MicroSoft (TM) Clusters and UPS systems,
so finding something won't be as easy as I initially thought. 
As for today, I'm ripping out NAS-0 and looking inside. I need to know the model of the RAID card for research, and how many ports it has.
This info will be recorded here. I am seeing if it can be used as a hub for ZFS, and if it can I'm planning on putting a bunch of RAM in it. 
For glory. 
Happy Halloween, My cluster friends. 

10/31/17
cont.
Found the things for the UPS. All the info we have as of right now is the location of the UPS documentaion on the cluster: /etc/ups.
Ryan has a couple of things from 2 years ago, but there isn't any exsisting code to check the batteries. Im going to start working on a code
to check the batteries

Moving onto the RAID card, the model is AMCC 9650SE-12 ml. it currently goes for $430 on the market, even though its some dated tech, which
leads me to believe that if any RAID card from that era could be used as a hub, this is it. the only problem is everything online says it's 
possible to use a RAID card as the hub, but no one says how because they unanimously say its a terrible descision. 

11/2/17
Riley
TAGS: NAS-0, RAM, RAID card, Battery test
in order to use the NAS-0 RAID card as a hub for ZFS, we need a metric tonne of RAM. lucky the motherboard can support 16 RAM sticks and
in the admin log it does say that it can handle up to 4GB sticks of DDR 2.
the only problem is that the RAM in the motherboard isnt DDR, its FB-DIMM. more research is needed to find out if there is any 
potential compatability problems
Daniel Campos gave me some amazing resources for running APC diagnostics tests. I'm going to try and make the APC as schnazzie as possible.
hopefully the Trip-Lite battery tests wont be too much more difficult. The battery info can be found at /etc/ups

BATTERY LOCATION: /etc/ups

11/2/17
cont.
TAGS: RAM, NAS-0
it seems that the RAM is an implied DDR-2, even though it doesn't say anything about DDR on it
UPDATE: We (With the help of Daniel Campos) found a decent way to solve our issues. NONE of the RAM fit into the mothernboard,
which is fine because we dont need it anymore. Daniel suggested we use JBOD to host ZFS, and it doesn't really need a lot of RAM.


11/27/2017
TAGS: CE hang
The CE hung again today, so I powercycled it, and now it's fixed. It took FOREVER to turn on, though.
There were some mad NFS timeout times, so I'm gonna try to reduce that. I changed the timeouts in 
'/etc/auto.master' from 1200 to 500. Hopefully that'll fix the problem.


12/04/2017
TAGS: nas0 dashboard diagnostics page
The RAID health check for NAS-0 is all kinds of messed up because NAS-0 has crazy splash text on login.
Let's fix it! 
It said that line 29 in '/etc/ssh/ssh_known_hosts' in the CE was the offending line. That's the line for the old
NAS-0; it was trying, and failing, to match the new NAS-0's key with the old key the CE had. I just deleted that line,
and it put the new key on the CE. All is now well!


12/04/2017
TAGS: NAS-0 no root login
Ankit recommended we disable root login for NAS-0, which is probably not a bad idea.
I created a user "fakeroot" and put `su -` in its '.bashrc', so that the root password must
be entered to gain access to NAS-0. I copied over the CE's ssh key, but it still didn't work.
I changed the permissions for '~/.ssh' and '~/.ssh/authorized_keys' in 'fakeroot''s home directory
on NAS-0, and I ran `restorecon -Rv ~/.ssh`, which resets the SELinux configuration to default.
It works fine! I can login to NAS-0 from the CE with RSA.
I've also added 'fakeuser' to the sudoers group on NAS-0:
$ usermod -aG wheel fakeuser
For changes to take effect, log out and back in.
I disabled ssh login for root on NAS-0 by setting 'PermitRootLogin' to 'no' in '/etc/ssh/sshd_config'.
I made the root password required for any 'sudo' activity by adding 'Defaults rootpw' to '/etc/sudoers'.


12/19/2017
TAGS: NAS0 ZFS
I tried to work on the cluster remotely, only to find that my certificate wasn't working. Uh Oh.
Turns out ZFS didn't start up correctly on NAS-0, so '/nas0' wasn't mounted. I logged in as 'root'
and tried a `zfs list`, but it just told me that no datasets were found. Maaaaaan. I'm gonna try 
unmounting NAS-0 from the CE, then restarting the thing. No dice. Imma try an update and restart
No dice x2.
`zpool import` gave me data on the pool, and told me a drive failed. The error message gave me
this URL: 
http://zfsonlinux.org/msg/ZFS-8000-4J/
Turns out, since 'nas0' is an exported pool, it needs to be imported, which failed because it
was degraded. It can still be manually imported, however, so that it can be worked on.
*sigh* Turns out the issue is that THREE drives decided to fail IMMEDIATELY after I left. 
*sigh* Man, c'mon now. There's gotta be a reason why all this nonsense always happens. Why do
the drives in NAS-0 fail so often? NAS-0's super important. Maybe it's just 'cause all the drives
are super old. I mean, it is a bunch of 750GB, which is an outdated size anyway. That's probably
it; they're just super old. I guess even the "new" drives we get would be old even if they've 
never been used. I don't even know how to fix that, though, short of replacing all the drives, 
but that's super expensive. *sigh* Who knows, man? Who knows?
I haven't decided if I'm gonna run down there to replace the drives or not. Since it's still operational,
and nothing new's been put on it, I'll probably just leave it.


01/04/2018
TAGS: Intel security
Intel done messed up their processors, and they are
vulnerable. I'm doing a 'yumUp' on the CE now,
and will update the nodes when they're operational.


01/08/2018
TAGS: UPS beeping red
I've returned from Christmas break, and the APC UPS is beeping at me. It had been
beeping a bit more often than usual before I left, so Imma take the batteries out
and test them.
EMT ended early, so now I have a full hour to play with batteries!
Let's start by shutting everything down.
A'ight, so the batteries are mostly fine, but one in the left tray is reading 11V instead of the
regular 13V and the required 12V.


01/08/2018
TAGS: NAS0 nas0 drives failed
I've also got those three drives in NAS-0 down; one in a pool and both spares.
How do I figure out which hard drives failed so that I replace the right ones;
there aren't any helpful red lights.
`zpool status -x` gave me the status' of all the drives. It also told me that the
failed drive in the pool was '/dev/sdb1'. The following command can be run to find
the slot of the 'sdb' drive:
$ udevadm info --query=all --path=/block/sdb
In the 'DEVPATH' line of the output, we're looking for 'target0:0:2', which indicates
that the drive is in the second slot. (sda is 0 and sdb jumps to 2 because sda is made up
of two drives; it's mirrored OS array managed by the RAID card.) 
To replace the drive, the zpool must first be brought offline:
$ zpool offline <poolName> <driveName>
Now that the drive is offline, I'm gonna try to remove the drive in slot 2.
With the drive removed, the status of the drive is still reported to be 'offline'.
Now, I'm gonna insert the new drive.
The new drive must now be brought online:
$ sudo zpool online nas0 15433276318644629044
(this step may be able to be skipped because it gave me a warning that said 'zpool replace'
should just be used instead)
I tried to use 'zpool replace nas0 /dev/sdb', and it told be that no such thing existed. Since
it said that the failed drive used to be '/dev/sdb1', I tried using that. It told me that
'/dev/sdb1' is already a part of 'nas0'. And it says it's FAULTED like before. Hmm... What's
goin' on here? I even tried unmounting the whole pool with 'zpool export nas0', but it couldn't
because the device is busy. I'm gonna try a full restart, then. Which works for me, since I have to
check the APC batteries anyway. Unfortunately, I don't have enough time for that right now, so I'll
have to try later.

cont.
01/09/2018
I'm messing around with it some more, and I'm gonna try to throw a different drive in to see if my
replacement also didn't fail. Interestingly enough, the zpool doesn't show that the drive has been
removed, only that it continues to be "offline". "offline" probably just includes "ejected". Hmm, I
wonder what happens if I try to bring the "drive" back online. It told me that the drive was onlined,
but remained in a "faulted state". Additionally, `zpool status -x` now hangs. Interesting!
A'ight, so I took the phantom drive back offline, and I was able to run a `zpool status -x`.
What does this thing think it's doing? It's resilvering two drives in the other RAIDZ2 pool for some
reason. Por que? Well, now I'm scared to interrupt it, so Imma just let it sit for a bit and sort itself
out.

cont.
01/10/2018
Daniel Campos came by and taught me some ZFS and general drive things. SCSI commands and numbers
are useful. `dmesg` will tell me the name of a newly plugged in drive, which is real nice. Also,
I have to use the version of 'zpool replace' that uses two drives. The name of the old drive is
just the long number that `zpool status -x` provides, and the name of the new drive is the name
gotten from 'dmesg' or the other commands.
The directory '/dev/disk/by-path' is very interesting. It shows the physical locations of all of
the drives. When I remove the drive from slot 2 again, though, the entry in '/dev/disk/by-path'
doesn't change; it still thinks there's something in slot 2. It's also not counting all 16 drives
when all are inserted; slot 15 isn't mentioned.

cont.
01/12/2018
Daniel came by again, and we discovered that the disk wasn't being seen because I configured the
drives in the RAID card to all be "single-disks" rather than JBOD. I did that because the
documentation had said that "single-disk" was better than JBOD, but that was only in terms
of fault tolerance; we've got ZFS taking care of that for us. Daniel reconfigured the drives to
be JBOD and the card to automatically export unconfigured drives as JBOD. So now we've got just
a bunch of disks for ZFS to play with!
First, though, I'm going to test to see if we have the ability to hot swap, now; we couldn't just
do that in "single-disk".
Turns out ZFS just kind of auto-fixed itself, which is nice. It says all the drives are fine, and
the zpool has been restored. When I tried to whip out a drive suddenly, though, to see what would
happen, nothing did. `dmesg` reported that the drive had been removed from the slot, but
`zpool status nas0` hasn't changed.
I'm going to solve this problem, first, before trying to test hot-swapablity. Why won't it mount?
`zfs list` displays 'nas0' just fine, so it's not like it's invisible or anything.
OK, so the mount point for 'nas0' is '/nas0', which, reported by `mount /nas0`, doesn't show up
in '/etc/fstab' or '/etc/mtab'. Which, turns out, shouldn't matter, since there never were entries
for it. Which I guess makes sense, since I suppose ZFS takes care of all that nonsense with `zfs mount`.
Oh, I'm just dumb, `zfs mount` isn't going to do anything without a zpool to mount. The correct command is:
$ zfs mount nas0
Whoops!
Now let's try to whip a drive out. The RAID card is totally cool with JBOD! Taking drives in and out
is no problem at all. Excellent! Now let's put all the data back onto NAS-0!
I started the data transfer with the command from last time:
$ rsync -av --append /mnt/nas1/nas0-bak-20160304/home/ /mnt/nas0/home/
I'm anticipating the same initial problems as before, but the solutions to those are documented,
so we're good.

cont.
01/15/2018
The transfer hung, so I've stopped it, and am gonna try restarting it. It looked like it hung 
because NAS-0 wierded out; full of "rejecting I/O to offline device" errors. It's not letting me
login, though, so I'm thinking I'll just have to restart NAS-0. When I tried to restart NAS-0,
the 'shutdown' command gave me an I/O error. Apparently, this means the drive is having mad issues.
It looks like I found an alternate restart method, though.
$ echo 1 > /proc/sys/kernel/sysrq
$ echo b > /proc/sysrq-trigger
Will tell the computer to restart, but if the RAID card fails to initialize, the machine must
be powercycled. Now let's try to resume the transfer.

cont.
01/16/2018
Transfer's still going...

cont.
01/18/2018
It looked like the transfer had finished, but some errors were reported.
I'm going to run the command again to make sure everything's actually over.

cont.
01/19/2018
Transfer's still going...

cont.
01/20/2018
Transfer's still going...
(It's on Vallary now, though, so it's almost done!)

cont.
01/21/2018
Transfer's done with no errors to speak of!
Now to prepare everything like I did before.
Since I've already copied over the files required from before, all
the permissions are already good to go. Nice! Now to get my ssh key
back up and running. Everything's good on the logging in front!
Now to pickup from where I left off...


01/08/2018
TAGS: nas1 no video out
NAS-1 won't give me any video output. When I turn it on, it just beeps at me, and that's it.

cont.
01/10/2018
Riley and I whipped out NAS-1 (watch your fingers!) to get the model of the motherboard:
"AMIBIOS 786Q 2000 American Megatrends"
When I turned NAS-1 on to hear the beeping, it just turned on normally.
I guess it just needed to be unplugged and plugged back in. OK


01/09/2018
TAGS: UPS software configuration
Riley:
Starting to work on UPS software again, picking up from where i left off
Ryan and I are going to have weekly meetings. hopefully this issue will be done in like a week.
basically the software is there, it jsut needs to be configured.


01/19/2018
TAGS: drive LED red nas1 NAS1
One of the drives on NAS-1 has a red LED! I logged into NAS-1 to see what was up with
$ storcli /c0 show
and it didn't say anything was amiss; it thinks everything's green. Strange. Maybe it just
needs a reboot or something, but it's still transfering data, so I'm gonna have to wait on that.

cont.
01/20/2018
The red light's turned off, so I guess we're good.


01/21/2018
TAGS: nodes no power
Welp, I went to go turn on the nodes so that I could resume working on condor, 
buuuut they won't get power. I turned on the top ten, since they alread had
their lights on, but the bottom ten didn't even have their little power lights
in the back illuminated. Huh. I tried turning the UPS that powers the bottom
nodes on and off again, but that only served to kill the power to the top set
of nodes. "What?", you may be asking yourself. I, too, am asking myself that
same question. Why would the bottom UPS, the one exclusively dedicated to powering
the bottom set of nodes, kill the power to the top set of nodes? It's truly a
mystery, fo' sho'. Well, I tried the same thing with the top UPS to no change.
Now I have two sets of nodes without the slightest inkling of power even though 
they are both plugged in to fully powered UPSs. *sigh*
Alright, looks like the power strips aren't being powered for some reason.
I've plugged the desk lamp into one of them so I'll know if it miraculously
begins to work again. I plugged one of the power strips into one of the big
power strips on the ground, and it worked fine! I guess the strips are fine,
the UPSs aren't supplying their power correctly. They're supplying their power
to everything else that's redundandly plugged in, though; both lights are green
on everything else. 
Turns out the output breakers for the row of outlets into which the node power
strips are plugged keeps getting popped. Everything else turns on like before,
though. 
(Side Note: the screen for the CE has changed, the picture is much brighter
for some reason, huh, spooky)
How to troubleshoot breakers, I have no clue.
In the mean time, I have only five nodes on so that I can still try to fix
condor.

cont.
01/23/2018
I talked to Daniel about the problem. He says to see what the UPSs think
their load is; I can plug my computer into them and investigate using their
software. If that doesn't yell at me, he says to see what the UPS thinks
its power draw is. If it's tripping at a lower point than what it's supposed
to, the breakers are going bad, but if it gets too high, then something
else has changed. My computer's having a hard time finding the UPS that's
plugged into it. I plugged the UPS into the CE, and I'm gonna try accessing
it that way. I found it at '/dev/usb/hiddev0', but with no way to access it.
`lsusb` also shows that the UPS is detected.

>>>>> IMPORTANT 
<thing> -a id

the day that is today:(i think its like the 29)ya i cheqt my phone it is so:01/29/2k18:
 so (((DAnieL))) has been helping out
i havent been posting here but hes helping a lot
he keeps calling out but im pretty sure that next time we meet we're going to get everything donezaroni
with trippmeme software. He said APC is 'easier' and 'usable' so i think that means 
once tripplite is like
done
we are a day ///maybe/// from bean finish
after everyhign is done i need to write the output to a cronjob and that i can <<<Most Likely>>> do by my

lonesome

but IDK. I have to report today that I've been lost af for like 3 months and now that i have help
its just a matter of meeting with daniel. IDK how anyone is supposed to learn this themselves, 
daniel has been doing this for what he claims to be like 10 years and even he is having issues...
well at least he has some sembalance of an idea of what to do. even when he tries to help there just doesnt
seem to be any rhyme or reason to how this nonsense works. I vote we ban trippmeme to the shadow realm
and use only the best and brightest of the APC software/hardware. hopefully nexttime i can report we made
actual headway to this issue that ive been dealing with for litterlaly 4 months with no real progress. once
this is taken care of tho, we stilll have:
- the website
- nas0
- nas1
- the xXx_iNodes489_xXx
- making the SE run an actual OS
- making sure theres no BUGSSS
- replacing the file-managin system BestMan with The Open Source Framework, "Hadoop(c)" ()()()()(HOWW)()()()()
- doing a bunch of YumUpdates and YumProvides to whatever tf needs it
- i hope the CE isnt totally destroyed
- we can only hope we dont need to replace the batteries again before this is out
--also side note. the batteries are hooked up using red/black alagator jaws and copper wire. how are we going
to get an individual battery report? if we want that we need to completly destroy the units and start over.
not to mention the current "Storage Solution" is the sole reason the batteries last an eighth their lifetime
and need to be foreably removed. maybe a change to storage isnt a bad idea. also the website
Rilo O U T 

01/31/2018
TAGS: Daniel, UPS

today Daniel and I sat down to work on tripplite. We need to upgrade to CentOS 7. We can't use the
proprietary software with CentOS 6, and it'll just fix like every issue. For the things that need to 
stay in CO6, we can do some weird hmount thing that keeps them in 6 while all the real software is on 7.
also i think we hit something bc the entire cluster started screaming. daniel eventually figured it out,
but we need the APC to be on. it started screaming right before i did anyhting which was weird timing. i 
may have knocked a wire loose. Ryan hasnt put anything in the log for a while so i guess he hasnt made any
progress recently. R?I?P? ?Ryan?
 cont. today i am teaching sam how to use the bash
hopefully shell stop bean an scrub and become a sysadmin
"Are you adding this to the offical log?"
- Samantha Worjlsthaer, 2k18
"I got it"
-sam 2k18
"No, dont put that where anyone else can see it!"
-still sam

02/03/2018
TAGS: nas0 drive failure
A drive's failed in NAS-0. While this would normally be bad news, it served
as an excellent test to see if everything works, though. Once the drive failed,
the hot spare immediately took over no problem, so it's all working great! I 
still gotta change out the drive, though, so that's what I'm doing today.
It says that 'sdi' is the one that failed, so I ran the following command to find
'sdi's physical port number.
$ udevadm info --query=all --path=/block/sdi
It looks like it's in slot 8. I brought the drive offline with:
$ sudo zpool offline nas0 sdi
and whipped drive 8 out of the array. I've inserted the new drive and run
$ sudo zpool clear nas0 sdi
It now says it's repairing both 'sdi' and the spare 'sdn'. Another drive, though,
'sdh' is now 'faulted'. I'm going to wait for the repairing to complete before
messing with 'sdh', though. I was going to replace the battery in the APC UPS today,
but I don't wanna turn off NAS-0 in the middle of this repair, though, so I'll save
that for tomorrow.

cont.
02/05/2018
I've replaced 'sdh', but now 'sdg' has faulted. I've been extracting the drives from
the incorrect slots. The `udevadm` command seems to count the first two OS drives as
one drive, so I've just been working my way up the array taking out incorrect drives.
I'm gonna wait for 'sdh' to get itself fixed up before I play with 'sdg'.

cont.
02/13/2018
Alright, 'sdi' is degraded, and what used to be 'sdg' is unavailable. Let's get
'sdg' situated first. The 'udevadm' command listed 'sdg's position as slot 6, so I'm
going to remove the drive in slot 7, because that's the actual slot when taking 'sda'
(2 physical drives for 1 logical drive) into account. Sliding the drive already in the
slot back in, since the drive was a replacement anyway, and running 
`zpool replace nas0 /dev/sdg` did the trick! The drive is now being resilvered. 'sdi' is
also being resilvered, so I'm going to wait until it's done doing what it's doing before
I fix that one next.

cont.
02/16/2018
Time to replace 'sdi'. I've offlined it and replaced the drive. 'replace' doesn't work,
though, it gives me a "cannot label" error. Huh, maybe I just accidentally threw in a
bad drive. I've tried a different drive with the same result. When I try 
$ sudo zpool replace nas0 /dev/sdi1
though, instead of 
$ sudo zpool replace nas0 /dev/sdi
I get a "one or more devices is currently unavailable" error. Hmm.
I glanced at the history to remind myself of how I did the previous drive, but it's
just a 'zpool replace'. Man, what's up? 

cont.
03/02/2018
Now the drives are all good, so Imma throw it in.


02/05/2018
TAGS: APC UPS battery low
One of the batteries in the APC UPS was measured to be 10V instead of the
rated 12V and the regularly reported 13V. We ordered new batteries, and
I'm gonna throw in the new one while checking the other batteries.
I've replaced the low one with the new one, and everything seems to be
fine.


2018/02/11
(Daniel C.)
Fixed condor scheduler, /etc/condor/config.d/00personal_condor.config has
CONDOR_HOST set to a local address instead of FULL_HOSTNAME.
Needs further investigation:
/etc/hosts defines the listening ip (10.1.1.1) as uscms1.local.
The preferred solution is to make 10.1.1.1 resolve to
uscms1.fltech-grid3.fit.edu. The current solution is to add exceptions
to /etc/condor/config.d/00personal_condor.config and add 10.1.1.1 to
COLLECTOR_HOST and ALLOW_NEGOTIATOR.
Investigated HTCondor-CE authentication issues and determined that
only LCMAPS VOMS is supported for OSG 3.4. May require some new stuff
to be done.


2018/02/12
(Daniel C.)

Fixed dashboardNAS0.php disk updates.
dashboardNAS0.php reads from nas0check.txt in /var/www/html/diagnostics.
This file is updated by a cron job located at nas0chk, which runs
/usr/local/bin/nas0check.sh
This script was currently not working and needed tweaking.
Instead of ssh'ing to nas-0-0 as root, it needed to be fakeroot.
The ssh shell only needed to run tw_cli, not awk as well, so that was
moved client side. nas-0-0 had a sudoers.d file added (sudoers.d/tw_cli-nas0check)
with the following contents:
fakeroot ALL=(ALL) NOPASSWD:	/usr/sbin/tw_cli /c0 show

for whatever reason, fakeroot is not root. (I mean, no duh but I don't understand why
it exists.) fakeroot is given sudo access to run tw_cli /c0 show and that command only
with no password. The script now works and is reporting correctly.


02/13/2018
TAGS: nas1 drive failure
Drive 41:2 (physical slot 1:2) on NAS-1 has failed. I've run the following commands
to replace it:
$ storcli /c0/e41/s2 set offline
$ storcli /c0/e41/s2 set missing
$ storcli /c0/e41/s2 spindown
Once the new drive is in place, it should automaticall start to rebuild.
The status of the rebuild can be checked with:
$ storcli /c0/e41/s2 show rebuild
If it does not automatically begin, the rebuild can be manually started with:
$ storcli /c0/e41/s2 start rebuild

02/16/2018 - Riley
TAGS: Fail2Ban,  OSG User Accounts, CERN User Accounts
installed fail2ban on the SE, and will install on NAS-1. SE has basic configurations, NAS-1 will have some fancy stuff
on it. im putting off configuring the batteries until i have time to redo the process for CentOS #Dumb, or until I just make
a chroot in CentOS #real (777). 
I am also making user accounts for myself (Riley) for cern and OSG
I think we are actually getting close to having a functioning cluster, or at least closer. the battery check is not necessary, 
and i think im just going to leave that to be something to do later. realistically its just something nice to have, and there
is no reason to do that now. there are much more pressing issues. 

02/19/2018
TAGS: APC UPS tripping off
[import from offline adminlog]


02/20/2018
TAGS: NAS0 mount incorrectly
When turning everything back on, it seems that the CE mounted the wrong part of NAS-0;
only NAS-0's OS drive was mounted rather than the storage partition. 
`zpool status -x` says that no pools are found, which is worrisome because it's
hooked up to nas0. I guess I'll just have to come back later and restart the whole 
thing again to see if that'll fix it.

cont.
02/23/2018
Riley must have restarted the cluster, because nas0 is back online!


02/21/2018
TAGS: Fail2Ban
strange issue with f2b, its the debian distro. I used "yum install fail2ban" to get the files but somehow
they're not the redhat files the cluster needs maybe. it doesnt seem to be working yet bc the chinese
tried to log in like 50 times in 5 days. the specifications for the current configuration shouldnt allow more
than 25 login attempts over the course of 5 days if theyre just spamming, either that or they're in some
jail somewhere and they're still allowed to ping for some reason. idk man


02/23/2018
TAGS: Fail2Ban, Drives
Fail2Ban is now complete. I have become the offical ban hammer of the clsuter.
Daniel showed me some dank commands to run to check disks, which pretty much eliminates the need to rip
disks out. he said this is what normal people do and im upset that I've never even seen it before. anyway,
rip in peace rian, I hope i can get the command onto the adminlog when im feeling less lazy.

TODO: install postfix/configure to use fit.edu as relay for fail2ban notifications

02/23/2018
TAGS: APC UPS battery light red
The battery light on the APC UPS is red again. Imma turn everything off, then
run the UPS's test. The test was fine, and I've turned everything back on.

02/27/2018
TAGS: Certs
I'm looking into how to make certs for CERN and OSG so there will be 2 SysAdmins with ProCerts
BanHammer out