Server and Compute Resource Status - 9/1/09

To see this page, click the STATUS tab at the top of the page.

 


03:44pm 08/17/2015   BIGZFS1 DOWN, STILL, REPLACED BY RC-DATA1.  After some consideration, we decided to bring up all user accounts from BIGZFS1 on what was the mirror server, RC-STOR1-R. This server has 10G and Infiniband connections.   All account and /home directory information transferred, as well as Unix passwords.   Samba passwords cannot be transferred because of differences between how Solaris and Linux implement Samba.   -- New passwords needed.   Checking for NFS mounts and will use same paths in order not to break existing links, etc.  Mailing list of BIGZFS users has been notified of ongoing changes.   At this point, data appears ok.  It seems probable that /data2 and /data3 pools on BIGZFS1 are ok, and that /data1 is not.   Will continue on RC-DATA1 and determine best way to repurpose BIGZFS1.


08:00am 08/15/2015 BIGZFS1 SERIOUS ISSUES ON DATA1 POOL.  After reboot to get back two shelves that seemed to be down because of drive controller, server returned to normal.   Shelves went off line again on the morning of 8/14.  Console port of BIGZFS1 was unavailable.   Powered BIGZFS1 down. Switched drive controller, after which BIGZFS1 did not reboot.   Decision made to bring up alternate server head.   Came up with several drives missing.   "Resilvering" which repairs drives process is hung.   Plan is to get system drives from old BIGZFS1, get user account information, and move accounts to RC-STOR1-R, which is mirror of BIZFS1.   Users will be notified.  -- mht heading to data center to insert old BIGZFS1 system drives into new BIGZFS1 (Sat, 8:15) 


09:29am 12/10/2014 RESSRV14 moved to new chassis.  The motherboard of RESSRV14 failed when power was restored after the partial power outage at the 20 Overland data center on Monday.   As luck would have it, we had another chassis to where we could move the drives, and a spare Areca raid card that could host the RAID volumes.   Drives were moved to new chassis, boot drives were moved, file systems were checked, acquired new address. (172.24.220.241). System was back on line at about 10:00pm on 12/09/2014.    Any systems that hard code the address of RESSRV14 should change the entry to the new address.  


08:26am 7/31/2014 Bigzfs (Aberdeen) shelf replacement.  We have replaced shelf S1 and recabled to avoid daisy-chaing of shelves S1/S2.   This restored the 90-drive set.   Data from data2 has been copied to data3.   Plan is to remove data2 pool and rebuild that.2:30pm 4/9/11 External name resolution issues fixed.




10:14am 4/9/11 External connections to DFCI fail with NAME NOT FOUND.  There is a problem with name translations for all dfci users connecting externally.   Internal name translations are working properly.    Network Engineering and InfoSec are aware of the problem and working on it, but I have no ETA.   Right now, from outside, I ran a test on a server that does internat name lookup -- results in DNS server failure notification.

Retrieving DNS records for research.dfci.harvard.edu...Attempt to get a DNS server for research.dfci.harvard.edu failed: research.dfci.harvard.edu The query returned a server failure

16:07 3/2/10 CAMS and DFCI-Online External login Problems Right now, users of the CAMS Freezer monitoring systems and users of DFCIOnline cannot login through HTTPS:// connections. The only available workaround is connection over VPN. DFCIOnline got a new server, and the software underlying the authentication in CAMS were both changed end of last week. 16:00 3/1/10 DFHCC Web Server Erratic -- to be replaced The power supply in the DF/HCC Web server is failing. New server has arrived and needs to be put in rack. Will restore site to new server. Date/time not yet set. 10:47 am, 7/20 Events over the last few days: 1. This morning, the BISCOM/BDS Secure File Transfer Server was not authenticating users with Partners Active Directory usernames and passwords. Restarting the service on the intermediary computer restored the lookups. 2. In the last two weeks, our new server RESSRV18 was appeared to go off-line three times. In two causes, it appeared back on line with no intervention. The cause was found to be a bad port on a network switch. Moving the patch cable to a new port has fixed the problem, we believe. 3. RCBIG1 was not pingable for a half hour last night, presumably as a result of Network Engineering activities.


12:15 pm, 4/28 New backup servers: Our two new backup servers have arrived. One will replace our main server, which is quite old. The other will drive our 3rd tape library and be used for backups of large partitions. This should also speed full backups, which can now be distributed on three machines. This waill also allow us to upgrade from 7.2.1 to 7.4 of Networker, so we can backup computers in the VDMZ. 12:13 pm 4/28 RCBIG /home2 full: RCBIG, the last machine scheduled to be upgraded, has a filled partition /home2. RCBIG will be upgraded either tonight or tomorrow early am or tomorrow night. Older RAID cards will be removed and replaced with cards that can use Terabyte-sized drives. New drive cables will be installed. 8:24 am, 4/25 Exercising RESSRV16: For the last few days, I've been moving partitions around, then reformatting the original partition, then copying back to the reformatted partition. All the partitions have now been moved - about 5 Tb of data have been copied, recopied, then reduced to a single version. No RAID errors occurred during this process. I conclude that new controllers fixed the problem and that we can make RESSRV16 ready for use again by our community. 10:25 am, 4/12 Full backups completed: Full backups done. We are awaiting arrival, early this week, of to new backup servers, one to replace our primary server (now 8-years-old) and one to go with our third backup tape library. We can then upgrade our backup software and move our outward-facing servers into the VDMZ, as requested by Partners InfoSec. 6:14 pm, 4/11 Full backups continuing: Down to the last partition. Expect completion tonight; followed by incremental to capture this week's new files. RCBIG1 finished with no errors. 8:09 am, 4/10 Full backups continuing: No errors. Down to a few partitions on RESSRV12, RCBIG1, and RESSRV14. RCBIG1 is late because it hung during the backup last weekend and restarted from the beginning. Note that RCBIG1 has been backing up successfully during this without hanging again -- a lot of activity. 3:40 pm, 4/9 Full backups continuing: No errors. Only a few machines (but with large partitions) to complete. In progress. 11:51 am, 4/8 Full backups continuing: No errors 11:55 am, 4/7 Ressrv13: Ressrv13 continues to have problem. Trying to assess whether memory, disk drive, or motherboard. 9:55 am, 4/7 Full backups continuing: Looks like this will finish near the end of the week. Keep local copies of your work, please. 9:54 am, 4/7 Ressrv13: Ressrv13 hung. No sign of why. Upgrading to FC8 out of desperation right now, to replace s/w and see if that's the issue. 1:30 pm, 4/6 Full backups continue This is a major exercise for the tape backup servers. Primary server software to be updated next week. Primary server to be replaced week after. 6:30 pm, 4/5 rebooted RCBIG1: RCBIG1 hung at 4:00 am Saturday morning. No problems in log file. Have started hourly monitoring, which will report problems to us or if down. Clean reboot, but missed automatic restarting of Samba. Fixed. 5:44 pm, 4/3 Full backups restarting: TAPES! UGH! Had to restart. 10:29 am, 4/3 Full Backups in Progress: Full backups have been in progress for about 18 hours. This is a long process and will go on for a few days. During this time there will be no incremental backups. Please keep a local copy of your work as well as a copy on any server you can during this time. Any backups that fail will be rerun after this full backup cycle completes. During this repeat time, normal incremental backups will be running. 11:00 am, 4/1 RCBIG1: Somehow, while examining RCBIG to get ready to replace its RAID controllers, RCBIG1 got nudged and went off line for yet-undetermined reason. Rebooted with no errors. 11:00 pm 3/30 RESSRV13: RESSRV13 hung for reasons unknown for 2nd time in a month. No evidence in log file. Rebooted with no errors. 10:00 am, 3/27 VectorNTI: Licenses in place on license server. Instructions posted to web. MUST LOGIN TO SEE Instructions. 5:00 pm, 3/26: RESSRV16!!: New RAID card arrived. Replaced bad card with new card. Looked ok. Realized the card pulled from the server /was/ the good card, so replaced old bad card with old good card. Rebooted. Server only sees 1 controller instead of 2. Pulling hair out. Calling vendor 11:20 pm, 3/24: Genome bites the dustOne of our oldest computers bit the dust over the weekend -- genome.dfci.harvard.edu. Once it was the home of a group of SAGE analysis web pages developed by Li Cai. Recently, it's been hosting a number of instances of PhpSchedulit. Those have all been moved to Research4, our main Web Server. The name "genome" is now an alias to research4. 9:45 am, 3/21: RCBIGall directories (with the exception of CJG9 restored to RCBIG. 9:45 am, 3/19: RCBIG restore of 265 Gb (of 376) to RCBIG2. After this completes, we'll restore the other directories. I plan now to offload /all/ users from RCBIG to other servers, then replace the controllers and drives on the rest of RCBIG. Stay tuned. 8:28 pm, 3/18: RCBIG RAIDset /home1 failed on RCBIG this morning (see the next story below.) Restoring that filesystem to RBCIG2. Interrupted by the backup server hanging. Restore of the following directories [ alw1 common djg19 fxz hge mb703 mht4 nh018 receh zucker ] has begun at 8:58 pm. RESSRV16 -- It appears that BOTH of the RAID controllers have caused a problem after showing memory and parity errors. They control ALL user files on RESSRV16. Those file systems have been restored to RAIDs on RCBIG1 (which has been upgraded to Terabyte drives.) You may log on to RESSRV16 to see the state of your files there. Status updates below. 8:45 am, 3/18 Copied entire file system to rebuilt filesystem without error. 12:45 pm, 3/17 Updated firmware on RAID controllers for RESSRV16 at suggestion of 3Ware tech. 8:32 pm, 3/16 All filesystems restored or moved to RCBIG1. All accounts recreated on RCBIG1. (Actually, MBCF move still in process.) 8:18 am, 3/16. All of d1-2 restored. Hong is recreating accounts on RCBIG1. 10:54 am, 3/15. All files restored to d1-3 on RCBIG1. Files currently restoring to d1-2. All file systems on RESSRV16 are now readonly, so users can't put things there on flaky file system. ALL data from RESSRV16 being restored to RCBIG1. Have not created user accounts on RCBIG1 yet. Partition Status on RESSRV16: /mnt/d0-3 (Pellman Lab) appears intact) /mnt/d0-2 (rogrants) appears intact, all other partitions on /mnt/d0-x look safe. /mnt/d1-7 (carrasco) and /mnt/d1/1 (hnf) look ok. d1-4 and d1-5 are unused d1-2 and d1-3 are damaged 1:39 am, 3/15. 701 Gb of 849 Gb restored to d1-3 on RCBIG1. After this finishes, will restore d1-2 back to RCBIG1 Just checked again, and now there are file system errors on the second controller. Will restore back to rcbig1. Don't know what's wrong here It is 9:04, am, 3/14 and so far, 658 GB of 849 GB have been restored. RESSRV14 and RESSRV15 -- the 500 Gb drives in these servers fail prematurely. Western Digital has given us replacement drives. Through the goodness of RAID, we have already swapped out 20 without any downtime. RCBIG1 and RCBIG2 -- The 250 gigabyte drives have been replaced with Terabyte drives, quadrupling their storage. At the same time, they were updated from FC2 to FC5, RESCOMP1 -- we are testing using RESCOMP1 as a "home" machine. It is an 8-processor dual-core server, and has the processing bandwidth to mount all server partitions via NFS. We'll observe throughput and try to thrash on it. If successful, all users, whether connecting to Samba shares, or using NFS, will connect to this home machine rather than individual servers. It also means, that if we need to move a user account, such a move will be transparent to the user. RESCOMP2 -- being used to test Helicos software for Paul Morrison's MBCF core. ATLAS -- this is a 48 Terabyte server hosting Ed Fox's Solid sequence data. RESSRV12, RESSRV13, RESSRV14, RESSRV15 No reported prolems RESSRV17 -- New 48 terabyte server, not yet placed in use.