When redundancy isn't
Sep. 20th, 2018 12:33 am![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
I'm reminded of a joking phrase used in WAN operations: backhoe fade
That's when you lose your connection because some idiot dug up the cable with a backhoe.
A rather infamous incident shut down the Internet in New England back in the late 70s.
Seems that while the customer had specified separate routing for the pair of T-1(?) lines that carried the Internet, the provider had routed the connections via separate cables... in the *same* trench.
Needless to say the customer had words with the provider. And the provider revised their rules so the "separate routing" meant different cables *routed* differently so that one accident couldn't take both out...
A manufacturing place I used to work at got separate power feeds from two different power companies because they had processes that didn't take well to sudden power loss.
One power line came in from the north, one from the south. Only single point of failure was the company's substation that they both connected to.
And they had a *large* room full of batteries to enable shutting down those critical processes gracefully.
Sadly, we still lost power a few times in the dozen years I worked there.
That's when you lose your connection because some idiot dug up the cable with a backhoe.
A rather infamous incident shut down the Internet in New England back in the late 70s.
Seems that while the customer had specified separate routing for the pair of T-1(?) lines that carried the Internet, the provider had routed the connections via separate cables... in the *same* trench.
Needless to say the customer had words with the provider. And the provider revised their rules so the "separate routing" meant different cables *routed* differently so that one accident couldn't take both out...
A manufacturing place I used to work at got separate power feeds from two different power companies because they had processes that didn't take well to sudden power loss.
One power line came in from the north, one from the south. Only single point of failure was the company's substation that they both connected to.
And they had a *large* room full of batteries to enable shutting down those critical processes gracefully.
Sadly, we still lost power a few times in the dozen years I worked there.
no subject
Date: 2018-09-20 02:42 pm (UTC)That way they had the current backup in the other building.
And if something clobbered both buildings, then the backups weren't needed anyway because we'd have been out of business.
Common failure mode for backups is having just *one* set of tapes/disks/whatever.
If something going wrong *during* a backup, you've just hosed your only backup.
Also a good idea to make sure the backups are *readable* and actually contain the data they are supposed to.
Lots of places have cheerfully plugged in the backup, only to find that either it isn't readable or the process was mis-configured and the critical data was never written to the tapes.
recommendation from *way* back (late 70s) is to have a backup for every day of the week. They can be "incremental" backups.
On the last working day of the week, you do a complete backup, and archive it. You also shuffle the tapes by one day and open a new one which will become the "first day of the week" tape. With the former "first day" becoming second day, etc.
This eliminates worn out media as a problem.
You also make "end of month", end of quarter and end of year backups (extra tapes in this case). And you archive them. This way you can go back and figure out the way things were in the past with reasonable granularity.
A backup for the planet would really have to be more like the "disaster recovery" centers some big corporations have. Not just "stuff" at a different site, but something set up to be able to run things until you can rebuild.
That'll require a *very* self-sufficient colony.
Recommended is
no subject
Date: 2018-09-21 12:36 pm (UTC)For some reason I know several people who do computer stuff for businesses. One told the tale of how he got called in (new client who didn't have a regular IT) for a server emergency and had to replace several failed components. When he tried to restore from backup he discovered there was no tape in the backup drive.
no subject
Date: 2018-09-21 05:37 pm (UTC)I was good friends with the manager and the "tech support" person at the Radio Shack Computer Center back in the 80s.
I was talking with the tech support person one day and she told me about one long term problem she finally solved with this one customer.
TRS-DOS had this built-in capability to limit backup copies of programs. Basically as part of the file properties, there was a byte that got checked when you did a "backup" (equivalent of MS-DOS's DISKCOPY) of a floppy.
If the byte was 255 (the default) the file was "unlimited". If the value was 1-254, it got decremented by one, and the copy had it set to 0. If it was zero, the file wasn't copied.
So, on to the problem.
This one customer kept having his backups of the business program he was using go bad.
So he'd have to bring in the master disk and she'd have to reset the number of backups.
She'd talked him thru the backup process over the phone many times and was about ready to pull her hair out.
She'd been on the phone with him yet another time, going thru things step by step. Having him go into excruciating detail.
cust: "It says backup complete. I take the master out of drive 0 and put it back in the sleeve in the binder."
tech: "Ok, sounds good."
cust: "I take the backup out of drive 1 . put it in the sleeve and stick it up on the file cabinet"
tech: "Wait a sec. 'Stick it up on the file cabinet?"
cust: "Yeah, I take it out of the drive, put it in the sleeve and use this magnet to stick it to the side of the file cabinet..."
tech (with *great* restraint): "Ok, I think we've found the problem.."