Don't you just love it.... [Archive] - SV650.org

View Full Version : Don't you just love it....

Baph

15-11-06, 05:02 PM

When the crap gets sticky, and your boss is the only person that can advise because you're in unknown terratory, and he buries his head in the sand because he doesn't want his neck chopping off.

Who's got an axe? 'Cos I'm about to start chopping!

(Critical problem across 2 sites, looking like a viral infection causing 2 clients to suffer downtime, means they're loosing money, fast, no-one has encountered the situation before, no-one is willing to help with it either... damn I love my job!)

Jabba

15-11-06, 05:06 PM

Sounds like a cue for more "geek talk" :lol:

Take the initiative and sort it yourself :thumbsup:

SoulKiss

15-11-06, 05:10 PM

Advise boss in writing (email) of situation, and explain that further to input from him you will be doing x, y and z where x, y and z are the ways you see to deal with this.

This covers your back

David

Alpinestarhero

15-11-06, 05:13 PM

Yup, if no-one will help you, just say what your gonna do. They know the problem, they want it solved but refuse to do it themselves or at elast advise/help.

Then go ahead and do it. It it messes up, then its not really your fault, since you received inadequate support.

Do you have any colleuges to talk to about it who might impart their advice?

Matt

Baph

15-11-06, 05:16 PM

Cheers for the words guys, and I did that before posting to the org ;) (including CC'ing the board of directors)

You can tell how much I care about clients loosing money, posting on this forum whilst dealing with the issue :lol:

alpinestarhero, nope, there is NO-ONE in the company that can advise. I'm litteraly in unknown terratory... dealing with a live virus, for which there is no patch, and trying to keep systems running. YAY!

Or at least, I'm treating what I'm seeing as a virus, until I have more information about the things I'm seeing.

SoulKiss

15-11-06, 05:23 PM

Is getting them to pull the plug on what they have and getting "clean" kit to them an option?

Would allow you to ship the infected stuff home and inspect in a clean-room environment

David

Baph

15-11-06, 05:25 PM

Is getting them to pull the plug on what they have and getting "clean" kit to them an option?

Would allow you to ship the infected stuff home and inspect in a clean-room environment

David

Nope, live environment. To use resilience will require a 3 hours downtime in itself, and by now is probably already infected too.

The system requires t'interweb access at all times due to the nature of business. Nice thought though.

I'm dealing with it, slowly & methodically :)

SoulKiss

15-11-06, 05:28 PM

Teching to suck eggs but.....

1) Assuming a firewall between the infected boxes and the outside world, log outgoing stuff

2) Shut down all legitimate services

3) Phone your missus - tell her you will see her at the weekend

4) Look at the logs to see what is trying to connect to what

5) Google for those ports etc

6) Make the rest up as you go along

David

Filipe M.

15-11-06, 05:33 PM

fizzwheel

15-11-06, 05:50 PM

Might I add another point.

7. Ring Dominos or some such other take away establishment and order provisions.

If that was me I'd be pulling the plug live system or not. Sod struggling to keep things up and running, Pull the plug get the pox cleaned up and then off you go again.

I've lost count of the number of times stuff like this happens at work and the managers insist on keeping the system up. When in hindsight it reveals that had we pulled the plug and cleaned the mess up the systems would have been down for less time.

tricky

15-11-06, 07:10 PM

I hate computers.
We've had some database "issues" over the past few days.

Unfortunately, just before it all went tits-up, the outsource boys in India made an un-documented and un-approved change (for which they received a proper bollocking from me).

Anyway the application support bods got wind of this and of course this had to be the cause of the incident (Of course it could never be the prehistoric version of Oracle or the ****ty application causing a problem :P )

I tried to explain that me putting the kettle on in Nottingham and causing a power spike in the data centre 50 miles away, was just as likely to have caused the problem, as the Mumbai lads adding an extra line in syslog.conf, but they wouldn't have it.

What are the sentencing guidelines for pre-meditated ABH ?

Sorry for the derail/rant.

Baph

16-11-06, 10:02 AM

OK, for the geek contingent on the site, the issue is now fixed (I think) so the system is running. It's far from perfect, but it's running.

Just a bit of background information first I think. Unfortunately I'm tied to an NDA, so I'll have to keep details vague. The two clients affected yesterday, one is a large car manufacturer (quite expensive cars too), the other is a world wide distribution/logistics company.

For the car manufacturer, our applications control everything from production line robotics right the way through their own distribution & GPS tracking systems. Because they had downtime, and I mean, complete downtime, everything was powered down for safety reasons (ever seen a robotic rivet gun getting fed random data when there's a chance for people to be nearby doing inspections? NOT GOOD).

For the logistic's company, all their route planning system, GPS tracking, automated warehousing, the lot, down. This meant that everything for them had to revert to manual, which isn't quite as big a deal as the car manufacturer, but still slows down operations.

fizzwheel, I understand fully what you were getting at. For the car manufacturer, I simply shutdown the servers. Their have to power down their hardware anyway, so what's the point of having servers commanding hardware that has no power. Also, two clients down, which one to fix first? I can't be on two systems at the same time. The logistics company, well, it works out that if they have to stop operations completely, they loose just short of £41,700 every minute! :shock: Crippled systems mean they loose less money, yes they still loose it, but not quite as much.

So the setup. Each client has 4 servers (at least). Database server, and 3 application servers. Of those application servers, 2 deal with users, the other automated processes (EDI stuff mainly). All of this runs on various Windows platforms (we don't care, it's up to them to choose), with IIS (yuck) and WebObjects (nice, but rather limiting). Our applications are written in Java, and interface like most other web based applications. User sees HTML, this is fed back through WebObjects to our application.

OK, so the steps to fix is were basically:
1) Get call from clients, have a brief look on both systems, realise that the car manufacturer has a potential to actually hurt people if it's left running, so shut that down. Screw the money they loose, policy states you don't put profit ahead of people, ever.
2) Investigate the logistics client a little more, and realise that this is a MAJOR issue. Call the client & get authorisation to "do whatever is needed to ensure productivity". Basically, I now have the green light to do everything up to and including re-imaging live servers in-situ. It's at this point is when I turn around to my boss, explain what I see before me, and his response was literally "OK, you'd best deal with it" as he walked out of the office. He never came back, I've no idea where he went, and I really don't care. He can explain himself to those who pay his wages. It was this point when I emailed him, CC'd the directors of our company & the client & basically said "This is what I plan on doing, if anyone has any objections I'm on extension 229, you've got 30 seconds before I start. If my first thoughts don't work, I've been left alone to deal with it, it'll be dealt with however I see fit."
3) Get a call from the client asking if I'm going to be able to fix this within critical SLA (4 hours). I explain that I'm not sure, as this is something no-one has ever encountered, and he says he'll call me back in an hour to get an update. If I'm still not sure, he'll arrange for transport to site (which means a private jet 8) ) At this point I call the Mrs, and let her know that I'm dealing with a problem, and it's magnitude, and that I might not be home in the next month (if we go to site, we return whenever the client says so).
4) Seeing that the server is running, and I can keep a connection to it, but our application isn't talking to anything, either via IIS or standalone EDI file transaction, or our own port communications. Start scratching my head, decide to shut down the entire thing temporarily & restart it. Not having any of it.
5) Run some of our database integrity checking tools, which tell me that the DB has 'inconsistancies' (this could mean anything realisitcally), so drop to resillience & run the same scripts, same result. Bugger. Live DB, no backup system available. Oh ****.
6) OK, so shutdown & restart hasn't killed this thing. I've got no access to the applications. Need to start thinking now. I know nothing about it, so, I need to know something, how do I do that? Aha, I have a java compiler. So I quickly knock out a rough application to iterate the running processes, memory segments they reside in etc etc, run it on one of the servers & start ruling out legitimate services/applications.
7) Repeat 6, on another server, and cross-reference results. This leaves one standing out from the rest. I take this to be a virus (it might not be, but it damn well looks like it) so call a college friend of mine who just happens to work as a code monkey for an AV company :D Again, I have to keep things vague with him, which doesn't help the situation.
8) Spend the next 10mins bouncing ideas off him, explaining that 2 clients and our own AV software hasn't caught it. We come up with a plan to watch memory, live, to see what this thing is doing. At least we'll know more about it.
9) Damn it seems there's a pattern to it's processing. That's good. My friend emails me a modified Windows NTx kernel, which I send to one of the application servers and reboot it (there's another application server, so things will keep running). With the new kernel in place, I now have the power to overwrite any memory segments I choose (but so does the virus). So I quickly fire up a memory hex editor, and fill in a few NOOPs. It takes a few attempts, but I manage to commit these NOOPs at the right time, and kill the damn thing. (For those that know about it, I basically used the old NOOP sled attack that used to cause buffer overflows, but a customized version).
10) Restore the original kernel, reboot the server, test things (after locking it away from the rest of the LAN by firewall rules), good, she's running.
11) Repeat one server at a time, until the client is sorted out. Thank Allah for that!
12) Report back to the client, tell him to cancel my transport, and do everything above for the other client. By now my head hurts, a lot.
13) Everything fixed, everyone happy, I head home, not too late either :)

Then I get a call from the logistics client at 10pm last night "We've got the same issue again"... OH ****! So spend half the night fixing it again.

Now I have to fill in reports about what happened, and try to find out where this thing came from. My college friend will probably come in handy for that, and in reward, his company will get as much information I can provide (without breeching NDA) about the attack.

I'm also going to spell out that I'd like a chocolate fireguard as my new boss when I file the reports. It'd of been much more useful!

So there you have it folks :)

21QUEST

16-11-06, 10:15 AM

....Loads of stuff in a foreign language ....

So there you have it folks :)

:? I think blondy( http://forums.sv650.org/viewtopic.php?t=47889 ) was right, you geeks are not normal :lol: :wink:

Sounds like you did a :thumbsup: job.

Cheers
Ben