joeware - never stop exploring... :)

Information about joeware mixed with wild and crazy opinions...

Makes me proud (at least for now)

by @ 10:13 pm on 7/7/2006. Filed under general

I stopped in to check on the status of one of my long running experiments today… Basically I went and had lunch with some old friends who I used to work with in my last job (fired from a little over two years ago). We do this once a month or so and catch up and they let me know what is going on around the company and how my “baby” (their AD Forest) is doing.

I say I am checking the status on my long running experiment because the AD they support is one I helped design/build and define process, procedures, and rules for. I like having the feedback immediately available on how it is all going. I think it makes me do better now when making suggestions to other folks about what they should be doing.

The short and sweet is that it is doing very well. The 4 member (+1 manager) AD Team processes many thousands of help desk tickets a year but they indicated that less than 1% are actual “problems”, the rest are requests for information, misrouted tickets, and requests. The team’s primary work is adding/removing objects from AD as they are the only ones who can create/delete groups, server accounts, sites, subnets, OUs, etc. The users and contacts are all handled through a corporate provisioning system and workstations are handled by local site admins. Also local site admins handle group membership once the groups are created.

On the Exchange side of the house, the directory is so stable and the Exchange environment is being run so well they have 3-4 “9”‘s for availability on Exchange and this is without using ANY clustering whatsoever. None. The AD is in good shape, the DCs are in good shape, the provisioning system works properly, people don’t have too many rights to screw things up, it all just works the way MSFT pretty much planned in that case.

After seeing very many companies, none of which run this smoothly and well… that is, in a word, amazing. I know of no other environments even at a 10th of the size (they are about 250,000 users) that runs anywhere near as well.  

Interestingly while having lunch with them this time their pagers actually went off. In the two years I have been going down to have lunch with them that was the FIRST time that had occurred and our lunches usually go 90-120 minutes easy. The team actually looked shocked they were being paged. The pagers went off for an urgent ticket that came into their trouble ticket queue… the problem… someone wanted to know who owned a certain server… Now if someone is willing to put that at an urgent, you know there aren’t any real serious issues. Actually, it is evident in other ways, mostly the fact that everyone is pretty laid back and relaxed and admins are going on week+ long vacations and there aren’t any concerns about coverage etc – remember what I said, 250,000 users and 4 AD Engineers.

Why/How you ask? Especially you ask if you are a harried and crazy busy admin… It is because we, AS A TEAM, were a complete Ass to anyone who wanted to put anything into AD and the team as a whole adopted processes and procedures and rules to make sure people only brought good things into the directory. If someone can’t follow standardized rules, processes, and procedures, it usually is an indicator of their lack of capability, not that of the system. If all of those processes and procedures weren’t in place, they would be running around crazy doing ad hoc crap just like many hundreds of other places I have seen. The whole thing ONLY works if everyone on the team follows the rules and makes sure everyone they help is following the rules. If one person on the team doesn’t do it, that one person undermines the authority and effectiveness of the entire team and threatens the long term efficiency and capability of the group and, in my opinion, should get beaten nearly to death for even thinking about it. Someone who would do that obviously doesn’t understand proper support and how to run an enterprise environment.

When I still worked there I did a lot for a lot of people and folks often raised the question, “what happens when joe is gone? Isn’t this all going to fall down right away?” My response to that was always no, it should run fine as long as people continue following the rules, processes, and procedures. Assuming that happens, stupid problems will not crop up when maintaining the status quo and you can take time to make sure new things get brought in properly. I said that AD should run very well for years without me because we had good people and good processes.

My fears as I always stated then was with bringing new apps in in a piss poor way. If they are handled poorly which is popular for Dev people to try and do unless support properly jumped on the developers in the development/integration/implementation stages to make sure things are done right, a piece of crap application will get into place and it will end up destroying everything that we built. Not only did I say that was what my fear was that could happen, I firmly believed it would eventually happen because there is always a case where someone feels process and procedures shouldn’t apply to them and eventually someone would fuck up and believe it.

It only takes one person put in charge of making sure something is done right who caves in and starts doing things wrong to make people happy that can wipe the whole thing out. Unfortunately from what I was hearing in the various conversations, that is now happening. Hopefully it can get turned around but who knows. I would really hate to see all the good work we put together to make the environment stable and secure and running efficiently get blown apart and start running like so many other environments I have seen because of the poor decisions of one support analyst not standing up to Dev folks and making them do things correctly.

To put it in the most simple terms. Dev is the accelerator and is there to make people move forward, Support is the brake and is there to make sure you don’t hit the wall and if you do it is to act as the air bag and make sure people don’t get too hurt. Hopefully Dev takes a little bit of the support requirements in mind and hopefully Support keeps a little of the dev requirements in mind but the instance support fully caves to Dev, the game is over. A Support person can and should be friends with the Dev people, but they should also recall who they work for and why they are there. It isn’t to make Dev’s life happy or get an app out the door on time, it is to make sure the app gets out the door properly and can be supported now and in the future. If a support analyst doing that kind of work can’t accomplish that, they are worthless for that task and should be off doing process based work, not dev related work.  

 

  joe

Rating 3.00 out of 5

5 Responses to “Makes me proud (at least for now)”

  1. Mike Kline says:

    Fired? That had to be due to a personality conflict. That had to be a rough period for everyone involved.

    Let’s say a senior level manager needs a group created. Is that considered high priority in that environment?

    For instance in my environment if someone is an Admiral or General their request is always high priority and even though that actual task may be easy we can paged for something like that. It’s not that bad… after all in the end they sign the checks

  2. joe says:

    MIke:
    Yep fired, my last two jobs I have been fired from, or at least my contract has been terminated.

    In the first case, it was said it was because I did bad things and was a physical danger to the employees, interestingly enough, that company rehired me as a full time employee later on. The real issue was probably due to the fact that I refused several offers for full time employment because at the time they required a tremendous salary cut and additional responsibilities. On top of that, I had “too much power”, meaning casual comments from me could be used to sway the opinions of folk fairly easily even in disregard for what management said/wanted.

    The second case I was told a bevy of issues none of which were real from my services were no longer needed to I had broken Microsoft NDA. However, the rumours indicate that the second issue from above kicked in again. Basically I had built up a level of respect with folks who saw that things I worked on got done correctly and that my team was one of the best teams around so once again I had too much power when in actual reality I really had none officially. Couple that with saying that the work we were doing to go to Linux wasn’t being done properly and would cause us considerable pain if we didn’t get the right people involved who actually understood the environment and problem statement properly. The Directory of IT seemed to take that to mean I was against going to Linux which absolutely wasn’t the case. I was actually trying to make sure we did it right as I saw us heading for a tremendous failure. You see, that Linux stuff was his baby… Interestingly enough much of the Linux project was pretty much killed within a few months.

    Re: the group creation and priorities, all requests have a one week SLA. However during US business hours things are usually handled in less than an hour (sometimes seconds or minutes) and outside of business hours they are usually handled within an hour of business hours resuming. The idea is that new implementation shouldn’t be something done as an emergency. It indicates lack of planning which is a big no no. You have to at least pretend you are doing proper planning and testing of things.

    Overall the response time is usually so good that if something takes a few hours during US Business hours people will actually start to complain until they get redirected to the SLA website. That is a good indicator that while the SLA may say one thing, the level of service is so good people don’t even know what the SLA is. If the group did not run so well and was constantly being driven by pagers going off, etc then the overall service level would, no doubt in my mind, drop. Pagers going off and phones ringing and people dropping by for favors all drammatically impact the service levels a group can deliver in a very bad way. Each one of those items is an interruption and slows everything down. If people are allowed to work undisturbed, they don’t have to keep putting thoughts “on the stack” and then trying to remember what they were doing later.

    I’ll admit it was tough to get it going that way initially because everyone wants special treatment. But the secret was to do as best a job as possible all of the time and show people that things go well without high priority and urgent tickets for things that aren’t causing the sky to actually fall. Even if one or 25 users were completely down, that isn’t urgent in an org of 250,000. I used to say if a whole country isn’t down, our pagers better not be going off and even if a whole country is down, our pagers are already going off from our monitoring so we probably already know and are probably already working the issue. Still in the beginning, people would still feel they were important enough that they should send pages and urgent tickets or the best are to leave messages that say “call me” or send tickets that say call me. For the people who didn’t use the priorities properly, we would reduce the ticket priority and make those items wait a while. If they did things properly, we handled it quickly. As for the call me phone calls and tickets, the calls were ignored and the tickets were sent back and the people were told to document the issue and we actually had a set of some 14-16 questions we wanted fully filled out for any issue that didn’t fit the other request formats. If those questions weren’t filled out or the request formats weren’t properly filled out, the tickets were sent back.

    The request format is so important because it was set up to be cut and pasted into scripts to do the work. That way a request for 100 new subnets took the same amount of time as one for 1 new subnet.

    The 14-16 questions were the most boiled down set of questions for core data we needed to actually determine a problem. Usually there was enough info in there for us to work out who was really responsible for the issue or what the actual issue was. Most of those questions had to do with configuration of the machines involved with the issue and good descriptions of the issue. I would say that 90% of the time you could see that the issue was misconfiguration right from the resulting answers and someone wasn’t following standards.

  3. Hey Joe:

    I’m very interested in what corporate provisioning system the company is/was running (if it’s ok to tell) – and did you have a hand or a say in the original configuration?

    It’s always interesting to see a successful provisioning install…

    Cheers,

    Pam

  4. joe says:

    Hey Pamela,

    The provisioning system was actually something built internally. It was originally built for managing an x500 directory that ran on mainframe, that directory eventually moved to iPlanet running on *nix but the mainframe web based front end was kept for it. Then when Windows NT Domains came along that was plugged into using the same system as well and obviously when that was migrated to Windows AD the provisioning system was set up to handle that. Now the old iPlanet backend stuff is being moved to an ADAM configuration set with some MIIS backend pieces as well but as far as I know, the web based mainframe front end will still stay… Oi!

    The system is quite flexible though really a series of giant hacks. The web page makes some local changes on the mainframe which then shoots down CSV files to the proper machines (iPlanet or Windows) to do the actual mods. The “shooting” is done via ftp and after the work is complete the mainframe picks up the results from the same ftp site.

    On the actual Windows side (I wasn’t very involved with the iPlanet stuff) some code written many years ago by an MCS guy (very poorly written code to be sure and guy’s name was a variation on the spelling of the word Hack (I couldn’t make this stuff up…)…) implements a service which just watches a folder (the FTP landing zone) and then when it sees a file there it executes a VB app that reads the file and processes it and writes an output file for the mainframe to pull back up. It is all positively evil and horribly written but works amazingly well.

    That was actually another project I wanted to tackle when I was there, to rewrite that entire system as it could be done much better but alas, didn’t get the chance. And by better, I don’t necessarily mean implement MIIS. Overall MIIS annoys me, I strongly dislike the need for SQL Server and do not feel it is necessary and simply a gambit by MSFT to force more reliance on different parts of their tech.

  5. Hey Joe,

    Thanks for the description – even if it is a bit evil and possibly requiring overhaul soon, the presence of the system at all implies that there is a business process being followed that is conducive to automation — a small miracle in itself, in my experience 🙂

    Cheers,

    Pam

[joeware – never stop exploring… :) is proudly powered by WordPress.]