First post in a long, long, long time. Too much has happened all of which is downright weird. Still new job, even more fun and exciting, this time back to working with bare-metal (as well as some public and private cloud) and virtualised servers. That’s for a later post, though.
This one’s about what happens when one starts working in companies with bare-metal and virtualised environments in various states of maintenance. Specifically, it is the part that comes just after one starts getting an idea of what does what, how and why, and its called “establishing a baseline”.
Wikipedia defines a baseline as “..an agreed description of the attributes of a product, at a point in time, which serves as a basis for defining change…A “change” is a movement from this baseline state to a next state. The identification of significant changes from the baseline state is the central purpose of baseline identification” https://en.wikipedia.org/wiki/Baseline_(configuration_management)
So, defining a baseline, therefore, involves finding and agreeing on the most optimal configuration of something, and documenting it as the default configuration. Any changes from this default are either incorporated into that default when they further optimise or add to it without any loss of its default functionality, or rejected when they subtract from it or when they compromise it (or both).
The benefits of defining a baseline are immediately obvious: Any changes from the defaults are immediately (or with little enough effort) visible, any errors traceable and most often easily correctable by systematically rolling back your changes until you spot the problem and correct it or by reverting to the default configuration.
In traditional System Administration, one can define the baseline configuration as the state in which your application/service is running smoothly for a pre-defined period of time, a pre-defined number of users, a pre-defined workload, a pre-defined OS & Hardware configuration and a pre-defined I/O & Network throughtput.
Which is why we NEED to know and understand the system/service/application well enough, and which is where a System Administrator needs to gather metrics, run tests and simulations, see what they have to work with (in terms of resources and time allotted) and, perhaps most importantly, actually talk to the users of that system (be it customers, colleagues or both) and management. That will allow the System Administrator to understand what people expect from it and how they expect it to perform, what its importance to the company is, and what the company reasonably expects to pay to maintain it.
In an ideal world, with no restrictions imposed by funding, customer & management expectations, one can take time to bring the system/service down, untangle the mess they inherited, put it back together and party. In the real world, where we get paid for our work and we require nourishment and rest, this takes skill, knowledge, management buy-in, a line manager that cares about their job and is willing to trust you, and loads and loads of baby-step changes preceded by a series of thinking-cap-on sessions, in between breaks, firefighting and sometimes even in bed, in the small amount of clear-thinking time we get between wakefulness and sleep.
You can start bottom-up or top-down. You can start with the hardware and go up to the application, or the other way around. If you got the time, I’ve found its easier in the long-term to start with the hardware and move upwards.
Take your time to pick a good enough hardware kit, with a good enough set of support contracts with a good replacement & service policy. Ensure your network infrastructure is fit for purpose and can take the beating under heavy load. Make sure your OS is up to date, your repos official, your packages signed, your OS config sane, your yum excludes stable yet reasonable. Ensure your app/service configs are sane, and your init and shutdown scripts are, too.
But in a pinch (defined as “you have the proverbial ‘balls’, a good and valid support contract with all vendors involved, and a good and tested set of backups & backup & restore policy for it”) you can start with stabilising the application/service, using only by-the-book configuration and run-time options, setting sane and stable, and move down. Just make sure your baby-steps are (at the very least) human-baby (baby-turtle, even, if you can stretch the time and budget enough) not 7-foot-alien-baby steps!
That should hopefully get you started on the mind-set you will need to bring one service/app to a baseline. Rinse and repeat for the rest. I did mention this WILL take time, didn’t I? 🙂
It does require management-buy-in, and you will have to sweat, but I’ve had visible results in weeks. And the managers that did listen (reluctantly at first) and did allow it, liked it. A lot!
That’s all for now, folks, time for this young man to go to bed and get some rest before his on-call Friday’s next call-out! Stay tuned for more OOH wisdom! 🙂