There is a wealth of literature on data center design, construction, and operation, and yet few data centers are built, redesigned, or operated without significant, costly errors which impact every aspect of an organization. Many of the mistakes that inevitably get made can be avoided, not by reinventing the wheel, rather by including more people in data center design, ensuring the right people are in place for smooth operation and quality control, and consistent training and testing the skills of those entrusted with running the heart of your company’s digital infrastructure.
It is not a question of arbitrary inclusiveness, rather mitigating the number one cause of errors and failures which can have significant financial repercussions for any company: human error.
Including Operations Team in Facility Design
Basic structural and procedural elements of data center design and operation remain consistent across the board, however, the unique aspects and needs of every organization play an important role in data center function. Those in the best possible position to contribute to the brain trust are those who will be running the data center. Common mistake organizations make is not including the operations team in the design (or redesign) of their data center. Theory and planning is all well and good, but success comes down to adapting to unpredictable situations. Those in data center operations face issues daily and are a great resource for planning the smooth operation of your unique data center set-up.
Finding and Keeping the Right Staff
As with most divisions of a business, a smoothly operating data center needs not only experts in critical roles but enough skilled staff, in general, to fill and support all the roles essential to its operation. When looking at basic staffing plans to determine needs, companies often overlook the need for redundancy. We take this for granted when it comes to hardware because it is standard practice now, but, it is rarely hardware failure which causes the biggest problems, rather not enough, or not properly trained staff are the leading cause of operational breakdowns.
When considering staffing for a data center it is important to keep in mind not only the normal, daily operation of the center (including various needs for staff leave), but also the periodic cycles of activity that ensure the health of your data center (i.e emergency drills, system checks, quality control procedures, training, and other drills).
Training and Developing Talent
Because data center operation effects every single function of a company, there is a high need for very specific technical skills and experience in a multitude of roles. Hiring in this competitive market for such specific skills can be difficult, however, general technical expertise and problem-solving skills are slightly easier to find. Given how relatively unique data center operations are for each organization, it is a good idea to find those with technical proficiency and intelligence and train them on the highly specific areas you need.
Having a training program in place to develop intelligent workers functions not only to ensure that people are performing the specific job functions you require, it also helps in the elusive quest to retain good employees. Research shows that companies that make the effort to train and develop employees have a much higher retention rate than businesses that are always on the hunt for that unicorn to fill in a role that only they have – only for all parties to be disappointed a few months down the road when the unicorn turns out to not be real.
Most importantly turnover in mission critical roles puts a significant financial and operational strain on the organization, which can lead to further loss of talent, to say nothing of secondary and tertiary repercussions to system errors, downtime, client facing system issues…
Regardless of job specific experience level or general technical proficiency, all employees need some training, which they in turn pass on to other employees. If that training is not efficient or effective than you are creating an internal negative spiral of action and consequence. Though this is not new information for most managers and directors, there remains a lack of data center specific training. Many companies have a hard time identifying the source of their issues – first to narrow it down to human error, but then understanding that the lack of proper training is a cause. One of the reasons for this oversight is the cost of training, both in area expert fees as well as the time it takes for employees to get trained. There is little argument however that though there can be a significant up front cost, it is never greater than the fallout from errors that could have been prevented with proper training.
Drilling and Skills Testing
No matter your level of experience or education – if you don’t use it, you lose it. It’s a cliché, sure, but it is one that can mean the difference between surviving an emergency situation, and a total systems failure. Whatever employees do day to day becomes second nature to them – they don’t even have to think too much to know what to do, be it problem solving or maintenance. However, when the unexpected happens, how prepared are the staff and managers to handle the situation? The answer, unfortunately, is that without regular drilling and testing of skills very few will be ready.
Most organizations have had their share of battles fought, and if not, there are countless written accounts of most things that can and will come up eventually during the lifecycle of your data center. Selecting those scenarios that are most likely, ensuring there is a variety of causes and effects, and creating monthly drill to test responsiveness will go a long way to ensuring that when issues to arise your teams are ready to mitigate their impact on the data center and the company as a whole.
Testing also plays a vital role in identifying areas of improvement, as well as areas of expertise that can be shared in the organizational brain-trust. Testing and drilling go hand in hand to ensure weaknesses are identified and addressed, and skills and expertise are shared to improve the overall function of the data center and all the teams therein.