Published date: June 28, 2024
Expert tutorial by Egor Karitskii, Head of IT-infrastructure
Know your enemy
Daily operations of data centres may seem trivial to those unfamiliar with their intricacies. Consider, for example, the arrival of a batch of patch cord cables for switching. It looks like a simple task that will take a couple of minutes. However, in a medium-sized data processing centre, such shipments often contain thousands of cords, each neatly packaged with two ties. While unpacking a single cord may not pose a challenge, dealing with thousands presents a significant issue. It’s not simply a matter of untangling and hanging them; it’s about ensuring they’re properly organised and accessible. Unpacking and organising 5,000 patch cords can take a team of three engineers an entire day and what if we have even more of them? When the transition from manual labour to full automation is a must? Let’s examine how approaches to data centre maintenance vary based on the data centre size.
Going live with 10/ 100/ 1000 nodes
At this point, I’ll compare three different DC scales: 10 servers, 100 servers, and 1000 servers. The disparity between them is significant across various aspects. Day-to-day operations extend beyond merely deploying a server once and leaving it. Engineers’ activities within a data centre are continuous and highly dynamic, operating 24×7. Comparing processes across these scales—deploying 10, 100, or 1000 servers—will help us understand the infrastructure needs for automation.
Planning
Planning is an important step regardless of the scale. Yet when dealing with 10 servers, planning is relatively straightforward due to shorter deadlines. You can easily procure servers from a supplier within a day or two, and most likely encounter no significant issues.
However, for larger quantities like 100 or 1000 servers, the planning process differs. For procuring such volumes the planning horizon expands dramatically, with timelines stretching to a week or even a month for 100 servers, and up to a month or even half a year for 1000 servers. Therefore, meticulous long-term planning is essential to anticipate and address the requirements for such deployments.
Logistics
​​A similar contrast exists in terms of logistics between deploying 10, 100, or 1000 servers. Transporting 10 servers poses minimal logistical challenges, easily manageable by any transport company. However, the logistics complexities escalate exponentially with larger quantities. Transporting 100 or 1000 servers needs specialised vehicles to ensure safe and secure transportation, with packaging tailored to prevent damage during transit. The logistics of handling such volumes require meticulous planning and coordination to overcome potential delays and logistical constraints.
Unpacking and setting up servers become formidable tasks, involving the careful removal of protective packaging and components, as well as the allocation of sufficient space for storage and assembly. This process demands significant time and resources, particularly when dealing with extensive server deployments.
Consequently, when we deal with large volumes, we should always make careful planning of logistics. This includes considering the potential need for special vehicles, insurance coverage, adequate storage for the servers, and timely utilisation of packaging on-site.
Server mounting
Mounting a server in a rack is quite a complex task. While two people can manage it, for larger quantities like 5,000 servers, it becomes one of the most labour-intensive and time-consuming processes. This is primarily due to the weight and height of the servers, requiring stability and careful handling to prevent accidents.
When dealing with such large quantities, it’s often necessary to hire a contractor who can provide a team of engineers to handle the installation. Alternatively, if there’s no urgency, the task can be managed internally, although it will take significantly longer. This process involves meticulous work, including attaching rails and screws, which can be physically demanding and exhausting.
Hiring external contractors with qualified engineers will ensure the task is completed efficiently and safely. This approach allows swift deployment, especially when there’s pressure to generate revenue from the servers quickly. However, if time permits, a more gradual installation process can be adopted with careful planning and execution by internal teams.
Cabling
After mounting the servers, the next task is cabling, which is equally demanding. A single mistake in cable placement can lead to difficulties in locating and rectifying the error later on. Each server in the rack needs to be precisely connected, and even a small misstep can result in issues.
Troubleshooting cabling errors is very challenging and often leads to delays in operations. However, third-party contractors are not always an optimal solution here, as they may struggle with this task due to the specific standards and practices of each company. Teaching external teams these standards can take too much time and may not always yield the desired results.
Hyperscalers address this challenge by pre-assembling and pre-cabling server equipment in the production facilities. This approach almost fully eliminates the need for on-site unpacking and assembly, making the deployment process much faster and easier. If we speak about any amount exceeding 100 servers in my opinion it is worth ordering pre-assembled servers with cabling in place.
TOR-switches installation
TOR-switches, short for Top of Rack switches, are essential components in network infrastructure, present in each project. And once again, while the installation of TOR-switches for a small number of servers is simple, scaling up to larger quantities requires much more time and expertise. Planning becomes a key solution, and in some cases, outsourcing the assembly to a factory may be the most efficient way to ensure timely and accurate deployment.
These switches too are often pre-configured and integrated into fully assembled racks by larger DCs. However, smaller companies install TOR-switches alongside servers, mounting them within a rack and connecting them accordingly already on the site.
OS installation
In modest-sized DCs operating system installation is typically carried out manually, with an engineer connecting a monitor, keyboard, and mouse to each server individually. Whether it’s Windows or Linux, the engineer initiates the installation process for each of the ten or twenty servers.
Scaling up this process to a hundred or thousand servers requires automation. In order to implement automation this system should be capable of identifying new servers, initiating the installation process, and managing the entire procedure autonomously.
Thus, it’s not uncommon for companies to encounter problems when they experience growth from 20 to 100 more servers, for example. They suddenly realise that with a significant increase in server numbers they will need to automate the processes. If their system is not ready for that, they will first need to build an integration system that will make the automation possible. This can lead to delays and financial losses. So I recommend always planning for scaling even if at the initial stage you have 10 servers.
New rack configuration
After installing the operating systems on each server it is time for the new rack configuration. Manual configuration is possible for a small number, say 10 or 20 servers. However, when dealing with a much larger quantity, like 100 or 1000, automation becomes inevitable.
Automating the deployment process involves more than just installing the OS — it requires a comprehensive system. This system must include a pre-established database specifying the intended roles of each server, such as database servers, along with considerations for information security, including password management, access control, and encryption keys.
Without such a system in place, managing the deployment of thousands of servers becomes impractical, if not impossible. Even if the servers are physically installed and the electrical systems are operational, without an automated deployment system, their functionality is severely limited.
In conclusion, automation becomes increasingly critical as the number of servers grows. This progression is intuitive — while a single server or a dozen servers may be managed manually, the complexity of managing 100 – 1000 servers needs considerable automation.
What Needs Automation
In this section, we’ll talk about the automation of different processes, with a particular focus on capacity planning and lifecycle management.
Capacity Planning and Lifecycle
I consider these processes to be the most important for automation. Surprisingly, some companies overlook the importance of automating them, assuming that they won’t encounter any issues. However, failing to automate capacity planning and lifecycle management can lead to unexpected challenges as operations scale.
At a certain point there will be a need to change the servers, but how can we take this decision without reliable data? Inevitably we’ll have to answer the following questions about each server:
- Is it outdated?
- Is it functional?Â
- Is it optimised for its task?Â
- How efficiently is it being utilised, and by which team?
Without automated systems, answering these questions becomes impossible. Consequently, decisions about purchasing new servers become mere guesswork devoid of essential data. Automated systems not only calculate but also forecast future consumption needs, helping determine the optimal number of servers required. They also identify servers that are outdated and inefficient, enabling informed decisions about decommissioning them.
In large-scale infrastructure management, manual methods simply aren’t feasible. Automation provides the necessary precision and foresight to control thousands of units effectively.
Manufacturing
It is also worth remembering that servers can be tailored to specific preferences right in manufacturing. Especially when dealing with large orders including thousands of units, customisation options become very helpful. This extends beyond the servers themselves to include packaging preferences. For instance, you can request servers to be delivered without packaging, simply with individual patch codes enclosed in the box. With such large orders, manufacturers are often flexible and can accommodate specific requests swiftly.
This level of customisation extends to the servers as well. Whether it’s omitting certain components or including specific features, manufacturers can adapt to meet your needs precisely. This flexibility is highly advantageous and can be leveraged to streamline operations.
Pre-assembled racks
We’ve already discussed this point, but let us stress it once again. The more servers you need, the more important it becomes to order pre-assembled racks.
Incoming Tests
Testing is a critical aspect that can benefit greatly from automation. When managing just 10 servers, it’s manageable to notice issues during manual installation — you might find a server not functioning, a faulty memory module, or a damaged disk. However, with a thousand servers, automating these processes becomes essential. Automated testing can identify any failures or errors and allow you to repair them on time. Ideally, testing should occur both at the reception and production stages.
At the production stage, pre-assembled racks are thoroughly tested to ensure all components are functional before being dispatched. Upon receiving a pre-assembled rack, a testing script verifies its integrity, ensuring nothing has gone awry during transportation.
Then it is necessary to automate input control to preempt operational issues. Without it, you may encounter difficulties during operation that can lead to last-minute adjustments and repairs.
OS installations
The next phase of automation involves installing operating systems, a task that’s both common and well-established. There’s a wide array of products available for this purpose. In about 90% of cases, companies use PXE technology. It is a network protocol that enables automated deployment of operating systems over a network.
TOR-switches tests and configuration
When it comes to configuring and testing switches for this system, there aren’t many standardised solutions available. Each company makes its own decisions based on its specific setup, given the multitude of switch vendors, interfaces, and connections involved. Unlike operating systems like Linux or Windows, where procedures are clearer, configuring switches poses a broader and more complex challenge. A single misconfigured server may not cause significant issues, but a malfunctioning switch connected to dozens of servers can disrupt an entire rack. Consequently, configuring and testing rack switches manually is a risky endeavour. In environments with numerous servers and racks, automating the configuration and testing of rack switches is required to ensure stability and reliability.
DCIM and CMDB
Moving on to DCIM and CMDB systems, let’s start with CMDB, which stands for Configuration Management Database. Essentially, it serves as an inventory, tracking the whereabouts and status of all components. It records details like which disk is installed in which server, which server is housed in which unit, and so forth. Any changes, such as component replacements, are meticulously logged, including the engineer responsible and the date of the action. This meticulous tracking ensures accountability and facilitates maintenance tasks. Without CMDB, managing even a modest number of assets becomes challenging.
DCIM, or Data Center Infrastructure Management, serves a similar purpose but focuses on engineering equipment within data processing centres. It covers aspects like ventilation, air conditioning, and backup power systems. DCIM alerts engineers when maintenance is due, ensuring the smooth operation of critical infrastructure. Maintenance tasks are recorded, providing a comprehensive history of service actions.
Expectations and results
Now let us take a look at the expectations business teams usually have from the infrastructure squad and what metrics they use to evaluate the outcome.
Time to production
Time to production is a critical metric for businesses, measuring the duration from the conception or order of a server to its deployment. It’s the fundamental aspect that businesses inquire about when seeking resources from the infrastructure team.
Understanding the timeframe for resource availability is important here as businesses base their plans on this information. An incorrect assessment of this metric can have dire consequences, leading to disrupted plans and potentially significant losses. Infrastructure cannot be expedited overnight; it requires meticulous planning and execution.
Staff
Similarly, the number of personnel required is crucial. While two engineers may suffice for a small number of servers, scaling up requires careful planning, hiring, and training. The level of automation plays a significant role; higher automation levels can reduce the manpower required for maintenance.
Warranty
Typically, server and network equipment come with robust warranties and technical support from the manufacturer. These warranties often include services like Next Business Day support, ensuring prompt assistance in case of equipment failure. However, it’s essential to note that the cost of these warranties can sometimes exceed the cost of the equipment itself. Therefore, it’s crucial to plan effectively to maximise the value of these warranties.
Often it happens that equipment remains idle in a warehouse due to bad planning. In such cases, the warranty gradually loses its value as it expires over time. Thus, it is important to have profound planning for unpacking, logistics, cabling, and installation to ensure that the warranty remains valid and to avoid unnecessary losses. Effective planning not only safeguards the warranty but also helps in optimising resources and reducing costs in the long run.
Wrapping Up
In conclusion, it is important for the business team to fully acknowledge that it is impossible to expedite infrastructure deployment. Business expectations should always consider the long-term nature of large infrastructure deployment. This is where effective communication and understanding between the infrastructure team and the business play a key role in successful project execution.