Abstract:
The data center should remain operational round the clock to support critical applications and services hosted on the servers. The data center downtime not only means expenses on recovery but also results in loss of reputation. This research focuses on reliability analysis of RCMS Data Center. Statistical data is analyzed and failure mechanisms are discussed in this research. In order to perform the maintainability and reliability assessment of data centers, several techniques are used. The research mainly focuses on four System Engineering techniques including Quality Function Deployment (QFD), Calculation of Reliabilities using Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR) parameters, Reliability Block Diagrams (RBDs) and Failure Mode and Effect Analysis (FMEA). QFD and RBD techniques are used in reliability assessment while calculation of reliabilities using MTBF and MTTR parameters and FMEA techniques are used for maintainability assessment. Components of data center are prioritized according to their importance determined on the basis of QFD. Those having more weightage/importance value must be acquired earlier. The head nodes are the most stable and most available component of data center whereas Storage Area Network (SAN) is the most unstable component. Therefore, special attention must be paid to increase the availability of SAN storage. The main reason behind its non-availability is frequent shutdown and restart of data center. Hence, the practice of daily turning off data center must be stopped. Failures having high severity and high occurrence rating should be addressed earlier and must be accorded priority. Risk Priority Number (RPN) and occurrence rating significantly reduces if a recommended action is taken. All components of RCMS data center are connected in series so there is no redundancy except the compute nodes, which are connected in parallel. A reliable data center must have critical components connected in parallel to provide redundancy and ensure round the clock availability.