Saturday, November 17, 2012

24x7: A pipe dream or reality? And at what cost? Can a business afford not to be 24/7?

Executive Summary

What is 24x7? In simple terms, it refers to “Always On” or “100 % availability”. Meaning the information system is available to users, at any time, without interruption. No matter what the day is. No matter from where users access the system is.
Web definition of 24x7 is stated below;

“24/7 is an abbreviation which stands for "24 hours a day, 7 days a week", usually referring to a business or service available at all times without interruption.”  [1]

The most popular and the technical term used for the same concept is “availability”. As we learn in our software engineering subject at graduate studies, “availability” is one of the non-functional requirements of an information system. To achieve 24x7 availability, an enterprise has to spend money on it. It does not come free, as many end users may think; in long run enterprises get the return on 24x7. [1]

How 24x7 relates to Information Systems

Today’s information systems are very competitive.   For example, consider social networking media systems available today. No doubt the most popular system is Facebook. There are various other similar systems available to compete with Facebook. Some are older than Facebook while others are still emerging, e.g.: GooglePlus.  The most critical factor and also the most contributing factor towards the success of the business is the availability of the system. At any time system should be available to users as they want to use it, at times convenient for them. Imagine a situation where the Facebook is down or unavailable for a couple hours or even couple of minutes. What do you think the impact would be on Facebook? Furthermore imagine about this type of availability issues happen on Facebook on regular basis. That would be the point that users begin to move away from Facebook to more stable systems. Users may slowly switch to another similar system like GooglePlus or Hi 5 or to another one. This situation is slowly built-up due to the availability of the system.  No matter how many elegant features you delivered to users. Facebook is one example which 24x7 plays a major role.

Consider Airline reservation system. This could be considered as a mission critical system where the system availability is top most priority. Failure of such of such systems will result failure of business operations. That is critical to the organizations mission. It is not hard to imagine the impact to the passengers in case of failure in Airline reservation system. The impact would be not limited to a country, it definitely a global impact. This emphasizes the importance of system availability. The following literature survey found in internet gives an evidence of such airline system failure which lasted about five hours. [2] The below is an excerpt of the same article.

“United said that the system failure affected "departures, airport processing, and reservations" which caused flights not to be able to take off from some locations including its hub in San Francisco International Airport (SFO)
The airline cited "a network connectivity issues" as the reason for the computer systems failure which were fixed through "troubleshooting procedures"

Similarly there are lots of other information systems available which 24x7 plays a major role. E.g.: defense system, stock exchange systems and many other financial applications, etc.
More importantly mission critical systems need to have 24x7 availability. Those systems cannot afford to have even a second downtime.

Challenges of 24x7

Why is it difficult to provide 24x7 availability continuously? To answer that, need to think about the maintenance of information systems. Any information system runs on top of an operating system. The operating system would be Windows, Linux, etc. Information systems communicate with other systems through communication channels which is essentially network systems.  The following bullet points discuss several challenges facing on when providing 24x7 supports.

•    OS patching - It requires updating the operating system in regular basis to apply important security patches and other OS components. After applying security patches the necessary requirement is to restart the machine. During the restart time, the systems hosted in those servers are not accessible to users. Thus it breaches the concept of 24x7. This would be one and most challenging factors for 24x7.

•    System changes - Any software system evolves with the time and it accommodates changes due to the changes on business. The changes may be modifications to existing code base or implementing new requirements.  Once code is developed and tested it has to push to production environment for the customers to use. During the release the system may need a downtime, which may be a feature downtime or entire system downtime.  This is another challenge for 24x7.

•    Database modifications - The main challenge would be the schema modifications on database systems. Database is a central component which store all the enterprise data and retrieves based on customer requests. Doing such schema changes on production requires a downtime for the systems which accesses the databases which are being modified.

•    Database patching – Similarly OS patching, database systems also need to apply patches time to time, which requires the database system shutdown. During this period, the system which access data from the database system will not function.

•    Administrative work - Like any other product or system, software systems also have to go through certain administrative work, specially rebuilding of database indexes, etc. All sort of these work may need a downtime which results unavailability of the enterprise systems.

•    Hardware changes - Like you replace parts in your automobile due to wear-out, it is necessary to replace parts of the computers which host enterprise applications. You may need to upgrade the processing power of the servers to newer models, you may increase memory capacity, or you also may want to replace or increase hard disk capacities. This can cause interruptions for enterprise applications. There can be network outages also due to various reasons which cause interruptions to systems.

•    Natural threats - The most important challenge would be to protect enterprise systems from natural disasters or terrorists activities. We have plethora of instances on these two categories. Example, terrorist attack on World Trade Center (WTC) USA, various floods happens on many countries in the world, etc.

Consider the WTC attack. What happened to the systems and data onsite at WTC? It is no doubt that everything could have been destroyed. Then what happen to the systems data? Were they not available to users? Perhaps that could be the case for some systems, but at the same time there could be systems which still continued operating.

•    Cyber threats - Protecting data in enterprise applications is the most difficult task today. All the world’s super powers are preparing for a cyber-war which will be the next phase of war between countries. Already some countries have blamed other countries about illegal access to their valuable data. The security aspect of data is most important factor for 24x7. If the system is hacked by a professional hacker availability of the system is in jeopardy.

24x7: a dream or reality?

With all the above mentioned challenges, can enterprise applications achieve 24x7? Or it’s merely a theoretical concept. Information technology is evolving, with the latest new concepts and implementations. With the advance technologies available today 24x7 is not a dream any more. It is achievable and many enterprise applications are using them.  That’s why systems like Google, Facebook, stock exchange systems, Amazon, eBay are available at any time.

How to overcome challenges?

To overcome the above challenges it needs to understand various aspects of the enterprise systems in terms of how they develop. Any enterprise system is being built by using hardware components, software components, communication components, etc. The challenges listed under “Challenges of 24x7” are directly connected to the different components of an enterprise system.
The main hardware components of any system are physical servers, CPUs, memory, UPS (back power), network interface cards, storage, etc.

Data center concept

Large enterprise systems probably have data centers. Data center is a place where all the physical servers and other hardware components are connected each other to install systems and provide services to end users. A data center consists of many physical servers. In case of extremely large system, there can be thousands of physical servers.  There can be several data centers available depending on how large the system is. A data center communicates another data center through communication channels. Each data system should have redundant power systems, so that in case of single power system failure will not result entire data center failure. Most of the hardware related components are being used redundantly to make sure one failure not fails the whole system.

Data center security


Data center is secured in many ways. The location of the data center is obviously a place where there are no known natural disasters. E.g.: earthquakes, floods, Tsunami, etc. There are special security protection mechanisms are built it to data enters. Surveillance security cameras and access keep track of access events. Unauthorized logins are prevented. There are software and hardware systems are there to prevent those unauthorized accesses.
Still there can be threats to data centers regardless of how much measures being taken to secure them. A terrorist attack or unexpected naturel disaster could potentially destroy any data center resulting systems hosted in data centers not going to be accessed anymore.
As the author discussed in one of the earlier sections, the solution would be to apply the redundancy to the data centers as well. Another data center could be added a passive data center while active data center is providing the services. In a situation where active data center is unavailable completely, the passive data center could be taken all the responsibilities done by active data center. As a result it will not break any systems, it provides the business continuity. [3]

Data communications

There has to be a communication channels between data centers. Through the data communication channels, enterprise applications would be able to communicate with data centers and then connects to the World Wide Web so that it can reach out to many different geographical locations.

System Security

System security is a top most priority which essentially needs attention and continuous monitoring. There hackers in the world who tries to break the systems security with or without a purpose. There has to be measures take to prevent such security breaches. If not business continuity is an issue. The following is an excerpt from internet which it states how to make sure systems security.

Under the direction of the Nippon Express Corporate Compliance Committee program and decision, Nippon Express system security best practice is based on the ISO-27001, SOX, J-SOX, HIPPA, ISO-9001 and SAS 70 Type-2 compliances with ISO-17991 and CoBit controls. IT provides robust systems with enhanced security for internal networks, at all network perimeters, and in data exchanges with customers on external networks. All network access that traverses between internal and external networks, such as EDI with clients, Web, FTP and VPN are routed through redundant enterprise-class firewalls managed and monitored by IBM at the e-Business Hosting Center. Perimeter security is controlled using a sophisticated design of multiple DMZ. Routine security tests (external hacking) and IDS (Intrusion Detection Systems) are provided by IBM. Any production change is reviewed by the weekly change control committee meeting and weekly IT management meeting to review, justify, approve, implement and trace. System security incidents and issues are tracked and escalated for prevention and quick resolution. Our IT best practices are being validated by the combination of annual audits by Deloitte and quarterly technical audits through special technical auditors including password management and patch management. Customer data is secured and protected with multiple levels of security access control (internet, operating system, application, database, transmission) including personal information privacy data protection. [4]”

Mitigation planning

Mitigation plan comes into play if something goes wrong. Even if there is a mitigation plan in place there might be a situation that plan does not work as expected or mitigation plan will fail. If such a situation happened there should be a plan to take the business back on track as quickly as possible. The following is an excerpt from internet on how business continuity make sure in a company called Nippon Express.

Nippon Express Management has developed a Business Continuity Plan on how we will respond to events that significantly disrupt our business situations including, but not limited to, power outages, major water leaks, fire, loss of water, severe weather, and any facilities failures that may cause business interruptions. Since the timing and impact of disasters and disruptions is unpredictable for the global logistics business, we will have to be flexible with business partners in responding to actual events as they occur. Our business continuity plan is to quickly recover and resume business operations after a significant business disruption and respond by safeguarding our employees and property, making a financial and operational assessment, protecting the company assets, and allowing our customers to transact business. In short, our business continuity plan is designed to permit our company to resume operations as quickly as possible, given the scope and severity of the significant business disruption. Our business continuity plan addresses annual reviews of the following: data backup and recovery; all mission critical systems; financial and operational assessments; alternative communications with customers, employees, and regulators; alternative physical location of employees; critical supplier, contractor, bank and counter-party impact; regulator reporting; and assuring our customers prompt access to their business operational data. Other business partners with whom we do business are required to maintain business continuity plans also. Along with all proven redundant systems with multiple business continuity rehearsals at Nippon Express, our current IT Infrastructure including MPLS WAN with two IBM data center solutions in NJ and IL provide an average of 99.99% technical service level ratio results to support Nippon Express business continuity. Nippon Express continues to improve for the worst system disaster scenarios with a new RTO (recovery time objective) target of zero to 24 minutes while RPO (recovery point objective) is zero to 24 seconds. Furthermore, Nippon Express makes daily data backup on tapes to store at the Iron Mountain offsite storage per corporate compliance record retention policy. Also our IT capacity planning and on-going monitoring allows to provide adequate system capability for today and future business requirements.” [4]

Tool support for 24 x 7

Use of tools is an essence factor when supporting 24x7. By using tools, potential failures could be identified in advance thus system administrators could take remedial actions to prevent the failures.
System administrators need to implement monitoring mechanisms to the enterprise system to check the health of all the systems in a regular basis. The below is an excerpt from internet which describes the tool supports for 24x7.

24x7 Event Server is a flexible and relentless watchdog. It can detect occurrences of single point-in-time events such as error record or security alert written to system event log, CPU usage or free disk space reaching pre-defined threshold level, startup or shutdown of a specific service or application, and many other. It can also detect persistent or recurring events such as applications consuming too much CPU for a certain period of time, database or web server slow responses to user requests, hanged backup processes and many others. 24x7 Event Server supports a number of event response processing methods, so once it detects an event and alerts you to potentially damaging situations it can then take automated corrective actions like restarting failed application or service, running a fix-it user-defined script and more. 24x7 Event Server provides straightforward wizard-driven user interface that is easy to use and quick to learn. No special training is required.” [4]

Conclusion


The author has experience of supporting 24 x 7 for the SQL Server databases. As per the author’s view achieving 24x7 is very much achievable in today’s context with the support of advanced technological implementations.  Author has given much evidence to support the idea. However going for 24x7 for every enterprise is not a wise idea. 24x7 does not come free. It has a cost factor more than that the enterprise should realize the requirement for going to 24x7. Then it needs to look at the financial and technical feasibility.

References

[1] "Wikipedia," [Online]. Available: http://en.wikipedia.org/wiki/24x7. [Accessed 24 March 2012].
[2] "United Airlines Computer System Failure," IBTimes, 18 June 2011. [Online]. Available: http://sanfrancisco.ibtimes.com/articles/165269/20110618/united-airlines-computer-systems-failure-merger-related.htm. [Accessed 15 April 2012].
[3] "Data Centers," [Online]. Available: http://www.secure-24.com/data-centers/. [Accessed 15 April 2012].
[4] "NIPPON EXPRESS," [Online]. Available: http://www.nipponexpressusa.com/information-systems/overview. [Accessed 18 April 2012].

No comments:

Post a Comment