JUMP TO
Get URL

Mission-Critical Computing

Lower Latency

Lower latency refers to a minimal delay in the processing of computer data over a network connection. The lower the processing latency, the closer it approaches real-time access. A lower latency network connection is one that experiences very small delay times. Latency is the amount of time a message takes to traverse a computer network. It is typically measured in milliseconds. Any latency below 100 milliseconds (ms) is considered good, and below 50 ms is very good. Typical DSL or cable Internet connections have latencies of less than 100 ms, while satellite connections usually have latencies of 500 ms or higher. In general, LAN connections are faster and have lower latency, and WAN connections have higher latency.

Latency is affected by propagation delays, transmission delays (properties of the physical medium) and processing delays (proxy servers or network hops). The impact of latency on network throughput can be temporary (lasting a few seconds) or persistent (constant) depending on the source of the delays. Excessive latency creates bottlenecks that prevent data from filling the network pipe, thus decreasing throughput and limiting the effective bandwidth of a connection.

Lower latency is especially important in industries that rely on real-time applications and live-streaming graphics such as for banking, diagnostic imaging, navigation, stock trading, weather forecasting, collaboration, research, ticket sales, video broadcasting and online multi-player gaming. Cloud latency, or the amount of time it takes for a cloud-based service to respond to a user’s request, is an important criterion when choosing a cloud provider. Cloud latency is affected by where users connect to the cloud, which cloud data center they connect to, which network provider is used, the route of network traffic and other factors.

Geo-Clustering

Geo clustering is computer clustering over geographically dispersed sites. A basic cluster is a group of independent computers called nodes, usually housed in the same physical location, that work together to run a common set of applications. The nodes are physically connected by network and storage infrastructure and logically connected by clustering software. Unlike a basic cluster, a geo cluster disperses its nodes to several different physical locations.

Geo clustering protects data and IT resources from location-related disasters such as fires, floods, electrical outages and malicious damage. The cluster nodes are separated geographically and synchronously mirrored between sites. A geo cluster is unaware of the physical distance between its nodes. In a geographically dispersed cluster, the public and private network interfaces must exist in the same network segment and the cluster nodes must share the same IP subnet for failover purposes.

Computer clustering relies on centralized management software that makes the nodes available as orchestrated servers. The right enterprise operating system can prevent application downtime with clustering that replicates data across multiple computer clusters and provides service failover across any distance with geo clustering. SUSE Linux Enterprise High Availability Extension can protect workloads across globally distributed data centers. It allows companies to deploy both physical and virtual Linux clusters across data centers, ensuring business continuity. Large enterprises may use geo clustering for both data center high availability and business continuity. Extending data center capabilities across unlimited distances maximizes an organization’s tolerance for regional catastrophic events, ensuring it maintains business continuity, protects data integrity and minimizes unplanned downtime for its mission-critical workloads. Geo Clustering for SUSE Linux Enterprise High Availability Extension enables business continuity by deploying physical and virtual Linux clusters between data centers located anywhere in the world.

Business Continuity

Business continuity is the ability of an organization to maintain its essential functions during and after a disaster. Business continuity planning focuses on establishing risk management procedures to prevent interruptions to mission-critical services and re-establish full functionality with as little downtime as possible. A business continuity plan considers various unpredictable events such as natural disasters, hardware and software failure, fires, disease outbreaks, loss of key personnel, cyberattacks, supply chain failures, market upheavals and other external threats. All sources of possible business disruption must be identified and assessed for their probability and harm to the organization’s operation.

Business continuity includes building resilience into critical business functions and their supporting IT infrastructures. They must be designed in a way that is unaffected by disruptions, for example, through the use of redundancy, spares and mirroring. In addition to risk management, business continuity may also include governance, information security and compliance.

Large enterprises may use geo clustering for both data center high availability and business continuity. Geo clustering allows organizations to have multiple, geographically dispersed sites with a local cluster each. In the event of a regional disaster, mission-critical workloads can automatically failover or be manually transferred to another data center or cluster outside of the disaster zone. Extending data center capabilities across unlimited distances maximizes an organization’s tolerance for regional catastrophic events, ensuring it maintains business continuity, protects data integrity and minimizes unplanned downtime for its mission-critical workloads. Geo Clustering for SUSE Linux Enterprise High Availability Extension enables business continuity by deploying physical and virtual Linux clusters between data centers located anywhere in the world.

Continuous Integration

Continuous integration is a software development approach where developers regularly merge or integrate code changes into a central repository. Automated builds and tests are then run from the shared repository, immediately surfacing any functional or integration errors. In the past, developers often worked in isolation for extended periods, merging code changes only after their work was completed. Merging long-accumulated code changes was difficult, time-consuming and laden with bugs. With continuous integration, developers frequently merge changes to a shared repository using a version control system. Every revision committed to the repository triggers an automated build and test.

Continuous integration usually refers to the build and unit testing stages of the DevOps software process. The goals of continuous integration are to find and address errors quickly, improve software quality, and accelerate the validation and release of new software updates. This iterative building and testing process overcomes many problems that formerly occurred between the planning, design and implementation stages in the software development lifecycle. Regular merging of various team member’s code prevents software projects from drifting too far from a functional state. As new code is added by others, catastrophic merge conflicts can be avoided. With continuous integration, when developers check in bad code the system automatically notifies the entire team within minutes. This prevents new versions from accidentally building upon or relying on bad code.

Continuous integration ensures bugs are caught early in the development cycle, which makes them less expensive to fix. It also reduces the time to market for new applications and helps businesses respond quickly to changing customer needs. Open source technologies and development tools can support continuous integration and the DevOps model with automation and lifecycle management products. For example, SUSE Manager includes automated deployment and configuration tools for software development projects.

Continuous Deployment

Continuous deployment is a software development approach in which every code change goes through the entire pipeline and is put into production automatically, resulting in many production deployments every day. Unlike continuous delivery, which requires human intervention to approve production deployment, continuous deployment automatically pushes any updated, working version of the software to production. In a sense, continuous deployment is the next step of continuous delivery: every change that passes the automated tests is deployed to production automatically. The goal of continuous deployment is to release applications (and business value) to end users faster and more cost-effectively by managing small, incremental software changes.

Continuous deployment is usually managed by application release automation tools. Historically, the activities of deploying to testing and production were handled by testing and data center teams, separate from the software development team. Agile operations and DevOps lifecycles have modernized the development process with automation, putting engineering, operations and support teams together as one team. Application containers can bring developers and operations engineers closer together by providing a shared resource. Containers can also facilitate the Build step of the DevOps lifecycle. By design, containers are easy to scale and can support the growth of new business applications.

Open source technologies and development tools can support continuous deployment and the DevOps model with automation and lifecycle management products.

Continuous Delivery

Continuous delivery is an iterative software development approach that ensures every change to a system can be released. The goal is to make software releases reliable so organizations can deliver them frequently, reduce risks, get faster feedback from end users, and shorten time-to-market for innovations. In continuous delivery, every code change is built, tested and then pushed to a non-production testing or staging environment. There can be multiple and parallel test stages before a production deployment. Tests may include user interface testing, load testing, integration testing, API reliability testing and others.

Continuous delivery lets developers automate testing so they can verify application updates across multiple dimensions before deploying the software to customers. Every revision triggers an automated flow that builds, tests and then stages the update. The final decision to deploy to a live production environment is triggered by the developer. Automated testing helps developers validate updates and discover any issues with the changes. With frequent, automated testing developers can discover and address bugs early in the product development process, before they create problems.

Continuous delivery lets developers more easily perform additional types of tests on their code because the process is automated. Cloud computing allows development teams to cost-effectively automate the creation and replication of multiple testing environments without affecting the on-premises environment. Open source technologies and development tools support continuous delivery with automation and lifecycle management products.

Real-Time

Real time is a guaranteed level of computer responsiveness within a specified time constraint, usually milliseconds or microseconds, between an event and its response deadline. Real time describes a human sense of time (rather than machine time) that seems immediate. For example, real-time weather maps appear to portray immediate changes, when actually several milliseconds may have elapsed between image updates. In contrast to mainframe batch computers, where processes are grouped and scheduled for processing at later times, real-time computer processes occur without delay. In computer simulations, “real time” means the simulation’s clock runs at the same speed as an actual clock, not in fast-motion or slow-motion.

Real-time computing (RTC), also known as reactive computing, describes hardware and software systems guaranteed to respond within a given timeframe. Real-time software applications often require synchronous programming languages, real-time operating systems and real-time networks to execute processes by the specified deadline, regardless of system load. Examples of real-time software include aircraft navigation programs, multimedia broadcasts, multi-player video games, data analysis programs and stock-trading applications.

The success of many large enterprises depends on the real-time responsiveness of their IT systems. If a business can respond quicker to new information and changing market conditions, it gains an advantage over businesses that cannot. Running time-sensitive applications in real time reduces process delays (latencies), offering an advantage over competitors. Reduced latency, combined with virtualization, maximizes application performance. Operating systems with a pre-emptible real time kernel can provide precise timing and synchronization for advanced application workloads. SUSE Linux Enterprise Real Time is an open source, real time operating system designed to reduce latency and increase the predictability and reliability of time-sensitive and mission-critical applications.

High-Performance Computing

High Performance Computing (HPC) is the IT practice of aggregating computing power to deliver more performance than a typical computer can provide. Originally used to solve complex scientific and engineering problems, HPC is now used by businesses of all sizes for data-intensive tasks. Companies that provide automotive engineering, pharmaceutical design, oil and gas exploration, renewable energy research, entertainment and media, financial analytics, and consumer product manufacturing rely on HPC for scalable business computing.

An HPC system is typically a cluster of computers or nodes, with each node containing one to four processors and each processor containing two to four cores. A common cluster size in many businesses is between 16 and 64 nodes, with 64 to 256 cores. Linux is the dominant operating system for HPC installations. HPC applications usually require fast network and storage performance, large amounts of memory and very high compute capabilities. An HPC cluster running in the cloud can scale to larger numbers of parallel tasks than most on-premises environments. A cloud-based HPC cluster allows users to focus on their applications and research output instead of IT maintenance. HPC cloud systems only charge for the services clients actually use, so businesses can optimize costs without paying for idle compute capacity.

HPC infrastructures are complex in both design and operation, involving a large set of interdependent hardware and software elements that must be precisely configured and seamlessly integrated across a growing number of compute nodes. SUSE Linux Enterprise for HPC is designed for scalability and includes tools that simplify configuration and management. SUSE Linux Enterprise Real Time is ideal for time-sensitive HPC applications.

Mission-Critical Computing

Mission-critical computing, also known as a mission-critical system, is any IT component (software, hardware, database, process, application, etc.) that performs a function essential to business operations. Failure or disruption of a mission-critical factor adversely affects the mission or activity which it supports, and may affect the survival of the business as a whole. For example, online banking systems, traffic control systems and communication systems are mission-critical computing systems that result in business loss if they fail.

Mission-critical computing may also be used as an umbrella term for any IT system or network device whose loss would cause business operations to fail. System resilience, high availability, redundancy, backup and other IT strategies can protect business operations and maintain uptime if a mission-critical component fails. For large enterprises, database systems and process control servers are considered mission-critical computing. Data centers and database servers for Internet applications must be protected from potential failures caused by software corruption, power shortages, faulty hardware and environmental disasters. Replicating data across multiple server clusters is another way to protect mission-critical computing workloads, prevent application downtime and assure data availability. A server cluster is a group of linked servers that work together to improve system performance, load balancing and service availability. If a server fails, other servers in the cluster can take over the functions and workloads of the failed server.

To extend high availability for mission-critical computing, physical and virtual servers can be clustered together. Geo clustering helps minimize the effects of regional disasters, providing service failover across any distance. On Linux-based systems, live kernel patching can reduce downtime by allowing on-the-fly fixes without rebooting. SUSE Linux Enterprise High Availability Extension virtually eliminates unplanned downtime with features that protect mission-critical computing. SUSE Linux Enterprise Server for SAP Applications includes an integrated clustering solution for physical and virtual Linux systems to protect mission-critical SAP HANA databases.

Downtime

Downtime is a computer industry term for the time during which a computer or IT system is unavailable, offline or not operational. Downtime has many causes, including shutdowns for maintenance (known as scheduled downtime), human errors, software or hardware malfunctions, and environmental disasters such as power outages, fires, flooding or major temperature changes. In industrial environments, downtime may refer to failures in production equipment. This type of downtime is often measured as downtime per work shift or downtime per a 12- or 24-hour period. Downtime duration is the period of time when a system fails to perform its primary function. Communications failures, for example, may cause network downtime.

In IT environments, downtime can be one of the metrics used for system availability. Availability is often measured against a 100% operational or never-fails standard. A common standard of availability is 99.999%, known as “five 9s” availability. Two 9s would be a system that guarantees 99% availability in a one-year period, allowing up to 1% downtime, or 3.65 days of unavailability. Service level agreements (SLAs) often use monthly downtime or availability percentages for billing calculation. Scheduled downtime for system updates and routine maintenance is usually not included in the availability percentages for SLA contracts. For provisioning, service level agreements may use uptime and downtime percentages to describe the dependability of the various services available to clients. Such percentages also help determine the value of each service, as most clients desire continuous real-time availability (zero downtime).

Large enterprises increasingly depend on high availability for IT services and applications delivered through the cloud. IT organizations may deploy server clusters to improve availability and reduce unscheduled downtime. A server cluster is a group of linked servers that work together to improve system performance, load balancing and service availability. If a server fails, other servers in the cluster can take over the functions and workloads of the failed server. SUSE Linux Enterprise Server can help businesses minimize downtime by exploiting hardware reliability, availability and serviceability features, by providing server clustering for physical and virtual systems, and by enabling live kernel patching without rebooting.

Uptime

Uptime is a computer industry term for the time during which a computer or IT system is operational. Uptime can also be a metric that represents the percentage of time that hardware, a computer network, or a device is successfully operational. Uptime is often measured in percentiles, such as “five 9s,” meaning a system that is operational 99.999 percent of the time. Downtime, the opposite of uptime, is the period of time when a system is not operational.

The terms uptime and downtime are used to define the level of success provided by real-time services. A service level agreement (SLA) or other real-time service contract may include uptime/downtime ratios that show how much time a service is expected to remain operational. IT professionals may use uptime to refer to a total consecutive amount of operational time. For example, a computer system that has been running for three weeks has a “three-week uptime.” High availability uses uptime to define an agreed level of operational performance measured against a 100 percent operational standard.

Uptime is often used as a sign of operating system or network reliability, representing the length of time a system can be left unattended without crashing or needing maintenance. Hardware reliability, availability and serviceability (RAS) are important factors in data center uptime. Redundant servers for backup and failover help maintain data center uptime in case of server failure. Server clustering is another uptime strategy that delivers high availability of IT services and workloads. A server cluster is a group of linked servers that work together to improve system performance, load balancing and service availability. If a server fails, other servers in the cluster can take over the functions and workloads of the failed server. SUSE Linux Enterprise Server can help maximize uptime by providing server clustering, by exploiting hardware RAS features, and by enabling live kernel patching without rebooting.

High Availability

High availability (HA) is a state of continuous operation in a computer system or IT component for a specified length of time. High availability may also refer to an agreed level of operational performance (usually uptime) assured for a higher than normal period. Availability is often measured against a 100% operational or never-fails standard. A common standard of availability is known as “five 9s,” 99.999% availability. Two 9s would be a system that guarantees 99% availability in a one-year period, allowing up to 1% downtime, or 3.65 days of unavailability. Service level agreements (SLAs) often use monthly downtime or availability percentages for billing calculation.

The increased demand for reliable infrastructures running business-critical systems has made reducing downtime and eliminating single-points-of-failure just as important as high availability. For example, hospitals and data centers require high availability of their systems—and no unscheduled downtime—to perform daily tasks. Unscheduled downtime may be a hardware or software failure, or adverse environmental conditions such as power outages, flooding or temperature changes. Scheduled downtime for system updates and maintenance are often not included in availability percentages.

Reliability engineering uses three principles of systems design to help achieve high availability: elimination of single-points-of-failure; reliable crossover or failover points; and failure detection capabilities. High availability of data access and storage is often required in government, healthcare and other compliance-regulated industries. Highly available systems must recover from server or component failure automatically. A distributed approach can achieve this with multiple redundant nodes connected as a cluster, where each node is capable of failure detection and recovery. SUSE Enterprise Storage is an example of a highly available system designed to have no single-points-of-failure.

Scalability

Scalability is the ability for IT systems – such as applications, storage, databases and networking – to continue to function properly when changed in size or volume. It often refers to increasing or decreasing resources as needed to meet the higher or lower demands of a business.

Vertical (scale-up) scalability increases the capacity of hardware or software by adding resources to a physical system, such as adding processing power to a server to make it faster. For scale-up storage, this means adding more devices, such as disk drives, to an existing system when more capacity is required.

Horizontal (scale-out) scalability connects multiple items in order to work as a single logical unit. For scale-out storage, this means adding devices in connected arrays or clusters. Each cluster can have many nodes (devices), and nodes can be separated geographically. Scale-out NAS (network-attached storage) grows by adding clustered nodes. Because each node includes storage capacity, processing power and I/O (input/output) bandwidth, performance increases along with storage capacity. In a scale-out storage system, new hardware can be added and configured as the need arises. When a scale-out system reaches its storage limit, another array can be added to expand the system capacity. Scale-out storage can use the added storage across arrays. A scale-out architecture allows the initial storage investment to be small, because future storage can be added as needed.

Software-defined storage (SDS) creates a virtualized network of storage resources by separating the management software from its underlying storage hardware. SDS resources may be spread across multiple servers and shared as if they reside on one physical device. This type of storage enables unlimited scalability on demand. SUSE Enterprise Storage is an SDS solution that provides limitless storage capacity and scalability.

RELATED TOPICS

Devops

DevOps (a combination of “development” and “operations”) is a software application and services development model that integrates development lifecycles with agile operations. ...

Learn more

HPC

High Performance Computing (HPC) is the IT practice of aggregating computing power to deliver more performance than a typical computer can provide. Originally used to solve complex scientific and engi...

Learn more