Resiliency

Resiliency is the ability to recover from disruptions to critical processes and supporting technology systems. This includes anything from hardware or software issues to network outages or data center failures. A resilient Nexus Repository implementation is designed to minimize downtime.

This involves redundancy, failover mechanisms, and robust data management strategies to protect against data loss and service interruptions.

Choosing the appropriate resiliency options is your primary goal when designing your Nexus Repository architecture.

See Migrating to a Resilient Deployment documentation for details.

Resiliency Terminology

Understanding the specific terminology used is crucial for understanding the guidance recommended in this section. Here's a breakdown of the key terms:

Node
A node represents a single, running instance of the Nexus Repository application. It encompasses the software, resources (CPU, memory, storage), and configuration necessary for the repository to function. A standalone deployment consists of a single node, while clustered deployments involve multiple nodes. We generally recommend that each node reside on a physical server to limit the risk of hardware failure.
System of Record
A system of record is the authoritative Nexus Repository instance for a given artifact.
Cluster
A cluster is a group of interconnected nodes working together to provide high availability and scalability for Nexus Repository. Clustering allows for distributing the workload, providing redundancy, and ensuring continued operation even if some nodes fail. The nodes in the cluster connect to a shared external database and write to the same blob stores.
High Availability (HA)
HA refers to the system's ability to remain operational and accessible with minimal downtime. In Nexus Repository, HA is typically achieved through clustering, where multiple nodes can take over if one node becomes unavailable.
Failover
Failover is the automatic process of switching to a redundant or backup node in the event of a primary node's failure. This ensures continuous service availability.
Replication
Replication involves copying data from one node to another. In Nexus Repository, replication can be used to synchronize data across multiple nodes in a cluster, ensuring data consistency and enabling failover.
Load Balancing
Load balancing distributes incoming requests across multiple nodes in a cluster. This helps to prevent any single node from becoming overloaded and improves overall performance.
Data Consistency
Data consistency ensures that all nodes in a cluster have the same, up-to-date data. This is crucial for maintaining data integrity and preventing conflicts.
Disaster Recovery (DR)
DR refers to the process of restoring the Nexus Repository system and its data after a major outage or disaster. This typically involves backups, failover to a geographically separate location, and a recovery plan.
Backup and Restore
Backup and restore procedures are essential for protecting Nexus Repository data. Regular backups allow you to recover from data loss or system failures.
Backups should include documenting the service environment and the system configuration, as well as the application binaries, blob stores, and database. You may want different schedules or priorities depending on the frequency of deployments and audit requirements.
Recovery Point Objective (RPO)
The amount of data loss that is acceptable when a restore becomes necessary. Measured in the amount of time since that last backup.
Recovery Time Objective
The length of time required to restore the service. This may includes a least a partial recovery, the service in at least read only state, or one that restores the service to the RPO.
Monitoring
Monitoring of Nexus Repository nodes and the cluster as a whole is crucial for detecting potential issues and ensuring system health. Monitoring tools can track metrics like CPU usage, memory consumption, and network activity.

Recommendations for Resiliency

Resiliency involves balancing three outcomes; the costs of your maintenance plan, with the time needed to recover, against the risks of disruptions to the service. The following best practices are tools for improving the resiliency of your Nexus Repository.

Regular Backups
Implement a robust backup strategy that includes regular, automated backups of the Nexus Repository data, configuration, and application files. Store backups securely, preferably in a separate location such as a data center in another region.
High Availability
Deploy Nexus Repository in a clustered configuration to ensure high availability and automatic failover in case of node failures.
Comprehensive Monitoring
Implement comprehensive monitoring of Nexus Repository nodes, the cluster (if applicable), the database, and the underlying infrastructure. Monitor key metrics such as CPU usage, memory consumption, disk space, network activity, and application logs. Set up alerts for critical events.
Disaster Recovery Plan
Develop and regularly test a disaster recovery plan that outlines the steps to restore Nexus Repository in the event of a major outage or disaster. This plan should include procedures for backup restoration, failover to a secondary location, and communication with stakeholders.
Security Hardening
Implement security best practices to protect Nexus Repository from security breaches. This includes strong password policies, least privilege access controls, updating to the latest versions, and vulnerability scanning.
Regular Testing
Regularly test failover and recovery procedures to ensure they work as expected. This includes simulating node failures, database failures, and network outages. Test the backup and restore process to verify its effectiveness.

Failure Scenarios and Recovery

This section provides a breakdown of potential failure scenarios for Nexus Repository in both single-node and clustered environments. The scope of interruption you need to mitigate balanced with the cost of ownership for those deployments determines which architecture is needed to achieve your required level of resiliency.

Single Node Failure

This encompasses failures of the physical or virtual server hosting Nexus Repository. Hardware failures include hard drive crashes, memory issues, or power supply problems. Software failures can involve operating system crashes, application bugs, or dependency conflicts.

Loss of service for artifacts hosted on the failed node. Builds dependent on those artifacts will fail.

Single Node
The Nexus Repository is completely unavailable. High chance of data loss when components are stored on the same server. The embedded database may be corrupt when failure happens in the middle of a transaction. Failover systems may attempt to restore service using standby nodes.
Restore the Nexus Repository data and configuration from the most recent backup. The backup should include the application files, database, and any custom configurations.
High Availability Cluster
Users might experience temporary slowdowns as traffic is redirected by the load balancer to other nodes continuing to serve requests.
The cluster should automatically detect the node failure and initiate failover to a healthy, available node. Repair or replace the failed node and rejoin it to the cluster.

Data Center or Zone Outages

A data center or zone outage refers to a significant outage affecting an entire data center or a large portion of its infrastructure. This could be caused by natural disasters (e.g., floods, fires, earthquakes), widespread power outages, or major infrastructure problems.

Single Node
Depending on the disaster recovery plan and backup strategy, there might be a risk of data loss if backups are not stored in a geographically separate location.
High Availability Cluster
HA clusters are configured to a single data center or zone so we recommend using failover to cluster in another availability zone to mitigate this risk. Once the primary data center is restored, plan and execute a failback to the primary environment if desired. This might involve replicating data from the secondary to the primary location.
Develop a comprehensive disaster recovery plan that includes procedures for data backup and restoration, infrastructure recovery, application deployment, and failback.

Network Connectivity Issues

Problems with the network infrastructure, such as switch failures, router misconfigurations, or network outages, can disrupt communication between Nexus Repository nodes (in a cluster) or between clients and the repository.

Single Node
Intermittent or complete loss of client access. Doesn't directly impact the internal database unless using network storage. High latency may lead to loss of transactions or even database corruption.
High latency between remote proxy nodes may cause long download times and build failures.
High Availability Cluster
May affect node communication with the external database and blob stores leading to high latency in requests causing intermittent issues. Monitor the logs and set up alerts to troubleshoot and resolve network issues.

Database Failure

This refers to failures of the database used by Nexus Repository to store artifact metadata and server configuration. Database failures can be caused by hardware problems, software bugs, high latency, and data corruption.

A database failure typically leads to a complete Nexus Repository outage. When restoring a backup, there will be inconsistencies between the database and the artifacts in the blob stores that must be reconciled.

Recovery
Contact Sonatype Support for guidance in case of a database failure.
Restore the database from the most recent, valid backup. After restoring the database, verify its integrity to ensure no data corruption occurred during the backup or restore process.
Under support supervision, use the repair and reconcile task to resolve inconsistencies between the database and blob stores.

Disk Failure

Failure of the hard drives or storage volumes where Nexus Repository is installed or where its data is stored. The impact depends on which disks fail. A failure of the OS disk may bring down the server. A failure of a blob store disk could lead to loss of the artifact data on that disk.

Single Node
Failure in either the OS disk or blob store disk will need to be restored from backup. Expect downtime while the infrastructure is replaced and restored.
Use a dynamic failover instance to improve recovery time.
High Availability Cluster
Similar to a node failure, users might experience temporary slowdowns as traffic is redirected by the load balancer to other nodes continuing to serve requests.
Object storage replication is used to maintain copies of artifacts across different storage locations in other data centers or even across different geographical regions. Using the object storage failover configuration, Nexus Repository automatically switches to a new location in case of unavailability or failure.

Data Corruption

Data corruption refers to situations where the data stored within Nexus Repository becomes damaged or inconsistent. This can be caused by software bugs, hardware issues, or improper shutdowns.

The impact varies depending on the extent of the corruption. It might affect a few artifacts, a specific repository, or the entire repository.

Recovery
Identify and isolate the corrupted data. This might involve examining logs, comparing checksums, or running repository integrity checks.
Restore the affected artifacts from backups. In some cases, it might be necessary to rebuild repository metadata if it has become significantly corrupted.

Security Breach

A security breach occurs when unauthorized individuals gain access to the Nexus Repository system or its data. This can be due to compromised credentials, software vulnerabilities, or misconfigurations.

Recovery
Isolate the affected server to prevent further damage. Investigate the breach to determine how it occurred and what data was accessed.
Reset all passwords and revoke any compromised credentials. Consider restoring the Nexus Repository from clean backups if the system has been significantly compromised.
Keeping Nexus Repository on the latest version ensures you have the latest security fixes.
Use the principle of least privileges when designing your access controls to limit future exposure.

Library of Patterns

The sections below list various patterns to use depending on your resiliency requirements.

Single Node with Backup

Single active node with a cold backup that can be used to recover from a data loss.

Supported with Community Edition (CE)

Use Cases: This model reduces data loss.
Limitations: The backup is manual and requires downtime to recover.
Examples:
Backup and Restore
Backup and Restore in Amazon Web Services

Single Node with Dynamic Failover

A single active node in one availability zone. Should a node or availability zone fail, Kubernetes activates a second node in either the same or a second availability zone.

Use Cases: This model reduces downtime, infrastructure costs, and data loss.
Limitations: Requires more downtime than active patterns.
Nexus Repository Pro Deployments
Azure - single node
GCP - single node
Community Edition (CE) Deployments
These models are compatible with the Community Edition. The included Helm charts are only available for Pro deployments, however they may be manually adjusted for CE deployments.
On-premises: single node with Kubernetes
AWS - single node

Active-Active Node Clustering

A cluster of redundant active Nexus Repository instances within a single region or on-premises data center. The number of instances may be manually scaled or leverage Kubernetes to automatically scale instances.

Use Cases: This model maximizes uptime while protecting against application and hardware failures. May be scaled to multiple availability zones in a single region for additional protection.
Limitations: Uses multiple technology stacks that require in-house expertise to manage appropriately. Requires high infrastructure and maintenance overhead which may not yield your desired return on investment.
Examples
High Availability Deployment

Resiliency

Resiliency Terminology

Node

System of Record

Cluster

High Availability (HA)

Failover

Replication

Load Balancing

Data Consistency

Disaster Recovery (DR)

Backup and Restore

Recovery Point Objective (RPO)

Recovery Time Objective

Monitoring

Recommendations for Resiliency

Regular Backups

High Availability

Comprehensive Monitoring

Disaster Recovery Plan

Security Hardening

Regular Testing