Resiliency and High Availability

What is Resiliency?

Choosing the appropriate resiliency options to meet your needs should be your primary goal when designing your Nexus Repository architecture. Resiliency refers to the ability to recover from disruptions to critical processes and supporting technology systems. Disruptions may include any of the following:

  • failure of a single service (the repository node, the external relational database, or the artifact storage)
  • a data center outage for the production environment
  • a regional outage in the case of cloud services

The scope of interruption you are planning to mitigate will determine which architecture you will need to achieve the level of resiliency required. 

Backup and Restoration

As you review backup strategies, there are two important terms to remember:

  • Recovery Point Objective - the amount of data loss that is acceptable if a restore becomes necessary
  • Recovery Time Objective - the length of time required to restore the service 

Your backup plan will need to balance the cost of maintenance with the risk of potential data loss and disruptions to the service. Setting requirements for fast recovery with the least risk will increase infrastructure complexity and maintenance cost for achieving those results. You will also need to regularly test the recovery process to ensure that the process is successful and to provide training for process owners. Regardless of implementation size, make sure to document your plan and to keep it up to date with any infrastructure changes.

You can configure your architecture to schedule database exports or use third-party tooling to transfer and back up files from one location to another.

For OrientDB or H2, Nexus Repository provides tasks to create database snapshots and relocate them to a target disk. Other directories in your local instance (or instances) should also be copied and rebuilt on a backup disk (see Prepare a Backup).

You will need to back up blob storage outside of the repository service. 

See Backup and Restore (for H2 and OrientDB) and Backup and Restore in Amazon Web Services for further information.

Library of Patterns 

The matrix below lists various deployment patterns that you might use depending on the level of resiliency you wish to achieve.

Pattern NameDescriptionUse CasesLimitationsExamples*
Single Node with Backup

Single active node with a cold backup that can be used to recover from a data loss.

  • Reduce data loss
  • Manual backup
  • More downtime on recovery vs. other patterns
  • Less availability than other patterns
Single Node with Dynamic Failover

Single active node in one availability zone/region; in the event of a node or regional failure, a second node is automatically spun up in a second region.

  • Achieve automatic failover 
  • Reduce downtime
  • Reduce data loss
  • More downtime on recovery vs. other patterns
  • Less availability than other patterns
Active-Active

A cluster of peer nodes, all of which are able to handle reads and writes.

Not yet supported in the newer H2 or PostgreSQL deployment options.

  • Either site can handle the entire workload for the business
  • Achieve maximum availability and minimum data loss
  • More costly compared to a single node with dynamic failover
  • High bandwidth requirement
  • Low latency network requirement

* We will continue to update this section with more examples as we validate them.

What is Legacy High Availability Clustering?

High Availability Clustering (HA-C) is a Nexus Repository feature implemented only in OrientDB that is meant to improve uptime by having a cluster of redundant Nexus Repository instances (i.e., nodes) within a single data center. This should allow you to maintain availability to your Nexus Repository if one of the nodes becomes unavailable. However, if you are using the newer H2 or PostgreSQL database options, you'll want to take advantage of newer resiliency patterns as outlined in Library of Patterns.