Operating your cluster


It is recommended that the nodes are started in the reverse order in which they were shutdown wherever possible (for example, if they were shutdown in the order A, B, C, then C should be started first, followed by B and then A). The last node shutdown will have the most current data and should be the first one started up after a full cluster shutdown.

Startup and Confirming Node Connectivity

After enabling HA-C for your Nexus Repository Manager installation, check the console or log file after launching a clustered node to confirm that all nodes have been detected.

When you start the nodes, you will see a message in the nexus.log confirming the connection of the cluster members, like the one below:

2017-11-06 11:04:12,045-0500 INFO [hz.nexus.generic-operation.thread-0] *SYSTEM com.hazelcast.internal.cluster.ClusterService - [172.19.3.78]:5703 [nexus] [3.7.8] 


Members [3] { 
        Member [172.19.3.78]:5701 - 6f4df1bc-606d-4821-8a3b-0980f093abd1 
        Member [172.19.3.78]:5702 - 436158e4-2865-40fe-bfaa-d350db8dedb1 
        Member [172.19.3.78]:5703 - 7ff93c39-23b3-4bf5-9b8d-51264ccecee2 this 
}

Running in Docker

Running your Nexus Repository Manager cluster within a container network is supported. However, there is an important consideration. When running Nexus Repository Manager inside of a docker container, the default behavior of the "docker stop" command is to first send a TERM signal, but if the process has not quit within 10 seconds, a KILL signal is sent. This has been known to cause Nexus Repository Manager clusters to become corrupted.

WARNING

When running Nexus Repository Manager nodes inside docker containers, you will need to specify an extended timeout, such as

docker stop --time=120 <CONTAINER_NAME>

to allow nodes to exit the cluster cleanly.

Shutdown

A node leaves the cluster through a clean shutdown of Nexus Repository Manager. A proper shutdown is performed by using the nexus.exe stop command on Windows-based systems and nexus stop on all others. When the node is leaving by a clean shutdown, the node is unregistered from the cluster. If a node that is registered in the cluster becomes abruptly unavailable (for example, the node's network link is broken, or the node's operating system crashes), that node may remain registered in the cluster despite no longer participating in database replication.

The clean shutdown is an important consideration when designing scripts for stopping a Nexus Repository Manager instance.

Nexus Repository Manager will initiate a shutdown if it receives the TERM signal, which is the default signal sent with the kill command on unix-like operating systems. A clean shutdown is also important for recovering from an outage event.

Backup

Make sure to take some time to read the section on designing your backup and restore plan. Running Nexus Repository Manager in a cluster does not reduce the need for a robust backup strategy. With a cluster, you will need to specify the repository manager instance where the task will execute. This is necessary because the backup will result in exported database backup files. These files will be placed in the configured location, on the chosen node's file system. However, since the task is relegated to a single node, this introduces the possibility that the database backup may not be executed if the node is no longer participating in the cluster. In some environments it may be preferable to configure two different nodes to execute backup tasks on alternating schedules. While the Backup and Restore task is running, the Nexus Repository Manager cluster will be in read-only mode. 

Testing your Backup/Restore process

We encourage you to test your restore process for disaster recovery minimally every 6 months. We also strongly suggest that you test the disaster recovery plan prior to upgrading Nexus Repository Manager to a new release. 

Please keep in mind Nexus Repository Manager requires that you choose and execute a separate backup process of your blob store(s) appropriate for the file system.

Node Names

When clustered, Nexus Repository Manager offers the ability to specify an alias for each node. Behind the scenes, Nexus Repository Manager generates a unique identifier for each node that joins a cluster. The alias that is specified by the administrator allows a logical mapping between the system-generated identifier and the specified name. The alias for a node is stored in sonatype-work/nexus3/etc/node-name.properties. The name can be updated in the Nexus Repository Manager user interface or directly in the properties file. If the name is updated directly in the properties file, it is necessary to restart the renamed node for the change to take effect. When an alias has been provided for a node, that alias will be used in place of the identifier where appropriate. For instance, the alias will be used in logging messages and in the user interface. 

Maintaining log files

Clustered Nexus Repository Manager nodes will create the same set of log files as unclustered installations. However, in a cluster, the name of the node will be added to each log message. The name that is used in the log messages will be the given alias or default to the node's system-generated identifier. The log files on each node will be rotated daily. The log files for the previous day will be renamed and compressed to save disk space.

The Nexus Repository Manager user interface contains a log viewer available to administrators. This log viewer is limited to only display log messages that have been emitted on the local node. In an environment where traffic is routed through a load balancer, this view may not be helpful. In that case, we suggest that a log aggregation system be put in place. Having a node identifier as a part of each log entry should enable better searching within such a system.  

Verifying Synchronization

At run-time, the repository manager user interface allows you to view the status of the clustered nodes.

See Nodes for details on viewing active nodes in a cluster.

In the event a single node loses connection to the cluster, the remaining nodes will continue to make decisions on which data changes are valid. The disconnected node will reject further writes until it rejoins the cluster.

Database Quorum and Cluster Reset

To maintain data consistency, Nexus Repository Manager forces database transactions to be accepted and written to multiple nodes in the cluster. The goal is that all transactions are replicated within the cluster before they are accepted. However, this introduces some situations that need explanation. To allow for continued operation, even during some failures, the transactions are not expected to replicate to all nodes in a cluster before the transaction is accepted. The Nexus Repository Manager cluster does expect that the transaction is replicated to the majority of nodes within a cluster. This means that in a cluster with three nodes, no less than two nodes must accept a write before the transaction is considered a success. To calculate the size of the majority, Nexus Repository Manager will use the following equation:

ROUNDED_DOWN( $N / 2 ) + 1

Where $N represents the number of nodes registered in the cluster and ROUNDED_DOWN finds the largest whole number less than or equal to the given parameter.

Nodes register with the cluster when they are launched. 

Although Nexus Repository Manager clustering replicates data across all nodes in the cluster, a cluster can be reduced to a single node. A single node cluster will accept write transactions. You can still write to your Nexus Repository Manager cluster environment with only one running node as long as the other members of the cluster have either been shutdown cleanly or they have not started yet. If nodes in your cluster unexpectedly lose connectivity or die and fewer than the calculated majority are running, then attempts to write (such as publishing assets, changing system configuration) will fail.

Because of the quorum calculation and cluster registration algorithms, it is possible that Nexus Repository Manager's cluster table can enter a state where it does not accurately reflect the true state of your cluster. For example, if a registered member is abruptly removed from the cluster and a fresh installation is added to replace the failed node, then Repository Manager clustering may be calculating the write quorum for database transactions still considering the removed node. In this example, Nexus Repository Manager will continue to operate. However, after each abrupt removal of a node, it becomes more likely that a cluster could have difficulty achieving write quorum.

Troubleshooting

If your cluster enters a state where write quorum cannot be achieved, then Nexus Repository Manager will present the option to reset your cluster. The write quorum not reached warning appears as a horizontal bar across the top of the Nexus Repository Manager user interface, as shown in the screen capture below:

Before resetting your Nexus Repository Manager cluster, you should attempt to manually resolve the cluster issues. There are a few ways you can manually resolve the issues.

  • Try to cleanly shutdown all clustered nodes, except one, and then try to bring your cluster back online by restarting the nodes one at a time waiting for each to finish starting before starting the next.
  • You can also try to reconnect or restart nodes that may have been killed or disconnected. If the restarted nodes are not meant to be permanent members of the cluster, then performing a clean shutdown after restarting or reconnecting will properly remove the nodes from the cluster.

As a last resort, you can click the Troubleshoot link in the write quorum warning. On the cluster reset page, you will see the Reset Database Quorum button. When you click on the button, all entries are removed from the cluster table except for the node that processes the request to reset the cluster. In practice, this means that all other nodes need to be shutdown and removed from the load balancer rotation before resetting the cluster.

WARNING

It is only safe to reset the cluster once all other nodes have been cleanly shutdown.

After the reset completes, the cluster will immediately begin accepting writes on the remaining node. Nodes can then be added to the cluster and placed back into the load balancer rotation. 

Monitoring Node Health

Once your HA environment is set up, you should monitor the health of the nodes in your cluster. The read-only status of the cluster is visible at the http://<serveripaddress>:<port>/service/metrics/data endpoint, as "readonly.enabled", under "gauges". See the support article for the HTTP endpoints that HA-C exposes and/or the support article for enabling JMX, for more information.

You can also use the load balancer friendly status resource to check the health of the nodes in an HA cluster.

Task Management

In a Nexus Repository Manager cluster, it is necessary to consider which node or nodes execute a task. Knowing how the tasks run in a cluster will be an important element of monitoring cluster health. Nexus Repository Manager now places task logging output in a separate file that corresponds to the execution of the task. Many tasks will be executed on a random single node in the cluster to balance the resource usage across the cluster. Other tasks may execute simultaneously on all nodes in the cluster. Monitoring of the task must take into account the behavior of the execution of the task. The node requirements for common tasks are listed below.

Admin - Execute script

In a Nexus Repository Manager cluster, when you create an Admin - Execute script task, you are able to specify whether the task runs on all nodes simultaneously or whether the script is only executed on a single node. When a task manipulates a resource that is shared within the cluster, such as repository configuration or a blob store, the task should only be executed on a single node. However, if the task is interacting with a resource that is local to the node, such as the search indices, then the task needs to be configured with Run task on all nodes in the cluster selected. 

Export Configuration & Metadata for Backup Task

Please see the section on designing your backup plan.

Single-node Tasks

The following Single-node Tasks will run on a single, randomly-selected node in a Nexus Repository Manager cluster:

  • Admin - Cleanup tags
  • Admin - Compact blob store
  • Admin - Delete orphaned API keys
  • Admin - Log database table record counts
  • Admin - Remove a member from a blob store group
  • Docker - Delete incomplete uploads
  • Docker - Delete unused manifests and images
  • Maven - Delete SNAPSHOT
  • Maven - Delete unused SNAPSHOT
  • Maven - Publish Maven Indexer files
  • Maven - Unpublish Maven Indexer files
  • Repair - Rebuild Maven repository metadata (maven-metadata.xml)
  • Repair - Rebuild Yum repository metadata (repodata)
  • Repair - Reconcile component database from blob store
  • Repository - Delete unused components

The following Single-node Tasks will run on a single selected node in a Nexus Repository Manager cluster:

  • Admin - Export database for backup

  • Repair - Rebuild repository browse
  • Repair - Reconcile date metadata for blob store

Multi-node Tasks

The following tasks may run on all nodes or on the current node (i.e. the node the UI is interacting with):

  • Admin - Execute script
  • Repair - Rebuild repository search
  • Repair - Reconcile npm /-/v1/search metadata