High-availability monitor for VPN tunnels.
Side A Side B
+----------+ active +----------+
| Active | <=======================> | Active |
+----------o x----------+
. o x .
. o x .
. o x .
. x o .
. backup x o backup .
. x o .
+----------x o----------+
| Backup | < - - - - - - - - - - - > | Backup |
+----------+ backup +----------+
Typical setup would be:
- 2x VPN hosts on each side of the bridge.
- One host on each side configured as
active, the other asstandby. - Each host on each side has VPN tunnels configured to both of the other side hosts.
- Only
active-activetunnel is used, the others are there for backup.
In the event of the downtime (tunnel is broken, one of active hosts is down), the monitor would:
- Promote the remaining tunnel to become
active. - Trigger custom command scripts to adjust for the situation (to re-configure the routes, for example).
In order to achieve that, vpnham does the following:
-
Regularly send UDP-datagrams to the each of the peers on the other side (connections
<==>,< - >,< x >, and< o >on the diagram above).- The peer adds its bit into the datagram and sends it back.
- This is how
vpnhamdetermines theup/downstatus of the tunnel (it accounts for the sent probes with their sequence numbers, and expects them to come back).
-
Regularly poll the partner's bridge (i.e. the
activebridge polls thestandbyone, and vice versa; connections< . >above).- Failure to poll means the partner is
down. - If both of the tunnels are
down, the bridge marks itselfdownas well and reports itself accordingly to its partner.
- Failure to poll means the partner is
-
If
activetunnel isdown, thestandbyis promoted toactive. -
If
activebridge isdown, thestandbyis promoted toactive. -
Once the
active(by configuration) tunnel gets backup, it will reclaim theactivestatus (in other words, the ties are broken via configuration). -
Similar approach with the bridges.
There are configurable scripts (per bridge, or globally):
-
bridge_activateis triggered when a bridge is promoted toactive.- Recognised placeholders are:
${proto}${bridge_peer_cidr}${bridge_interface}${bridge_interface_ip}
- Recognised placeholders are:
-
tunnel_activateis triggered when a tunnel is markedactive.- Recognised placeholders are the same as for
bridge_activate, plus:${tunnel_interface}${tunnel_interface_ip}
- Recognised placeholders are the same as for
-
tunnel_deactivateis triggered when the tunnel'sactivemark is removed.- Recognised placeholders are the same as for
tunnel_activate
- Recognised placeholders are the same as for
In addition, there is a metrics endpoint where vpnham reports the following:
-
vpnham_bridge_activeis a gauge for count of active bridges.0is when no connectivity to the other side (bad).1is when all is good (yay).2is when both us and the partner consider themselvesactive(this means bug).
-
vpnham_bridge_upis a gauge for the count of online bridges (from0to2, the more the merrier). -
vpnham_tunnel_interface_activeis a gauge for count of active tunnels. -
vpnham_tunnel_interface_upis a gauge for count of online tunnels.
Also (since we have that info at our fingertips through probing), the following metrics are exposed:
-
vpnham_probes_sent_totalis a counter for probes sent. -
vpnham_probes_returned_totalis a counter for probes returned. -
vpnham_probes_failed_totalis a counter for probes failed to send, or to receive. -
vpnham_probes_latency_forward_microsecondsis a histogram for the probes forward latency (on their trip "there"). -
vpnham_probes_latency_return_microsecondsis a histogram for the probe return latency (trip "back")
bridges:
vpnham-dev-lft: # bridge name (must match one at the partner's side)
role: active # our role (`active` or `standby`)
bridge_interface: eth0 # interface on which the bridge connects to VPC
peer_cidr: 10.1.0.0/16 # CIDR range of the VPC we are bridging into
status_addr: 10.0.0.2:8080 # address where our partner polls our status
partner_url: http://10.0.0.3:8080/ # url where we poll the status of the partner
probe_interval: 1s # interval between UDP probes or status polls
probe_location: left/active # location label for the latency metrics
tunnel_interfaces:
eth1: # interface on which VPN tunnel is running
role: active # tunnel role (`active` or `standby`)
addr: 192.168.255.2:3003 # address where we respond to UDP probes
probe_addr: 192.168.255.3:3003 # address where we send the UDP probes to
threshold_down: 5 # count of failed probes/polls to mark peer/partner "down"
threshold_up: 3 # count of successful probes/polls to mark peer/partner "up"
eth2:
role: standby
addr: 192.168.255.18:3003
probe_addr: 192.168.255.19:3003
threshold_down: 7
threshold_up: 5
scripts_timeout: 5s # max amount of time for script commands to finish
metrics:
listen_addr: 0.0.0.0:8000 # where we expose the metrics (at `/metrics` path)
latency_buckets_count: 33 # count of histogram buckets for latency metrics
max_latency_us: 1000000 # max latency bucket in [us]; the buckets are computed
# exponentially, so that
# max_latency == pow(min_latency, buckets_count)
default_scripts: # default scripts (complement the `scripts` on bridge config)
bridge_activate: # script that we will run when bridge becomes `active`
- ["sh", "-c", "echo ${bridge_interface} ${bridge_interface_ip} ${bridge_peer_cidr}"]
- ["sleep", "15"]
interface_activate: # script that we will run when tunnel becomes `active`
- ["sh", "-c", "echo 'activate ${tunnel_interface} ${tunnel_interface_ip}'"]
interface_deactivate: # script that we will run when tunnel becomes `inactive`
- ["sh", "-c", "echo 'deactivate ${tunnel_interface} ${tunnel_interface_ip}'"]Note
See the following files for the full example:
Also: make docker-compose
vpnham takes only one cli-parameter --config that should point to the
yaml-file with full configuration. By default it will seek .vpnham.yaml file
in the working directory.