Multi-node training cross regions

Note - This is all tested in AWS, but should in theory be the same in any other cloud platforms.

Problem description: Performing multi-node training within a region under the same VPC is straightforward by using torchrun following the explanations in (https://pytorch.org/docs/stable/elastic/run.html) and setting up multiple instances with supporting configs.

The problem starts when trying to do the same across regions, where instances are associated with distinct VPCs that have no bridge. In this case there is a problem with the way distributed training happens in pytorch where the participating nodes announce themself using the IP that is tied to the default ethernet device on the instance. For example if running “ip addr” will get following output:

(base) ubuntu@ip-172-31-40-46:~$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    link/ether 0e:18:fb:3c:bb:43 brd ff:ff:ff:ff:ff:ff
    inet 172.31.40.46/20 metric 100 brd 172.31.47.255 scope global dynamic ens5
       valid_lft 3509sec preferred_lft 3509sec
    inet6 fe80::c18:fbff:fe3c:bb43/64 scope link
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:f7:5f:c8:e3 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever

where the “ens5” device is the private IP address provided by the cloud platform. So pytorch will be grabbing this IP address (in this case “172.31.40.46”) and announcing the node using it. This IP address is not the public address that external addresses see, but is internal to the VPC in the region. The translation between private and public happens at the VPC gateway.

If two nodes sit in separate VPCs which is often the case when doing cross-region they will both announce their private IPs to each other and routing between them will fail. This is a known problem in torchrun at the moment and also documented here https://github.com/pytorch/pytorch/issues/85300, but isn’t being fixed.

Solutions:

There are various solutions to this problem, I will mention two that worked for me.

  1. Make sure the VPCs between the region have a bridge. In AWS this is called “peer connections” under the VPC service. This requires setting up two VPCs (one in each region), making sure the subnet address spaces don’t overlap; for example one can be 10.1.0.0/16 and the second 10.2.0.0/16. Subnets need to be allocated for each VPC so instance can get allocated an IP from the address space, and in addition an internet gateway can be optionally set up (if during training nodes need access to the internet). Finally “routing tables” (under VPC again) should be set up so traffic will be routed between the subnets. Once all this is set up properly the nodes should be able to have a routing and when nodes announce their private IPs, things will work.
  1. A somewhat hacky but quicker solution is to not connect VPCs in different regions but allow the pytorch nodes to communicate using their private addresses, but in each node translate all other node private addresses to their public ones using a NAT. So for a 2 node example this is process to perform on each node:

a. First get the public and private IP addresses on each of the nodes

node1:

PUBLIC_IP_1=$(curl -s <https://api.ipify.org>)
PRIVATE_IP_1=$(hostname -I | awk '{print $1}')

node2:

PUBLIC_IP_2=$(curl -s <https://api.ipify.org>)
PRIVATE_IP_2=$(hostname -I | awk '{print $1}')

b. On each node update the NAT table to translate private to public addresses

node1:

sudo iptables -t nat -A OUTPUT -d $PRIVATE_IP_2 -j DNAT --to-destination=$PUBLIC_IP_2

node2:

sudo iptables -t nat -A OUTPUT -d $PRIVATE_IP_1 -j DNAT --to-destination=$PUBLIC_IP_1

If you want to see the NAT table is correct on a node run the following command: