Note - This is all tested in AWS, but should in theory be the same in any other cloud platforms.
Problem description: Performing multi-node training within a region under the same VPC is straightforward by using torchrun following the explanations in (https://pytorch.org/docs/stable/elastic/run.html) and setting up multiple instances with supporting configs.
The problem starts when trying to do the same across regions, where instances are associated with distinct VPCs that have no bridge. In this case there is a problem with the way distributed training happens in pytorch where the participating nodes announce themself using the IP that is tied to the default ethernet device on the instance. For example if running “ip addr” will get following output:
(base) ubuntu@ip-172-31-40-46:~$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
link/ether 0e:18:fb:3c:bb:43 brd ff:ff:ff:ff:ff:ff
inet 172.31.40.46/20 metric 100 brd 172.31.47.255 scope global dynamic ens5
valid_lft 3509sec preferred_lft 3509sec
inet6 fe80::c18:fbff:fe3c:bb43/64 scope link
valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:f7:5f:c8:e3 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
where the “ens5” device is the private IP address provided by the cloud platform. So pytorch will be grabbing this IP address (in this case “172.31.40.46”) and announcing the node using it. This IP address is not the public address that external addresses see, but is internal to the VPC in the region. The translation between private and public happens at the VPC gateway.
If two nodes sit in separate VPCs which is often the case when doing cross-region they will both announce their private IPs to each other and routing between them will fail. This is a known problem in torchrun at the moment and also documented here https://github.com/pytorch/pytorch/issues/85300, but isn’t being fixed.
Solutions:
There are various solutions to this problem, I will mention two that worked for me.
a. First get the public and private IP addresses on each of the nodes
node1:
PUBLIC_IP_1=$(curl -s <https://api.ipify.org>)
PRIVATE_IP_1=$(hostname -I | awk '{print $1}')
node2:
PUBLIC_IP_2=$(curl -s <https://api.ipify.org>)
PRIVATE_IP_2=$(hostname -I | awk '{print $1}')
b. On each node update the NAT table to translate private to public addresses
node1:
sudo iptables -t nat -A OUTPUT -d $PRIVATE_IP_2 -j DNAT --to-destination=$PUBLIC_IP_2
node2:
sudo iptables -t nat -A OUTPUT -d $PRIVATE_IP_1 -j DNAT --to-destination=$PUBLIC_IP_1
If you want to see the NAT table is correct on a node run the following command: