High Inter-node Latency for Small Messages for UCX-Py using InfiniBand #563

aamirshafi · 2020-07-19T19:29:54Z

We are trying to reproduce host-based (numpy objects) UCX-Py numbers shown on slide#22 of GTC 2019 talk (https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9679-ucx-python-a-flexible-communication-library-for-python-applications.pdf) but are getting much higher numbers for latency. We are seeing around 54 us for 4 byte message with latency-bound mode. With throughput-bound mode, the latency is around 82 us. The same benchmark at the UCX level was reporting a number of 2 to 3 us.

The latency here seems to be on the higher-side. What could be the reason for this?

Some details on the test setup.

These are two nodes connected via IB. The benchmark is https://github.com/rapidsai/ucx-py/blob/branch-0.15/benchmarks/local-send-recv.py. We only modified this to get latency numbers (see latency.patch.txt).

Exact commands are as follows:

UCX_NET_DEVICES=mlx5_0:1 UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=rc,sm,self python local-send-recv.py --n-bytes 4 --reuse-alloc --server-only --n-iter 10000 --object_type numpy

Server Running at X:Y
Client:

UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=rc,sm,self python local-send-recv.py --n-bytes 4 --client-only --server-address X --reuse-alloc --port Y --n-iter 10000 --object_type numpy

quasiben · 2020-07-20T15:13:25Z

I think there are a few things here (@pentschev / @madsbk please chime if I'm incorrect or you have additional thoughts):

With host memory transfers you probably will need UCXPY_NON_BLOCKING_MODE=1 due to a UCX issue. We have also found establishing a CPU affinity for the process (especially with SM) you can increase performance. So with the following I measure 35-40 us (with your latency patch):

UCXPY_NON_BLOCKING_MODE=1 UCX_NET_DEVICES=mlx5_0:1 UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=rc,sm,self python local-send-recv.py --n-bytes 4 --client-only --server-address 10.33.228.80 --reuse-alloc --port 39880 --n-iter 1000 --object_type numpy --server-cpu-affinity 2

I have a vague recollection that the test in the benchmark was only measuring one direction? I think this is also what ucx_perftest does as well. Something equivalent to:

            start = clock()
            await ep.send(msg_send_list[i], args.n_bytes)
            stop = clock()

Currently this test does the following:

            start = clock()
            await ep.send(msg_send_list[i], args.n_bytes)
            await ep.recv(msg_recv_list[i], args.n_bytes)
            stop = clock()

So we are perhaps measuring the bi-directional send/recv. If we change the test to only measure send I observe a 9-10 us latency which seems more in line what UCX is capable of

pentschev · 2020-07-24T19:51:41Z

With host memory transfers you probably will need UCXPY_NON_BLOCKING_MODE=1 due to a UCX issue.

That's only true for shared memory, not for other transports. For shared memory to work, we also need #565, which I just opened.

Also, I'm seeing much higher latencies than both of you, around 90-100us in blocking mode and 50-60us in non-blocking mode, and I don't know why that is. Could you try it in non-blocking mode with the patch I mentioned above and see how latency looks like then?

jakirkham · 2020-07-25T02:15:50Z

Might also be worth trying a bigger message size to get something more representative.

pentschev · 2020-07-25T15:02:43Z

Might also be worth trying a bigger message size to get something more representative.

I don't think this applies to latency benchmarks the same way it does for bandwidth. According to slide 22 of the GTC presentation referenced in #563 (comment) the latency for UCX-Py used to be under 3us for 4 bytes messages. We really haven't put much effort in improving latency since UCX-Py's rewriting, but it's probably something we should do.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Inter-node Latency for Small Messages for UCX-Py using InfiniBand #563

High Inter-node Latency for Small Messages for UCX-Py using InfiniBand #563

aamirshafi commented Jul 19, 2020

quasiben commented Jul 20, 2020

pentschev commented Jul 24, 2020

jakirkham commented Jul 25, 2020

pentschev commented Jul 25, 2020

High Inter-node Latency for Small Messages for UCX-Py using InfiniBand #563

High Inter-node Latency for Small Messages for UCX-Py using InfiniBand #563

Comments

aamirshafi commented Jul 19, 2020

quasiben commented Jul 20, 2020

pentschev commented Jul 24, 2020

jakirkham commented Jul 25, 2020

pentschev commented Jul 25, 2020