Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Inter-node Latency for Small Messages for UCX-Py using InfiniBand #563

Open
aamirshafi opened this issue Jul 19, 2020 · 4 comments
Open

Comments

@aamirshafi
Copy link

We are trying to reproduce host-based (numpy objects) UCX-Py numbers shown on slide#22 of GTC 2019 talk (https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9679-ucx-python-a-flexible-communication-library-for-python-applications.pdf) but are getting much higher numbers for latency. We are seeing around 54 us for 4 byte message with latency-bound mode. With throughput-bound mode, the latency is around 82 us. The same benchmark at the UCX level was reporting a number of 2 to 3 us.

The latency here seems to be on the higher-side. What could be the reason for this?

Some details on the test setup.

These are two nodes connected via IB. The benchmark is https://github.com/rapidsai/ucx-py/blob/branch-0.15/benchmarks/local-send-recv.py. We only modified this to get latency numbers (see latency.patch.txt).

Exact commands are as follows:

UCX_NET_DEVICES=mlx5_0:1 UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=rc,sm,self python local-send-recv.py --n-bytes 4 --reuse-alloc --server-only --n-iter 10000 --object_type numpy

Server Running at X:Y
Client:

UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=rc,sm,self python local-send-recv.py --n-bytes 4 --client-only --server-address X --reuse-alloc --port Y --n-iter 10000 --object_type numpy

@quasiben
Copy link
Member

I think there are a few things here (@pentschev / @madsbk please chime if I'm incorrect or you have additional thoughts):

With host memory transfers you probably will need UCXPY_NON_BLOCKING_MODE=1 due to a UCX issue. We have also found establishing a CPU affinity for the process (especially with SM) you can increase performance. So with the following I measure 35-40 us (with your latency patch):

UCXPY_NON_BLOCKING_MODE=1 UCX_NET_DEVICES=mlx5_0:1 UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=rc,sm,self python local-send-recv.py --n-bytes 4 --client-only --server-address 10.33.228.80 --reuse-alloc --port 39880 --n-iter 1000 --object_type numpy --server-cpu-affinity 2

I have a vague recollection that the test in the benchmark was only measuring one direction? I think this is also what ucx_perftest does as well. Something equivalent to:

            start = clock()
            await ep.send(msg_send_list[i], args.n_bytes)
            stop = clock()

Currently this test does the following:

            start = clock()
            await ep.send(msg_send_list[i], args.n_bytes)
            await ep.recv(msg_recv_list[i], args.n_bytes)
            stop = clock()

So we are perhaps measuring the bi-directional send/recv. If we change the test to only measure send I observe a 9-10 us latency which seems more in line what UCX is capable of

@pentschev
Copy link
Member

With host memory transfers you probably will need UCXPY_NON_BLOCKING_MODE=1 due to a UCX issue.

That's only true for shared memory, not for other transports. For shared memory to work, we also need #565, which I just opened.

Also, I'm seeing much higher latencies than both of you, around 90-100us in blocking mode and 50-60us in non-blocking mode, and I don't know why that is. Could you try it in non-blocking mode with the patch I mentioned above and see how latency looks like then?

@jakirkham
Copy link
Member

Might also be worth trying a bigger message size to get something more representative.

@pentschev
Copy link
Member

Might also be worth trying a bigger message size to get something more representative.

I don't think this applies to latency benchmarks the same way it does for bandwidth. According to slide 22 of the GTC presentation referenced in #563 (comment) the latency for UCX-Py used to be under 3us for 4 bytes messages. We really haven't put much effort in improving latency since UCX-Py's rewriting, but it's probably something we should do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

4 participants