-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High Inter-node Latency for Small Messages for UCX-Py using InfiniBand #563
Comments
I think there are a few things here (@pentschev / @madsbk please chime if I'm incorrect or you have additional thoughts): With host memory transfers you probably will need
I have a vague recollection that the test in the benchmark was only measuring one direction? I think this is also what start = clock()
await ep.send(msg_send_list[i], args.n_bytes)
stop = clock() Currently this test does the following: start = clock()
await ep.send(msg_send_list[i], args.n_bytes)
await ep.recv(msg_recv_list[i], args.n_bytes)
stop = clock() So we are perhaps measuring the bi-directional send/recv. If we change the test to only measure |
That's only true for shared memory, not for other transports. For shared memory to work, we also need #565, which I just opened. Also, I'm seeing much higher latencies than both of you, around 90-100us in blocking mode and 50-60us in non-blocking mode, and I don't know why that is. Could you try it in non-blocking mode with the patch I mentioned above and see how latency looks like then? |
Might also be worth trying a bigger message size to get something more representative. |
I don't think this applies to latency benchmarks the same way it does for bandwidth. According to slide 22 of the GTC presentation referenced in #563 (comment) the latency for UCX-Py used to be under 3us for 4 bytes messages. We really haven't put much effort in improving latency since UCX-Py's rewriting, but it's probably something we should do. |
We are trying to reproduce host-based (numpy objects) UCX-Py numbers shown on slide#22 of GTC 2019 talk (https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9679-ucx-python-a-flexible-communication-library-for-python-applications.pdf) but are getting much higher numbers for latency. We are seeing around 54 us for 4 byte message with latency-bound mode. With throughput-bound mode, the latency is around 82 us. The same benchmark at the UCX level was reporting a number of 2 to 3 us.
The latency here seems to be on the higher-side. What could be the reason for this?
Some details on the test setup.
These are two nodes connected via IB. The benchmark is https://github.com/rapidsai/ucx-py/blob/branch-0.15/benchmarks/local-send-recv.py. We only modified this to get latency numbers (see latency.patch.txt).
Exact commands are as follows:
UCX_NET_DEVICES=mlx5_0:1 UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=rc,sm,self python local-send-recv.py --n-bytes 4 --reuse-alloc --server-only --n-iter 10000 --object_type numpy
Server Running at X:Y
Client:
UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=rc,sm,self python local-send-recv.py --n-bytes 4 --client-only --server-address X --reuse-alloc --port Y --n-iter 10000 --object_type numpy
The text was updated successfully, but these errors were encountered: