DDS Tuning for ROS 2

May 17, 2024 · breq.dev

Picture this: you're developing a robot that you aim to control using a nice, clean user interface. Control goes through this UI on one laptop over a network connection to the robot, which runs its own SBC, all using standard ROS 2. You do a ton of testing, at first just with an Ethernet cable since you don't feel like dragging your radios out of the closet, and everything looks good! And then, you show up to an event with your robot, and you realize that whenever you switch views in the UI, suddenly your entire comms link grinds to a halt.

So, what gives?

Link to this section Discovery Traffic and Fragmentation

This issue is a classic case of discovery traffic trashing your link. Whenever the UI subscribes to a new topic in ROS, the Data Distribution Service (DDS) middleware needs to figure out which node in the network is publishing that topic, so it generates a ton of multicast UDP traffic to send out to the network. This works fine over a high-bandwidth Ethernet link, but causes problems with a bandwidth-constrained radio connection.

When our team hit this, we took some Wireshark captures before and after. Here's the normal, working state of the network:

And here's after we generate a bit of DDS discovery traffic:

The unassembled fragments! The horror...

If you're not familiar with fragmentation, it's something that happens at the network layer to break large packets into smaller ones to fit your particular network's maximum transmission unit (MTU), which is usually 1500. This disassembly and reassembly is typically transparent to the user, and is handled at the OS level. The fact that Wireshark is showing these unassembled fragments implies that they couldn't be reassembled correctly -- some of the fragments were dropped, and now the message can't be recovered.

In other words, this is not cute, computer networks only do this when they're in extreme distress.

Link to this section Tunable Parameters

The ROS 2 documentation suggests a few knobs which you can turn to hopefully improve performance:

Reduce net.ipv4.ipfrag_time. By default, Linux allows 30 seconds for a packet to be reassembled before giving up and dropping the fragments from memory. If a lot of fragmentation is happening, allowing this much time can fill buffers quickly, and on a small network (on the scale of one robot and its base station), it shouldn't ever take more than a few seconds for a fragment to get from one end to the other. ROS recommends a value of 3 seconds.
Increase net.ipv4.ipfrag_high_thresh. This controls the amount of memory used for packet defragmentation. By default it's 256 KiB, but in a scenario with lots of fragmentation happening, you can bump this up as high as 128 MB.
Increase net.core.rmem_max. This controls the size of the buffer that the Linux kernel uses for receiving data on a socket. ROS suggests anywhere from 4 MiB to 2 GiB depending on the DDS vendor.

Link to this section Alternative DDS Middleware

One thing to try is changing the actual DDS implementation you're using. DDS is an open standard, and there are plenty of implementations to choose from. The popular implementations all have corresponding ROS middlewares written, and it's trivial to switch from one to the other.

The default middlewares so far have been:

Ardent, Bouncy, Crystal, Dashing, Eloquent, Foxy: FastDDS by eProsima
Galactic: CycloneDDS by Eclipse
Humble, Jazzy: FastDDS by eProsima

In addition to FastDDS and CycloneDDS, there are ConnextDDS and GurumDDS, but those lack open-source licenses.

Interestingly, Cyclone is seemingly more loved by the community than FastDDS: The MoveIt inverse kinematics tool recommends Cyclone. For what it's worth, switching to Cyclone largely fixed our discovery traffic problems:

Link to this section DDS Overhead

This slide deck by Charles Cross goes into detail about some techniques for optimizing ROS 2 traffic. One often-overlooked feature of networking in ROS 2 is the amount of overhead for each message.

Here's a packet containing a command that we sent to the left wheel of our robot. It's a single 32-bit floating point number, so the value itself takes up 4 bytes. How much overhead does transmitting this incur?

That's 138 bytes over the wire:

14 bytes of Ethernet header
20 bytes of IPv4 header
8 bytes of UDP header
96 bytes of RTSP data
- 20 bytes of RTSP header
- 12 bytes of INFO_TS submessage
- 32 bytes of DATA submessage
  - 24 bits of header
  - 4 bytes of encapsulation kind/options
  - 4 bytes of data (this is our actual message)
- 32 bytes of HEARTBEAT submessage

In other words, sending each message requires 132 bytes of overhead, regardless of the actual size of the message. With this in mind, we can greatly reduce the load over the network by combining as much as possible into as few topics as possible. In practice, here's how we applied this:

Link to this section Custom Messages

Wherever possible, make use of ROS 2 custom messages. Each new system we build starts with a *_msgs package, which includes message definitions tailored to that stack.

There are lots of reasons to do this other than performance: it makes the development experience much better by giving meaningful names to values, it catches semantic "mismatches" caused by improperly hooking up a publisher and subscriber, and it helps make invalid states unrepresentable (can you tell I'm a Rust fan?).

For the above example, our team now uses a DriveCommand which encodes the speed of each of our six motors in a single message, each named semantically:

DriveCommand.msg

float32 left_front
float32 left_middle
float32 left_back
float32 right_front
float32 right_middle
float32 right_back

Link to this section Large Messages and Fragmentation

Our team ran into another issue with ROS 2: heavy fragmentation with large message types.

While filming a demo video of our system, we wanted to include a visualization of point-cloud data from our stereoscopic camera (a Zed 2i by StereoLabs). Last year, we got the shot (video here):

We had no trouble doing this in ROS 1 -- as our communication link degraded, the point cloud data just started to slow down. However, in ROS 2, this demo completely broke down.

The key difference here is that ROS 1 used TCP, while ROS 2 uses UDP. Each point cloud message is big: our sensor is quite high-resolution and each pixel needs a color and position in space. With a TCP connection, congestion control takes care of things, slowing down the link and causing frames to be dropped to keep things working -- this slows down the framerate at the output but otherwise works fine.

In ROS 2 using a DDS middleware, messages are sent over UDP. However, remember that frames on most networks are limited to an MTU of 1500. When you send a large message using a DDS, the DDS layer just creates a single massive UDP packet, and the OS is responsible for fragmenting it appropriately. If the computer on the other end can defragment it properly, you're in business. In a congested network, the chances of successful reassembly fall to near-zero, meaning you'll get nothing out of the other end of the connection.

Here, you've got a few options:

Try to control the fragmentation process yourself: avoid sending large messages, instead send smaller ones.
If you're able to, try to reduce the traffic over the network link to pull it out of a congested state (and hope for the best).
Transport this data through a different transport mechanism. (We do this for image streaming.)

Link to this section QoS parameters

ROS 2 introduced Quality-of-Service parameters on subscribers and publishers. A QoS "profile" includes a few parameters (listed here roughly from most to least important):

History (Queue Depth): The number of samples to keep in the queue. Lower numbers improve efficiency, but may lead to old messages being lost if the network drops for a second or so. A sensor readout can usually have a queue depth of 1 since the system only cares about the last read value, but a button press value should have a larger queue depth or it's possible to miss inputs.
Reliability: "Best Effort" or "Reliable." Controls whether publishers should retry transmission to subscribers if it fails.
Durability: Whether the publisher holds on to samples to give to subscribers who join after the publisher starts ("Transient Local") or not ("Volatile").
Deadline: The expected maximum amount of time between messages on a topic.
Lifespan: The amount of time until a message becomes stale. Messages which are published but not received until after their lifespan expires are dropped.
Liveliness: Whether publishers are implicitly considered alive if they publish to a topic ("Automatic") or if they must manually use the publisher API to assert liveliness ("Manual By Topic").
Lease Duration: The amount of time between assertions of liveliness (see above) before a node is considered to have lost liveliness.

While subscribers and publishers use the same profiles, the work in slightly different ways. A subscriber's profile is the minimum quality it's willing to accept, and the publisher's profile is the maximum quality it's willing to provide. This makes a bit more sense when you look at examples:

Publisher	Subscriber	Communication
Reliable	Reliable	Reliable
Reliable	Best Effort	Best Effort
Best Effort	Best Effort	Best Effort
Best Effort	Reliable	(doesn't work)

Link to this section Zenoh

What if, however, you weren't at the whimsy of the DDS protocol at all? If you let go of the purism of the DDS standard and instead substituted a more efficient protocol which could better handle the congested, dynamic conditions of a lossy wireless network?

This is where Zenoh comes in. Zenoh is a pub/sub protocol designed with one goal: reduce network overhead as much as possible. Most importantly, it has plugins tailored at interoperating with existing DDS systems.

When working with ROS 2, Zenoh dramatically reduces the amount of discovery traffic generated (by up to 97%) compared to a traditional DDS middleware. We've also found it to be much better at communication in general even after the discovery state is finished. We're not the only ones: an Indy Autonomous Challenge team made the same observations.

So how is Zenoh some kind of miracle for communication? It's a well-designed protocol, but more importantly, it's not held down by the restrictions of the DDS standard. DDS is based on a protocol designed in 2003, and is designed foremost for interoperability, not efficiency. You can see this in the protocol's design: instead of defining a standard byte order, it's defined on each message based on a flags register; information about the DDS system vendor is sent with every packet; each node is required to maintain a full graph of the network; and entity IDs are duplicated between submessages in a packet instead of being sent once per packet.

ROS 2 was initially pitched as a DDS-centric version of ROS, replacing the bespoke TCPROS with DDS as an open, standard communication protocol. However, frustrations with the complexity of the DDS implementations and their poor performance on lossy wireless networks motivated OpenRobotics to search for an alternate middleware. Unsurprisingly, they chose Zenoh.

If you're reading this after May 23, 2024 (World Turtle Day), ROS 2 Jazzy Jalisco will be out, and hopefully rmw_zenoh with it.

In the meantime, we're stuck running a DDS on each host and using plugins to bridge the gaps. The best approach as of writing is to use zenoh-plugin-ros2dds, setting ROS_DOMAIN_ID to a different value on each host to prevent DDS communication over the network. Old tutorials used zenoh-plugin-dds, which works for ROS 2 applications but doesn't support ROS 2 tooling as well.

Zenoh ended up being the solution to our team's DDS dilemma. We plan to run Zenoh at the 2024 University Rover Challenge.

Link to this section UDP vs TCP

When running over a network, Zenoh can run over either TCP or UDP. Typically, TCP is usually chosen as a network protocol for its reliable transmission guarantees, so it may seem like a poor fit for a system where some topics only need best-effort transmission. We've found, however, that our system tends to work much better over TCP than UDP, and we believe this is due to the TCP congestion control mechanism.

TCP uses a complex congestion control algorithm to avoid sending too much traffic over a congested network and causing undesirable behavior. With our radio network, the constrained link is specifically between our two IP radios, Each radio has a full 100 Mbps link to its computer (either the Jetson on the rover or the laptop in the base station), so the computers running ROS have no direct knowledge of network conditions. Then, when naive UDP protocols like DDS send way too much traffic over the network, it causes packets to be dropped on the radios, and the only feedback each host receives is a lack of acknowledgement after a timeout. On the other hand, TCP congestion control minimizes packet drops in the system, causing more reliable transmission.

Link to this section `ros2-sunburst`

In true breq.dev tradition, this post leaves off with a tool I've written to help solve some tiny facet of this problem. In this case, it's a bandwidth usage visualization tool for ROS 2 traffic.

Here's a plot of our team's bandwidth usage by topic:

In ROS, topics tend to follow a hierarchy structure: your robot's drive stack might fall under /drive, commands might be under /drive/cmd_vel, etc. This structure isn't inherent to the transport protocol in any sense, but it does help you reason about the parts of your system which are using the most bandwidth.

Using this tool works as follows:

Capture a sample of DDS traffic over your network with Wireshark. (Ensure you include the initial discovery traffic -- otherwise there is no way to map DDS writer/reader IDs to ROS 2 topic names).
Run this script on your packet capture file.
View the output in the terminal (the top topic names by value) and in the plot (the starburst visualization).

You can grab the script from this gist.