Project

General

Profile

Task #1624

Design congestion control scheme

Added by Junxiao Shi almost 3 years ago. Updated 7 months ago.

Status:
New
Priority:
High
Category:
Forwarding
Target version:
-
Start date:
Due date:
% Done:

0%


Description

Design a congestion control scheme.

design-20160229.pdf (138 KB) Jeff Burke, 03/06/2016 08:37 AM

draft_own_cong_marks.pdf - data for note 5 (816 KB) Jeff Burke, 03/13/2016 05:37 PM

draft_multi_for_bic3.pdf - TCP BIC Performance (1.06 MB) Klaus Schneider, 03/29/2016 10:22 PM

draftmulti_for_aimd3.pdf - TCP Reno Performance (1.07 MB) Klaus Schneider, 03/29/2016 10:22 PM


Related issues

Related to NFD - Feature #3797: Congestion Control: generic congestion marks Feedback
Related to NFD - Feature #3823: Congestion Control: design Local Link Loss Detection In Progress

History

#1 Updated by Jeff Burke about 1 year ago

Is this the correct issue for tracking current congestion control work to support NDN-RTC? It seems abandoned.

#2 Updated by Junxiao Shi about 1 year ago

  • Assignee set to Klaus Schneider

Klaus started working on congestion control last month, but I forgot to assign this task.

#3 Updated by Klaus Schneider about 1 year ago

I think we are talking about two different designs:

  • A long-term general (application-agnostic) congestion control design.
  • A stripped-down short-term version for NDN-RTC.

I will send around the current state of the first one next week.

  • We can turn the short-term design into a Hackathon project at the NDN Retreat.

#4 Updated by Jeff Burke about 1 year ago

The design proposed by Klaus on 2/29 is attached.

#5 Updated by Jeff Burke about 1 year ago

  • File draft_own_cong_marks.pdf added

Per Klaus -

The problem is the following: In the document I described to use the
"time that the sojourn time remained above the target value" (further
called "queuing time") as measure of the queue size.

Currently, I only use a binary notion of congestion (marked/not marked).
I tried to use the quantitative congestion notion ("queuing time") to
influence either the reaction at the consumer or the adjustment of the
forwarding ratio, but could never achieve a notable performance benefit.

I just figured out the problem. Look at figures "Congestion Window" and
"Congestion Marks" on slide 5 and "producer1, 0" on the last slide of
the pdf.

The "queuing time" is actually negatively related with the mismatch of
the congestion window and the current queue size at the router.

  • the queuing time is highest around the time the congestion window has reached the optimal size
  • it is highest right before the queuing delay at the router falls below the target value (5 ms).

A better metric of the amount of congestion would be "the minimum
queuing delay over the last interval (100 ms)". However, I think there
is a good reason why CoDel doesn't use it: It is much more complex in
terms of CPU usage.

Basically, it would require to calculate a minimum over all packets in
the last 100ms for every incoming packet. The current design only
needs 1 comparison (check if queuing delay > target) instead of this
minimum calculation.

Does anyone know how to do this efficiently?

If there isn't an efficient algorithm for that, I think we should stick
to a stream of binary congestion marks. These also have the benefit of
lower design complexity and traffic overhead.

#6 Updated by Jeff Burke about 1 year ago

  • File deleted (draft_own_cong_marks.pdf)

#8 Updated by susmit shannigrahi about 1 year ago

The design document talks about TCP RENO, additive increase and multiplicative decrease. This performs quiet badly in high capacity networks. I understand this may or may not be applicable for this particular application, but we should consider the trade-offs.

Here are some interesting numbers: https://fasterdata.es.net/assets/fasterdata/JT-201010.pdf, slide 21 onwards.

#9 Updated by Klaus Schneider about 1 year ago

  • Priority changed from Normal to High

Hi susmit,

thanks for your comments.

Since our design is different from TCP RENO (e.g. it doesn't rely on packet drops as the congestion signal), I think we should look at the specifics why AIMD performs badly in high bandwidth-delay product networks.

  1. One problem is that, on high-speed networks, TCP RENO requires a very low packet loss rate (slide 22 of your link). This is an issue, but it may be handled by hop-by-hop loss recovery like NDNLP (#3437)

  2. Another problem is the "oscillation" around the optimal value of the receive window (or "pipe size"): It may take a long time for the additive increase step to grab the whole bandwidth.

We have seen these issues in our evaluation and are currently working on replacing the consumer reaction part with an equivalent of TCP CUBIC.

Other suggestions are highly appreciated.

#10 Updated by susmit shannigrahi about 1 year ago

Since our design is different from TCP RENO (e.g. it doesn't rely on packet drops as the congestion signal), I think we should look at the specifics why AIMD performs badly in high bandwidth-delay product networks.

How we know about congestion is a bit irrelevant, I think. Point is, if packet loss or Nack results in AIMD, performance will be bad since you will take many RTTs to ramp up again. This paper gives a nice overview - http://conferences.sigcomm.org/sigcomm/2002/papers/xcp.pdf.

We have seen these issues in our evaluation and are currently working on replacing the consumer reaction part with an equivalent of TCP CUBIC.

Is the design document attached here outdated? Can you please update it if that's the case?-

Other suggestions are highly appreciated.

Curious if you have looked at HTCP - https://tools.ietf.org/html/rfc3649.

#11 Updated by Klaus Schneider about 1 year ago

susmit shannigrahi wrote:

Since our design is different from TCP RENO (e.g. it doesn't rely on packet drops as the congestion signal), I think we should look at the specifics why AIMD performs badly in high bandwidth-delay product networks.

How we know about congestion is a bit irrelevant, I think. Point is, if packet loss or Nack results in AIMD, performance will be bad since you will take many RTTs to ramp up again. This paper gives a nice overview - http://conferences.sigcomm.org/sigcomm/2002/papers/xcp.pdf.

Yes, I agree that it may take many RTTs to grab the full bandwidth again (see my second point above).

We use many of the high-level ideas from the XCP paper. For example, the notion to use explicit multi-bit feedback instead of relying on packet losses and timeouts.

We have seen these issues in our evaluation and are currently working on replacing the consumer reaction part with an equivalent of TCP CUBIC.

Is the design document attached here outdated? Can you please update it if that's the case?-

The document is still work in progress and I'll update it periodically. Currently, all the anticipated changes are mentioned in this comment section.

Other suggestions are highly appreciated.

Curious if you have looked at HTCP - https://tools.ietf.org/html/rfc3649.

As I understood, HighSpeed TCP uses a more aggressive AIMD than standard TCP once the congestion window is large.

This should be easily usable in our scheme and we can compare its performance to a TCP CUBIC-like consumer reaction. The four parts of our scheme (congestion detection, signaling, consumer reaction, and forwarding strategy) should be easy to change independently. However, some combinations will likely perform better than others.

#12 Updated by Klaus Schneider about 1 year ago

Here is an update on the adaptation for high capacity networks.

@Susmit: Thanks for the pointer to HighSpeed TCP. I think there are 4 popular TCP algorithms that we could use for the consumer adaptation (ordered by their complexity, as estimated by lines of source code in the Linux Kernel Implementation - https://github.com/torvalds/linux/tree/master/net/ipv4):

  1. Scalable TCP (49 sloc)
  2. High Speed TCP (165 sloc)
  3. TCP BIC (203 sloc)
  4. TCP CUBIC (449 sloc)

Any preferences?

The implementation of these algorithms should be straight forward when accounting for NDN's use of selective acknowledgements (each interest is acknowledged by its corresponding data packet). We can use the algorithm described in RFC 3517 and 6675 (https://datatracker.ietf.org/doc/rfc3517/ https://datatracker.ietf.org/doc/rfc6675/) to only consider one loss event per RTT, even if a burst of marked packets are received.

#13 Updated by Klaus Schneider 12 months ago

Here is another question. Lets say Router A has 3 outgoing interfaces to R1 (10ms delay), R2 (50 ms), and R3 (100 ms) -- there might be multiple hops between these hops and the final producer.

The current design starts with the shortest path (R1), but eventually reaches an even split (33%/33%/33%) on all paths, regardless of the delay.

  • Should we prioritize a path with a lower delay, even if this leads to a loss in total throughput?

I think this decision depends on the application and it might be useful to introduce a header bit that says "only use the shortest path" for certain delay-sensitive traffic.

  • Should the strategy shift traffic back to the faster paths once the bandwidth is no longer fully used?

These seems useful to me, but I still have to think about a way to implement it.

#14 Updated by susmit shannigrahi 12 months ago

Klaus Schneider wrote:
Here is an update on the adaptation for high capacity networks.

@Susmit: Thanks for the pointer to HighSpeed TCP. I think there are 4 popular TCP algorithms that we could use for the consumer adaptation (ordered by their complexity, as estimated by lines of source code in the Linux Kernel Implementation - https://github.com/torvalds/linux/tree/master/net/ipv4):

  1. Scalable TCP (49 sloc)
  2. High Speed TCP (165 sloc)
  3. TCP BIC (203 sloc)
  4. TCP CUBIC (449 sloc)

Any preferences?

I don't have any preferences. Whatever works well.

The implementation of these algorithms should be straight forward when accounting for NDN's use of selective acknowledgements (each interest is acknowledged by its corresponding data packet). We can use the algorithm described in RFC 3517 and 6675 (https://datatracker.ietf.org/doc/rfc3517/ https://datatracker.ietf.org/doc/rfc6675/) to only consider one loss event per RTT, even if a burst of marked packets are received.

What are the trade-offs if we used NACK?

#15 Updated by susmit shannigrahi 12 months ago

Klaus Schneider wrote:
Here is another question. Lets say Router A has 3 outgoing interfaces to R1 (10ms delay), R2 (50 ms), and R3 (100 ms) -- there might be multiple hops between these hops and the final producer.

The current design starts with the shortest path (R1), but eventually reaches an even split (33%/33%/33%) on all paths, regardless of the delay.

  • Should we prioritize a path with a lower delay, even if this leads to a loss in total throughput?

I think this decision depends on the application and it might be useful to introduce a header bit that says "only use the shortest path" for certain delay-sensitive traffic.

Yes, this is application specific, but do you need a header bit or can you leave it to strategy?

  • Should the strategy shift traffic back to the faster paths once the bandwidth is no longer fully used?

Do you intend to keep running tab on per face bandwidth?

#16 Updated by Klaus Schneider 12 months ago

susmit shannigrahi wrote:

Klaus Schneider wrote:
Here is an update on the adaptation for high capacity networks.

@Susmit: Thanks for the pointer to HighSpeed TCP. I think there are 4 popular TCP algorithms that we could use for the consumer adaptation (ordered by their complexity, as estimated by lines of source code in the Linux Kernel Implementation - https://github.com/torvalds/linux/tree/master/net/ipv4):

  1. Scalable TCP (49 sloc)
  2. High Speed TCP (165 sloc)
  3. TCP BIC (203 sloc)
  4. TCP CUBIC (449 sloc)

Any preferences?

I don't have any preferences. Whatever works well.

The implementation of these algorithms should be straight forward when accounting for NDN's use of selective acknowledgements (each interest is acknowledged by its corresponding data packet). We can use the algorithm described in RFC 3517 and 6675 (https://datatracker.ietf.org/doc/rfc3517/ https://datatracker.ietf.org/doc/rfc6675/) to only consider one loss event per RTT, even if a burst of marked packets are received.

What are the trade-offs if we used NACK?

Are you talking about the trade-offs using the mentioned algorithms? (I'm not sure what you mean with NACK)

I would say: Implementation complexity vs. performance.

CUBIC is now the default in the Linux Kernel and would probably work best with our scheme. I've settled for TCP BIC now, because it was much easier to implement and has two key benefits of CUBIC:

  • It doesn't overshoot the optimal cwnd as much as the AIMD slow start does.
  • After the slow start finished, it can grab available bandwidth much faster than linear increase.

I append a comparison example. Look at second 20, when Consumer1 finished (page 4 and 5). Even STCP and HSTCP would only do a linear increase at that point; BIC/CUPIC start another "slow start"-like phase (binary search).

#17 Updated by Klaus Schneider 12 months ago

susmit shannigrahi wrote:

Klaus Schneider wrote:
Here is another question. Lets say Router A has 3 outgoing interfaces to R1 (10ms delay), R2 (50 ms), and R3 (100 ms) -- there might be multiple hops between these hops and the final producer.

The current design starts with the shortest path (R1), but eventually reaches an even split (33%/33%/33%) on all paths, regardless of the delay.

  • Should we prioritize a path with a lower delay, even if this leads to a loss in total throughput?

I think this decision depends on the application and it might be useful to introduce a header bit that says "only use the shortest path" for certain delay-sensitive traffic.

Yes, this is application specific, but do you need a header bit or can you leave it to strategy?

How should the strategy know what the application wants?

An alternative would be to use a different pre-defined name prefix for all applications that want "low delay" vs. "more bandwidth". In any case, some agreement between strategy layer and applications is necessary.

  • Should the strategy shift traffic back to the faster paths once the bandwidth is no longer fully used?

Do you intend to keep running tab on per face bandwidth?

I'm not exactly sure what you mean. I'm not thinking about explicit bandwidth measurements. More something like: "If there was no congestion for some time, try to shift the traffic to the lowest delay path and see what happens. Then when this path shows congestion again, shift some of it back."

#18 Updated by Davide Pesavento 12 months ago

Where is the congestion control algorithm going to be implemented? Is it supposed to be used by every consumer application? In that case, should it go in the library (as opposed to in the forwarder)?

#19 Updated by Klaus Schneider 12 months ago

It depends on which part of the scheme we are talking about.

The Congestion Detection (1) should run on each in-network router and likely needs access to some low-level Linux API: http://www.infradead.org/~tgr/libnl/

Congestion Signaling (2) and Multipath Forwarding Strategy (4) need to be implemented at the strategy layer.

The Consumer Reaction (3) should indeed be used by each consumer and some library implementation would be great. Do you know where the best place for this would be?

#20 Updated by Davide Pesavento 12 months ago

I think I was referring to (1) and (3).

I agree that (3) should go in the library/-ies (ndn-cxx, ndn-cpp, etc...), although it means it has to be reimplemented for each client library, it's still way better than leaving it to each app developer.

I'm still a bit confused on (1). What is an "in-network router"? If it means every node that runs NFD, it's basically equivalent to saying every node in the network, since almost every node will run NFD (I don't have a problem with this, just wanted to clarify). If, on the other hand, you mean that it's a software component (piece of code) of the forwarder, then ok it makes sense.

I also have a few more questions:

  1. On what queues do you intend to measure the "packet sojourn time", as defined by CoDel, and how?

  2. What's the rationale behind not using CoDel's control loop?

  3. The proposed alternative to CoDel's control loop seems to be tagging Data packets with the max queue size encountered on the return path and letting strategy and consumer react on that tag. What do we do if an Interest packet needs to be dropped (according to CoDel's estimator+setpoint)? Can we immediately send back a Nack instead of silently dropping?

#21 Updated by Klaus Schneider 12 months ago

Thanks for your comments.

Davide Pesavento wrote:
I think I was referring to (1) and (3).

I agree that (3) should go in the library/-ies (ndn-cxx, ndn-cpp, etc...), although it means it has to be reimplemented for each client library, it's still way better than leaving it to each app developer.

I'm still a bit confused on (1). What is an "in-network router"? If it means every node that runs NFD, it's basically equivalent to saying every node in the network, since almost every node will run NFD (I don't have a problem with this, just wanted to clarify). If, on the other hand, you mean that it's a software component (piece of code) of the forwarder, then ok it makes sense.

Yeah, basically every node that runs NFD. Sorry for the confusion.

I also have a few more questions:

  1. On what queues do you intend to measure the "packet sojourn time", as defined by CoDel, and how?

Mostly on the actual link layer queues. In Linux this can be done with the libnl API mentioned above. You can see these by running:
tc -s qdisc

Moreover, as Beichuan pointed out in our discussion, it might be necessary to run CoDel on the incoming buffer inside NFD, whenever NFD is not fast enough to process packets on line speed.

  1. What's the rationale behind not using CoDel's control loop?

First, we want to avoid dropping packets, because explicit feedback doesn't require retransmissions. Second, instead of using CoDels ECN marking (instead of drops), we want more fine-grained congestion feedback. Thus, we put the "time above sojourn time" into each packet whenever it is above the limit. This multi-bit feedback allows the forwarding strategy to adapt much faster than using CoDels original feedback.

  1. The proposed alternative to CoDel's control loop seems to be tagging Data packets with the max queue size encountered on the return path and letting strategy and consumer react on that tag. What do we do if an Interest packet needs to be dropped (according to CoDel's estimator+setpoint)? Can we immediately send back a Nack instead of silently dropping?

Packets only need to be dropped when the queue buffer overflows. This shouldn't happen in normal operation, as the consumers should react to the marked packets. Buffers should be sized large enough to catch any short-term traffic bursts.

When some consumers don't reduce their window on receiving marks (like a UDP flow), the buffer might still overflow. In this worst case, you might send a NACK back, but what for? If the consumer doesn't react to marks, it probably also won't react to NACKs. Moreover, NACKs are "wasting" the bandwidth of the already congested link. So you might as well just drop the packets.

I hope this makes sense. Let me know if you still have questions.

#22 Updated by Davide Pesavento 12 months ago

Klaus Schneider wrote:

Mostly on the actual link layer queues. In Linux this can be done with the libnl API mentioned above.

Hmm ok, but those queues are under OS control, so they will have a qdisc already, which might be dropping packets... can you still reliably obtain the sojourn time in that case?

Moreover, as Beichuan pointed out in our discussion, it might be necessary to run CoDel on the incoming buffer inside NFD, whenever NFD is not fast enough to process packets on line speed.

Agreed, but note that NFD does not have an "incoming buffer" at the moment :)

  1. What's the rationale behind not using CoDel's control loop?

First, we want to avoid dropping packets, because explicit feedback doesn't require retransmissions. Second, instead of using CoDels ECN marking (instead of drops), we want more fine-grained congestion feedback. Thus, we put the "time above sojourn time" into each packet whenever it is above the limit. This multi-bit feedback allows the forwarding strategy to adapt much faster than using CoDels original feedback.

Got it. I asked because in the attached pdf the tagQueueSize is only ever compared against zero, i.e. you're using it as a "binary" value (congestion/no-congestion).

  1. The proposed alternative to CoDel's control loop seems to be tagging Data packets with the max queue size encountered on the return path and letting strategy and consumer react on that tag. What do we do if an Interest packet needs to be dropped (according to CoDel's estimator+setpoint)? Can we immediately send back a Nack instead of silently dropping?

Packets only need to be dropped when the queue buffer overflows. This shouldn't happen in normal operation, as the consumers should react to the marked packets. Buffers should be sized large enough to catch any short-term traffic bursts.

Oh I see the reasoning. You're basically willing to wait one RTT (or rather, the remaining part of the RTT) since the congestion event to give the consumer the chance to slow down when it eventually receives a marked Data packet. What I'm asking is: if the congestion is on the "Interest path", would it be better (or worse) to immediately send back a Nack to hopefully trigger a quicker window size reduction at the consumer? (consider for example the case of a consumer app that is overloading its next-hop NFD with too many interests)

I hope this makes sense. Let me know if you still have questions.

Yes, thanks for the answers.

#23 Updated by Klaus Schneider 12 months ago

Davide Pesavento wrote:
Klaus Schneider wrote:

Mostly on the actual link layer queues. In Linux this can be done with the libnl API mentioned above.

Hmm ok, but those queues are under OS control, so they will have a qdisc already, which might be dropping packets... can you still reliably obtain the sojourn time in that case?

I think so. Either you enable CoDel + ECN (which doesn't drop packets), or you use a regular FIFO queue and implement a modified CoDel on top of it.

Moreover, as Beichuan pointed out in our discussion, it might be necessary to run CoDel on the incoming buffer inside NFD, whenever NFD is not fast enough to process packets on line speed.

Agreed, but note that NFD does not have an "incoming buffer" at the moment :)

As long as NFD runs at line speed, the incoming buffer is unnecessary. However, it might be useful as a workaround when NFD processing is too slow.

  1. What's the rationale behind not using CoDel's control loop?

First, we want to avoid dropping packets, because explicit feedback doesn't require retransmissions. Second, instead of using CoDels ECN marking (instead of drops), we want more fine-grained congestion feedback. Thus, we put the "time above sojourn time" into each packet whenever it is above the limit. This multi-bit feedback allows the forwarding strategy to adapt much faster than using CoDels original feedback.

Got it. I asked because in the attached pdf the tagQueueSize is only ever compared against zero, i.e. you're using it as a "binary" value (congestion/no-congestion).

Right, still the tagQueueSize is put in too every packet as long as the queue is above the limit. CoDel, on the other hand, spaces out the markings/drops (maybe once every couple 100 ms) in order to achieve the linear decrease of TCP senders. Marking every packet gives more feedback to routers to work with, even if each mark is just binary.

  1. The proposed alternative to CoDel's control loop seems to be tagging Data packets with the max queue size encountered on the return path and letting strategy and consumer react on that tag. What do we do if an Interest packet needs to be dropped (according to CoDel's estimator+setpoint)? Can we immediately send back a Nack instead of silently dropping?

Packets only need to be dropped when the queue buffer overflows. This shouldn't happen in normal operation, as the consumers should react to the marked packets. Buffers should be sized large enough to catch any short-term traffic bursts.

Oh I see the reasoning. You're basically willing to wait one RTT (or rather, the remaining part of the RTT) since the congestion event to give the consumer the chance to slow down when it eventually receives a marked Data packet.

Yes, we are willing to wait on RTT in order to avoid the complexity that a NACK, followed by a retransmission, would create. Here are some questions that follow from Cheng Yi et al.'s NACK based design (from the "adaptive forwarding" paper):

  • Who should retransmit: in-network routers or only the consumer?
  • If it's the consumer, then each NACK requires an additional RTT (or a large part of it) for the retransmission
  • If routers retransmit: How often should they try? How much delay is acceptable for the application? (answer: you don't know without some mechanism for the application to tell you)
  • NACK's are likely to occur in bursts which 1) waste a lot of bandwidth, 2) Require a lot of processing at routers, and 3) Don't give you much more information about the congestion status than a single NACK or marking.

Markings also occur in bursts, but come with much less overhead: Only one additional byte in the packet header (for the "time above limit" in ms), or maybe just one bit (congested/not congested). Also, the consumer will still get the data packet, thus avoids retransmissions and the issues above.

What I'm asking is: if the congestion is on the "Interest path", would it be better (or worse) to immediately send back a Nack to hopefully trigger a quicker window size reduction at the consumer?

Congestion caused by interests is an interesting problem. Some questions:

  1. How large is the typical size ratio between interest and data packets? I would guess that Interests are maybe 50 Bytes large (?), while data packet can be many Kilobytes (or even Megabytes?). Thus, I would expect that data packets contribute much more too congestion.

  2. In some cases (like IoT), Interests might be larger than data packets. This actually makes the congestion reaction at the forwarding strategy way easier: The strategy can just look at its own outgoing queues to adjust the forwarding ratio. When data packets are larger, it has to wait for the markings to indicate the congestion status of the reverse path.

Still, this would require to consider the upstream queue at each router, which I discussed with Beichuan, but is not yet in this design document. We still have to figure out how to weigh the importance of the upstream queue (indicating congestion by interests) compared to the markings (which indicate the congestion status of the downstream links at the upstream routers, that is, the path of the data packets).

(consider for example the case of a consumer app that is overloading its next-hop NFD with too many interests)

With "overloading" do you mean 1) too much interests for NFD processing to catch up, or 2) overloading the link with a rate that is too high? 1) would require some consideration, but only happens when NFD is too slow. For 2:

As long as the consumer app follows the protocol of starting with a congestion window of 1 and using some "slow start" mechanism, this case should never happen; the consumer waits for returning data before increasing its window.

If the consumer doesn't adhere to the specification, there is also no guarantee that it will react to NACKs instead of just keeping up the sending rate. We decided to keep these scenarios of "DoS Attacks" to future work. They would probably require some fair queuing at in-network routers, which lead to a whole lot of other problems.

I hope this makes sense. Let me know if you still have questions.

Yes, thanks for the answers.

#24 Updated by Klaus Schneider 7 months ago

The updated design is part of our paper at the ICN'16: https://named-data.net/wp-content/uploads/2015/01/practical_congestion_control_scheme.pdf

We can continue the discussion here on how to implement the parts in NFD. Beichuan's group is currently working on this, and I'll post an update soon.

#25 Updated by Junxiao Shi 6 months ago

  • Related to Feature #3797: Congestion Control: generic congestion marks added

#26 Updated by Klaus Schneider 5 months ago

  • Related to Feature #3823: Congestion Control: design Local Link Loss Detection added

Also available in: Atom PDF