In order to solve the flow control and congestion control problems, the Transport layer has to be able to identify that a problem exists. Since flow control is strictly a transport layer issue (for transport layer protocols), it can be handled by adding things to the protocol (the header) to allow the endpoints to now conditions on the other end. Of course, the more information you provide, the more complicated the protocol becomes, so you want to limit it as much as possible.
Congestion control is actually a Network layer problem, but it needs some help from the Transport layer and the Transport layer protocol has to have some clues that the problem exists. You could create an interaction between the layers, but that is a complication that makes it difficult to modularize the layers and allow them to work with other protocols, so a better scheme is to use information that is already available. For TCP, that would be packets timing out because they can't get through the subnet in a reasonable time.
where LBS is Last Byte Sent and LAR is Last ACK Received for the sliding window protocol. The sender never sends more than EWS bytes and the receiver acknowledges every arriving packet with Next Byte Expected and the Advertised Window Size. The AWS decreases if the receiver builds up a backlog of unprocessed data due to a slow application, but it can also change if the receiver finds itself too busy to process data or is running out of memory resources.
For example, the following sequence shows what might happen if you had a very slow receiver in comparison to the sender:
Sender Reciver Event AWS LBS LAR EWS NBE LAS AWS Start 10,000 0 -1 10,000 0 -1 10,000 Send 0-999 10,000 999 -1 9,000 1,000 999 9,000 Send 1000-2999 9,000 2,999 999 7,000 3,000 2,999 7,000 Send 3000-5999 7,000 5,999 2,999 4,000 6,000 5,999 4,000 Send 6000-9999 4,000 9,999 5,999 0 10,000 9,999 0
As you can see, the receiver is acknowledging the receipt of data, but is not able to process it and remove it from the buffer, so the AWS shrinks until it eventually becomes zero and all data transfers halt.
The basic method uses a smoothing function:
This recursive definition bases the new Estimated Round Trip Time (ERTT) on the previous ERTT and the latest actual value, the Sample Round Trip Time (SRTT). a is a constant in the interval [0,1] that specifies the weight to be placed on each value. A large a places more weight on the previous ERTT and therefore smooths the value more, while a small a is more responsive to changes in the actual round trip times that occur.
For example, if a is 0.6 and we set the initial ERTT to 50 ms, then a sample round trip time of 75 ms would result in an update ERTT of:
If the next roundtrip time was 40 ms, then:
Typical value of a are 0.7 to 1.0.
If this is the roundtrip time you expect, what do you choose for the value of the timeout. Typically, a value of 2 times the ERTT is used. This is conservative, effectively saying that a delay in resending is not as bad as resending an unnecessary duplicate.
where typical values are d = 0.5, u = 1 and f = 4. For example, assume that ERTT is 100 and the current deviation is 50, all times in ms:
Most TCP implementations use some form of Jacobsen/Karel's because it provides good reaction to changing conditions in the network, but these are simply choices and are not a defined part of the TCP standard.
The next problem is congestion control, which is difficult to detect. TCP could certainly take advantage of Source Quench or other ICMP messages, but it prefers to rely on data that is readily available in its own protocol; timeouts and increasing roundtrip times which are good indicators of increasing loads in the network. When these things happen, the TCP senders attempt to slow down their rate of sending and reduce the number of duplicates that they are sending.
The one measure that TCP senders have to control their data rate is the window size. If it shrinks, it forces the sender to more frequently wait for acknowledgements before continuing. The Effective Window Size measures the amount of data that can be allowed, and an additional parameter, the Congestion Window Size is used to reflect the desirability of controlling congestion. In general, the sender will only send up to:
So the CWS can be used to slow down the sender if it is smaller than the EWS. If congestion is a problem (as noted by timeouts and increasing RTT's), decrease CWS, and if not, increase CWS.
The algorithm for window management is based on the number of segments that can be sent, which is based on the current segment size. For example, if the window size is 20,000 bytes = minimum {30000, 20000} and the segment size is 500 bytes, then the window size is 20,000/500 = 40 segments.
The algorithm is:
A typical value for C is 1 MSS and 0.5 for M. The following graph shows the changes in the congestion window for this situation.
Remember that the increase is based on an acknowledgement for a group of segments representing an entire congestion window size. This would only be an issue for connections moving large amounts of data, but connections moving small amounts of data won't create congestion problems. In a typical instance, the entire group of segments will be sent as a group or flight, and the acknowledgement will return for the entire flight.
Typically, it would be a lot of effort to keep track of the outstanding segments and which are part of a particular congestion window-sized group, so the additive increase is accomplished by adding 1/CWS * MSS, where the CWS is in segments, each time an acknowledgement arrives. CWS acknowledgements will result in an increase of one MSS.
When other timeouts occur there are two possibilities.
A course-grained timeout means that the entire window has been sent so that the senders window is full, and that data transfer is held waiting for a single acknowledgement at the beginning of the window. If that ACK arrives and the sender responds by sending an entire window, it could have a negative impact on congestion problems in the network. So this type of timeout is handled by performing a complete restart at 1 MSS for the CWS and then entering slow start again. Note that this also resets the threshold value to half the current CWS.
A fine-grained timeout is handled by staying in Additive Increase Multiplicative Decrease collision avoidance mode and dividing the CSW by 2 (typically).
One way to view this process is that given the CWS in segments, when in slow start mode, every acknowledgement adds 1 MSS CWS; when in collision avoidance mode every acknowledgement increases the CWS by 1/CWS times MSS bytes. So for example, given an MSS of 1000 bytes, if the current CWS is 20 and an acknowledgement arrives, if the system is in slow start mode, the new size is 21 segments; if in collision avoidance mode the new size is 20 segments + 1000/20 bytes. Any timeout where there are no incoming acknowledgements or outgoing packets for a period of time causes slow start to be entered, the threshold is set to one-half the CWS and the CWS is set to 1 MSS; any other timeout results in the CWS being halved and slow start being used up to the threshold value and then collision avoidance is started. In the latter case, if the new CWS is above the current threshold, slow start is not used.
In an effort to bring all of these ideas, here is the algorithm for a TCP sender using additive increase-multiplicative-decrease and slow start.
Note that the threshold is never allowed to drop below 2 times MSS. Also, many of these parameters are adjustable depending on the desired properties of your TCP implementation. While the interaction between end-points is carefully defined by TCP, the operation of the endpoints is largely a matter of individual choice. However, the methods described here have been found to work quite well.ACK Received:
if mode == slowstart
CWS = CWS + MSS
if CWS > threshold
mode = aimd
else
CWS = CWS + MSS/CWS
Timeout Event:
threshold = max {CWS/2, 2*MSS}
if no messages in transit /* dead channel */ or in mode == slowstart
CWS = 1
enter mode = slowstart
else // fine-grained timeout
CWS = CWS / 2
if CWS < threshold
mode = slowstart
else
mode == aimd
An example of all of this is: