Networking - Data Structures and Algorithm

socket - The basis for the implementation of the BSD socket interface. The system call socket() sets up and initializes the structure.

sk_buff - The individual communications packets arriving and being sent from the host. The structure acts as buffer and holds the packets before either sending them onto the network interface for transmission, or sending them onto the higher layers for processing and which eventually arrive at the application layer.

INET - Administers the network-specific parts of the sockets. eg: for TCP, UDP and RAW sockets.

sockaddr - This supports different address formats for different address families. ( eg; PF_UNIX, PF_INET )

  
.-------------------------------+-------------------------------.

  |          Source Port          
|       Destination Port        
|

  |-------------------------------+-------------------------------|

  |                        
Sequence Number                        
|

  |---------------------------------------------------------------|

  |                    
Acknowledgment Number                      
|

  |-------------------+-+-+-+-+-+-+-------------------------------|

  |  Data |           
|U|A|P|R|S|F|                               
|

  | Offset| Reserved  |R|C|S|S|Y|I|            
Window             |

  |       |           
|G|K|H|T|N|N|                               
|

  |-------+-----------+-+-+-+-+-+-+-------------------------------|

  |           Checksum            
|         Urgent Pointer        
|

  `---------------------------------------------------------------'

SLAB caches are used for inodes and buffer heads (from call to kmem_cache_create) , and the actual objects are allocated using kmem_cache_alloc . The send and receive buffers of a socket are made up of linked list of segments. Any packet received from network is placed in its own buffer. (sk_buff from include/linux/skbuff.h). The important elements of this structure are ;

At its most basic level, a list of buffers is managed using functions like this:

void append_frame(char *buf, int len)
{
  struct sk_buff *skb=alloc_skb(len, GFP_ATOMIC);
  if(skb==NULL)
    my_dropped++;
  else
  {
    skb_put(skb,len);
    memcpy(skb->data,data,len);
    skb_append(&my_list, skb);
  }
}

void process_queue(void)
{
  struct sk_buff *skb;
  while((skb=skb_dequeue(&my_list))!=NULL)
  {
    process_data(skb);
    kfree_skb(skb, FREE_READ);
  }
}

Linux provides the following functions for managing the lists:

skb_dequeue() pulls the first buffer from the list. If empty, a NULL pointer is returned.
skb_queue_head() places a buffer at the head of a list. It is atomic operation.
skb_queue_tail() places a buffer at the end of a list .
skb_unlink() removes a buffer from the list. The buffer is not freed, merely removed from the list. To make some operations easier, you need not know what list the buffer is in, and you can always call skb_unlink() for a buffer which is not in any list. This function enables network code to pull a buffer out of use even when the network protocol has no idea who is currently using the buffer. A separate locking mechanism is provided, so that a buffer currently in use by a device driver can not be removed.
skb_insert() and skb_append(), exist to allow users to place sk_buffs before or after a specific buffer in a list.
alloc_skb() creates a new sk_buff and initializes it. Normally this is done by skb->free=1. A buffer can be flagged as not freeable by kfree_skb()
kfree_skb() releases a buffer, and if skb->sk is set, it lowers the memory use counts of the socket (sk). It is up to the socket and protocol-level routines to increment these counts and to avoid freeing a socket with outstanding buffers. The memory count is important as the kernel networking layers need to know how much memory is taken up by each connection.
skb_clone() makes a copy of a sk_buff, but does not copy the data area.

3. Tweaking

For kernel 2.4.21 there are 3 sysctl variables that limit the amount of memory allocated to buffers. They are;

Each socket stores the size of its receive and send buffers; sk->rcvbuf and sk->sndbuf. Along with these sk->rmem_alloc and sk->wmem_alloc elements store the number of bytes in use from each queue. Buffer sizes have a specified maximum value and do not correspond to the amount of memory being allocated; memory is allocated when a segment storage is needed. When the socket is created, the function net/ipv4/tcp_ipv4.c::tcp_v4_init_sock initializes the buffer sizes to sysctl_tcp_rmem[1] and sysctl_tcp_wmem[1]. The initial values are then changed as soon as the connection enters an established state.

During the startup, when the function net/ipv4/inet_init creates the TCP control socket. There is a special function for sending an RST on receiving a packet for a non-existent socket. The send and receive buffers are initialized to sysctl_rmem default and sysctl wmem default by net/core/sock.c::sock_init_data(). These in turn are set from SK_RMEM_MAX and SK_WMEM_MAX; which are defined in include/linux/skbuff.h with value 65535.

Any entry in the routing table [struct rtentry in include/linux/route.c] contains various statistics that are cached between connections. These caches are stored in a dst_entry structure [include/net/dst.h]. When connect() is called with a socket, the function include/net/route.h::ip_route_connect() performs a routing table lookup, and sets sk->dst_cache as a pointer in the routing table entry. When the socket is closed, the function include/net/tcp.h::tcp_update_metrics() updates its destination cache. The advertised Maximum Segment Size (tp->advmss) is also cached, by net/ipv4/route.c::rt_set_nexthop(). The destination cache stores the maximum window clamp [ tp->window_clamp] which remains at zero after this point. It may be useful to note that destination caching can be disabled by turning off the DST_HOST flag in dst.h. The destination cache can be flushed, using the command ip_route_flush_cache().

The Linux kernel implements a general purpose protocol, called PF_PACKET, which allows us to create a (RAW) socket that receives packets directly from the NIC driver. Hence, any other protocols' handling is skipped, and any packets can be received. These sockets allow a programmer to develop network monitoring applications when the network device is placed in promiscuous mode.

PF_PACKET sockets differ from PF_INET right from the socket() system call. The socket() system call which ends up calling the sys_socket() function in net/socket.c, uses the sock_create() function to determine the protocol family operations. This information comes from the net_families[] array, which it defined at kernel startup time by a call to sock_register(). The kernel is able to call the create() function specific for the given family, which will produce an appropriate sock structure and complete the creation operation.

Flow control algorithms limit the amount of transmitted traffic based on the estimated network capacity and utilization. TCP Vegas is a congestion control algorithm implemented in Linux (since v2.2) that reduces queuing and packet loss, and thus reduces latency and increases overall throughput, by matching the sending rate to the rate at which packets are successfully being drained by the network. Like most every TCP congestion control algorithm, Vegas is purely a sender-side algorithm. Enabling Vegas helps to send a lot of data , but not useful to receive data. Most of this topic needs a more detailed consideration of the packet flow and is beyond the scope of this project.

Data Structures and Algorithms