1. Kernel Data Structure for Networking
The following structures are vital components of networking in Linux Kernel:
socket - The basis for the implementation of the BSD socket interface. The
system call socket() sets up and initializes the structure.
sk_buff - The individual communications packets arriving and being sent from the
host. The structure acts as buffer and holds the packets before either sending
them onto the network interface for transmission, or sending them onto the
higher layers for processing and which eventually arrive at the application
layer.
INET - Administers the network-specific parts of the sockets. eg: for TCP, UDP and RAW sockets.
proto - protocol independent parameters are held by this structure.
sockaddr - This supports different address formats for different address families. ( eg; PF_UNIX, PF_INET )
The TCP/IP Packet can be viewed conceptually as in Figure below;
.-------------------------------+-------------------------------.
| Source Port
| Destination Port
|
|-------------------------------+-------------------------------|
|
Sequence Number
|
|---------------------------------------------------------------|
|
Acknowledgment Number
|
|-------------------+-+-+-+-+-+-+-------------------------------|
| Data |
|U|A|P|R|S|F|
|
| Offset| Reserved |R|C|S|S|Y|I|
Window |
| |
|G|K|H|T|N|N|
|
|-------+-----------+-+-+-+-+-+-+-------------------------------|
| Checksum
| Urgent Pointer
|
`---------------------------------------------------------------'
Some important fields in struct sock are :
The actual definition of sk_buff structure is shown here.
2. Memory Allocation
The Linux kernel allocates memory in two ways -
1) allocates contiguous memory block using kmalloc.
2) allocates memory via SLAB cache.
SLAB caches are used for inodes and buffer heads (from call to kmem_cache_create) , and the actual objects are allocated using kmem_cache_alloc . The send and receive buffers of a socket are made up of linked list of segments. Any packet received from network is placed in its own buffer. (sk_buff from include/linux/skbuff.h). The important elements of this structure are ;
For receive and send queues, each socket holds two pointers:
At its most basic level, a list of buffers is managed using functions like this:
void append_frame(char *buf, int len)
{
struct sk_buff *skb=alloc_skb(len, GFP_ATOMIC);
if(skb==NULL)
my_dropped++;
else
{
skb_put(skb,len);
memcpy(skb->data,data,len);
skb_append(&my_list, skb);
}
}
void process_queue(void)
{
struct sk_buff *skb;
while((skb=skb_dequeue(&my_list))!=NULL)
{
process_data(skb);
kfree_skb(skb, FREE_READ);
}
}
Linux provides the following functions for managing the lists:
3. Tweaking
For kernel 2.4.21 there are 3 sysctl variables that limit the amount of
memory allocated to buffers. They are;
The initial values for the arrays are set in net/ipv4/tcp.c::tcp_init(),
Where , initial order = log2[(number of physical pages)/2x]
with x = 11, number of pages < 128K
For a 256MB physical memory with 4KB pages, order is of value 2.
Each socket stores the size of its receive and send buffers; sk->rcvbuf and sk->sndbuf. Along with these sk->rmem_alloc and sk->wmem_alloc elements store the number of bytes in use from each queue. Buffer sizes have a specified maximum value and do not correspond to the amount of memory being allocated; memory is allocated when a segment storage is needed. When the socket is created, the function net/ipv4/tcp_ipv4.c::tcp_v4_init_sock initializes the buffer sizes to sysctl_tcp_rmem[1] and sysctl_tcp_wmem[1]. The initial values are then changed as soon as the connection enters an established state.
During the startup, when the function net/ipv4/inet_init creates the TCP control socket. There is a special function for sending an RST on receiving a packet for a non-existent socket. The send and receive buffers are initialized to sysctl_rmem default and sysctl wmem default by net/core/sock.c::sock_init_data(). These in turn are set from SK_RMEM_MAX and SK_WMEM_MAX; which are defined in include/linux/skbuff.h with value 65535.
|
219 220 #define SK_WMEM_MAX 65535 221 #define SK_RMEM_MAX 65535 222 |
4. Socket Statistics
Any entry in the routing table [struct rtentry in include/linux/route.c] contains various statistics that are cached between connections. These caches are stored in a dst_entry structure [include/net/dst.h]. When connect() is called with a socket, the function include/net/route.h::ip_route_connect() performs a routing table lookup, and sets sk->dst_cache as a pointer in the routing table entry. When the socket is closed, the function include/net/tcp.h::tcp_update_metrics() updates its destination cache. The advertised Maximum Segment Size (tp->advmss) is also cached, by net/ipv4/route.c::rt_set_nexthop(). The destination cache stores the maximum window clamp [ tp->window_clamp] which remains at zero after this point. It may be useful to note that destination caching can be disabled by turning off the DST_HOST flag in dst.h. The destination cache can be flushed, using the command ip_route_flush_cache().
5. PF_PACKET Sockets
The Linux kernel implements a general purpose protocol, called PF_PACKET, which allows us to create a (RAW) socket that receives packets directly from the NIC driver. Hence, any other protocols' handling is skipped, and any packets can be received. These sockets allow a programmer to develop network monitoring applications when the network device is placed in promiscuous mode.
PF_PACKET sockets differ from PF_INET right from the socket() system call. The socket() system call which ends up calling the sys_socket() function in net/socket.c, uses the sock_create() function to determine the protocol family operations. This information comes from the net_families[] array, which it defined at kernel startup time by a call to sock_register(). The kernel is able to call the create() function specific for the given family, which will produce an appropriate sock structure and complete the creation operation.
More information about PF_PACKET can be found at this link.
6. Congestion control algorithms
Flow control algorithms limit the amount of transmitted traffic based on the estimated network capacity and utilization. TCP Vegas is a congestion control algorithm implemented in Linux (since v2.2) that reduces queuing and packet loss, and thus reduces latency and increases overall throughput, by matching the sending rate to the rate at which packets are successfully being drained by the network. Like most every TCP congestion control algorithm, Vegas is purely a sender-side algorithm. Enabling Vegas helps to send a lot of data , but not useful to receive data. Most of this topic needs a more detailed consideration of the packet flow and is beyond the scope of this project.