Monitoring and Tuning The Linux Networking Stack - Sending Data PDF

3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
 packagecloud:blog
Subscribe to our blog via email
Sign up!
Subscribe to our RSS feed
back to posts
Monitoring and Tuning

the Linux Networking
Stack: Sending Data
Feb 6, 2017 • packagecloud
linux
TL;DR
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our 
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 1/151
This blog post explains how computers running the Linux kernel send
packets, as well as how to monitor and tune each component of the
networking stack as packets ow from user programs to network hardware.
This post forms a pair with our previous post Monitoring and Tuning the
Linux Networking Stack: Receiving Data.
Setup your own NPM registry for free. Sign up!
It is impossible to tune or monitor the Linux networking stack without

reading the source code of the kernel and having a deep understanding of
what exactly is happening.
This blog post will hopefully serve as a reference to anyone looking to do

this.
TL;DR
General advice on monitoring and tuning the Linux networking stack
Overview
Detailed Look
Protocol family registration
Sending network data via a socket
sock_sendmsg , __sock_sendmsg , and __sock_sendmsg_nosec
inet_sendmsg
UDP protocol layer

We use udp_sendmsg
cookies to enhance the user experience on packagecloud.
UDPPrivacy
Cookie Policy, corking
Policy, and our Terms of Service. back to top
Get the UDP destination address and port

Socket transmit bookkeeping and timestamping
Ancillary messages, via sendmsg
Setting custom IP options
Multicast or unicast?
Routing
Prevent the ARP cache from going stale with MSG_CONFIRM
Fast path for uncorked UDP sockets: Prepare data for transmit
ip_make_skb
Transmit the data!
Slow path for corked UDP sockets with no preexisting corked data
ip_append_data
__ip_append_data
Flushing corked sockets
Error accounting
udp_send_skb
Monitoring: UDP protocol layer statistics
/proc/net/snmp
/proc/net/udp
Tuning: Socket send queue memory

IP protocol layer
ip_send_skb
ip_local_out and __ip_local_out
net lter and nf_hook
We useDestination cache
ip_output
ip_finish_output
Path MTU Discovery

ip_finish_output2
dst_neigh_output
neigh_hh_output
n->output
neigh_resolve_output
Monitoring: IP protocol layer

/proc/net/snmp
/proc/net/netstat
Linux netdevice subsystem

Linux traf c control
dev_queue_xmit and __dev_queue_xmit
netdev_pick_tx
__netdev_pick_tx
Transmit Packet Steering (XPS)

skb_tx_hash
Resuming __dev_queue_xmit
__dev_xmit_skb
Tuning: Transmit Packet Steering (XPS)
Queuing disciplines!
qdisc_run_begin and qdisc_run_end
__qdisc_run
qdisc_restart
dequeue_skb
sch_direct_xmit
handle_dev_cpu_collision
dev_requeue_skb
Reminder, while loop in __qdisc_run

__netif_schedule
net_tx_action
net_tx_action completion queue

net_tx_action output queue
Finally time to meet our friend dev_hard_start_xmit

Monitoring qdiscs
Using the tc command line tool
Tuning qdiscs
Increasing the processing weight of __qdisc_run
Increasing the transmit queue length
Network Device Driver
Driver operations registration
Transmit data with ndo_start_xmit
igb_tx_map
Dynamic Queue Limits (DQL)

Transmit completions
Transmit completion IRQ
igb_poll
igb_clean_tx_irq
igb_poll return value
Monitoring
By using networkthat
our site, you acknowledge devices
you have read and understand our 
Using ethtool -S
Using sysfs
Using /proc/net/dev
Monitoring dynamic queue limits
Tuning network devices
Check the number of TX queues being used
Adjust the number of TX queues used
Adjust the size of the TX queues
The End
Extras
Reducing ARP traf c ( MSG_CONFIRM )
UDP Corking
Timestamping
Conclusion
Help with Linux networking or other systems
Related posts
General advice on monitoring and

tuning the Linux networking stack
As mentioned in our previous article, the Linux network stack is complex

and there is no one size ts all solution for monitoring or tuning. If you truly
want to our
By using tune
site,the network stack,
you acknowledge youread
that you have willandhave no choice
understand our but to invest a 
considerable amount of time, effort, and money into understanding how the
various parts of networking system interact.
Many of the example settings provided in this blog post are used solely for
illustrative purposes and are not a recommendation for or against a certain
con guration or default setting. Before adjusting any setting, you should
develop a frame of reference around what you need to be monitoring to
notice a meaningful change.
Adjusting networking settings while connected to the machine over a

network is dangerous; you could very easily lock yourself out or completely
take out your networking. Do not adjust these settings on production
machines; instead, make adjustments on new machines and rotate them
into production, if possible.
Overview
For reference, you may want to have a copy of the device data sheet handy.
This post will examine the Intel I350 Ethernet controller, controlled by the
igb device driver. You can nd that data sheet (warning: LARGE PDF) here
for your reference.
The high-level path network data takes from a user program to a network
device is as follows:
1. Data is written using a system call (like sendto , sendmsg , et. al.).
2. Data passes through the socket subsystem on to the socket’s
protocol
By using family’s
our site, you system
acknowledge (inhave
that you ourread
case, AF_INETour
and understand ). 
3. The protocol family passes data through the protocol layers which
(in many cases) arrange the data into packets.
4. The data passes through the routing layer, populating the
destination and neighbour caches along the way (if they are cold).
This can generate ARP traf c if an ethernet address needs to be
looked up.
5. After passing through the protocol layers, packets reach the device
agnostic layer.
6. The output queue is chosen using XPS (if enabled) or a hash
function.
7. The device driver’s transmit function is called.
8. The data is then passed on to the queue discipline (qdisc) attached
to the output device.
9. The qdisc will either transmit the data directly if it can, or queue it
up to be sent during the NET_TX softirq.
10. Eventually the data is handed down to the driver from the qdisc.
11. The driver creates the needed DMA mappings so the device can
read the data from RAM.
12. The driver signals the device that the data is ready to be transmit.
13. The device fetches the data from RAM and transmits it.
14. Once transmission is complete, the device raises an interrupt to
signal transmit completion.
15. The driver’s registered IRQ handler for transmit completion runs.
For many devices, this handler simply triggers the NAPI poll loop to
start running via the NET_RX softirq.
16. The poll function runs via a softIRQ and calls down into the driver
to unmap DMA regions and free packet data.
This entire
Cookie ow will
Policy, Privacy beandexamined
Policy, our Terms of in detail in the following sections.
Service. back to top
The protocol layers examined below are the IP and UDP protocol layers.
Much of the information presented will serve as a reference for other
protocol layers, as well.
Detailed Look
This blog post will be examining the Linux kernel version 3.13.0 with links
to code on GitHub and code snippets throughout this post, much like the
companion post.
Let’s begin by examining how protocol families are registered in the kernel
and used by the socket subsystem, then we can proceed to receiving data.
Protocol family registration
What happens when you run a piece of code like this in a user program to
create a UDP socket?
sock = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP)
In short, the Linux kernel looks up a set of functions exported by the UDP
protocol stack that deal with many things including sending and receiving
network data. To understand exactly how this work, we have to look into the
AF_INET address family code.
The Linux kernel executes the inet_init function early during kernel
initialization. ThisPolicy,
Cookie Policy, Privacy function
and our registers the AF_INET protocol family, the
Terms of Service. back to top
individual protocol stacks within that family (TCP, UDP, ICMP, and RAW), and
calls initialization routines to get protocol stacks ready to process network
data. You can nd the code for inet_init in ./net/ipv4/af_inet.c.
The AF_INET protocol family exports a structure that has a create function.
This function is called by the kernel when a socket is created from a user
program:
static const struct net_proto_family inet_family_ops = {

.family = PF_INET,
.create = inet_create,
.owner = THIS_MODULE,
};
The inet_create function takes the arguments passed to the socket system
call and searches the registered protocols to nd a set of operations to link
to the socket. Take a look:
/* Look for the requested type/protocol pair. */

lookup_protocol:
err = -ESOCKTNOSUPPORT;
rcu_read_lock();
list_for_each_entry_rcu(answer, &inetsw[sock->type], list) {
err = 0;
/* Check the non-wild match. */
if (protocol == answer->protocol) {
if (protocol != IPPROTO_IP)
break;
} else {
/* Check for the two wild cases. */
if (IPPROTO_IP == protocol) {
protocol
We use cookies to enhance the user experience = answer->protocol;
on packagecloud.
By using our site, you acknowledge that youbreak;
have read and understand our 
Cookie Policy, Privacy Policy, and
} our Terms of Service. back to top
if (IPPROTO_IP == answer->protocol)
break;
}
err = -EPROTONOSUPPORT;
}
Later, answer which holds a reference to a particular protocol stack has its
ops elds copied into the socket structure:
sock->ops = answer->ops;
You can nd the structure de nitions for all of the protocol stacks in
af_inet.c . Let’s take a look at the TCP and UDP protocol structures:
/* Upon startup we insert all the elements in inetsw_array[] into

* the linked list inetsw.
*/
static struct inet_protosw inetsw_array[] =
{
{
.type = SOCK_STREAM,
.protocol = IPPROTO_TCP,
.prot = &tcp_prot,
.ops = &inet_stream_ops,
.no_check = 0,
.flags = INET_PROTOSW_PERMANENT |
INET_PROTOSW_ICSK,
},
{
.type = SOCK_DGRAM,
.protocol = IPPROTO_UDP,
.prot = &udp_prot,
We use cookies to enhance
.ops =the user experience on packagecloud.
&inet_dgram_ops,
.no_check = UDP_CSUM_DEFAULT,
.flags = INET_PROTOSW_PERMANENT,
},
/* .... more protocols ... */
In the case of IPPROTO_UDP , an ops structure is linked into place which

contains functions for various things, including sending and receiving data:
const struct proto_ops inet_dgram_ops = {

.family = PF_INET,
/* ... */
.sendmsg = inet_sendmsg,
.recvmsg = inet_recvmsg,
/* ... */
};
EXPORT_SYMBOL(inet_dgram_ops);
and a protocol-speci c structure prot , which contains function pointers to

all the internal UDP protocol stack function. For the UDP protocol, this
structure is called udp_prot and is exported by ./net/ipv4/udp.c:
struct proto udp_prot = {

.name = "UDP",
/* ... */
.sendmsg = udp_sendmsg,
.recvmsg = udp_recvmsg,
/* ...
Cookie */ Privacy Policy, and our Terms of Service.
Policy, back to top
};
EXPORT_SYMBOL(udp_prot);
Now, let’s turn to a user program that sends UDP data to see how
udp_sendmsg is called in the kernel!
Create an RPM repository in less than 10 seconds, Sign up!

free.
Sending network data via a socket
A user program wants to send UDP network data and so it uses the sendto
system call, maybe like this:
ret = sendto(socket, buffer, buflen, 0, &dest, sizeof(dest));
This system call passes through the Linux system call layer and lands in this
function in ./net/socket.c :
/*
* Send a datagram to a given address. We move the address into kernel
* space and check the user space data area is readable before invoking
* the protocol.
*/
SYSCALL_DEFINE6(sendto, int, fd, void __user *, buff, size_t, len,

unsigned int, flags, struct sockaddr __user *, addr,
int, addr_len)
{
/* Privacy
Cookie Policy, ... code ...
Policy, and*/
our Terms of Service. back to top
err = sock_sendmsg(sock, &msg, len);
/* ... code ... */

}
The SYSCALL_DEFINE6 macro unfolds into a pile of macros, which in turn, set
up the infrastructure needed to create a system call with 6 arguments
(hence DEFINE6 ). One of the results of this is that inside the kernel, system
call function names have sys_ prepended to them.
The system call code for sendto calls sock_sendmsg after arranging the
data in a way that the lower layers will be able to handle. In particular, it
takes the destination address passed into sendto and arranges it into a
structure, let’s take a look:
iov.iov_base = buff;
iov.iov_len = len;
msg.msg_name = NULL;
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
msg.msg_control = NULL;
msg.msg_controllen = 0;
msg.msg_namelen = 0;
if (addr) {
err = move_addr_to_kernel(addr, addr_len, &address);
if (err < 0)
goto out_put;
msg.msg_name = (struct sockaddr *)&address;
msg.msg_namelen = addr_len;
}
This code
We use is copying
cookies addr
to enhance the user, experience
passed in via the user program into the kernel
on packagecloud.
data structure address , which is then embedded into a struct msghdr 

By using our site, you acknowledge that you have read and understand our
structure as msg_name . This is similar to what a userland program would do

if it were calling sendmsg instead of sendto . The kernel provides this
mutation because both sendto and sendmsg do call down to sock_sendmsg .
sock_sendmsg , __sock_sendmsg , and

__sock_sendmsg_nosec
sock_sendmsg performs some error checking before calling __sock_sendmsg

does its own error checking before calling __sock_sendmsg_nosec .
__sock_sendmsg_nosec passes the data deeper into the socket subsystem:
static inline int __sock_sendmsg_nosec(struct kiocb *iocb, struct socket *sock,

struct msghdr *msg, size_t size)
{
struct sock_iocb *si = ....
/* other code ... */
return sock->ops->sendmsg(iocb, sock, msg, size);

}
As seen in the previous section explaining socket creation, the sendmsg

function registered to this socket ops structure is inet_sendmsg .
inet_sendmsg
As you may have guessed from the name, this is a generic function provided
by the AF_INET protocol family. This function starts by calling
We use cookies to enhance the to
sock_rps_record_flow userrecord the
experience on last CPU that the ow was processed
packagecloud.
on;
Bythis
using is
ourused byacknowledge
site, you Receivethat Packet Steering.
you have Next, this
read and understand our function looks up the

sendmsg function on the socket’s internal protocol operations structure and

calls it:
int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
size_t size)
{
struct sock *sk = sock->sk;
sock_rps_record_flow(sk);
/* We may need to bind the socket. */

if (!inet_sk(sk)->inet_num && !sk->sk_prot->no_autobind &&
inet_autobind(sk))
return -EAGAIN;
return sk->sk_prot->sendmsg(iocb, sk, msg, size);

}
EXPORT_SYMBOL(inet_sendmsg);
When dealing with UDP, sk->sk_prot->sendmsg above is udp_sendmsg as

exported by the UDP protocol layer, via the udp_prot structure we saw
earlier. This function call transitions from the generic AF_INET protocol
family on to the UDP protocol stack.
UDP protocol layer
udp_sendmsg
The udp_sendmsg function can be found in ./net/ipv4/udp.c. The entire

function is quite
We use cookies long,theso
to enhance we’ll
user examine
experience pieces of it below. Follow the
on packagecloud.
previous link
By using our site,if you’d
you like to
acknowledge thatread it in
you have itsandentirety.
read understand our 
UDP corking
After variable declarations and some basic error checking, one of the rst
things udp_sendmsg does is check if the socket is “corked”. UDP corking is a
feature that allows a user program request that the kernel accumulate data
from multiple calls to send into a single datagram before sending. There
are two ways to enable this option in your user program:
1. Use the setsockopt system call and pass UDP_CORK as the socket
option.
2. Pass MSG_MORE as one of the flags when calling send , sendto , or
sendmsg from your program.
These options are documented in the UDP man page and the send / sendto
/ sendmsg man page, respectively.
The code from udp_sendmsg checks up->pending to determine if the socket

is currently corked, and if so, it proceeds directly to appending data. We’ll
see how data is appended later.
int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
size_t len)
{
/* variables and error checking ... */
fl4 = &inet->cork.fl.u.ip4;
if (up->pending) {
/*
* There are pending frames.
* The socket lock must be held while it's corked.
*/
lock_sock(sk);
ifPrivacy
Cookie Policy, (likely(up->pending))
Policy, and our Terms of{Service. back to top
if (unlikely(up->pending != AF_INET)) {
release_sock(sk);
return -EINVAL;
}
goto do_append_data;
}
release_sock(sk);
}
Get the UDP destination address and port
Next, the destination address and port are determined from one of two
possible sources:
1. The socket itself has the destination address stored because the
socket was connected at some point.
2. The address is passed in via an auxiliary structure, as we saw in the
kernel code for sendto .
Here’s how the kernel deals with this:
/*
* Get and verify the address.
*/
if (msg->msg_name) {
struct sockaddr_in *usin = (struct sockaddr_in *)msg->msg_name;
if (msg->msg_namelen < sizeof(*usin))
return -EINVAL;
if (usin->sin_family != AF_INET) {
if (usin->sin_family != AF_UNSPEC)
return -EAFNOSUPPORT;
}
By using our daddr
site, you=acknowledge
usin->sin_addr.s_addr;
that you have read and understand our 
Cookie Policy, Privacy
dport = Policy, and our Terms of Service.
usin->sin_port; back to top
if (dport == 0)
return -EINVAL;
} else {
if (sk->sk_state != TCP_ESTABLISHED)
return -EDESTADDRREQ;
daddr = inet->inet_daddr;
dport = inet->inet_dport;
/* Open fast path for connected socket.
Route will not be used, if at least one option is set.
*/
connected = 1;
}
Yes, that is a TCP_ESTABLISHED in the UDP protocol layer! The socket states
for better or worse use TCP state descriptions.
Recall earlier that we saw how the kernel arranges a struct msghdr
structure on behalf of the user when the user program calls sendto . The
code above shows how the kernel parses that data back out in order to set
daddr and dport .
If the udp_sendmsg function was reached by kernel function which did not
arrange a struct msghdr structure, the destination address and port are
retrieved from the socket itself and the socket is marked as “connected.”
In either case daddr and dport will be set to the destination address and
port.
Socket transmit bookkeeping and timestamping
Create a RubyGem repository in less than 10

We use cookies to enhance the user experience on packagecloud. Sign up!
seconds, free.
Next, the source address, device index, and any timestamping options which
were set on the socket (like SOCK_TIMESTAMPING_TX_HARDWARE ,
SOCK_TIMESTAMPING_TX_SOFTWARE , SOCK_WIFI_STATUS ) are retrieved and
stored:
ipc.addr = inet->inet_saddr;
ipc.oif = sk->sk_bound_dev_if;
sock_tx_timestamp(sk, &ipc.tx_flags);
Ancillary messages, via sendmsg
The sendmsg and recvmsg system calls allow the user to set or request
ancillary data in addition to sending or receiving packets. User programs
can make use of this ancillary data by crafting a struct msghdr with the
request embedded in it. Many of the ancillary data types are documented in
the man page for IP.
One popular example of ancillary data is IP_PKTINFO . In the case of

sendmsg this data type allows the program to set a struct in_pktinfo to
be used when sending data. The program can specify the source address to
be used on the packet by lling in elds in the struct in_pktinfo
structure. This is a useful option if the program is a server program listening
on multiple IP addresses. In this case, the server program may want to reply
to the client with the same IP address that the client used to contact the
server. IP_PKTINFO enables precisely this use case.
Similarly, the IP_TTL and IP_TOS ancillary messages allow the user to set
the IP packet TTL and TOS values on a per-packet basis, when passed with
data to sendmsg
By using our site, you from the user
acknowledge that youprogram. Note
have read and that our
understand both IP_TTL and IP_TOS

back to top
may be set at the socket level for all outgoing packets by using setsockopt ,
Cookie Policy, Privacy Policy, and our Terms of Service.
instead of on a per-packet basis if desired. The Linux kernel translates the

TOS value speci ed to a priority using an array. The priority affects how and
when a packet is transmit from a queuing discipline. We’ll see more about
what this means later.
We can see how the kernel handles ancillary messages for sendmsg on UDP
sockets:
if (msg->msg_controllen) {
err = ip_cmsg_send(sock_net(sk), msg, &ipc,
sk->sk_family == AF_INET6);
if (err)
return err;
if (ipc.opt)
free = 1;
connected = 0;
}
The internals of parsing the ancillary messages is handled by ip_cmsg_send

from ./net/ipv4/ip_sockglue.c. Note that simply providing any ancillary data
marks this socket as not connected.
Setting custom IP options
Next, sendmsg will check to see if the user speci ed any custom IP options
with ancillary messages. If options were set, they will be used. If not, the
options already in use by this socket will be used:
if (!ipc.opt) {
struct ip_options_rcu *inet_opt;

rcu_read_lock();
inet_opt
Cookie Policy, Privacy=Policy,
rcu_dereference(inet->inet_opt);
and our Terms of Service. back to top
if (inet_opt) {
memcpy(&opt_copy, inet_opt,
sizeof(*inet_opt) + inet_opt->opt.optlen);
ipc.opt = &opt_copy.opt;
}
rcu_read_unlock();
}
Next up, the function checks to see if the source record route (SRR) IP
option is set. There are two types of source record routing: loose and strict
source record routing. If this option was set, the rst hop address is
recorded and stored as faddr and the socket is marked as “not connected”.
This will be used later:
ipc.addr = faddr = daddr;
if (ipc.opt && ipc.opt->opt.srr) {

if (!daddr)
return -EINVAL;
faddr = ipc.opt->opt.faddr;
connected = 0;
}
After the SRR option is handled, the TOS IP ag is retrieved either from the
value the user set via an ancillary message or the value currently in use by
the socket. Followed by a check to determine if:
SO_DONTROUTE was set on the socket (with setsockopt ), or

MSG_DONTROUTE was speci ed as a ag when calling sendto or
sendmsg , or
is_strictroute was set, indicating that strict source record routing
is cookies
We use desired to enhance the user experience on packagecloud.
Then, the tos has 0x1 ( RTO_ONLINK ) added to its bit set and the socket is
considered not “connected”:
tos = get_rttos(&ipc, inet);

if (sock_flag(sk, SOCK_LOCALROUTE) ||
(msg->msg_flags & MSG_DONTROUTE) ||
(ipc.opt && ipc.opt->opt.is_strictroute)) {
tos |= RTO_ONLINK;
connected = 0;
}
Multicast or unicast?
Next, the code attempts to deal with multicast. This is a bit tricky, as the
user could specify an alternate source address or device index of where to
send the packet from by sending an ancillary IP_PKTINFO message, as
explained earlier.
If the destination address is a multicast address:
1. The device index of where to write the packet will be set to the
multicast device index, and
2. The source address on the packet will be set to the multicast
source address.
Unless, the user has not overridden the device index by sending the
IP_PKTINFO ancillary message. Let’s take a look:
if (ipv4_is_multicast(daddr)) {
if (!ipc.oif)
ipc.oif = inet->mc_index;
if (!saddr)
saddr = inet->mc_addr;
connected = 0;
} else if (!ipc.oif)
ipc.oif = inet->uc_index;
If the destination address is not a multicast address, the device index is set
unless it was overridden by the user with IP_PKTINFO .
Routing
Now it’s time for routing!
The code in the UDP layer that deals with routing begins with a fast path. If
the socket is connected try to get the routing structure:
if (connected)
rt = (struct rtable *)sk_dst_check(sk, 0);
If the socket was not connected, or if it was but the routing helper
sk_dst_check decided the route was obsolete the code moves into the slow
path to generate a routing structure. This begins by calling
flowi4_init_output to construct a structure describing this UDP ow:
if (rt == NULL) {
struct net *net = sock_net(sk);
fl4 = &fl4_stack;
flowi4_init_output(fl4, ipc.oif, sk->sk_mark, tos,
RT_SCOPE_UNIVERSE, sk->sk_protocol,
inet_sk_flowi_flags(sk)|FLOWI_FLAG_CAN_SLEEP,
faddr, saddr, dport, inet->inet_sport);
Once this ow structure has been constructed, the socket and its ow
structure are passed along to the security subsystem so that systems like
SELinux or SMACK
Cookie Policy, canand
Privacy Policy, setouraTerms
security id value on the ow structure.back
of Service. Next,
to top
ip_route_output_flow will call into the IP routing code to generate a

routing structure for this ow:
security_sk_classify_flow(sk, flowi4_to_flowi(fl4));
rt = ip_route_output_flow(net, fl4, sk);
If a routing structure could not be generated and the error was

ENETUNREACH , the OUTNOROUTES statistic counter is incremented.
if (IS_ERR(rt)) {
err = PTR_ERR(rt);
rt = NULL;
if (err == -ENETUNREACH)
IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
goto out;
}
The location of the le holding these statistics counter and the other
available counters and their meanings will be discussed below in the UDP
monitoring section.
Next, if the route is for broadcast, but the socket option SOCK_BROADCAST
was not set on the socket the code terminates. If the socket is considered
“connected” (as described throughout this function), the routing structure is
cached on the socket:
err = -EACCES;
if ((rt->rt_flags & RTCF_BROADCAST) &&
!sock_flag(sk, SOCK_BROADCAST))
goto out;
if (connected)
sk_dst_set(sk,
We use cookies to enhance thedst_clone(&rt->dst));
user experience on packagecloud.
Create a Python repository in less than 10 Sign up!

seconds, free.
Prevent the ARP cache from going stale with MSG_CONFIRM
If the user speci ed the MSG_CONFIRM ag when calling send , sendto , or

sendmsg , the UDP protocol layer will now handle that:
if (msg->msg_flags&MSG_CONFIRM)
goto do_confirm;
back_from_confirm:
This ag indicates to the system to con rm that the ARP cache entry is still
valid and prevents it from being garbage collected. The dst_confirm
function simply sets a ag on destination cache entry which will be checked
much later when the neighbour cache has been queried and an entry has
been found. We’ll see this again later. This feature is commonly used in UDP
networking applications to reduce unnecessary ARP traf c. The do_confirm
label is found near the end of this function, but it is straightforward:
do_confirm:
dst_confirm(&rt->dst);
if (!(msg->msg_flags&MSG_PROBE) || len)
goto back_from_confirm;
err = 0;
goto out;
This code con rms the cache entry and jumps back to back_from_confirm , if
this was not a probe.
Once the do_confirm code jumps back to back_from_confirm (or no jump

happened to do_confirm in the rst place), the code will attempt to deal
with both the UDP cork and uncorked cases next.
Fast path for uncorked UDP sockets: Prepare data for transmit
If UDP corking is not requested, the data can be packed into a struct
sk_buff and passed on to udp_send_skb to move down the stack and closer
to the IP protocol layer. This is done by calling ip_make_skb . Note that the
routing structure generated earlier by calling ip_route_output_flow is
passed in as well. It will be af xed to the skb and used later in the IP
protocol layer.
/* Lockless fast path for the non-corking case. */

if (!corkreq) {
skb = ip_make_skb(sk, fl4, getfrag, msg->msg_iov, ulen,
sizeof(struct udphdr), &ipc, &rt,
msg->msg_flags);
err = PTR_ERR(skb);
if (!IS_ERR_OR_NULL(skb))
err = udp_send_skb(skb, fl4);
goto out;
}
The ip_make_skb function will attempt to construct an skb taking into

consideration a wide range of things, like:
The MTU.
UDP corking (if enabled).
UDP Fragmentation Of oading (UFO).
Fragmentation,
We use cookies to enhance theif UFO is unsupported
user experience and the size
on packagecloud. of the data to
transmit
Cookie is larger
Policy, Privacy than
Policy, and the MTU.
our Terms of Service. back to top
Most network device drivers do not support UFO because the network
hardware itself does not support this feature. Let’s take a look through this
code, keeping in mind that corking is disabled. We’ll look at the corking
enabled path next.
ip_make_skb
The ip_make_skb function can be found in ./net/ipv4/ip_output.c. This

function is a bit tricky. The lower level code that ip_make_skb needs to use
in order to build an skb requires a corking structure and queue where the
skb will be queued to be passed in. In the case where the socket is not
corked, a faux corking structure and empty queue are passed in as dummies.
Let’s take a look at how the faux corking structure and queue are setup:
struct sk_buff *ip_make_skb(struct sock *sk, /* more args */)

{
struct inet_cork cork;
struct sk_buff_head queue;
int err;
if (flags & MSG_PROBE)

return NULL;
__skb_queue_head_init(&queue);
cork.flags = 0;
cork.addr = 0;
cork.opt = NULL;
err = ip_setup_cork(sk, &cork, /* more args */);
if (err)
return ERR_PTR(err);

As seen above, both the corking structure ( cork ) and the queue ( queue ) are
stack-allocated; neither are needed by the time ip_make_skb has
completed. The faux corking structure is setup with a call to ip_setup_cork
which allocates memory and initializes the structure. Next,
__ip_append_data is called and the queue and corking structure are passed
in:
err = __ip_append_data(sk, fl4, &queue, &cork,

&current->task_frag, getfrag,
from, length, transhdrlen, flags);
We’ll see how this function works later, as it is used in both cases whether
the socket is corked or not. For now, all we need to know is that
__ip_append_data will create an skb, append data to it, and add that skb to
the queue passed in. If appending the data failed,
__ip_flush_pending_frame is called to drop the data on the oor and the
error code is passed back upward:
if (err) {
__ip_flush_pending_frames(sk, &queue, &cork);
return ERR_PTR(err);
}
Finally, if no error occurred, __ip_make_skb will dequeue the queued skb,

add the IP options, and return an skb that is ready to be passed on to lower
layers for sending:
return __ip_make_skb(sk, fl4, &queue, &cork);
Transmit the data!

If no errors occurred, the skb is handed to udp_send_skb which will pass the
skb to the next layer of the networking stack, the IP protocol stack:
err = PTR_ERR(skb);
if (!IS_ERR_OR_NULL(skb))
err = udp_send_skb(skb, fl4);
goto out;
If there was an error, it will be accounted later. See the “Error Accounting”
section below the UDP corking case for more information.
Slow path for corked UDP sockets with no preexisting corked

data
If UDP corking is being used, but no preexisting data is corked, the slow
path commences:
1. Lock the socket.

2. Check for an application bug: a corked socket that is being “re-
corked”.
3. The ow structure for this UDP ow is prepared for corking.
4. The data to be sent is appended to existing data.
You can see this in the next piece of code, continuing down udp_sendmsg :
lock_sock(sk);
if (unlikely(up->pending)) {
/* The socket is already corked while preparing it. */
/* ... which is an evident application bug. --ANK */
release_sock(sk);

By using our LIMIT_NETDEBUG(KERN_DEBUG
site, you acknowledge that you havepr_fmt("cork app our
read and understand bug 2\n")); 
Cookie Policy, Privacy
err Policy, and our Terms of Service.
= -EINVAL; back to top
goto out;
}
/*
* Now cork the socket to pend data.
*/
fl4 = &inet->cork.fl.u.ip4;
fl4->daddr = daddr;
fl4->saddr = saddr;
fl4->fl4_dport = dport;
fl4->fl4_sport = inet->inet_sport;
up->pending = AF_INET;
do_append_data:
up->len += ulen;
err = ip_append_data(sk, fl4, getfrag, msg->msg_iov, ulen,
sizeof(struct udphdr), &ipc, &rt,
corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);
ip_append_data
The ip_append_data is a small wrapper function which does two major

things prior to calling down to __ip__append_data :
1. Checks if the MSG_PROBE ag was passed in from the user. This ag

indicates that the user does not want to really send data. The path
should be probed (for example to determine the PMTU).
2. Checks if the socket’s send queue is empty. If so, this means that
there is no corked data pending, so ip_setup_cork is called to
setup corking.
Once the above conditions are dealt with the __ip_append_data function is
called which contains the bulk of the logic for processing data into packets.
C t R b G it i l th 10

Sign up!
seconds, free.
__ip_append_data
This function is called in either from ip_append_data if the socket is corked

or from ip_make_skb if the socket is not corked. In either case, this function
will either allocate a new buffer to store the data passed in or will append
the data with existing data.
The way this work centers around the socket’s send queue. Existing data
waiting to be sent (for example, if the socket is corked) will have an entry in
the queue where additional data can be appended.
This function is complex; it performs several rounds of calculations to

determine how to construct the skb that will be passed to the lower level
networking layers and examining the buffer allocation process in detail is
not strictly necessary for understanding how network data is transmit.
The important highlights of this function include:
1. Handling UDP fragmentation of oading (UFO), if supported by the

hardware. The vast majority of network hardware does not support
UFO. If your network card’s driver does support it, it will set the
feature ag NETIF_F_UFO .
2. Handling network cards that support scatter/gather IO. Many cards
support this and it is advertised with the NETIF_F_SG feature ag.
The availability of this feature indicates that a network card can
deal with transmitting a packet where the data has been split
amongst a set of buffers; the kernel does not need to spend time
coalescing
We use multiple
cookies to enhance buffers
the user into
experience a single buffer. Avoiding this
on packagecloud.
additional
Cookie Policy, Privacycopying
Policy, andis
ourdesired and most network cards supportback
Terms of Service. this.to top
3. Tracking the size of the send queue via calls to sock_wmalloc .

When a new skb is allocated, the size of the skb is charged to the
socket which owns it and the allocated bytes for a socket’s send
queue are incremented. If there was not suf cient space in the send
queue, the skb is not allocated and an error is returned and tracked.
We’ll see how to set the socket send queue size in the tuning
section below.
4. Incrementing error statistics. Any error in this function increments
“discard”. We’ll see how to read this value in the monitoring section
below.
Upon successful completion of this function, 0 is returned and the data to

be transmit will be assembled into an skb that is appropriate for the
network device and is waiting on the send queue.
In the uncorked case, the queue holding the skb is passed to __ip_make_skb
described above where it is dequeued and prepared to be sent to the lower
layers via udp_send_skb .
In the corked case, the return value of __ip_append_data is passed upward.

The data sits on the send queue until udp_sendmsg determines it is time to
call udp_push_pending_frames which will nalize the skb and call
udp_send_skb .
Flushing corked sockets
Now, udp_sendmsg will move on to check the return value ( err below) from
__ip_append_skb :
if (err)
udp_flush_pending_frames(sk);
else if (!corkreq)
Cookie Policy,
err Privacy Policy, and our Terms of Service.
= udp_push_pending_frames(sk); back to top
else if (unlikely(skb_queue_empty(&sk->sk_write_queue)))
up->pending = 0;
release_sock(sk);
Let’s take a look at each of these cases:
1. If there is an error ( err is non-zero), then

udp_flush_pending_frames is called, which cancels corking and
drops all data from the socket’s send queue.
2. If this data was sent without MSG_MORE speci ed, called
udp_push_pending_frames which will attempt to deliver the data to
the lower networking layers.
3. If the send queue is empty, mark the socket as no longer corking.
If the append operation completed successfully and there is more data to

cork coming, the code continues by cleaning up and returning the length of
the data appended:
ip_rt_put(rt);
if (free)
kfree(ipc.opt);
if (!err)
return len;
That is how the kernel deals with corked UDP sockets.
Error accounting
If:
1. The non-corking fast path failed to make an skb or udp_send_skb

reports
By using our site,an
youerror, or that you have read and understand our
acknowledge 
2. ip_append_data fails to append data to a corked UDP
socket,back
or to top
3. udp_push_pending_frames returns an error received from

udp_send_skb when trying to transmit a corked skb
the SNDBUFERRORS statistic will be incremented only if the error received

was ENOBUFS (no kernel memory available) or the socket has SOCK_NOSPACE
set (the send queue is full):
/*
* ENOBUFS = no kernel mem, SOCK_NOSPACE = no sndbuf space. Reporting
* ENOBUFS might not be good (it's not tunable per se), but otherwise
* we don't have a good statistic (IpOutDiscards but it can be too many
* things). We could add another new stat but at least for now that
* seems like overkill.
*/
if (err == -ENOBUFS || test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
UDP_INC_STATS_USER(sock_net(sk),
UDP_MIB_SNDBUFERRORS, is_udplite);
}
return err;
We’ll see how to read these counters in the monitoring section below.
udp_send_skb
The udp_send_skb function is how udp_sendmsg will eventually push an

skb down to the next layer of the networking stack, in this case the IP
protocol layer. This function does a few important things:
1. Adds a UDP header to the skb.

2. Deals with checksums: software checksums, hardware checksums,
or no checksum (if disabled).
3.using
By Attempts toacknowledge
our site, you send thethat skbyoutohave
theread
IPand
protocol layer
understand our by calling 
ip_send_skb .
4. Increments statistics counters for successful or failed transmissions.
Let’s take a look. First, a UDP header is created:
static int udp_send_skb(struct sk_buff *skb, struct flowi4 *fl4)

{
/* useful variables ... */
/*
* Create a UDP header
*/
uh = udp_hdr(skb);
uh->source = inet->inet_sport;
uh->dest = fl4->fl4_dport;
uh->len = htons(len);
uh->check = 0;
Next, checksumming is handled. There’s a few cases:
1. UDP-Lite checksums are handled rst.

2. Next, if the socket is set to not generate checksums at all (via
setsockopt with SO_NO_CHECK ), it will be marked as such.
3. Next, if the hardware supports UDP checksums, udp4_hwcsum will
be called to set that up. Note that the kernel will generate
checksums in software if the packet is fragmented. You can see this
in the source for udp4_hwcsum .
4. Lastly, a software checksum is generated with a call to udp_csum .
if (is_udplite) /* UDP-Lite */
csum = udplite_csum(skb);
else if (sk->sk_no_check == UDP_CSUM_NOXMIT) { /* UDP csum disabled */

skb->ip_summed
Cookie Policy, = CHECKSUM_NONE;
Privacy Policy, and our Terms of Service. back to top
goto send;
} else if (skb->ip_summed == CHECKSUM_PARTIAL) { /* UDP hardware csum */
udp4_hwcsum(skb, fl4->saddr, fl4->daddr);

goto send;
} else
csum = udp_csum(skb);
Next, the psuedo header is added:
uh->check = csum_tcpudp_magic(fl4->saddr, fl4->daddr, len,

sk->sk_protocol, csum);
if (uh->check == 0)
uh->check = CSUM_MANGLED_0;
If the checksum is 0, the equivalent in one’s complement is set as the

checksum, per RFC 768. Finally, the skb is passed to the IP protocol stack
and statistics are incremented:
send:
err = ip_send_skb(sock_net(sk), skb);
if (err) {
if (err == -ENOBUFS && !inet->recverr) {
UDP_MIB_SNDBUFERRORS, is_udplite);
err = 0;
}
} else
UDP_MIB_OUTDATAGRAMS, is_udplite);
return err;
If ip_send_skb completes successfully, the OUTDATAGRAMS statistic is

incremented. If the IP protocol layer reports an error, SNDBUFERRORS is
incremented, but only if the error is ENOBUFS (lack of kernel memory) and
there is no error queue enabled.
Before moving on to the IP protocol layer, let’s take a look at how to

monitor and tune the UDP protocol layer in the Linux kernel.
Monitoring: UDP protocol layer statistics
Create an APT repository in less than 10 seconds, Sign up!

free.
Two very useful les for getting UDP protocol statistics are:
/proc/net/snmp
/proc/net/udp
/proc/net/snmp
Monitor detailed UDP protocol statistics by reading /proc/net/snmp .
$ cat /proc/net/snmp | grep Udp\:

Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufEr
Udp: 16314 0 0 17161 0 0
In order to understand precisely where these statistics are incremented, you

will needourtosite,carefully
By using readthat
you acknowledge theyoukernel
have readsource. Thereourare a few cases where
and understand 
back to top
some errors are counted in more than one statistic.
InDatagrams : Incremented when recvmsg was used by a userland

program to read datagram. Also incremented when a UDP packet is
encapsulated and sent back for processing.
NoPorts : Incremented when UDP packets arrive destined for a port
where no program is listening.
InErrors : Incremented in several cases: no memory in the receive
queue, when a bad checksum is seen, and if sk_add_backlog fails
to add the datagram.
OutDatagrams : Incremented when a UDP packet is handed down
without error to the IP protocol layer to be sent.
RcvbufErrors : Incremented when sock_queue_rcv_skb reports that
no memory is available; this happens if sk->sk_rmem_alloc is
greater than or equal to sk->sk_rcvbuf .
SndbufErrors : Incremented if the IP protocol layer reported an
error when trying to send the packet and no error queue has been
setup. Also incremented if no send queue space or kernel memory
are available.
InCsumErrors : Incremented when a UDP checksum failure is
detected. Note that in all cases I could nd, InCsumErrors is
incremented at the same time as InErrors . Thus, InErrors -
InCsumErros should yield the count of memory related errors on
the receive side.
Note that some errors discovered by the UDP protocol layer are reported in
the statistics les for other protocol layers. One example of this: routing
errors. A routing error discovered by udp_sendmsg will cause an increment
to the IP protocol layer’s OutNoRoutes statistic.
/proc/net/udp
Monitor UDP socket statistics by reading /proc/net/udp
$ cat /proc/net/udp
sl local_address rem_address st tx_queue rx_queue tr tm->when r
515: 00000000:B346 00000000:0000 07 00000000:00000000 00:00000000
558: 00000000:0371 00000000:0000 07 00000000:00000000 00:00000000
588: 0100007F:038F 00000000:0000 07 00000000:00000000 00:00000000
769: 00000000:0044 00000000:0000 07 00000000:00000000 00:00000000
812: 00000000:006F 00000000:0000 07 00000000:00000000 00:00000000
The rst line describes each of the elds in the lines following:
sl : Kernel hash slot for the socket

local_address : Hexadecimal local address of the socket and port
number, separated by : .
rem_address : Hexadecimal remote address of the socket and port
number, separated by : .
st : The state of the socket. Oddly enough, the UDP protocol layer
seems to use some TCP socket states. In the example above, 7 is
TCP_CLOSE .
tx_queue : The amount of memory allocated in the kernel for
outgoing UDP datagrams.
rx_queue : The amount of memory allocated in the kernel for
incoming UDP datagrams.
tr , tm->when , retrnsmt : These elds are unused by the UDP
protocol layer.
uid : The effective user id of the user who created this socket.
timeout : Unused by the UDP protocol layer.
inode
We use : The
cookies inode
to enhance the number corresponding
user experience on packagecloud.to this socket. You can use
this to help you determine which user process has this socket open.
Check /proc/[pid]/fd , which will contain symlinks to

socket[:inode] .
ref : The current reference count for the socket.
pointer : The memory address in the kernel of the struct sock .
drops : The number of datagram drops associated with this socket.
Note that this does not include any drops related to sending
datagrams (on corked UDP sockets or otherwise); this is only
incremented in receive paths as of the kernel version examined by
this blog post.
The code which outputs this can be found in net/ipv4/udp.c .
Tuning: Socket send queue memory
The maximum size of the send queue (also called the write queue) can be
adjusted by setting the net.core.wmem_max sysctl.
Increase the maximum send buffer size by setting a sysctl .
$ sudo sysctl -w net.core.wmem_max=8388608
sk->sk_write_queue starts at the net.core.wmem_default value, which can

also be adjusted by setting a sysctl, like so:
Adjust the default initial send buffer size by setting a sysctl .
$ sudo sysctl -w net.core.wmem_default=8388608
You can also set the sk->sk_write_queue size by calling setsockopt from
your application and passed SO_SNDBUF . The maximum you can set with
setsockopt is net.core.wmem_max
Cookie Policy, Privacy .
Policy, and our Terms of Service. back to top
However, you can override the net.core.wmem_max limit by calling

setsockopt and passing SO_SNDBUFFORCE , but the user running the
application need the CAP_NET_ADMIN capability.
The sk->sk_wmem_alloc is incremented each time an skb is allocated by

calls to __ip_append_data . As we’ll see, UDP datagrams are transmit quickly
and typically don’t spend much time in the send queue.
IP protocol layer
The UDP protocol layer hands skbs down to the IP protocol by simply
calling ip_send_skb , so let’s start there and map out the IP protocol layer!
ip_send_skb
The ip_send_skb function is found in ./net/ipv4/ip_output.c and is very

short. It simply calls down to ip_local_out and bumps an error statistic if
ip_local_out returns an error of some sort. Let’s take a look:
int ip_send_skb(struct net *net, struct sk_buff *skb)

{
int err;
err = ip_local_out(skb);
if (err) {
if (err > 0)
err = net_xmit_errno(err);
if (err)
IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
}
return err;
}
As seen above, ip_local_out is called and the return value is dealt with
after that. The call to net_xmit_errno helps to “translate” any errors from
lower levels into an error that is understood by the IP and UDP protocol
layers. If any error happens, the IP protocol statistic “OutDiscards” is
incremented. We’ll see later which les to read to obtain this statistic. For
now, let’s continue down the rabbit hole and see where ip_local_out takes
us.
ip_local_out and __ip_local_out
Luckily for us, both ip_local_out and __ip_local_out are simple.

ip_local_out simply calls down to __ip_local_out and based on the
return value, will call into the routing layer to output the packet:
int ip_local_out(struct sk_buff *skb)

{
int err;
err = __ip_local_out(skb);
if (likely(err == 1))
err = dst_output(skb);
return err;
}
We can see from the source to __ip_local_out that the function does two
important things rst:
1. Sets the length of the IP packet

2. Calls ip_send_check to compute the checksum to be written in the

IP packet header. The ip_send_check function will call a function
named ip_fast_csum to compute the checksum. On the x86 and
x86_64 architectures, this function is implemented in assembly. You
can read the 64bit implementation here and the 32bit
implementation here.
Next, the IP protocol layer will call down into net lter by calling nf_hook .
The return value of the nf_hook function will be passed back up to
ip_local_out . If nf_hook returns 1 , this indicates that the packet was
allowed to pass and that the caller should pass it along itself. As we saw
above, this is precisely what happens: ip_local_out checks for the return
value of 1 and passes the packet on by calling dst_output itself. Let’s take
a look at the code for __ip_local_out :
int __ip_local_out(struct sk_buff *skb)

{
struct iphdr *iph = ip_hdr(skb);
iph->tot_len = htons(skb->len);
ip_send_check(iph);
return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT, skb, NULL,
skb_dst(skb)->dev, dst_output);
}
netfilter and nf_hook
In the interest of brevity (and my RSI), I’ve decided to skip my deep dive into
net lter, iptables, and conntrack. You can dive into the source for net lter
byWe
starting here and here.
use cookies to enhance the user experience on packagecloud.
The short version is that nf_hook is a wrapper which calls nf_hook_thresh

that rst checks if any lters are installed for the speci ed protocol family
and hook type ( NFPROTO_IPV4 and NF_INET_LOCAL_OUT in this case,
respectively) and attempt to return execution back to the IP protocol layer
to avoid going deeper into net lter and anything that hooks in below that
like iptables and conntrack.
Keep in mind: if you have numerous or very complex net lter or iptables
rules, those rules will be executed in the CPU context of the user process
which initiated the original sendmsg call. If you have CPU pinning set up to
restrict execution of this process to a particular CPU (or set of CPUs), be
aware that the CPU will spend system time processing outbound iptables
rules. Depending on your system’s workload, you may want to carefully pin
processes to CPUs or reduce the complexity of your ruleset if you measure a
performance regression here.
For the purposes of our discussion, let’s assume nf_hook returns 1

indicating that the caller (in this case, the IP protocol layer) should pass the
packet along itself.
Destination cache

seconds, free.
The dst code implements the protocol independent destination cache in

the Linux kernel. To understand how dst entries are setup to proceed with
the sending of UDP datagrams, we need to brie y examine how dst entries
and routes
Cookie Policy,are generated.
Privacy The
Policy, and our destination
Terms of Service. cache, routing, and neighbour
back to top
subsystems can all be examined in extreme detail on their own. For our
purposes, we can take a quick look to see how this all ts together.
The code we’ve seen above calls dst_output(skb) . This function simply
looks up the dst entry attached to the skb and calls the output function.
Let’s take a look:
/* Output packet to network from transport. */

static inline int dst_output(struct sk_buff *skb)
{
return skb_dst(skb)->output(skb);
}
Seems simple enough, but how does that output function get attached to
the dst entry in the rst place?
It’s important to understand that destination cache entries are added in

many different ways. One way we’ve seen so far in the code path we’ve
been following is with the call to ip_route_output_flow from
udp_sendmsg . The ip_route_output_flow function calls
__ip_route_output_key which calls __mkroute_output . The
__mkroute_output function creates the route and the destination cache
entry. When it does so, it determines which of the output functions is
appropriate for this destination. Most of the time, this function is
ip_output .
ip_output
So, dst_output executes the output function, which in the UDP IPv4 case
is We use cookies .toThe
ip_output ip_output
enhance function
the user experience is straightforward:
on packagecloud.
int ip_output(struct sk_buff *skb)

{
struct net_device *dev = skb_dst(skb)->dev;
IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUT, skb->len);
skb->dev = dev;
skb->protocol = htons(ETH_P_IP);
return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING, skb, NULL, dev,

ip_finish_output,
!(IPCB(skb)->flags & IPSKB_REROUTED));
}
First, a statistics counter is updated IPSTATS_MIB_OUT . The IP_UPD_PO_STATS

macro will increment both the number of bytes and number packets. We’ll
see in a later section how to obtain the IP protocol layer statistics and what
each of them mean. Next, the device for this skb to be transmit on is set, as
is the protocol.
Finally, control is passed off to net lter with a call to NF_HOOK_COND .

Looking at the function prototype for NF_HOOK_COND will help make the
explanation of how it works a bit clearer. From ./include/linux/net lter.h:
static inline int

NF_HOOK_COND(uint8_t pf, unsigned int hook, struct sk_buff *skb,
struct net_device *in, struct net_device *out,
int (*okfn)(struct sk_buff *), bool cond)
NF_HOOK_COND works by checking the conditional, which is passed in. In this

case, that conditional is !(IPCB(skb)->flags & IPSKB_REROUTED . If this
conditional is true, then the skb will be passed on to net lter. If net lter
allows the packet to pass, the okfn is called. In this case, the okfn is
ip_finish_output .
ip_finish_output
The ip_finish_output function is also short and clear. Let’s take a look:
static int ip_finish_output(struct sk_buff *skb)

{
#if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM)
/* Policy lookup after SNAT yielded a new policy */
if (skb_dst(skb)->xfrm != NULL) {
IPCB(skb)->flags |= IPSKB_REROUTED;
return dst_output(skb);
}
#endif
if (skb->len > ip_skb_dst_mtu(skb) && !skb_is_gso(skb))
return ip_fragment(skb, ip_finish_output2);
else
return ip_finish_output2(skb);
}
If net lter and packet transformation are enabled in this kernel, the skb ’s
ags are updated and it is sent back through dst_output . The two more
common cases are:
1. If packet’s length is larger than the MTU and the packet’s

segmentation will not be of oaded to the device, ip_fragment is
called to help fragment the packet prior to transmission.
2. Otherwise, the packet is passed straight through to
ip_finish_output2 .
Let’s take a short detour to talk about Path MTU Discovery before continuing
our way through the kernel.
Path MTU Discovery
Linux provides a feature I’ve avoided mentioning until now: Path MTU
Discovery. This feature allows the kernel to automatically determine the
largest MTU for a particular route. Determining this value and sending
packets that are less than or equal to the MTU for the route means that IP
fragmentation can be avoided. This is the preferred setting because
fragmenting packets consumes system resources and is seemingly easy to
avoid: simply send small enough packets and fragmentation is unnecessary.
You can adjust the Path MTU Discovery settings on a per-socket basis by
calling setsockopt in your application with the SOL_IP level and
IP_MTU_DISCOVER optname. The optval can be one of the several values
described in the IP protocol man page. The value you’ll likely want to set is:
IP_PMTUDISC_DO which means “Always do Path MTU Discovery.” More
advanced network applications or diagnostic tools may choose to
implement RFC 4821 themselves to determine the PMTU at application
start for a particular route or routes. In this case, you can use the
IP_PMTUDISC_PROBE option which tells the kernel to set the “Don’t
Fragment” bit, but allows you to send data larger than the PMTU.
Your application can retrieve the PMTU by calling getsockopt , with the
SOL_IP and IP_MTU optname. You can use this to help guide the size of the
UDP datagrams your application will construct prior to attempting
transmissions.
If We
youusehave enabled PTMU discovery, any attempt to send UDP data larger
than theour
By using PMTU will
site, you result that
acknowledge in the application
you have receiving
read and understand our the error code 
EMSGSIZE . The application can then retry, but with less data.
Enabling PTMU discovery is strongly encouraged, so I’ll avoid describing the

IP fragmentation code path in detail. When we take a look at the IP protocol
layer statistics, I’ll explain all the statistics including the fragmentation
related statistics. Many of them are incremented in ip_fragment . In both
the fragment or non-fragment case ip_finish_output2 is called, so let’s
continue there.
ip_finish_output2
The ip_finish_output2 is called after IP fragmentation and also directly

from ip_finish_output . This function handles bumping various statistics
counters prior to handing the packet down to the neighbour cache. Let’s see
how this works:
static inline int ip_finish_output2(struct sk_buff *skb)

{
/* variable declarations */
if (rt->rt_type == RTN_MULTICAST) {
IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUTMCAST, skb->len);
} else if (rt->rt_type == RTN_BROADCAST)
IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUTBCAST, skb->len);
/* Be paranoid, rather than too clever. */

if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
struct sk_buff *skb2;
skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));

We use cookies to enhance the user
if (skb2 == experience
NULL) { on packagecloud.
kfree_skb(skb);
return -ENOMEM;
}
if (skb->sk)
skb_set_owner_w(skb2, skb->sk);
consume_skb(skb);
skb = skb2;
}
If the routing structure associated with this packet is of type multicast, both
the OutMcastPkts and OutMcastOctets counters are bumped by using the
IP_UPD_PO_STATS macro. Otherwise, if the route type is broadcast the
OutBcastPkts and OutBcastOctets counters are bumped.
Next, a check is performed to ensure that the skb structure has enough
room for any link layer headers that need to be added. If not, additional
room is allocated with a call to skb_realloc_headroom and the cost of the
new skb is charged to the associated socket.
rcu_read_lock_bh();
nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);
neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
if (unlikely(!neigh))
neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
Continuing on, we can see that the next hop is computed by querying the
routing layer followed by a lookup against the neighbour cache. If the
neighbour is not found, one is created by calling __neigh_create . This
could be the case, for example, the rst time data is sent to another host.
Note that this function is called with arp_tbl (de ned in ./net/ipv4/arp.c) to
create the neighbour entry in the ARP table. Other systems (like IPv6 or
DECnet) maintain their own ARP tables and would pass a different structure
into __neigh_create . This post does not aim to cover the neighbour cache
inCookie
full detail, but it
Policy, Privacy is worth
Policy, nothing
and our Terms that if the neighbour has to beback
of Service. created
to top
it is possible that this creation can cause the cache to grow. This post will
cover some more details about the neighbour cache in the sections below.
At any rate, the neighbour cache exports its own set of statistics so that this
growth can be measured. See the monitoring sections below for more
information.
if (!IS_ERR(neigh)) {
int res = dst_neigh_output(dst, neigh, skb);
rcu_read_unlock_bh();
return res;
}
rcu_read_unlock_bh();
net_dbg_ratelimited("%s: No header cache and no neighbour!\n",

__func__);
kfree_skb(skb);
return -EINVAL;
}
Finally, if no error is returned, dst_neigh_output is called to pass the skb

along on its journey to be output. Otherwise, the skb is freed and EINVAL is
returned. An error here will ripple back and cause OutDiscards to be
incremented way back up in ip_send_skb . Let’s continue on in
dst_neigh_output and continue approaching the Linux kernel’s netdevice
subsystem.
dst_neigh_output
Easy to use Maven repositories, free.

Sign up!
The dst_neigh_output function does two important things for us. First,
recall from earlier in this blog post we saw that if a user speci ed
MSG_CONFIRM via an ancillary message to sendmsg the function, a ag is
ipped to indicate that the destination cache entry for the remote host is
still valid and should not be garbage collected. That check happens here
and the confirmed eld on the neighbour is set to the current jif es count.
static inline int dst_neigh_output(struct dst_entry *dst, struct neighbour *n,

struct sk_buff *skb)
{
const struct hh_cache *hh;
if (dst->pending_confirm) {
unsigned long now = jiffies;
dst->pending_confirm = 0;
/* avoid dirtying neighbour */
if (n->confirmed != now)
n->confirmed = now;
}
Second, the neighbour’s state is checked and the appropriate output

function is called. Let’s take a look at the conditional and try to understand
what’s going on:
hh = &n->hh;
if ((n->nud_state & NUD_CONNECTED) && hh->hh_len)
return neigh_hh_output(hh, skb);
else
return n->output(n, skb);
}

If By
a neighbour is acknowledge
using our site, you considered that NUD_CONNECTED , meaning
you have read and understand our it is one or more of:

NUD_PERMANENT : A static route.

NUD_NOARP : Does not require an ARP request (for example, the
destination is a multicast or broadcast address, or a loopback
device).
NUD_REACHABLE : The neighbour is “reachable.” A destination is
marked as reachable whenever an ARP request for it is successfully
processed.
and the “hardware header” ( hh ) is cached (because we’ve sent data before
and have previously generated it), call neigh_hh_output . Otherwise, call the
output function. Both code paths end with dev_queue_xmit which pass the
skb down to the Linux net device subsystem where it will be processed a bit
more before hitting the device driver layer. Let’s follow both the
neigh_hh_output and n->output code paths until we reach
dev_queue_xmit .
neigh_hh_output
If the destination is NUD_CONNECTED and the hardware header has been

cached, neigh_hh_output will be called, which does a small bit of
processing before handing the skb over to dev_queue_xmit . Let’s take a
look, from ./include/net/neighbour.h:
static inline int neigh_hh_output(const struct hh_cache *hh, struct sk_buff *sk
b)
{
unsigned int seq;
int hh_len;
We use cookies
do { to enhance the user experience on packagecloud.
By using our site, youseq
acknowledge that you have read and understand our
= read_seqbegin(&hh->hh_lock); 
hh_len = hh->hh_len;
if (likely(hh_len <= HH_DATA_MOD)) {

/* this is inlined by gcc */
memcpy(skb->data - HH_DATA_MOD, hh->hh_data, HH_DATA_MO
D);
} else {
int hh_alen = HH_DATA_ALIGN(hh_len);
memcpy(skb->data - hh_alen, hh->hh_data, hh_alen);

}
} while (read_seqretry(&hh->hh_lock, seq));
skb_push(skb, hh_len);
return dev_queue_xmit(skb);
}
This function is a bit tricky to understand, partially due to the locking

primitive used to synchronize reading/writing on the cached hardware
header. This code uses something called a seqlock. You can imagine the do
{ } while() loop above as a simple retry mechanism which will attempt to
perform the operations in the loop until it can be performed successfully.
The loop itself is attempted to determine if the hardware header’s length

needs to be aligned prior to being copied. This is required because some
hardware headers (like the IEEE 802.11 header) is larger than HH_DATA_MOD
(16 bytes).
Once the data is copied to the skb and the skb’s internal pointers tracking
the data are updated with skb_push , the skb is passed to dev_queue_xmit
to enter the Linux net device subsystem.
n->output
If the destination is not NUD_CONNECTED or the hardware header has not

been cached the code proceeds down the n->output path. What is attached
to the output function pointer on the neigbour structure? Well, it depends.
To understand how this is setup, we’ll need to understand a bit more about
how the neighbour cache works.
A struct neighbour contains several important elds. The nud_state eld

as we saw above, an output function, and an ops structure. Recall how
earlier we saw that __neigh_create is called from ip_finish_output2 if no
existing entry was found in the cache. When __neigh_creaet is called a
neighbour is allocated with its output function initially set to
neigh_blackhole . As the __neigh_create code progresses, it will adjust the
value of output to point to appropriate output functions based on the
state of the neighbour.
For example, neigh_connect will be used to set the output pointer to

neigh->ops->connected_output when the code determines the neighbour to
be connected. Alternatively, neigh_suspect will be used to set the output
pointer to neigh->ops->output when the code suspects that the neighbour
may be down (for example if has been more than
/proc/sys/net/ipv4/neigh/default/delay_first_probe_time seconds since
a probe was sent).
In other words: neigh->output is set to another pointer, either neigh-

>ops_connected_output or neigh->ops->output depending on it’s state.
Where does neigh->ops come from?
After the neighbour is allocated, arp_constructor (from ./net/ipv4/arp.c) is

called to set some of the elds of the struct neighbour . In particular, this
function checks the device associated with this neighbour and if the device
exposes a header_ops structure that contains a cache function (ethernet
devices do) neigh->ops is set to the following structure de ned in

./net/ipv4/arp.c:
static const struct neigh_ops arp_hh_ops = {

.family = AF_INET,
.solicit = arp_solicit,
.error_report = arp_error_report,
.output = neigh_resolve_output,
.connected_output = neigh_resolve_output,
};
So, regardless of whether or not the neighbour is considered “connected” or

“suspect” by the neighbour cache code, the neigh_resolve_output function
will be attached to neigh->output and will be called when n->output is
called above.
neigh_resolve_output
This function’s purpose is to attempt to resolve a neighbour that is not

connected or one which is connected, but has no cached hardware header.
Let’s take a look at how this function works:
/* Slow and careful. */
int neigh_resolve_output(struct neighbour *neigh, struct sk_buff *skb)

{
struct dst_entry *dst = skb_dst(skb);
int rc = 0;
if (!dst)
goto discard;
By using our
if site, you acknowledge that you have skb))
(!neigh_event_send(neigh, read and{ understand our 
int err;
struct net_device *dev = neigh->dev;

unsigned int seq;
The code starts by doing some basic checks and proceeds to calling
neigh_event_send . The neigh_event_send function is short wrapper around
__neigh_event_send which will do the heavy lifting to resolve the
neighbour. You can read the source for __neigh_event_send in
./net/core/neighbour.c, but the high-level takeaway from the code is that
there are three cases users will most interested in:
1. Neighbours in state NUD_NONE (the default state when allocated)

will cause an immediate ARP request to be sent assuming the
values set in /proc/sys/net/ipv4/neigh/default/app_solicit and
/proc/sys/net/ipv4/neigh/default/mcast_solicit allow probes to
be sent (if not, the state is marked as NUD_FAILED ). The neighbour
state will be updated and set to NUD_INCOMPLETE .
2. Neighbours in state NUD_STALE will be updated to NUD_DELAYED and
a timer will be set to probe them later (later is the time now +
/proc/sys/net/ipv4/neigh/default/delay_first_probe_time
seconds).
3. Any neighbours in NUD_INCOMPLETE (including things from case 1
above) will be checked to ensure that the number of queued
packets for an unresolved neighbour is less than or equal to
/proc/sys/net/ipv4/neigh/default/unres_qlen . If there are more,
packets are dequeued and dropped until the length is below or
equal to the value in proc. A statistics counter in the neighbour
cache stats is bumped for all occurrences of this.
If an immediate ARP probe is needed it will be sent. __neigh_event_send

will
We return either
use cookies 0 indicating
to enhance thatonthe
the user experience neighbour is considered “connected”
packagecloud.
or “delayed” or 1 otherwise. The return value of 0 allows

neigh_resolve_output to continue:
if (dev->header_ops->cache && !neigh->hh.hh_len)

neigh_hh_init(neigh, dst);
If the device’s protocol implementation (ethernet in our case) associated

with the neighbour supports caching the hardware header and it is
currently not cached, the call to neigh_hh_init will cache it.
do {
__skb_pull(skb, skb_network_offset(skb));
seq = read_seqbegin(&neigh->ha_lock);
err = dev_hard_header(skb, dev, ntohs(skb->protocol),
neigh->ha, NULL, skb->len);
} while (read_seqretry(&neigh->ha_lock, seq));
Next, a seqlock is used to synchronize access to the neighbour structure’s

hardware address which will be read by dev_hard_header when attempting
to create the ethernet header for the skb. Once the seqlock has allowed
execution to continue, error checking takes place:
if (err >= 0)
rc = dev_queue_xmit(skb);
else
goto out_kfree_skb;
}
If the ethernet header was written without returning an error, the skb is
handed down to dev_queue_xmit to pass through the Linux network device
subsystem for transmit. If there was an error, a goto will drop the skb, set
theBy return
using our code
site, youand returnthat
acknowledge theyouerror:
out:
return rc;
discard:
neigh_dbg(1, "%s: dst=%p neigh=%p\n", __func__, dst, neigh);
out_kfree_skb:
rc = -EINVAL;
kfree_skb(skb);
goto out;
}
EXPORT_SYMBOL(neigh_resolve_output);
Before we proceed into the Linux network device subsystem, let’s take a
look at some les for monitoring and turning the IP protocol layer.

free.
Monitoring: IP protocol layer
/proc/net/snmp
Monitor detailed IP protocol statistics by reading /proc/net/snmp .
$ cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDa
Ip: 1 64 25922988125 0 0 15771700 0 0 25898327616 22789396404 129878
...

This le our
By using contains statisticsthat
site, you acknowledge foryouseveral
have readprotocol layers.
The IP protocol layer
and understandour
appears rst. The rst line contains space separate names for each of the
corresponding values in the next line.
In the IP protocol layer, you will nd statistics counters being bumped.

Those counters are referenced by a C enum. All of the valid enum values
and the eld names they correspond to in /proc/net/snmp can be found in
include/uapi/linux/snmp.h:
enum
{
IPSTATS_MIB_NUM = 0,
/* frequently written fields in fast path, kept in same cache line */
IPSTATS_MIB_INPKTS, /* InReceives */
IPSTATS_MIB_INOCTETS, /* InOctets */
IPSTATS_MIB_INDELIVERS, /* InDelivers */
IPSTATS_MIB_OUTFORWDATAGRAMS, /* OutForwDatagrams */
IPSTATS_MIB_OUTPKTS, /* OutRequests */
IPSTATS_MIB_OUTOCTETS, /* OutOctets */
/* ... */
Some interesting statistics:
OutRequests : Incremented each time an IP packet is attempted to

be sent. It appears that this is incremented for every send,
successful or not.
OutDiscards : Incremented each time an IP packet is discarded. This
can happen if appending data to the skb (for corked sockets) fails,
or if the layers below IP return an error.
OutNoRoute : Incremented in several places, for example in the UDP
protocol layer ( udp_sendmsg ) if no route can be generated for a
given destination. Also incremented when an application calls
“connect”
We use on a UDP
cookies to enhance socket
the user butonno
experience route can be found.
packagecloud.
FragOKs : Incremented once per packet that is fragmented. For

example, a packet split into 3 fragments will cause this counter to
be incremented once.
FragCreates : Incremented once per fragment that is created. For
example, a packet split into 3 fragments will cause this counter to
be incremented thrice.
FragFails : Incremented if fragmentation was attempted, but is not
permitted (because the “Don’t Fragment” bit is set). Also
incremented if outputting the fragment fails.
Other statistics are documented in the receive side blog post.
/proc/net/netstat
Monitor extended IP protocol statistics by reading /proc/net/netstat .
$ cat /proc/net/netstat | grep IpExt

IpExt: InNoRoutes InTruncatedPkts InMcastPkts OutMcastPkts InBcastPk
IpExt: 0 0 0 0 277959 0 14568040307695 32991309088496 0 0 58649349 0
The format is similar to /proc/net/snmp , except the lines are pre xed with
IpExt .
Some interesting statistics:
OutMcastPkts : Incremented each time a packet destined for a

multicast address is sent.
OutBcastPkts : Incremented each time a packet destined for a
broadcast address is sent.
OutOctects : The number of packet bytes output.
Policy, Privacy Policy,: and
OutMcastOctets
Cookie The ournumber of multicast packet
Terms of Service. bytes output.
back to top
OutBcastOctets : The number of broadcast packet bytes output.
Other statistics are documented in the receive side blog post.
Note that each of these is incremented in really speci c locations in the IP

layer. Code gets moved around from time to time and double counting
errors or other accounting bugs can sneak in. If these statistics are
important to you, you are strongly encouraged to read the IP protocol layer
source code for the metrics that are important to you so you understand
when they are (and are not) being incremented.
Linux netdevice subsystem
Before we pick up on the packet transmit path with dev_queue_xmit , let’s

take a moment to talk about some important concepts which will appear in
the coming sections.
Linux traffic control
Linux supports a feature called traf c control. This feature allows system
administrators to control how packets are transmit from a machine. This
blog post will not dive into the details of every aspect of Linux traf c
control. This document provides a great in-depth examination of the
system, its control, and its features. There a few concepts that are worth
mentioning to make the code seen next easier to understand.
The traf c control system contains several different sets of queuing systems
that provide different features for controlling traf c ow. Individual queuing
systems are commonly called qdisc and also known as queuing disciplines.
You can think of qdiscs as schedulers; qdiscs decide when and how packets
are transmit.
On Linux every interface has a default qdisc associated with it. For network
hardware that supports only a single transmit queue, the default qdisc
pfifo_fast is used. Network hardware that supports multiple transmit
queues uses the default qdisc of mq . You can check your system by running
tc qdisc .
It is also important to note that some devices support traf c control in

hardware which can allow an administrator to of oad traf c control to the
network hardware and conserve CPU resources on the system.
Now that those ideas have been introduced, let’s proceed down
dev_queue_xmit from ./net/core/dev.c.
dev_queue_xmit and __dev_queue_xmit
dev_queue_xmit is a simple wrapper around __dev_queue_xmit :
int dev_queue_xmit(struct sk_buff *skb)

{
return __dev_queue_xmit(skb, NULL);
}
EXPORT_SYMBOL(dev_queue_xmit);
Following that, __dev_queue_xmit is where the heavy lifting gets done. Let’s
take a look and step through this code piece by piece. Follow along:
static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)

{We use cookies to enhance the user experience on packagecloud.
By using our site, you
struct acknowledge*dev
net_device that you have read and understand our
= skb->dev; 
struct netdev_queue *txq;
struct Qdisc *q;

int rc = -ENOMEM;
skb_reset_mac_header(skb);
/* Disable soft irqs for various locks below. Also

* stops preemption for RCU.
*/
rcu_read_lock_bh();
skb_update_prio(skb);
The code above starts out by:
1. Declaring variables.
2. Preparing the skb to be processed by calling
skb_reset_mac_header . This resets the skb’s internal pointers so
that the ethernet header can be accessed.
3. rcu_read_lock_bh is called to prepare for reading RCU protected
data structures in the code below. Read more about safely using
RCU.
4. skb_update_prio is called to set the skb’s priority, if the network
priority cgroup is being used.
Now, we’ll get to the more complicated parts of transmitting data ;)
txq = netdev_pick_tx(dev, skb, accel_priv);
Here the code attempts to determine which transmit queue to use. As you’ll
see later in this post, some network devices expose multiple transmit
queues for transmitting data. Let’s see how this works in detail.
C t R b G it i l th 10

Sign up!
seconds, free.
netdev_pick_tx
The netdev_pick_tx code lives in ./net/core/ ow_dissector.c. Let’s take a

look:
struct netdev_queue *netdev_pick_tx(struct net_device *dev,

struct sk_buff *skb,
void *accel_priv)
{
int queue_index = 0;
if (dev->real_num_tx_queues != 1) {
const struct net_device_ops *ops = dev->netdev_ops;
if (ops->ndo_select_queue)
queue_index = ops->ndo_select_queue(dev, skb,
accel_priv);
else
queue_index = __netdev_pick_tx(dev, skb);
if (!accel_priv)
queue_index = dev_cap_txqueue(dev, queue_index);
}
skb_set_queue_mapping(skb, queue_index);
return netdev_get_tx_queue(dev, queue_index);
}
As you can see above, if the network device supports only a single TX
queue, the more complex code is skipped and that single TX queue is
returned.
By using ourMost devices
site, you used
acknowledge thaton
you higher
have readend serversour
and understand will have multiple TX

queues. There are two cases for devices with multiple TX queues:
1. The driver implements ndo_select_queue , which can be used to

choose a TX queue more intelligently in a hardware or feature
speci c way, or
2. The driver does not implement `ndo_select_queue, so the kernel
should pick the device itself.
As of the 3.13 kernel, not many drivers implement ndo_select_queue . The

bnx2x and ixgbe drivers implement this function, but it is only used for bre
channel over ethernet (FCoE). In light of this, let’s assume that the network
device does not implement ndo_select_queue and/or that FCoE is not being
used. In that case, the kernel will choose the tx queue with
__netdev_pick_tx .
Once __netdev_pick_tx determines what the queue is index,

skb_set_queue_mapping will cache that value (it will be used later in the
traf c control code) and netdev_get_tx_queue will look up and return a
pointer to that queue. Let’s take a look at how __netdev_pick_tx works
before going back up to __dev_queue_xmit .
__netdev_pick_tx
Let’s take a look at how the kernel chooses the TX queue to use for
transmitting data. From ./net/core/ ow_dissector.c:
u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb)

{
struct sock *sk = skb->sk;
int queue_index = sk_tx_queue_get(sk);
if (queue_index < 0 || skb->ooo_okay ||

We use cookiesqueue_index
to enhance the >=
userdev->real_num_tx_queues)
experience on packagecloud. {
By using our site, youint
acknowledge
new_indexthat =
youget_xps_queue(dev,
have read and understand our
skb); 
if (new_index < 0)
new_index = skb_tx_hash(dev, skb);
if (queue_index != new_index && sk &&

rcu_access_pointer(sk->sk_dst_cache))
sk_tx_queue_set(sk, new_index);
queue_index = new_index;
}
return queue_index;
}
The code begins rst by checking if the transmit queue has already been
cached on the socket by calling sk_tx_queue_get , If it hasn’t been cached,
-1 is returned.
The next if-statement checks if any of the following are true:
The queue_index is < 0. This will happen if the queue hasn’t been
set yet.
If the ooo_okay ag is set. If this ag is set, this means that out of
order packets are allowed now. The protocol layers must set this
ag appropriately. The TCP protocol layer sets this ag when all
outstanding packets for a ow have been acknowledged. When this
happens, the kernel can choose a different TX queue for this
packet. The UDP protocol layer does not set this ag – so UDP
packets will never have ooo_okay set to a non-zero value.
If the queue index is larger than the number of queues. This can
happen if the user has recently changed the queue count on the
device via ethtool . More on this later.
InWe
anyuseof those
cookies cases,thethe
to enhance usercode descends
experience into the slow path to get the
on packagecloud.
transmit queue.
By using our site, youThis begins
acknowledge thatwith read and understandwhich
get_xps_queue
you have our attempts to use a
user-con gured map linking transmit queues to CPUs. This is called

“Transmit Packet Steering.” We’ll look more closely at what Transmit Packet
Steering (XPS) is and how it works shortly.
If get_xps_queue returns -1 because this kernel does not support XPS, or

XPS was not con gured by the system administrator, or the mapping
con gured refers to an invalid queue the code will continue on to call
skb_tx_hash .
Once the queue is selected by either XPS or by the kernel automatically

with skb_tx_hash , the queue is cached on the socket object with
sk_tx_queue_set and returned. Let’s see how XPS and skb_tx_hash work
before continuing through dev_queue_xmit .
Transmit Packet Steering (XPS)
Transmit Packet Steering (XPS) is a feature that allows the system

administrator to determine which CPUs can process transmit operations for
each available transmit queue supported by the device. The aim of this
feature is mainly to avoid lock contention when processing transmit
requests. Other bene ts like reducing cache evictions and avoiding remote
memory access on NUMA machines are also expected when using XPS.
You can read more about how XPS works by checking the kernel
documentation for XPS. We’ll examine how to tune XPS for your system
below, but for now, all you need to know is that to con gure XPS the system
administrator can de ne a bitmap mapping transmit queues to CPUs.
The function call in the code above to get_xps_queue will consult this user-
speci ed map in order to determine which transmit queue should be used.
If get_xps_queue returns -1 , skb_tx_hash will be used instead.
skb_tx_hash
If XPS is not included in the kernel, or is not con gured, or suggests a

queue that is not available (because perhaps the user adjusted the queue
count) skb_tx_hash takes over to determine which queue the data should
be sent on. Understanding precisely how skb_tx_hash works is important
depending on your transmit workload. Note that this code has been
adjusted over time, so if you are using a different kernel version than this
document, you should consult your kernel source directly.
Let’s take a look at how it works, from ./include/linux/netdevice.h:
/*
* Returns a Tx hash for the given packet when dev->real_num_tx_queues is used
* as a distribution range limit for the returned value.
*/
static inline u16 skb_tx_hash(const struct net_device *dev,
const struct sk_buff *skb)
{
return __skb_tx_hash(dev, skb, dev->real_num_tx_queues);
}
The code simply calls down to __skb_tx_hash , from

./net/core/ ow_dissector.c. There’s some interesting code in this function, so
let’s take a look:
/*
* Returns a Tx hash based on the given packet descriptor a Tx queues' number
* to be used as a distribution range.
*/
u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb,
unsigned int num_tx_queues)
{
u32 hash;
u16 qoffset = 0;
u16 Privacy
Cookie Policy, qcountPolicy,
= num_tx_queues;
and our Terms of Service. back to top
if (skb_rx_queue_recorded(skb)) {
hash = skb_get_rx_queue(skb);
while (unlikely(hash >= num_tx_queues))
hash -= num_tx_queues;
return hash;
}
The rst if stanza in this function is an interesting short circuit. The function
name skb_rx_queue_recorded is a bit misleading. An skb has a
queue_mapping eld that is used both for rx and tx. At any rate, this if
statement can be true if your system is receiving packets and forwarding
them elsewhere. If that isn’t the case, the code continues.
if (dev->num_tc) {
u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
qoffset = dev->tc_to_txq[tc].offset;
qcount = dev->tc_to_txq[tc].count;
}
To understand this piece of code, it is important to mention that a program

can set the priority of data sent on a socket. This can be done by using
setsockopt with the SOL_SOCKET and SO_PRIORITY level and optname,
respectively. See the socket(7) man page for more information about
SO_PRIORITY .
Note that if you have used the setsockopt option IP_TOS to set the TOS
ags on the IP packets sent on a particular socket (or on a per-packet basis
if passed as an ancillary message to sendmsg ) in your application, the kernel
will translate the TOS options set by you to a priority which end up in skb-
>priority .
As was mentioned earlier, some network devices support hardware based

traf c control systems. If num_tc is non-zero, that means this device
supports hardware based traf c control.
If that number is non-zero it means that this device supports hardware

based traf c control. The priority map which maps packet priority to
hardware based traf c control will be consulted. The appropriate traf c
class for the data’s priority will be selected based on this map.
Next, the range of appropriate transmit queues for the traf c class will be
generated. They will be used to determine the transmit queue.
If num_tc was zero (because the network device does not support hardware
based traf c control), the qcount and qoffset variables are set to the
number of transmit queues and 0 , respectively.
Using qcount and qoffset , the index of the transmit queue will be
calculated:
if (skb->sk && skb->sk->sk_hash)

hash = skb->sk->sk_hash;
else
hash = (__force u16) skb->protocol;
hash = __flow_hash_1word(hash);
return (u16) (((u64) hash * qcount) >> 32) + qoffset;

}
EXPORT_SYMBOL(__skb_tx_hash);
Finally, the appropriate queue index is returned back up to

__netdev_pick_tx .

C t APT it i l th 10
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ d 72/151
Create an APT repository in less than 10 seconds,

Sign up!
free.
Resuming __dev_queue_xmit
At this point the appropriate transmit queue has been selected.

__dev_queue_xmit can continue:
q = rcu_dereference_bh(txq->qdisc);
#ifdef CONFIG_NET_CLS_ACT
skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_EGRESS);
#endif
trace_net_dev_queue(skb);
if (q->enqueue) {
rc = __dev_xmit_skb(skb, q, dev, txq);
goto out;
}
It starts by obtaining a reference to the queuing discipline associated with

this queue. Recall that earlier we saw that the default for single transmit
queue devices is the pfifo_fast qdisc, whereas for multiqueue devices it is
the mq qdisc.
Next, the code assigns a traf c classi cation “verdict” to the outgoing data,
if the packet classi cation API has been enabled in your kernel. Next, the
queue discipline is checked to see if there is a way to queue data. Some
queuing disciplines like the noqueue qdisc do not have a queue. If there is a
queue, the code calls down to __dev_xmit_skb to continue processing the
data for transmit. Afterward, execution jumps to the end of this function.
We’ll take
By using ourasite,
look at __dev_xmit_skb
you acknowledge shortly.
that you have read For now,
and understand our let’s see what happens

if Cookie
therePolicy,
is noPrivacy
queue, starting
Policy, withof aService.
and our Terms very helpful comment: back to top
/* The device has no queue. Common case for software devices:

loopback, all the sorts of tunnels...
Really, it is unlikely that netif_tx_lock protection is necessary

here. (f.e. loopback and IP tunnels are clean ignoring statistics
counters.)
However, it is possible, that they rely on protection
made by us here.
Check this and shot the lock. It is not prone from deadlocks.
Either shot noqueue qdisc, it is even simpler 8)
*/
if (dev->flags & IFF_UP) {
int cpu = smp_processor_id(); /* ok because BHs are off */
As the comment illustrates, the only devices that could have a qdisc with no
queues are the loopback device and tunnel devices. If the device is currently
up, then the current CPU is saved. It used for the next check which is a bit
tricky, let’s take a look:
if (txq->xmit_lock_owner != cpu) {
if (__this_cpu_read(xmit_recursion) > RECURSION_LIMIT)

goto recursion_alert;
There’s two cases: the transmit lock on this device queue is owned by this
CPU or not. If so, a counter variable xmit_recursion , which is allocated per-
CPU, is checked here to determine if the count is over the RECURSION_LIMIT .
It is possible that one program could attempt to send data and get
preempted right around this place in the code. Another program could be
selected by the scheduler to run. If that second program attempts to send
data as cookies
We use well and lands
to enhance thehere. So, theon xmit_recursion
user experience packagecloud. counter is used to
prevent more than RECURSION_LIMIT programs from racing here to transmit

data. Let’s keep going:
HARD_TX_LOCK(dev, txq, cpu);
if (!netif_xmit_stopped(txq)) {
__this_cpu_inc(xmit_recursion);
rc = dev_hard_start_xmit(skb, dev, txq);
__this_cpu_dec(xmit_recursion);
if (dev_xmit_complete(rc)) {
HARD_TX_UNLOCK(dev, txq);
goto out;
}
}
net_crit_ratelimited("Virtual device %s asks to queue p
acket!\n",
dev->name);
} else {
/* Recursion is detected! It is possible,
* unfortunately
*/
recursion_alert:
net_crit_ratelimited("Dead loop on virtual device %s, f
ix it urgently!\n",
dev->name);
}
}
The remainder of the code starts by trying to take the transmit lock. The
device’s transmit queue to be used is checked to see if transmit is stopped.
If not, the xmit_recursion variable is incremented and the data is passed
down closer to the device to be transmit. We’ll see dev_hard_start_xmit in
more detail later. Once this completes, the locks are released and a warning
is Cookie
printed.
Policy, Privacy Policy, and our Terms of Service. back to top
Alternatively, if the current CPU is transmit lock owner, or if the

RECURSION_LIMIT is hit, no transmit is done, but a warning is printed. The
remaining code in the function sets the error code and returns.
Since we are interested in real ethernet devices, let’s continue down the
code path that would have been taken for those earlier via __dev_xmit_skb .
__dev_xmit_skb
And now we descend into __dev_xmit_skb from ./net/core/dev.c armed with

the queuing discipline, network device, and transmit queue reference:
static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,

struct net_device *dev,
struct netdev_queue *txq)
{
spinlock_t *root_lock = qdisc_lock(q);
bool contended;
int rc;
qdisc_pkt_len_init(skb);
qdisc_calculate_pkt_len(skb, q);
/*
* Heuristic to force contended enqueues to serialize on a
* separate lock before trying to get qdisc main lock.
* This permits __QDISC_STATE_RUNNING owner to get the lock more often
* and dequeue packets faster.
*/
contended = qdisc_is_running(q);
if (unlikely(contended))
spin_lock(&q->busylock);

This code
By using ourbegins by using that
site, you acknowledge qdisc_pkt_len_init andour
you have read and understand 
qdisc_calculate_pkt_len to compute an accurate length for the data that
will be used by the qdisc later. This is necessary for skbs that will pass
through hardware based send of oading (such as UDP Fragmentation
Of oading, as we saw earlier) as the additional headers that will be added
when fragmentation occurs need to be taken into account.
Next, a lock is used to help reduce contention on the qdisc’s main lock (a
second lock we’ll see later). If qdisc is currently running, then other
programs attempting to transmit will contend on the qdisc’s busylock . This
allows the running qdisc to process packets and contend with a smaller
number of programs for the second, main lock. This trick increases
throughput as the number of contenders is reduced. You can read the
original commit message describing this here. Next the main lock is taken:
spin_lock(root_lock);
Now, we approach an if statement that handles 3 possible cases:
1. The qdisc is deactivated.

2. The qdisc allows packets to bypass the queuing system, there are
no other packets to send, and the qdisc is not currently running. A
qdisc allows packet bypass for “work-conserving” qdisc - in other
words, a qdisc that does not delay packet transmit for traf c
shaping purposes.
3. All other cases.
Let’s take a look at what happens in each of these cases, in order starting
with a deactivated qdisc:
if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
kfree_skb(skb);
rc = NET_XMIT_DROP;
This is straightforward. If the qdisc is deactivated, free the data and set the
return code to NET_XMIT_DROP . Next, a qdisc allowing packet bypass, with
no other outstanding packets, that is not currently running:
} else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&

qdisc_run_begin(q)) {
/*
* This is a work-conserving queue; there are no old skbs
* waiting to be sent out; and the qdisc is not running -
* xmit the skb directly.
*/
if (!(dev->priv_flags & IFF_XMIT_DST_RELEASE))
skb_dst_force(skb);
qdisc_bstats_update(q, skb);
if (sch_direct_xmit(skb, q, dev, txq, root_lock)) {

if (unlikely(contended)) {
spin_unlock(&q->busylock);
contended = false;
}
__qdisc_run(q);
} else
qdisc_run_end(q);
rc = NET_XMIT_SUCCESS;
This if statement is a bit tricky. The entire statement evaluates as true if all
of the following are true:
1. q->flags & TCQ_F_CAN_BYPASS : The qdisc allows packets to bypass

the queuing system. This will be true for “work-conserving” qdiscs;
i.e. qdiscs that do not delay packet transmit for traf c shaping
purposes
By using are
our site, you considered
acknowledge “work-conserving”
that you and
have read and understand our allow packet  
bypass. The pfifo_fast qdisc allows packets to bypass the queuing

system.
2. !qdisc_qlen(q) : The qdisc’s queue has no data in it that is waiting
to be transmit.
3. qdisc_run_begin(p) : This function call will either set the qdisc’s
state as “running” and return true or return false if the qdisc was
already running.
If all of the above evaluate to true, then:
The IFF_XMIT_DST_RELEASE ag is checked. If enabled, this ag

indicates that the kernel is allowed to free the skb’s destination
cache structure. The code in this function checks if the ag is
disabled and forces a reference count on that structure.
qdisc_bstats_update is used to increment the number of bytes and
packet sent by the qdisc.
sch_direct_xmit is used to attempt to transmit the packet. We’ll
dive more into sch_direct_xmit shortly as it is used in the slower
code path, too.
The return value of sch_direct_xmit is checked for two cases:
1. The queue is not empty ( >0 returned). In this case, lock preventing
contention from other programs is released and __qdisc_run is
called to restart the qdisc processing.
2. The queue was empty ( 0 is returned). In this case qdisc_run_end is
used to turn off qdisc processing.
In either case, the return value NET_XMIT_SUCCESS is set as the return code.
That wasn’t too bad. Let’s check the last case, which is catch all:
} else {
skb_dst_force(skb);
rc = q->enqueue(skb, q) & NET_XMIT_MASK;
if (qdisc_run_begin(q)) {
if (unlikely(contended)) {
contended = false;
}
__qdisc_run(q);
}
}
In all other cases:
1. Call skb_dst_force to force a reference count bump on the skb’s

destination cache reference.
2. Queue the data to the qdisc by calling the enqueue function of the
queue disc. Store the return code.
3. Call qdisc_run_begin(p) to mark the qdisc as running. If it was not
already running, the busylock is released and __qdisc_run(p) is
called to start qdisc processing.
The function then nishes up by releasing some locks and returning the
return code:
spin_unlock(root_lock);
if (unlikely(contended))
return rc;

By using our site, you acknowledge that you have read and understand our Sign up!
Tuning: Transmit Packet Steering (XPS)
For XPS to work, it must be enabled in the kernel con guration (it is on
Ubuntu for kernel 3.13.0), and a bitmask describing which CPUs should
process packets for a given interface and TX queue.
These bitmasks are similar to the RPS bitmasks and you can nd some
documentation about these bitmasks in the kernel documentation.
In short, the bitmasks to modify are found in:
/sys/class/net/DEVICE_NAME/queues/QUEUE/xps_cpus
So, for eth0 and transmit queue 0, you would modify the le:
/sys/class/net/eth0/queues/tx-0/xps_cpus with a hexadecimal number
indicating which CPUs should process transmit completions from eth0 ’s
transmit queue 0. As the documentation points out, XPS may be
unnecessary in certain con gurations.
Queuing disciplines!
To follow the path of network data, we’ll need to move into the qdisc code a
bit. This post does not intend to cover the speci c details of each of the
different transmit queue options. If you are interested in that, check this
excellent guide.
For the purpose of this blog post, we’ll continue the code path by examining
how the generic packet scheduler code works. In particular, we’ll explore
how qdisc_run_begin , qdisc_run_end , __qdisc_run , and sch_direct_xmit
work to move network data closer to the driver for transmit.
Let’s start by examining how qdisc_run_begin works and proceed from

there.
qdisc_run_begin and qdisc_run_end
The qdisc_run_begin function can be found in ./include/net/sch_generic.h:
static inline bool qdisc_run_begin(struct Qdisc *qdisc)

{
if (qdisc_is_running(qdisc))
return false;
qdisc->__state |= __QDISC___STATE_RUNNING;
return true;
}
This function is simple: the qdisc __state ag is checked. If it’s already

running, false is returned. Otherwise, __state is updated to enable the
__QDISC___STATE_RUNNING bit.
Similarly, qdisc_run_end is anti-climactic:
static inline void qdisc_run_end(struct Qdisc *qdisc)

{
qdisc->__state &= ~__QDISC___STATE_RUNNING;
}
It simply disables the __QDISC___STATE_RUNNING bit from the qdisc’s

__state eld. It is important to note that both of these functions simply ip
bits; neither actually start or stop processing themselves. The function
__qdisc_run , on the other hand, will actually start processing.
__qdisc_run
The code for __qdisc_run is deceptively brief:
void __qdisc_run(struct Qdisc *q)

{
int quota = weight_p;
while (qdisc_restart(q)) {
/*
* Ordered by possible occurrence: Postpone processing if
* 1. we've exceeded packet quota
* 2. another process needs the CPU;
*/
if (--quota <= 0 || need_resched()) {
__netif_schedule(q);
break;
}
}
qdisc_run_end(q);
}
This function begins by obtaining the weight_p value. This is set typically
via a sysctl and is also used in the receive path. We’ll see later how to adjust
this value. This loop does two things:
1. It calls qdisc_restart in a busy loop until it returns false (or the

break below is triggered).
2. Determines if either the quota drops below zero or need_resched()
returns true. If either is true , __netif_schedule is called and the
loop is broken out of.
Remember: up to now the kernel is still executing on behalf of the original

call
Wetouse sendmsg by the
cookies to enhance theuser program;
user experience the user program is currently
on packagecloud.
accumulating system
By using our site, you time.that
acknowledge If you
thehave
user
readprogram hasourexhausted its time quota
and understand 
in the kernel, need_resched will return true. If there’s still available quota
and the user program hasn’t used is time slice up yet, qdisc_restart will
be called over again.
Let’s see how qdisc_restart(q) works and then we’ll dive into
__netif_schedule(q) .
qdisc_restart
Let’s jump into the code for qdisc_restart :
/*
* NOTE: Called under qdisc_lock(q) with locally disabled BH.
*
* __QDISC_STATE_RUNNING guarantees only one CPU can process
* this qdisc at a time. qdisc_lock(q) serializes queue accesses for
* this queue.
*
* netif_tx_lock serializes accesses to device driver.
*
* qdisc_lock(q) and netif_tx_lock are mutually exclusive,
* if one is grabbed, another must be free.
*
* Note, that this procedure can be called by a watchdog timer
*
* Returns to the caller:
* 0 - queue is empty or throttled.
* >0 - queue is not empty.
*
*/
static inline int qdisc_restart(struct Qdisc *q)
{
We use cookies
struct to enhance the user experience
netdev_queue *txq; on packagecloud.
struct net_device *dev;
spinlock_t *root_lock;
struct sk_buff *skb;
/* Dequeue packet */
skb = dequeue_skb(q);
if (unlikely(!skb))
return 0;
WARN_ON_ONCE(skb_dst_is_noref(skb));
root_lock = qdisc_lock(q);
dev = qdisc_dev(q);
txq = netdev_get_tx_queue(dev, skb_get_queue_mapping(skb));
return sch_direct_xmit(skb, q, dev, txq, root_lock);

}
The qdisc_restart function begins with a useful comment describing some

of the locking constraints for calling this function. The rst operation this
function performs is to attempt to dequeue an skb from the qdisc.
The function dequeue_skb will attempt to obtain the next packet to

transmit. If the queue is empty qdisc_restart will return false (causing the
loop in __qdisc_run above to bail).
Assuming there is data to transmit, the code continues by obtaining a

reference to the qdisc queue lock, the qdisc’s associated device, and the
transmit queue.
All of these are passed through to sch_direct_xmit . Let’s take a look at

dequeue_skb and then we’ll come back sch_direct_xmit .
Create a RubyGem repository in less than 10 Sign up!

seconds, free.
dequeue_skb
Let’s take a look at dequeue_skb from ./net/sched/sch_generic.c. This

function handles two major cases:
1. Dequeuing data that was requeued because it could not be sent

before, or
2. Dequeuing new data from the qdisc to be processed.
Let’s take a look at the rst case:
static inline struct sk_buff *dequeue_skb(struct Qdisc *q)

{
struct sk_buff *skb = q->gso_skb;
const struct netdev_queue *txq = q->dev_queue;
if (unlikely(skb)) {
/* check the reason of requeuing without tx lock first */
txq = netdev_get_tx_queue(txq->dev, skb_get_queue_mapping(skb
));
if (!netif_xmit_frozen_or_stopped(txq)) {
q->gso_skb = NULL;
q->q.qlen--;
} else
skb = NULL;
Note that the code begins by taking a reference to gso_skb eld of the
qdisc. This eld holds a reference to data that was requeued. If no data was
requeued, this eld will be NULL . If that eld is not NULL , the code
continues by getting the transmit queue for the data and checking if the
queue is stopped. If the queue is not stopped, the gso_skb eld is cleared
and the queue length counter is decreased. If the queue is stopped, the data
remains attached
By using our to gso_skb
site, you acknowledge that ,you
buthaveNULL will
read and be returned
understand our from this function.

Let’s check the next case, where there is no data that was requeued:
} else {
if (!(q->flags & TCQ_F_ONETXQUEUE) || !netif_xmit_frozen_or_sto
pped(txq))
skb = q->dequeue(q);
}
return skb;
}
In the case where no data was requeued, another tricky compound if

statement is evaluated. If:
1. The qdisc does not have a single transmit queue, or

2. The transmit queue is not stopped
Then, the qdisc’s dequeue function will be called to obtain new data. The
internal implementation of dequeue will vary depending on the qdisc’s
implementation and features.
The function nishes by returning the data that is up for processing.
sch_direct_xmit
Now we come to sch_direct_xmit (in ./net/sched/sch_generic.c) which is an

important participant in moving data down toward the network device. Let’s
walk through it, piece by piece:
/*
* Transmit one skb, and handle the return status as required. Holding the
* __QDISC_STATE_RUNNING bit guarantees that only one CPU can execute this
* function.
Cookie
* Policy, Privacy Policy, and our Terms of Service. back to top
* Returns to the caller:

* 0 - queue is empty or throttled.
* >0 - queue is not empty.
*/
int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
struct net_device *dev, struct netdev_queue *txq,
spinlock_t *root_lock)
{
int ret = NETDEV_TX_BUSY;
/* And release qdisc */

HARD_TX_LOCK(dev, txq, smp_processor_id());

if (!netif_xmit_frozen_or_stopped(txq))
ret = dev_hard_start_xmit(skb, dev, txq);
The code begins by unlocking the qdisc lock and then locking the transmit
lock. Note that HARD_TX_LOCK is a macro:
#define HARD_TX_LOCK(dev, txq, cpu) { \

if ((dev->features & NETIF_F_LLTX) == 0) { \
__netif_tx_lock(txq, cpu); \
} \
}
This macro is checking if the device has the NETIF_F_LLTX ag set in its
feature ags. This ag is deprecated and should not be used by new device
drivers. Most drivers in this kernel version do not use this ag, so this check
will evaluate to to true and the lock for the transmit queue for this data will
beWeobtained.
Next, the transmit queue is checked to ensure that it is not stopped and
then dev_hard_start_xmit is called. As we’ll see later,
dev_hard_start_xmit handles transitioning the network data from the
Linux kernel’s network device subsystem into the device driver itself for
transmission. The return code from this function is stored and will be
checked next to determine if the transmit succeeded.
Once this has run (or been skipped because the queue is stopped), the
queue’s transmit lock is released. Let’s continue:
spin_lock(root_lock);
if (dev_xmit_complete(ret)) {
/* Driver sent out skb successfully or skb was consumed */
ret = qdisc_qlen(q);
} else if (ret == NETDEV_TX_LOCKED) {
/* Driver try lock failed */
ret = handle_dev_cpu_collision(skb, txq, q);
Next, the lock for this qdisc is taken again and then the return value of
dev_hard_start_xmit is examined. The rst case is checked by calling
dev_xmit_complete which simply checks the return value to determine if
the data was sent successfully. If so the qdisc queue length is set as the
return value.
If dev_xmit_complete returns false, the return value will be checked to see

if dev_hard_start_xmit returned NETDEV_TX_LOCKED up from the device
driver. Devices with the deprecated NETIF_F_LLTX feature ag can return
NETDEV_TX_LOCKED when the driver attempts to do its own locking of the
transmit queue and fails. In this case, handle_dev_cpu_collision is called
to deal with the lock contention. We’ll take a closer look at
By using our site, you acknowledge thatshortly,
you have but forunderstand
read and now, let’sour continue down 
sch_direct_xmit and check out the catch-all case:
} else {
/* Driver returned NETDEV_TX_BUSY - requeue skb */
if (unlikely(ret != NETDEV_TX_BUSY))
net_warn_ratelimited("BUG %s code %d qlen %d\n",
dev->name, ret, q->q.qlen);
ret = dev_requeue_skb(skb, q);

}
So if the driver did not transmit the data and it was not due to the transmit
lock being held, it is probably due to NETDEV_TX_BUSY (if not a warning is
printed). NETDEV_TX_BUSY can be returned by a driver to indicate that either
the device or the driver were “busy” and the data can not be transmit right
now. In this case, dev_requeue_skb is used to queue the data to be retried.
The function wraps up by (possibly) adjusting the return value:
if (ret && netif_xmit_frozen_or_stopped(txq))

ret = 0;
return ret;
Let’s a take a dive into handle_dev_cpu_collision and dev_requeue_skb .
The code for handle_dev_cpu_collision , from ./net/sched/sch_generic.c

handles two cases:
1. The transmit lock is held by the current CPU.

2. The transmit lock is held by some other CPU.
In the rst case, this is handled as a con guration problem and thus a
warning is printed. In the second case a statistic counter cpu_collision is
incremented and the data is sent through dev_requeue_skb to be requeued
for transmission later. Recall earlier we saw code in dequeue_skb that dealt
speci cally with requeued skbs.
The code for handle_dev_cpu_collision is short and worth a quick read:
static inline int handle_dev_cpu_collision(struct sk_buff *skb,

struct netdev_queue *dev_queue,
struct Qdisc *q)
{
int ret;
if (unlikely(dev_queue->xmit_lock_owner == smp_processor_id())) {
/*
* Same CPU holding the lock. It may be a transient
* configuration error, when hard_start_xmit() recurses. We
* detect it by checking xmit owner and drop the packet when
* deadloop is detected. Return OK to try the next skb.
*/
kfree_skb(skb);
net_warn_ratelimited("Dead loop on netdevice %s, fix it urgentl
y!\n",
dev_queue->dev->name);
ret = qdisc_qlen(q);
} else {
/*
* Another cpu is holding lock, requeue & delay xmits for
* some time.
*/
__this_cpu_inc(softnet_data.cpu_collision);
ret = dev_requeue_skb(skb, q);
}
return ret;
}
Let’s take a look at what dev_requeue_skb does, as we’ll see this function
called from sch_direct_xmit .
dev_requeue_skb
Thankfully, the source for dev_requeue_skb is short and straight to the

point, from ./net/sched/sch_generic.c:
/* Modifications to data participating in scheduling must be protected with

* qdisc_lock(qdisc) spinlock.
*
* The idea is the following:
* - enqueue, dequeue are serialized via qdisc root lock
* - ingress filtering is also serialized via qdisc root lock
* - updates to tree and tree walking are only done under the rtnl mutex.
*/
static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)

{
skb_dst_force(skb);
q->gso_skb = skb;
q->qstats.requeues++;
q->q.qlen++; /* it's still part of the queue */
__netif_schedule(q);
return 0;
}
This function does a few things:

1.using
By It forces
our site, a
youreference
acknowledgecount on the
that you have read skb.
and understand our 
2. It attaches the skb to the qdisc’s gso_skb eld. Recall earlier we

saw that this eld is checked in dequeue_skb before data is pulled
off the qdisc’s queue.
3. A statistics counter is bumped.
4. The size of the queue is increased.
5. __netif_schedule is called.
Simple and straightforward. Let’s refresh how we got here and then
examine __netif_schedule .

seconds, free.
Reminder, while loop in __qdisc_run
Recall that we got here by examining the function __qdisc_run which

contained the following code:
void __qdisc_run(struct Qdisc *q)

{
int quota = weight_p;
while (qdisc_restart(q)) {
/*
* Ordered by possible occurrence: Postpone processing if
* 1. we've exceeded packet quota
* 2. another process needs the CPU;
*/
if (--quota
We use cookies to enhance <= 0 || need_resched())
the user experience on packagecloud. {
__netif_schedule(q); 
break;
}
}
qdisc_run_end(q);
}
This code works by repeatedly calling qdisc_restart in a loop which,

internally, dequeues skbs, attempts to transmit them by calling
sch_direct_xmit , which calls dev_hard_start_xmit to get down to the
driver to do the actual transmit. Anything that could not be transmit is
requeued to be transmit in the NET_TX softirq.
The next step in the transmit process is examining dev_hard_start_xmit to

see how the drivers are invoked for sending data. Before doing that, we
should examine __netif_schedule to fully understand how both
__qdisc_run and dev_requeue_skb work.
__netif_schedule
Let’s jump into __netif_schedule from ./net/core/dev.c:
void __netif_schedule(struct Qdisc *q)

{
if (!test_and_set_bit(__QDISC_STATE_SCHED, &q->state))
__netif_reschedule(q);
}
EXPORT_SYMBOL(__netif_schedule);
This code checks and sets the __QDISC_STATE_SCHED bit in the qdisc’s state.
If the bit was ipped (meaning that it was not previously in the
__QDISC_STATE_SCHED state), the code will call __netif_reschedule , which
is By
not much
using longer
our site, but hasthat
you acknowledge very interesting
you have side effects.
read and understand our Let’s take a look:
static inline void __netif_reschedule(struct Qdisc *q)

{
struct softnet_data *sd;
unsigned long flags;
local_irq_save(flags);
sd = &__get_cpu_var(softnet_data);
q->next_sched = NULL;
*sd->output_queue_tailp = q;
sd->output_queue_tailp = &q->next_sched;
raise_softirq_irqoff(NET_TX_SOFTIRQ);
local_irq_restore(flags);
}
This function does several things:
1. Save the current local IRQ state and disable IRQs with a call to
local_irq_save .
2. Get the current CPUs softnet_data structure.
3. Add the qdisc to the softnet_data ’s output queue.
4. Raise the NET_TX_SOFTIRQ softirq.
5. Restore the IRQ state and re-enable interrupts.
You can read more about the initialization of the softnet_data data
structures by reading our previous post about the receive side of the
networking stack.
The important piece of code in the above function is: raise_softirq_irqoff

which triggers the NET_TX_SOFTIRQ softirq. softirqs and their registration are
also covered in our previous post. Brie y, you can think of softirqs are kernel
threads that execute with a very high priority and process data on behalf of
the
Wekernel. They
use cookies are used
to enhance the userfor processing
experience incoming network data and also for
on packagecloud.
processing outgoing data.
As you’ll see from the previous post, the NET_TX_SOFTIRQ softirq has the
function net_tx_action registered to it. This means that there is a kernel
thread executing net_tx_action . That thread is occasionally paused and
raise_softirq_irqoff resumes it. Let’s take a look at what net_tx_action
does so we can understand how the kernel processes transmit requests.
net_tx_action
The net_tx_action function from ./net/core/dev.c handles two main things

when it runs:
1. The completion queue of the softnet_data structure for the

executing CPU.
2. The output queue of the softnet_data structure for the executing
CPU.
In fact, the code for the function is two large if blocks. Let’s take them one
at a time, remembering all the while that this code is executing in the
softirq context as an independent kernel thread. The purpose of
net_tx_action is to execute code that cannot be executed in hot paths
throughout the transmit side of the network stack; work is deferred and
later processed by the thread executing net_tx_action .
net_tx_action completion queue
The softnet_data ’s completion queue is simply a queue of skbs that are

waiting to be freed. The function dev_kfree_skb_irq can be used to add
skbs to a queue to be freed later. This is commonly used by device drivers to
defer freeing consumed skbs. The reason why a driver would want to defer
freeing the skb instead of simply freeing the skb is that freeing memory can
take time
By using ourand there
site, you are instances
acknowledge (like
that you have read hardirq handlers)
and understand our where code needs
to execute as quickly as possible and return.
Take a look at the net_tx_action code which deals with freeing skbs on the
completion queue:
if (sd->completion_queue) {
struct sk_buff *clist;
local_irq_disable();
clist = sd->completion_queue;
sd->completion_queue = NULL;
local_irq_enable();
while (clist) {
struct sk_buff *skb = clist;
clist = clist->next;
WARN_ON(atomic_read(&skb->users));
trace_kfree_skb(skb, net_tx_action);
__kfree_skb(skb);
}
}
If the completion queue has entries, the while loop will walk through the
linked list of skbs and call __kfree_skb on each of them to free their
memory. Remember, this code is running in a separate “thread” called a
softirq – it is not running on behalf of any user program in particular.
net_tx_action output queue
The output queue serves a different purpose entirely. As we saw earlier,

data is added to the output queue by calls to __netif_reschedule , which is
typically called from __netif_schedule . The __netif_schedule function is
called in two instances we’ve seen so far:
dev_requeue_skb : As we saw, this function can be called if the

driver reports back the error code NETDEV_TX_BUSY or if there is a
CPU collision.
__qdisc_run : We saw this function earlier, as well. It also calls
__netif_schedule once the quota has been exceeded or if the
process needs to be rescheduled.
In either of those cases, the __netif_schedule function will be called which

will add the qdisc to the softnet_data ’s output queue for processing. I’ve
split out the output queue processing code into three blocks. Let’s take a
look at the rst:
if (sd->output_queue) {
struct Qdisc *head;
local_irq_disable();
head = sd->output_queue;
sd->output_queue = NULL;
sd->output_queue_tailp = &sd->output_queue;
local_irq_enable();
This block simply ensures that there are qdiscs on the output queue, and if
so, it sets head to the rst entry and moves the tail pointer of the queue.
Next, the while loop for traversing the list of qdsics starts:
while (head) {
struct Qdisc *q = head;
spinlock_t *root_lock;
head = head->next_sched;

root_lock = qdisc_lock(q);
ifour
Cookie Policy, Privacy Policy, and (spin_trylock(root_lock))
Terms of Service. { back to top
smp_mb__before_clear_bit();
clear_bit(__QDISC_STATE_SCHED,
&q->state);
qdisc_run(q);
The above section of code moves the head pointer forward and obtains a
reference to the qdisc lock. spin_trylock is used to check if the lock can be
obtained; note that this call is used speci cally because it does not block. If
the lock is already held, spin_trylock will return immediately instead of
waiting to obtain the lock.
If spin_trylock successfully obtains the lock it returns a non-zero value. In

this case, the qdisc’s state eld has its __QDISC_STATE_SCHED bit ipped and
qdisc_run is invoked which ips the __QDISC___STATE_RUNNING bit and
kicks begins executing __qdisc_run .
This is important. What’s happening here is that the processing loop we

examined before which was running on behalf of the system call made by
the user is now running again, but in the softirq context because the skb
transmit for this qdisc was unable to transmit. This distinction is important
because it affects how you monitor CPU usage of applications which send
large amounts of data. Let me state this another way:
Your program’s system time will include time spent calling down to
the driver to try to send data, regardless of whether the send
completes or the driver returns an error.
If that send is unsuccessful at the driver layer (e.g. because the
device was busy sending something else), the qdisc will be added
to the output queue and processed later by a softirq thread. In this
case,
We use softirq
cookies (si) the
to enhance time
user will be spent
experience attempting to transmit your
on packagecloud.
data.
So, the total time spent sending data is a combination of both the system
time of send-related system calls and the softirq time for the NET_TX
softirq.
At any rate, the code above completes by releasing the qdisc lock. If the
spin_trylock call above falls to obtain the lock, the following code is
executed:
} else {
if (!test_bit(__QDISC_STATE_DEACTIVATED,
&q->state)) {
__netif_reschedule(q);
} else {
smp_mb__before_clear_bit();
clear_bit(__QDISC_STATE_SCHED,
&q->state);
}
}
}
}
This code, which only executes if the qdisc lock couldn’t be obtained,
handles two cases. Either:
1. The qdisc is not deactivated, but the lock couldn’t be obtained for
executing qdisc_run . So, call __netif_reschedule . Calling
__netif_reschedule here puts the qdisc back on the queue that
this function is currently dequeuing from. This allows the qdisc to
be checked again later when perhaps the lock has been given up.
2. The qdisc is marked as deactivated, ensure that the
__QDISC_STATE_SCHED state ag is cleared as well.
Finally time to meet our friend dev_hard_start_xmit
back to top

free.
So, we’ve traversed the entire network stack down to dev_hard_start_xmit .

Maybe you’ve arrived here directly from a sendmsg system call or you
arrived here via a softirq thread processing network data on the qdisc.
dev_hard_start_xmit will call down to the device driver to actually do the
transmit operation.
The dev_hard_start_xmit function handles two major cases:
Network data that is ready to send, or

Network data that has segmentation of oading that needs to be
dealt with.
We’ll see how both cases are handled, starting with the case of network
data that is ready to send. Let’s take a look (follow along here:
./net/code/dev.c:
int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,

struct netdev_queue *txq)
{
const struct net_device_ops *ops = dev->netdev_ops;
int rc = NETDEV_TX_OK;
unsigned int skb_len;
if (likely(!skb->next)) {
netdev_features_t features;
/*
We use cookies to enhance
* If the user experience
device doesn't on packagecloud.
need skb->dst, release it right now while
By using our site, you acknowledge
* its hot that you have
in this cpuread and understand our
cache 
*/
if (dev->priv_flags & IFF_XMIT_DST_RELEASE)

skb_dst_drop(skb);
features = netif_skb_features(skb);
This code starts by obtaining a reference to the device driver’s exposed

operations with ops . This will be used later when it’s time to get the driver
to do some work to transmit data. The code checks skb->next to ensure
that this data is not part of a chain of data that is segmented ready to go
and moves on to do two things:
1. First, it checks if the IFF_XMIT_DST_RELEASE ag is set on the

device. This ag isn’t used by any of the “real” ethernet devices in
this kernel. It used by the loopback device and some other software
devices, though. If this ag is enabled, the reference count on the
destination cache entry can be decreased, since it won’t be needed
by the driver.
2. Next, netif_skb_features is used to get the feature ags from the
device and modify them a bit based on the protocol for which the
data is destined ( dev->protocol ). For example, if the protocol is
one the device can checksum for, the skb will marked as such. The
VLAN tag (if it is set) will also cause additional feature ags to be
ipped.
Next, the vlan tag will be checked and if the device can’t of oad VLAN
tagging, __vlan_put_tag will be used to do this in software:
if (vlan_tx_tag_present(skb) &&
!vlan_hw_offload_capable(features, skb->vlan_proto)) {
skb = __vlan_put_tag(skb, skb->vlan_proto,
vlan_tx_tag_get(skb));
if that
By using our site, you acknowledge (unlikely(!skb))
goto
Cookie Policy, Privacy Policy, and our Terms out;
of Service. back to top
skb->vlan_tci = 0;
}
Following that, the data will be checked if it’s an encapsulation of oad

request, perhaps for GRE, for example. In this case, the feature ags will be
updated to include any device-speci c hardware encapsulation features that
are available:
/* If encapsulation offload request, verify we are testing

* hardware encapsulation features instead of standard
* features for the netdev
*/
if (skb->encapsulation)
features &= dev->hw_enc_features;
Next, netif_needs_gso is used to determine whether or not an skb itself

needs segmentation at all. If the skb needs segmentation, but the device
does not support it, then netif_needs_gso will return true indicating that
segmentation should occur in software. In this case, dev_gso_segment is
called to do the segmentation and the code will jump down to gso to
transmit the packets. We’ll see the GSO path later.
if (netif_needs_gso(skb, features)) {
if (unlikely(dev_gso_segment(skb, features)))
goto out_kfree_skb;
if (skb->next)
goto gso;
}
If the data does not need segmentation, a few other cases are handled.
First: does the data need to be linearized? That is, can the device support
sending network
Cookie Policy, Privacy data
Policy, if
andthe dataofisService.
our Terms spread out across multiple buffers,
back or
to top
does it all need to be combined into a single linear buffer rst? The vast
majority of network cards do not required the data to be linearized before
transmit, so in almost all cases this will evaluated to false and will be
skipped.
else {
if (skb_needs_linearize(skb, features) &&
__skb_linearize(skb))
goto out_kfree_skb;
A helpful comment is provided next, explaining the next case. The packet
will be checked to determine if it still needs a checksum. If the device does
not support checksumming, a checksum will be generated in software now:
/* If packet is not checksummed and device does not

* support checksumming for this protocol, complete
* checksumming here.
*/
if (skb->ip_summed == CHECKSUM_PARTIAL) {
if (skb->encapsulation)
skb_set_inner_transport_header(skb,
skb_checksum_start_offset(skb
));
else
skb_set_transport_header(skb,
skb_checksum_start_offset(skb
));
if (!(features & NETIF_F_ALL_CSUM) &&
skb_checksum_help(skb))
goto out_kfree_skb;
}
}

Now we move on to packet taps! Recall in the receive side blog post, we
saw how packets were passed off to packet taps (like PCAP). The next chunk
of code in this function hands packets which are about to be transmit over
to the packet taps (if there are any).
if (!list_empty(&ptype_all))
dev_queue_xmit_nit(skb, dev);
Finally, the driver’s ops are used to pass the data down to the device by
calling ndo_start_xmit :
skb_len = skb->len;
rc = ops->ndo_start_xmit(skb, dev);
trace_net_dev_xmit(skb, rc, dev, skb_len);

if (rc == NETDEV_TX_OK)
txq_trans_update(txq);
return rc;
}
The return value of ndo_start_xmit is returned indicating whether the

packet was transmit or not. We saw how this return value will affect the
upper layers: the data will likely be requeued by the qdisc above this
function so it can be transmit again later.
Let’s take a look at the GSO case. This code will run if the skb was already
separated into a chain of packets due to segmentation which happened in
this function or a packet that was previously segmented, but failed to send
and was queued to be sent again.
gso:
do {
struct
Cookie Policy, Privacy sk_buff
Policy, and *nskb
our Terms = skb->next;
of Service. back to top
skb->next = nskb->next;
nskb->next = NULL;
if (!list_empty(&ptype_all))
dev_queue_xmit_nit(nskb, dev);
skb_len = nskb->len;
rc = ops->ndo_start_xmit(nskb, dev);
trace_net_dev_xmit(nskb, rc, dev, skb_len);
if (unlikely(rc != NETDEV_TX_OK)) {
if (rc & ~NETDEV_TX_MASK)
goto out_kfree_gso_skb;
nskb->next = skb->next;
skb->next = nskb;
return rc;
}
txq_trans_update(txq);
if (unlikely(netif_xmit_stopped(txq) && skb->next))
return NETDEV_TX_BUSY;
} while (skb->next);
As you may have guessed, this code is a while loop that iterates over the list
of skbs that were generated when the data was segmented.
Each packet is:
Passed through the packet taps (if there are any).

Passed through to the driver via ndo_start_xmit to be transmit.
Any error in transmitting a packet is dealt with by adjusting the list of skbs
that need to be sent. The error will be returned up the stack and the unsent
skbs may be requeued to be sent again later.
The last piece of this function handles cleaning up and potentially freeing
data in Policy,
Cookie the event of any
Privacy Policy, anderrors hitof Service.
our Terms above: back to top
out_kfree_gso_skb:
if (likely(skb->next == NULL)) {
skb->destructor = DEV_GSO_CB(skb)->destructor;
consume_skb(skb);
return rc;
}
out_kfree_skb:
kfree_skb(skb);
out:
return rc;
}
EXPORT_SYMBOL_GPL(dev_hard_start_xmit);
Before continuing into the device driver, let’s take a look at some
monitoring and tuning that can be done for the code that we just walked
through.
Monitoring qdiscs

seconds, free.
Using the tc command line tool
Monitor your qdisc statistics by using tc
$ tc -s qdisc show dev eth1

qdisc mq 0: root
Sent 31973946891907 bytes 2298757402 pkt (dropped 0, overlimits 0 r
Bybacklog 0byou
using our site, 0packnowledge
requeuesthat1776429
In order to monitor the packet transmit health of your system, it is vital to

examine the statistics of the queue discipline(s) attached to your network
device(s). You can check the status by running the command line tool tc .
The example above shows how to check the statistics for the eth1
interface.
bytes : The number of bytes that were pushed down to the driver
for transmit.
pkt : The number of packets that were pushed down to the driver
for transmit.
dropped : The number of packets that were dropped by the qdisc.
This can happen if transmit queue length is not large enough to t
the data being queued to it.
overlimits : Depends on the queuing discipline, but can be either
the number of packets that could not be enqueued due to a limit
being hit, and/or the number of packets which triggered a
throttling event when dequeued.
requeues : Number of times dev_requeue_skb has been called to
requeue an skb. Note that an skb which is requeued multiple times
will bump this counter each time it is requeued.
backlog : Number of bytes currently on the qdisc’s queue. This
number is usually bumped each time a packet is enqueued.
Some qdsics may export additional statistics. Each qdisc is different and
may bump these counters at different times. You may want to study the
source for the qdisc you are using to understand precisely when these
values can be incremented on your system to help understand what the
consequences are for you.
Tuning qdiscs
Increasing the processing weight of __qdisc_run
You can adjust the weight of __qdisc_run loop seen earlier (the quota
variable seen above) which will cause more calls to __netif_schedule to be
executed. The result will be the current qdisc added to the output_queue
list for the current CPU more times, which should result in additional
processing of transmit packets.
Example: increase the `__qdisc_run` quota for all qdiscs with `sysctl`.
$ sudo sysctl -w net.core.dev_weight=600
Increasing the transmit queue length
Each network device has a txqueuelen tuning knob that can be modi ed.
Most qdisc’s will check if the device has suf cient txqueuelen bytes when
enqueuing data that should eventually be transmit by the qdisc. You can
adjust his parameter to increase the number of bytes that may be queued by
a qdisc.
Example: increase the `txqueuelen` of `eth0` to `10000`.
$ sudo ifconfig eth0 txqueuelen 10000
The default value for ethernet devices is 1000 . You can check the
txqueuelen for network devices by reading the output of ifconfig .
Network Device Driver
We’re nearing the end of our journey. There’s an important concept to

understand about packet transmit. Most devices and drivers deal with
packet transmit as a two-step process:
1. Data is arranged properly and the device is triggered to DMA the

data from RAM and write it to the network
2. After the transmit completes, the device will raise an interrupt so
the driver can unmap buffers, free memory, or otherwise clean its
state.
The second phase of this is commonly called the “transmit completion”

phase. We’re going to examine both, but we’ll start with the rst phase: the
transmit phase.
We saw that dev_hard_start_xmit calls the ndo_start_xmit (with a lock

held) to transmit data, so let’s start by examining how a driver registers an
ndo_start_xmit and then we’ll dive into how that function works.
As in the previous blog post we’ll be examining the igb driver.
Driver operations registration
Drivers implement a series of functions for a variety of operations, like:
Sending data ( ndo_start_xmit )

Getting statistical information ( ndo_get_stats64 )
Handling device ioctls ( ndo_do_ioctl )
Andourmore.
By using site, you acknowledge that you have read and understand our 
The functions are exported as a series of function pointers arranged in a

structure. Let’s take a look at the structure de ntion for these operations in
the igb driver source:
static const struct net_device_ops igb_netdev_ops = {

.ndo_open = igb_open,
.ndo_stop = igb_close,
.ndo_start_xmit = igb_xmit_frame,
.ndo_get_stats64 = igb_get_stats64,
/* ... more fields ... */

};
This structure is registered in the igb_probe function:
static int igb_probe(struct pci_dev *pdev, const struct pci_device_id *ent)

{
/* ... lots of other stuff ... */
netdev->netdev_ops = &igb_netdev_ops;
/* ... more code ... */

}
As we saw in the previous section, higher layers of code will obtain a

refernece to a device’s netdev_ops structure and call the appropriate
function. If you are curious to learn more about how exactly PCI devices are
brought up and when/where igb_probe is called, check out the driver
initialization section from our other blog post.

Sign up!
Transmit data with ndo_start_xmit
The higher layers of the networking stack use the net_device_ops structure
to call into a driver to perform various operations. As we saw earlier, the
qdisc code calls ndo_start_xmit to pass data down to the driver for
transmit. The ndo_start_xmit function is called while a lock is held, for
most hardware devices, as we saw above.
In the igb device driver, the function registered to ndo_start_xmit is called

igb_xmit_frame , so let’s start at igb_xmit_frame and learn how this driver
transmits data. Follow along in ./drivers/net/ethernet/intel/igb/igb_main.c
and keep in mind that a lock is beind held the entire time the following
code is executing:
netdev_tx_t igb_xmit_frame_ring(struct sk_buff *skb,

struct igb_ring *tx_ring)
{
struct igb_tx_buffer *first;
int tso;
u32 tx_flags = 0;
u16 count = TXD_USE_COUNT(skb_headlen(skb));
__be16 protocol = vlan_get_protocol(skb);
u8 hdr_len = 0;
/* need: 1 descriptor per page * PAGE_SIZE/IGB_MAX_DATA_PER_TXD,

* + 1 desc for skb_headlen/IGB_MAX_DATA_PER_TXD,
* + 2 desc gap to keep tail from touching head,
* + 1 desc for context descriptor,
* otherwise try next time
*/
if (NETDEV_FRAG_PAGE_MAX_SIZE > IGB_MAX_DATA_PER_TXD) {
unsigned short f;
By using our site, youfor (f = 0;that
acknowledge f <youskb_shinfo(skb)->nr_frags;
have read and understand our f++)

count
+= TXD_USE_COUNT(skb_shinfo(skb)->frags[f].size);
} else {
count += skb_shinfo(skb)->nr_frags;
}
The function starts out be determining use the TXD_USER_COUNT macro to

determine how many transmit descriptors will be needed to transmit the
data passed in. The count value initialized at the the number of descriptors
to t the skb. It is then adjusted to account for any additional fragments
that need to be transmit.
if (igb_maybe_stop_tx(tx_ring, count + 3)) {

/* this is a hard error */
return NETDEV_TX_BUSY;
}
The driver then calls an internal function igb_maybe_stop_tx which will

check the number of descriptors needed to ensure that the transmit queue
has enough available. If not, NETDEV_TX_BUSY is returned here. As we saw
earlier in the qdisc code, this will cause the qdisc to requeue the data to be
retried later.
/* record the location of the first descriptor for this packet */

first = &tx_ring->tx_buffer_info[tx_ring->next_to_use];
first->skb = skb;
first->bytecount = skb->len;
first->gso_segs = 1;
The code then obtains a regerence to the next available buffer info in the
transmit queue. This structure will track the information needed for setting
up a buffer descriptor later. A reference to the packet and its size are copied
into
We the buffer
use cookies info structure.
to enhance the user experience on packagecloud.
skb_tx_timestamp(skb);
The code above starts by calling skb_tx_timestamp which is used to obtain

a software based transmit timestamp. An application can use the transmit
timestamp to determine the amount of time it takes for a packet to travel
through the transmit path of the network stack.
Some devices also support generating timestamps for packets transmit in

hardware. This allows the system to of oad timestamping to the device and
it allows the programmer to obtain a more accurate timestamp, as it will be
taken much closer to when the actual transmit by the hardware occurs.
We’ll see the code for this now:
if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP)) {

struct igb_adapter *adapter = netdev_priv(tx_ring->netdev);
if (!(adapter->ptp_tx_skb)) {
skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
tx_flags |= IGB_TX_FLAGS_TSTAMP;
adapter->ptp_tx_skb = skb_get(skb);
adapter->ptp_tx_start = jiffies;
if (adapter->hw.mac.type == e1000_82576)
schedule_work(&adapter->ptp_tx_work);
}
}
Some network devices can timestamp packets in hardware using the

Precision Time Protocol. The driver code handles that here when a user
requests hardware timestampping.
The if statement above checks for the SKBTX_HW_TSTAMP ag. This ag

indicates that the user requested hardware timestamping. If the user
requested
By using our hardware timestamping,
site, you acknowledge that you have the code
read and will next
understand our check if ptp_tx_skb
is set. One packet can be timestampped at a time, so a reference to the
packet being timestampped is taken here and the SKBTX_IN_PROGRESS ag is

set on the skb. The tx_flags are updated to mark the IGB_TX_FLAGS_TSTAMP
ag. The tx_flags variable will be copied into the buffer info structure
later.
A reference is taken to the skb, the current jif es count is copied to

ptp_tx_start . This value will be used by other code in the driver to ensure
that the TX hardware timestampping is not hanging. Finally, the
schedule_work function is used to kick the workqueue if this is an 82576
ethernet hardware adapter.
if (vlan_tx_tag_present(skb)) {
tx_flags |= IGB_TX_FLAGS_VLAN;
tx_flags |= (vlan_tx_tag_get(skb) << IGB_TX_FLAGS_VLAN_SHIFT);
}
The code above will check if the vlan_tci eld of the skb was set. If it is
set, then the IGB_TX_FLAGS_VLAN ag is enabled and the vlan ID is stored.
/* record initial flags and protocol */

first->tx_flags = tx_flags;
first->protocol = protocol;
The ags and protocol are recorded to the buffer info structure.
tso = igb_tso(tx_ring, first, &hdr_len);

if (tso < 0)
goto out_drop;
else if (!tso)
igb_tx_csum(tx_ring, first);

Next, the driver calls its internal function igb_tso . This function will
determine
Cookie Policy,ifPrivacy
an skb needs
Policy, fragmentation.
and our Terms of Service. If so, the buffer info reference
back to top
( first ) will have its ags updated to indicate to the hardware that TSO is
required.
igb_tso will return 0 is TSO is unncessary, otherwise 1 is returned. If 0 is

returned, igb_tx_csum will be called to deal with enabling checksum
of oading if needed and if supported for this protocol. The igb_tx_csum
function will check the properties of the skb and ip some ag bits in the
buffer info first to signal that checksum of oading is needed.
igb_tx_map(tx_ring, first, hdr_len);
The igb_tx_map function is called to prepare the data to be consumed by

the device for transmit. We’ll examine this function in detail next.
/* Make sure there is space in the ring for the next send. */
igb_maybe_stop_tx(tx_ring, DESC_NEEDED);
return NETDEV_TX_OK;
Once the the transmit is complete, the driver checks to ensure that there is
suf cient space available for another transmit. If not, the queue is
shutdown. In either case NETDEV_TX_OK is returned to the higher layer (the
qdisc code).
out_drop:
igb_unmap_and_free_tx_resource(tx_ring, first);
return NETDEV_TX_OK;
}
Finally, some error handling code. This code is only hit if igb_tso hits an
error of some
By using kind.
our site, you The igb_unmap_and_free_tx_resource
acknowledge that you have read and understand our is used to clean

up data. NETDEV_TX_OK is returned in this case as well. The transmit was not
successful, but the driver freed the resources associated and there is
nothing left to do. Note that this driver does not increment packet drops in
this case, but it probably should.
igb_tx_map

free.
The igb_tx_map function handles the details of mapping skb data to DMA-
able regions of RAM. It also updates the transmit queue’s tail pointer on the
device, which is what triggers the device to “wake up”, fetch the data from
RAM, and begin transmitting the data.
Let’s take a look, brie y, at how this function works:
static void igb_tx_map(struct igb_ring *tx_ring,

struct igb_tx_buffer *first,
const u8 hdr_len)
{
struct sk_buff *skb = first->skb;
/* ... other variables ... */
u32 tx_flags = first->tx_flags;

u32 cmd_type = igb_tx_cmd_type(skb, tx_flags);
u16 i = tx_ring->next_to_use;

tx_desc = IGB_TX_DESC(tx_ring, i);
igb_tx_olinfo_status(tx_ring, tx_desc, tx_flags, skb->len - hdr_len);
size = skb_headlen(skb);
data_len = skb->data_len;
dma = dma_map_single(tx_ring->dev, skb->data, size, DMA_TO_DEVICE);
The code above does a few things:
1. Declares a set of variables and initializes them.

2. Uses the IGB_TX_DESC macro to determine obtain a reference to the
next available descriptor.
3. igb_tx_olinfo_status will update the tx_flags and copy them
into the descriptor ( tx_desc ).
4. The size and data length are captured so they can be used later.
5. dma_map_single is used to construct any memory mapping
necessary to obtain a DMA-able address for skb->data . This is done
so that the device can read the packet data from memory.
What follows next is a very dense loop in the driver to deal with generating
a valid mapping for each fragment of a skb. The details of how exactly this
happens are not particularly important, but are worth mentioning:
The driver iterates across the set of packet fragments.

The current descriptor has the DMA address of the data lled in.
If the size of the fragment is larger than what a single IGB
descriptor can transmit, multiple descriptors are constructed to
point to chunks of the DMA-able region until the entire fragment is
pointed to by descriptors.
The descriptor iterator is bumped.
Theourremaining
By using lengththatisyou
site, you acknowledge reduced.
The loop terminates when either: no fragments are remaining or

the entire data length has been consumed.
The code for the loop is provided below for reference with the above
description. This should illustrate further to readers that avoiding
fragmentation, if at all possible, is a good idea. Lots of additional code
needs to run to deal with it at every layer of the stack, including the driver.
tx_buffer = first;
for (frag = &skb_shinfo(skb)->frags[0];; frag++) {

if (dma_mapping_error(tx_ring->dev, dma))
goto dma_error;
/* record length, and DMA address */

dma_unmap_len_set(tx_buffer, len, size);
dma_unmap_addr_set(tx_buffer, dma, dma);
tx_desc->read.buffer_addr = cpu_to_le64(dma);
while (unlikely(size > IGB_MAX_DATA_PER_TXD)) {

tx_desc->read.cmd_type_len =
cpu_to_le32(cmd_type ^ IGB_MAX_DATA_PER_TXD);
i++;
tx_desc++;
if (i == tx_ring->count) {
tx_desc = IGB_TX_DESC(tx_ring, 0);
i = 0;
}
tx_desc->read.olinfo_status = 0;
dma += IGB_MAX_DATA_PER_TXD;
size -= IGB_MAX_DATA_PER_TXD;
tx_desc->read.buffer_addr
Cookie Policy, Privacy Policy, and our Terms of Service. = cpu_to_le64(dma); back to top
if (likely(!data_len))
break;
tx_desc->read.cmd_type_len = cpu_to_le32(cmd_type ^ size);
i++;
tx_desc++;
if (i == tx_ring->count) {
i = 0;
}
tx_desc->read.olinfo_status = 0;
size = skb_frag_size(frag);
data_len -= size;
dma = skb_frag_dma_map(tx_ring->dev, frag, 0,

size, DMA_TO_DEVICE);
tx_buffer = &tx_ring->tx_buffer_info[i];
}
Once all the necessary descriptors have been constructed and all of the skb’s
data has been mapped to DMA-able addresses, the driver proceeds to its
nal steps to trigger a transmit:
/* write last descriptor with RS and EOP bits */

cmd_type |= size | IGB_TXD_DCMD;
tx_desc->read.cmd_type_len = cpu_to_le32(cmd_type);
A terminating descriptor is written to indicate to the device that it is the last

descriptor.
netdev_tx_sent_queue(txring_txq(tx_ring), first->bytecount);
/* set the timestamp */

first->time_stamp = jiffies;
The netdev_tx_sent_queue function is called with the number of bytes

being added to this transmit queue. This function is part of the byte query
limit feature that we’ll see shortly in more detail. The current jif es are
stored in the rst buffer info structure.
Next, something a bit tricky:
/* Force memory writes to complete before letting h/w know there

* are new descriptors to fetch. (Only applicable for weak-ordered
* memory model archs, such as IA-64).
*
* We also need this memory barrier to make certain all of the
* status bits have been updated before next_to_watch is written.
*/
wmb();
/* set next_to_watch value indicating a packet is present */

first->next_to_watch = tx_desc;
i++;
if (i == tx_ring->count)
i = 0;
tx_ring->next_to_use = i;
writel(i, tx_ring->tail);
/* we need this if more than one processor can write to our tail
We use cookies
* attoaenhance
time, the
it user experience onIO
synchronizes packagecloud.
on IA64/Altix systems
*/
mmiowb();
return;
The code above is doing a few important things:
1.Start by using the wmb function is called to force memory writes to

complete. This executed as a special instruction appropriate for the CPU
platform and is commonly referred to as a “write barrier.” This is important
on certain CPU architectures because if we trigger the device to start DMA
without ensuring that all memory writes to update internal state have
completed, the device may read data from RAM that is not in a consistent
state. This article and this lecture dive into the details on memory ordering.
1. The next_to_watch eld is set. It will be used later during the

completion phase.
2. Counters are bumped, and the next_to_use eld of the transmit
queue is updated to the next available descriptor.
3. The transmit queue’s tail is updated with a writel function.
writel writes a “long” to a memory mapped I/O address. In this
case, the address is tx_ring->tail (which is a hardware address)
and the value to be written is i . This write to the device triggers
the device to let it know that additional data is ready to be DMA’d
from RAM and written to the network.
4. Finally, call the mmiowb function. This function will execute the
appropriate instruction for the CPU architecture causing memory
mapped write operations to be ordered. It is also a write barrier, but
for memory mapped I/O writes.
You can read some excellent documentation about memory barriers

included with the Linux kernel, if you are curious to learn more about wmb ,
By using, our
mmiowb and site,when to use them.
you acknowledge that you have read and understand our 
Finally, the code wraps up with some error handling. This code only
executes if an error was returned from the DMA API when attemtping to
map skb data addresses to DMA-able addresses.
dma_error:
dev_err(tx_ring->dev, "TX DMA map failed\n");
/* clear dma mappings for failed tx_buffer_info map */

for (;;) {
igb_unmap_and_free_tx_resource(tx_ring, tx_buffer);
if (tx_buffer == first)
break;
if (i == 0)
i = tx_ring->count;
i--;
}
tx_ring->next_to_use = i;
Before moving on to the transmit completion, let’s examine something we

passed over above: dynamic queue limits.
Dynamic Queue Limits (DQL)
Create a RubyGem repository in less than 10 Sign up!

seconds, free.
As you’ve seen throughout this post, network data spends a lot of time
sitting
We usequeues at various
cookies to enhance stages
the user as on
experience it packagecloud.
moves closer and closer to the device
forCookie
transmission. As queue sizes increase, packets spend longer sitting
Policy, Privacy Policy, and our Terms of Service. back in
to top
queues not being transmit i.e. packet transmit latency increases as queue
size increases.
One way to ght this is with back pressure. The dynamic queue limit (DQL)
system is a mechanism that device drivers can use to apply back pressure to
the networking system to prevent too much data from being queued for
transmit when the device is unable to transmit,
To use this system, network device drivers need to make a few simple API
calls during their transmit and completion routines. The DQL system
internally will use an algorithm to determine when suf cient data is in
transmit. Once this limit is reached, the transmit queue will be temporarily
disabled. This queue disabling is what produces the back pressure against
the networking system. The queue will be automatically re-enabled when
the DQL system determines enough data has nished transmission.
Check out this excellent set of slides about the DQL system for some
performance data and an explanation of the internal algorithm in DQL.
The function netdev_tx_sent_queue called in the code we just saw is part

of the DQL API. This function is called when data is queued to the device for
transmit. Once the transmit completes, the driver calls
netdev_tx_completed_queue . Internally, both of these functions will call into
the DQL library (found in ./lib/dynamic_queue_limits.c and
./include/linux/dynamic_queue_limits.h) to determine if the transmit queue
should be disabled, re-enabled, or left as-is.
DQL exports statistics and tuning knobs in sysfs. Tuning DQL should not be
necessary; the algorithm will adjust its parameters over time. Nevertheless,
in the interest of completeness we’ll see later how to monitor and tune
DQL.
Transmit completions
Once the device has transmit the data, it will generate an interrupt to signal
that transmission is complete. The device driver can then schedule some
long running work to be completed, like unmapping memory regions and
freeing data. How exactly this works is device speci c. In the case of the
igb driver (and its associated devices), the same IRQ is red for transmit
completion and packet receive. This means that for the igb driver the
NET_RX is used to process both transmit completions and incoming packet
receives.
Let me re-state that to emphasize the importance of this: your device may
raise the same interrupt for receiving packets that it raises to signal that a
packet transmit has completed. If it does, the NET_RX softirq runs to process
both incoming packets and transmit completions.
Since both operations share the same IRQ, only a single IRQ handler
function can be registered and it must deal with both possible cases. Recall
the following ow when network data is received:
1. Network data is received.

2. The network device raises an IRQ.
3. The device driver’s IRQ handler executes, clearing the IRQ and
ensuring that a softIRQ is scheduled to run (if not running already).
This softIRQ that is triggered here is the NET_RX softIRQ.
4. The softIRQ executes essentially as a separate kernel thread. It runs
and implements the NAPI poll loop.
5. The NAPI poll loop is simply a piece of code which executes in loop
harvesting packets as long as suf cient budget is available.
6. Each time a packet is processed, the budget is decreased until there

are no more packets to process, the budget reaches 0, or the time
slice has expired.
Step 5 above in the igb driver (and the ixgbe driver [greetings, tyler])
processes TX completions before processing incoming data. Keep in mind
that depending on the implementation of the driver, both processing
functions for TX completions and incoming data may share the same
processing budget. The igb and ixgbe drivers track the TX completion and
incoming packet budgets separately, so processing TX completions will not
necessarily exhaust the RX budget.
That said, the entire NAPI poll loop runs within a hard coded time slice. This
means that if you have a lot of TX completion processing to handle, TX
completions may eat more of the time slice than processing incoming data
does. This may be an important consideration for those running network
hardware in very high load environments.
Let’s see how this happens in practice for the igb driver.
Transmit completion IRQ
Instead of restating information already covered in the Linux kernel receive

side networking blog post, this post will instead list the steps in order and
link to the appropriate sections in the receive side blog post until transmit
completions are reached.
So, let’s start from the beginning:
1. The network device is brought up.

2.use
We IRQ handlers
cookies aretheregistered.
to enhance user experience on packagecloud.
3. The user program sends data to a network socket. The data travels
the network stack until the device fetches it from memory and
transmits it.
4. The device nishes transmitting the data and raises an IRQ to
signal transmit completion.
5. The driver’s IRQ handler executes to handle the interrupt.
6. The IRQ handler calls napi_schedule in response to the IRQ.
7. The NAPI code triggers the NET_RX softirq to execute.
8. The NET_RX so trq function, net_rx_action begins to execute.
9. The net_rx_action function calls the driver’s registered NAPI poll
function.
10. The NAPI poll function, igb_poll , is executed.
The poll function igb_poll is where the code splits off and processes both
incoming packets and transmit completions. Let’s dive into the code for this
function and see where that happens.
igb_poll
Let’s take a look at the code for igb_poll (from

./drivers/net/ethernet/intel/igb/igb_main.c):
/**
* igb_poll - NAPI Rx polling callback
* @napi: napi polling structure
* @budget: count of how many packets we should handle
**/
static int igb_poll(struct napi_struct *napi, int budget)
{
struct
We use cookies igb_q_vector
to enhance *q_vectoron=packagecloud.
the user experience container_of(napi,
struct igb_q_vector, 
napi);
bool clean_complete = true;
#ifdef CONFIG_IGB_DCA
if (q_vector->adapter->flags & IGB_FLAG_DCA_ENABLED)
igb_update_dca(q_vector);
#endif
if (q_vector->tx.ring)
clean_complete = igb_clean_tx_irq(q_vector);
if (q_vector->rx.ring)
clean_complete &= igb_clean_rx_irq(q_vector, budget);
/* If all work not completed, return budget and keep polling */

if (!clean_complete)
return budget;
/* If not enough Rx work done, exit the polling mode */

napi_complete(napi);
igb_ring_irq_enable(q_vector);
return 0;
}
This function performs a few operations, in order:
1. If Direct Cache Access (DCA) support is enabled in the kernel, the

CPU cache is warmed so that accesses to the RX ring will hit CPU
cache. You can read more about DCA in the Extras section of the
receive side networking post.
2. igb_clean_tx_irq is called which performs the transmit
completion operations.
3. igb_clean_rx_irq is called next which performs the incoming
packet processing.
4. Finally, clean_complete is checked to determine if there was still

more work that could have been done. If so, the budget is returned.
If this happens, net_rx_action will move this NAPI structure to the
end of the poll list to be processed again later.
To learn more about how igb_clean_rx_irq works, read this section of the
previous blog post.
This blog post is concerned primarily with the transmit side, so we’ll
continue by examining how igb_clean_tx_irq above works.

seconds, free.
igb_clean_tx_irq
Take a look at the source for this function in

./drivers/net/ethernet/intel/igb/igb_main.c.
It’s a bit long, so we’ll break it into chunks and go through it:
static bool igb_clean_tx_irq(struct igb_q_vector *q_vector)

{
struct igb_adapter *adapter = q_vector->adapter;
struct igb_ring *tx_ring = q_vector->tx.ring;
struct igb_tx_buffer *tx_buffer;
union e1000_adv_tx_desc *tx_desc;
unsigned int total_bytes = 0, total_packets = 0;
unsigned int budget = q_vector->tx.work_limit;
unsigned int i = tx_ring->next_to_clean;
if (test_bit(__IGB_DOWN, &adapter->state))
return true;
The function begins by initializing some useful variables. One important

one to take a look at is budget . As you can see above budget is initialized
to this queue’s tx.work_limit . In the igb driver, tx.work_limit is
initialized to a hardcoded value IGB_DEFAULT_TX_WORK (128).
It is important to note that while the TX completion code we are looking at

now runs in the same NET_RX softirq as receive processing does, the TX and
RX functions do not share a processing budget with each other in the igb
driver. Since the entire poll function runs within the same time slice, it is
not possible for a single run of the igb_poll function to starve incoming
packet processing or transmit completions. As long as igb_poll is called,
both will be handled.
Moving on, the snippet of code above nishes by checking if the network
device is down. If so, it returns true and exits igb_clean_tx_irq .
tx_desc = IGB_TX_DESC(tx_ring, i);
i -= tx_ring->count;
1. The tx_buffer variable is initialized to transmit buffer info

structure at location tx_ring->next_to_clean (which itself is
initialized to 0 ).
2. A reference to the associated descriptor is obtained and stored in
tx_desc .
3. The counter i is decreased by the size of the transmit queue. This
value can be adjusted (as we’ll see in the tuning section), but is
initialized
By using our site, youto IGB_DEFAULT_TXD
acknowledge (256).
that you have read and understand our  
Next, a loop begins. It includes some helpful comments to explain what is

happening at each step:
do {
union e1000_adv_tx_desc *eop_desc = tx_buffer->next_to_watch;
/* if next_to_watch is not set then there is no work pending */

if (!eop_desc)
break;
/* prevent any other reads prior to eop_desc */

read_barrier_depends();
/* if DD is not set pending work has not been completed */

if (!(eop_desc->wb.status & cpu_to_le32(E1000_TXD_STAT_DD)))
break;
/* clear next_to_watch to prevent false hangs */

tx_buffer->next_to_watch = NULL;
/* update the statistics for this packet */

total_bytes += tx_buffer->bytecount;
total_packets += tx_buffer->gso_segs;
/* free the skb */

dev_kfree_skb_any(tx_buffer->skb);
/* unmap skb header data */

dma_unmap_single(tx_ring->dev,
dma_unmap_addr(tx_buffer, dma),
dma_unmap_len(tx_buffer, len),
DMA_TO_DEVICE);
/* clear tx_buffer data */

We use cookies to enhance the user experience
tx_buffer->skb = NULL;on packagecloud.
dma_unmap_len_set(tx_buffer, len, 0);
1. First eop_desc is set to the buffer’s next_to_watch eld. This was

set in the transmit code we saw earlier.
2. If eop_desc (eop = end of packet) is NULL , then there is no work
pending.
3. The read_barrier_depends function is called, which will execute
the appropriate CPU instruction for this CPU architecture to prevent
reads from being reordered over this barrier.
4. Next, a status bit is checked in the end of packet descriptor
eop_desc . If the E1000_TXD_STAT_DD bit is not set, then the
transmit has not completed yet, so break from the loop.
5. Clear the tx_buffer->next_to_watch . A watchdog timer in the
driver will be watching this eld to determine if a transmit was
hung. Clearing this eld will prevent the watchdog from triggering.
6. Statistics counters are updated for total bytes and packets sent.
These will be copied into the statistics counters that the driver
reads once all descriptors have been processed.
7. The skb is freed.
8. dma_unmap_single is used to unmap the skb data region.
9. The tx_buffer->skb is set to NULL and the tx_buffer is
unmapped.
Next, another loop is started inside of the loop above:
/* clear last DMA location and unmap remaining buffers */

while (tx_desc != eop_desc) {
tx_buffer++;
tx_desc++;
i++;
if (unlikely(!i)) {
tx_buffer = tx_ring->tx_buffer_info;

}
/* unmap any remaining paged data */

if (dma_unmap_len(tx_buffer, len)) {
dma_unmap_page(tx_ring->dev,
dma_unmap_addr(tx_buffer, dma),
dma_unmap_len(tx_buffer, len),
DMA_TO_DEVICE);
dma_unmap_len_set(tx_buffer, len, 0);
}
}
This inner loop will loop over each transmit descriptor until tx_desc arrives
at the eop_desc . This code unmaps data referenced by any of the additional
descriptors.
The outer loop continues:
/* move us one more past the eop_desc for start of next pkt */
tx_buffer++;
tx_desc++;
i++;
if (unlikely(!i)) {
tx_buffer = tx_ring->tx_buffer_info;
}
/* issue prefetch for next Tx descriptor */

prefetch(tx_desc);
/* update budget accounting */

budget--;
} while (likely(budget));
The outer loop increments iterators and reduces the budget value. The loop
invariant is checked to determine if the loop should continue.
netdev_tx_completed_queue(txring_txq(tx_ring),
total_packets, total_bytes);
i += tx_ring->count;
tx_ring->next_to_clean = i;
u64_stats_update_begin(&tx_ring->tx_syncp);
tx_ring->tx_stats.bytes += total_bytes;
tx_ring->tx_stats.packets += total_packets;
u64_stats_update_end(&tx_ring->tx_syncp);
q_vector->tx.total_bytes += total_bytes;
q_vector->tx.total_packets += total_packets;
This code:
1. Calls netdev_tx_completed_queue , which is part of the DQL API

explained above. This will potentially re-enable a transmit queue if
enough completions were processed.
2. Statistics are added to their appropriate places so that they can be
accessed by the user as we’ll see later.
The code continues by rst checking if the IGB_RING_FLAG_TX_DETECT_HANG

ag is set. A watchdog timer sets this ag each time the timer callback is
run, to enforce periodic checking of the transmit queue. If that ag happens
to be on now, the code will continue and check if the transmit queue is
hung:
if (test_bit(IGB_RING_FLAG_TX_DETECT_HANG, &tx_ring->flags)) {
struct e1000_hw *hw = &adapter->hw;
/* Detect a transmit hang in hardware, this serializes the

* check with the clearing of time_stamp and movement of i
Cookie Policy, Privacy*/Policy, and our Terms of Service. back to top
clear_bit(IGB_RING_FLAG_TX_DETECT_HANG, &tx_ring->flags);
if (tx_buffer->next_to_watch &&
time_after(jiffies, tx_buffer->time_stamp +
(adapter->tx_timeout_factor * HZ)) &&
!(rd32(E1000_STATUS) & E1000_STATUS_TXOFF)) {
/* detected Tx unit hang */

dev_err(tx_ring->dev,
"Detected Tx Unit Hang\n"
" Tx Queue <%d>\n"
" TDH <%x>\n"
" TDT <%x>\n"
" next_to_use <%x>\n"
" next_to_clean <%x>\n"
"buffer_info[next_to_clean]\n"
" time_stamp <%lx>\n"
" next_to_watch <%p>\n"
" jiffies <%lx>\n"
" desc.status <%x>\n",
tx_ring->queue_index,
rd32(E1000_TDH(tx_ring->reg_idx)),
readl(tx_ring->tail),
tx_ring->next_to_use,
tx_ring->next_to_clean,
tx_buffer->time_stamp,
tx_buffer->next_to_watch,
jiffies,
tx_buffer->next_to_watch->wb.status);
netif_stop_subqueue(tx_ring->netdev,
tx_ring->queue_index);
/* we are about to reset, no point in enabling stuff */

return true;
}

The if statement
By using above checks:
our site, you acknowledge that you have read and understand our 
tx_buffer->next_to_watch is set, and

That the current jiffies is greater than the time_stamp recorded
on the transmit path to the tx_buffer with a timeout factor added,
and
The device’s transmit status register is not set to
E1000_STATUS_TXOFF .
If those three tests are all true, then an error is printed that a hang has been
detected. netif_stop_subqueue is used to turn off the queue and true is
returned.
Let’s continue reading the code to see what happens if there was no
transmit hang check, or if there was, but no hang was detected:
#define TX_WAKE_THRESHOLD (DESC_NEEDED * 2)

if (unlikely(total_packets &&
netif_carrier_ok(tx_ring->netdev) &&
igb_desc_unused(tx_ring) >= TX_WAKE_THRESHOLD)) {
/* Make sure that anybody stopping the queue after this
* sees the new next_to_clean.
*/
smp_mb();
if (__netif_subqueue_stopped(tx_ring->netdev,
tx_ring->queue_index) &&
!(test_bit(__IGB_DOWN, &adapter->state))) {
netif_wake_subqueue(tx_ring->netdev,
tx_ring->queue_index);
u64_stats_update_begin(&tx_ring->tx_syncp);
tx_ring->tx_stats.restart_queue++;
u64_stats_update_end(&tx_ring->tx_syncp);
}
}
Cookie Policy,
returnPrivacy Policy, and our Terms of Service.
!!budget; back to top
In the above code the driver will restart the transmit queue if it was
previously disabled. It rst checks if:
Some packets were processed for completions ( total_packets is

non-zero), and
netif_carrier_ok to ensure the device has not been brought
down, and
The number of unused descriptors in the transmit queue is greater
than or equal to TX_WAKE_THRESHOLD . This threshold value appears
to be 42 on my x86_64 system.
If all conditions are satis ed, a write barrier is used ( smp_mb ). Next another
set of conditions are checked:
If the queue is stopped, and

The device is not down
Then netif_wake_subqueue called to wake up the transmit queue and signal

to the higher layers that they may queue data again. The restart_queue
statistics counter is incremented. We’ll see how to read this value next.
Finally, a boolean value is returned. If there was any remaining un-used

budget true is returned, otherwise false . This value is checked in
igb_poll to determine what to return back to net_rx_action .
igb_poll return value
The igb_poll function has this code to determine what to return to

net_rx_action :

if site,
By using our (q_vector->tx.ring)
you acknowledge that you have read and understand our 
Cookie Policy, Privacy Policy, and our Terms
clean_complete of Service.
= igb_clean_tx_irq(q_vector); back to top
if (q_vector->rx.ring)
clean_complete &= igb_clean_rx_irq(q_vector, budget);
/* If all work not completed, return budget and keep polling */

if (!clean_complete)
return budget;
In other words, if:
igb_clean_tx_irq cleared all transmit completions without

exhausting its transmit completion budget, and
igb_clean_rx_irq cleared all incoming packets without exhausting
its packet processing budget
Then, the entire budget amount (which is hardcoded to 64 for most drivers
including igb ) will be returned. If either of RX or TX processing could not
complete (because there was more work to do), then NAPI is disabled with a
call to napi_complete and 0 is returned:
/* If not enough Rx work done, exit the polling mode */

napi_complete(napi);
igb_ring_irq_enable(q_vector);
return 0;
}
Monitoring network devices
Create an APT repository in less than 10 seconds,

We use cookies to enhance the user experience on packagecloud. Sign up!
free.
There are several different ways to monitor your network devices offering
different levels of granularity and complexity. Let’s start with most granular
and move to least granular.
Using ethtool -S
You can install ethtool on an Ubuntu system by running: sudo apt-get

install ethtool .
Once it is installed, you can access the statistics by passing the -S ag

along with the name of the network device you want statistics about.
Monitor detailed NIC device statistics (e.g., transmit errors) with `ethtool -
S`.
$ sudo ethtool -S eth0

NIC statistics:
rx_packets: 597028087
tx_packets: 5924278060
rx_bytes: 112643393747
tx_bytes: 990080156714
rx_broadcast: 96
tx_broadcast: 116
rx_multicast: 20294528
....
Monitoring this data can be dif cult. It is easy to obtain, but there is no
standardization of the eld values. Different drivers, or even different
versions of the same driver might produce different eld names that have
the same meaning.
You should look for values with “drop”, “buffer”, “miss”, “errors” etc in the
label. Next,
By using you
our site, youwill have to
acknowledge thatread your
you have readdriver source.
and understand our You’ll be able to 
determine which values are accounted for totally in software (e.g.,
incremented when there is no memory) and which values come directly

from hardware via a register read. In the case of a register value, you should
consult the data sheet for your hardware to determine what the meaning of
the counter really is; many of the labels given via ethtool can be
misleading.
Using sysfs
sysfs also provides a lot of statistics values, but they are slightly higher
level than the direct NIC level stats provided.
You can nd the number of dropped incoming network data frames for, e.g.
eth0 by using cat on a le.
Monitor higher level NIC statistics with sysfs.
$ cat /sys/class/net/eth0/statistics/tx_aborted_errors
2
The counter values will be split into les like tx_aborted_errors ,

tx_carrier_errors , tx_compressed , tx_dropped , etc.
Unfortunately, it is up to the drivers to decide what the meaning of each

eld is, and thus, when to increment them and where the values come from.
You may notice that some drivers count a certain type of error condition as a
drop, but other drivers may count the same as a miss.
If these values are critical to you, you will need to read your driver source
and device data sheet to understand exactly what your driver thinks each of
these values means.
Using /proc/net/dev

back to top
An even higher level le is /proc/net/dev which provides high-level

summary-esque information for each network adapter on the system.
Monitor high level NIC statistics by reading /proc/net/dev .
$ cat /proc/net/dev
Inter-| Receive |
face |bytes packets errs drop fifo frame compressed multicast|by
eth0: 110346752214 597737500 0 2 0 0 0 2096
lo: 428349463836 1579868535 0 0 0 0 0
This le shows a subset of the values you’ll nd in the sysfs les mentioned
above, but it may serve as a useful general reference.
The caveat mentioned above applies here, as well: if these values are
important to you, you will still need to read your driver source to
understand exactly when, where, and why they are incremented to ensure
your understanding of an error, drop, or fo are the same as your driver.
Monitoring dynamic queue limits
You can monitor dynamic queue limits for a network device by reading the
les located under: /sys/class/net/NIC/queues/tx-
QUEUE_NUMBER/byte_queue_limits/ .
Replacing NIC with your device name ( eth0 , eth1 , etc) and tx-
QUEUE_NUMBER with the transmit queue number ( tx-0 , tx-1 , tx-2 , etc).
Some of those les are:
hold_time
We use : Initialized
cookies to enhance to HZ (aon single
the user experience hertz). If the queue
packagecloud. has been
full for hold_time , then the maximum size is decreased.
inflight : This value is equal to (number of packets queued -

number of packets completed). It is the current number of packets
being transmit for which a completion has not been processed.
limit_max : A hardcoded value, set to DQL_MAX_LIMIT ( 1879048192
on my x86_64 system).
limit_min : A hardcoded value, set to 0 .
limit : A value between limit_min and limit_max which
represents the current maximum number of objects which can be
queued.
Before modifying any of these values, it is strongly recommended to read

these presentation slides for an in-depth explanation of the algorithm.
Monitor packet transmits in ight by reading

/sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight .
$ cat /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight
350
Tuning network devices
Check the number of TX queues being used
If your NIC and the device driver loaded on your system support multiple
transmit queues, you can usually adjust the number of TX queues (also
called TX channels), by using ethtool .
Check the number of NIC transmit queues with ethtool

$ using
By sudoourethtool -l eth0 that you have read and understand our
site, you acknowledge 
Channel
Cookie parameters
Policy, foroureth0:
Privacy Policy, and Terms of Service. back to top
Pre-set maximums:
RX: 0
TX: 0
Other: 0
Combined: 8
Current hardware settings:
RX: 0
TX: 0
Other: 0
Combined: 4
This output is displaying the pre-set maximums (enforced by the driver and
the hardware) and the current settings.
Note: not all device drivers will have support for this operation.
Error seen if your NIC doesn't support this operation.
$ sudo ethtool -l eth0

Channel parameters for eth0:
Cannot get device channel parameters
: Operation not supported
This means that your driver has not implemented the ethtool get_channels
operation. This could be because the NIC doesn’t support adjusting the
number of queues, doesn’t support multiple transmit queues, or your driver
has not been updated to handle this feature.
Adjust the number of TX queues used
Once you’ve found the current and maximum queue count, you can adjust
the
Wevalues bytousing
use cookies enhancesudo
the userethtool
experience -L .
on packagecloud.
Note: some devices and their drivers only support combined queues that
are paired for transmit and receive, as in the example in the above
section.
Set combined NIC transmit and receive queues to 8 with ethtool -L
$ sudo ethtool -L eth0 combined 8
If your device and driver support individual settings for RX and TX and you’d
like to change only the TX queue count to 8, you would run:
Set the number of NIC transmit queues to 8 with ethtool -L .
$ sudo ethtool -L eth0 tx 8
Note: making these changes will, for most drivers, take the interface
down and then bring it back up; connections to this interface will be
interrupted. This may not matter much for a one-time change, though.
Adjust the size of the TX queues

free.
Some NICs and their drivers also support adjusting the size of the TX queue.
Exactly how this works is hardware speci c, but luckily ethtool provides a
generic way for
We use cookies usersthe
to enhance touser
adjust the on
experience size. Increasing the size of the TX may
packagecloud.
not make a drastic difference because DQL is used to prevent higher
layer
back to top
networking code from queueing more data at times. Nevertheless, you may
want to increase the TX queues to the maximum size and let DQL sort
everything else out for you:
Check current NIC queue sizes with ethtool -g
$ sudo ethtool -g eth0

Ring parameters for eth0:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 512
RX Mini: 0
RX Jumbo: 0
TX: 512
the above output indicates that the hardware supports up to 4096 receive
and transmit descriptors, but it is currently only using 512.
Increase size of each TX queue to 4096 with ethtool -G
$ sudo ethtool -G eth0 tx 4096
Note: making these changes will, for most drivers, take the interface
down and then bring it back up; connections to this interface will be
interrupted. This may not matter much for a one-time change, though.
The
We useEnd
The end! Now you know everything about how packet transmit works on
Linux: from the user program to the device driver and back.
Extras
There are a few extra things worth mentioning that are worth mentioning
which didn’t seem quite right anywhere else.
Reducing ARP traffic ( MSG_CONFIRM )
The send , sendto , and and sendmsg system calls all take a flags
parameter. If you pass the MSG_CONFIRM ag to these system calls from your
application, it will cause the dst_neigh_output function in the kernel on
the send path to update the timestamp of the neighbour structure. The
consequence of this is that the neighbour structure will not be garbage
collected. This prevents additional ARP traf c from being generated as the
neighbour cache entry will stay warmer, longer.
UDP Corking
We examined UDP corking extensively throughout the UDP protocol stack. If

you want to use it in your application, you can enable UDP corking by
calling setsockopt with level set to IPPROTO_UDP , optname set to
UDP_CORK , and
By using our optval
site, you setthat
acknowledge to you
1 . have read and understand our
 
Timestamping
As mentioned in the above blog post, the networking stack can collect
timestamps of outgoing data. See the above network stack walkthrough to
see where transmit timestamping happens in software. Some NICs even
support timestamping in hardware, too.
This is a useful feature if you’d like to try to determine how much latency
the kernel network stack is adding to sending packets.
The kernel documentation about timestamping is excellent and there is

even an included sample program and Make le you can check out!.
Determine which timestamp modes your driver and device support with
ethtool -T .
$ sudo ethtool -T eth0

Time stamping parameters for eth0:
Capabilities:
software-transmit (SOF_TIMESTAMPING_TX_SOFTWARE)
software-receive (SOF_TIMESTAMPING_RX_SOFTWARE)
software-system-clock (SOF_TIMESTAMPING_SOFTWARE)
PTP Hardware Clock: none
Hardware Transmit Timestamp Modes: none
Hardware Receive Filter Modes: none
This NIC, unfortunately, does not support hardware transmit timestamping,

but software timestamping can still be used on this system to help me
determine how much latency the kernel is adding to my packet transmit
path.

Conclusion
The Linux networking stack is complicated.
As we saw above, even something as simple as the NET_RX can’t be

guaranteed to work as we expect it to. Even though RX is in the name,
transmit completions are still processed in this softIRQ.
This highlights what I believe to be the core of the issue: optimizing and
monitoring the network stack is impossible unless you carefully read and
understand how it works. You cannot monitor code you don’t understand at
a deep level.
Help with Linux networking or other

systems
Need some extra help navigating the network stack? Have questions about
anything in this post or related things not covered? Send us an email and
let us know how we can help.
Related posts
If you enjoyed this post, you may enjoy some of our other low-level
technical posts:
Monitoring and Tuning the Linux Networking Stack: Receiving Data

Illustrated Guide to Monitoring and Tuning the Linux Networking
Stack: Receiving Data
The De nitive Guide to Linux System Calls
How does strace work?
How does ltrace work?
APT Hash sum mismatch
HOWTO: GPG sign and verify deb packages and APT repositories
HOWTO: GPG sign and verify RPM packages and yum repositories
Share this post:
Never miss an update!

Subscribe to our blog via email
Sign up!

Subscribe to our RSS feed
Already signed up?
Features
Travis CI
Jenkins
Buildkite
Public Package Repository
Private Package Repository
GPG Signatures
Info
Pricing
Private NPM registry
Private DEB repository
Private RPM repository
Private RubyGem server
Private PyPI server
Private Maven repository
HOWTOs
NPM/NodeJS HOWTO
Maven HOWTO
Java
HOWTO
Debian HOWTO
RPM HOWTO
RubyGem
We HOWTO
Python HOWTO
Linux HOWTO
Guides
Maven Guide
Debian Guide
RPM Guide
RubyGem Guide
Python Guide
Linux Guide
Docs
General Docs
API Docs
Command Line Interface
Community
Blog
Slack
Status
Contact
Legal
Terms of Service
Privacy Policy


Monitoring and Tuning The Linux Networking Stack - Sending Data PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Monitoring and Tuning The Linux Networking Stack - Sending Data PDF

Încărcat de

Drepturi de autor:

Formate disponibile

3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog

Subscribe to our RSS feed

Monitoring and Tuning

Setup your own NPM registry for free. Sign up!

It is impossible to tune or monitor the Linux networking stack without

This blog post will hopefully serve as a reference to anyone looking to do

UDP protocol layer

Get the UDP destination address and port

Tuning: Socket send queue memory

Path MTU Discovery

Monitoring: IP protocol layer

Linux netdevice subsystem

Transmit Packet Steering (XPS)

Reminder, while loop in __qdisc_run

net_tx_action completion queue

Finally time to meet our friend dev_hard_start_xmit

Dynamic Queue Limits (DQL)

General advice on monitoring and

As mentioned in our previous article, the Linux network stack is complex

Adjusting networking settings while connected to the machine over a

Protocol family registration

sock = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP)

static const struct net_proto_family inet_family_ops = {

/* Look for the requested type/protocol pair. */

/* Upon startup we insert all the elements in inetsw_array[] into

/* .... more protocols ... */

In the case of IPPROTO_UDP , an ops structure is linked into place which

const struct proto_ops inet_dgram_ops = {

and a protocol-speci c structure prot , which contains function pointers to

struct proto udp_prot = {

Create an RPM repository in less than 10 seconds, Sign up!

Sending network data via a socket

ret = sendto(socket, buffer, buflen, 0, &dest, sizeof(dest));

SYSCALL_DEFINE6(sendto, int, fd, void __user *, buff, size_t, len,

err = sock_sendmsg(sock, &msg, len);

/* ... code ... */

data structure address , which is then embedded into a struct msghdr 

structure as msg_name . This is similar to what a userland program would do

sock_sendmsg , __sock_sendmsg , and

sock_sendmsg performs some error checking before calling __sock_sendmsg

static inline int __sock_sendmsg_nosec(struct kiocb *iocb, struct socket *sock,

/* other code ... */

return sock->ops->sendmsg(iocb, sock, msg, size);

As seen in the previous section explaining socket creation, the sendmsg

sendmsg function on the socket’s internal protocol operations structure and

/* We may need to bind the socket. */

return sk->sk_prot->sendmsg(iocb, sk, msg, size);

When dealing with UDP, sk->sk_prot->sendmsg above is udp_sendmsg as

UDP protocol layer

The udp_sendmsg function can be found in ./net/ipv4/udp.c. The entire

The code from udp_sendmsg checks up->pending to determine if the socket

/* variables and error checking ... */

Get the UDP destination address and port

Here’s how the kernel deals with this:

Socket transmit bookkeeping and timestamping

Create a RubyGem repository in less than 10

Ancillary messages, via sendmsg

One popular example of ancillary data is IP_PKTINFO . In the case of

instead of on a per-packet basis if desired. The Linux kernel translates the

The internals of parsing the ancillary messages is handled by ip_cmsg_send

Setting custom IP options

We use cookies to enhance the user experience on packagecloud.

ipc.addr = faddr = daddr;

static inline int __sock_sendmsg_nosec(struct kiocb iocb, struct socket sock,

struct sk_buff ip_make_skb(struct sock sk, /* more args */)

static int udp_send_skb(struct sk_buff skb, struct flowi4 fl4)