In this post, I’m going to talk about what IP fragmentation is, how it works and why it’s needed. And while learning that, we’re going to touch on subjects like
OSI Layers /
Knowledge about them is required for truly understanding IP fragmentation process and troubleshooting network connection issues in general.
This is basically a standardized model for network communications which breaks them into what’s called abstraction layers.
The original model consists of 7 layers:
Going up the OSI model hierarchy, layers stack up on top of one another.
Each layer would add an overhead to the final size (Yes, even the physical layer!)1
More info can be found Here.
In the OSI model, each layer consists of units. They are called Protocol Data Units.
Any transmission between 2 entities on a layer, would be be done by sending PDU of that layer.
If a unit does not get fully received, it must be dropped (provided that it couldn’t be reconstructed somehow) and if the carried information was important, there must be a way to signal the other end to re-send it.
Depending on the application, each PDU be could be hundreds (or even thousands) of bytes in size.
Not counting the hardware limitations (which is usually the bottleneck), there is a balance that needs to be made between the maximum acceptable unit size, resource usage and links stability (As I said earlier, if a unit does not get fully received, it is lost. Bigger the unit, bigger the re-sent data going to be).
As a rule of thumb, assuming enough stability, the bigger the unit size, the lesser the overhead (each unit is bundled with its own header), and greater the speed. It also likely has less impact on resources (for each unit the header might be validated which depending on the used application, could also mean verifying the unit’s data as well).
Maximum Transmission Unit is the maximum size of a unit that can be successfully sent over a network to the other end.
likewise, Maximum Receive Unit, is the maximum size of a unit can be received from the other end2.
Nowadays, MTU/MRU generally refer to the packet units (explained below). But do remember that in each layer, the corresponding PDU could have its own MTU/MRU. And as you go up the OSI layers, because of the overhead of the previous ones, MTU/MRU of later PDUs would decrease. From now on in this article unless otherwise stated, MTU/MRU would refer to the
Packets are the PDUs that travel over a link on the netwrok layer.
IP is the most widely used network layer and also the protocol in which
TCP (The most widely used transposrt layer) relies on.
In our secure-webpage-loading example, everything above IP (which would be
HTTP) should be packed in an IP packet with some added IP related header.
Minimum acceptable packet size is:
According to standard, all devices along the way, must be able to accept and route those packet sizes.
Absolute maximum packet size is:
Note: The practical packets MTU/MRU, is much less than that. In a normal LAN, the maximum size of a packet is likely
Network nodes (including NIC interfaces of each ends as well as switches/routers along the way) might not support bigger packets than that at all.
Now in a network like WAN (or basically your typical Internet connection), there is absolutely no guarantee that your MTU between different endpoints would be consistent. For example, connecting to one server, your MTU might be
1492 and for another one might be
1400. To make matter even worse, due to things like Dynamic Routing even your MTU to a single server might change over time.
As a client, your ISP has set an upper limit to your links MTU/MRU and you can not send/receive any packets successfully if it gets bigger than that size. In an ideal link, that upper limit is
1500 bytes (which is also the default for most OS’s and NIC drivers).
But things aren’t always that simple. Sometimes lower layers (e.g data link), append some extra headers to their own unit, making less space available to their payload (the upper layers data embedded in). As an example, your typical
PPPOE connection adds
8 bytes of extra header, limiting your packets MTU/MRU to only
1492 (1500 - 8).
But what if the data you want to send, don’t fit in a single packet? That’s where IP Fragmentation comes into play.
IPv4, once the MTU of a path has been determined by the router, if a packet exceeds that size, it seamlessly fragments the packet into smaller ones before sending them along the way. The first router that determines its next-hop5 could handle the full packet size, would defragment it on the spot.
For further clarification, take a look at these examples:
Now all these (de)fragmentation is costly. Specially for routers which usually do the work. They have to tear the packets apart, split their payloads and calculate and add headers for each new ones.
The work adds up quickly when a router is dealing with tens/hundreds of thousands of packets per second. This is one of the reasons why on
IPv6, fragmentation on router level is not allowed anymore and hosts are solely responsible for (de)fragmenting their own packets.
The Don’t Fragment flag, is a special bit in the
IPv4 packets header. It’s sole purpose is to tell the routers along the way that the packet should not be fragmented.
Well-configured routers, have no choice but to drop the packet if its bigger than the MTU of their next-hop. This flag has 2 main uses:
Well-designed applications/protocols can optimize the packet size themselves (instead of relying on routers IPv4 fragmentation capability). If a packet gets dropped because of violation of MTU size along the path, the said application/protocol use that to its own advantage to adjust its packets size in a way that they wouldn’t get fragmented anymore.
As we will see shortly, DF flag also plays a huge role in actually identifying the MTU of a path.
So how would an OS know about the MTU/MRU of a path?
Giving the fact that your MRU, is always the MTU of your next immediate hop, our main concern is MTU most of the times.
(i.e., if your ISP knows its MTU path to you is X, that would mean your MRU is X and the ISP, would never send a packet larger than that)
By now we know that its in our best interest to send as big of a packet as possible, but not bigger than our MTU. Also because of different factors in play, we can’t usually know the MTU of a path before hand.
Path MTU Discovery is here to the rescue! Originally intended for routers (since they were traditionally responsible for fragmenting IPv4), It is now a needed feature for every client OS.
When the DF flag is set on an IPv4 packet and a router along the way needs to drop it (because the packet size is higher than its next-hop MTU), it should also do something else…
When a router is unable to forward a datagram because it exceeds the MTU of the next-hop network and its Don’t Fragment bit is set, the router is required to return an ICMP Destination Unreachable message to the source of the datagram, with the Code indicating “fragmentation needed and DF set”. To support the Path MTU Discovery technique specified in this memo, the router MUST include the MTU of that next-hop network in the low-order 16 bits of the ICMP header field that is labelled “unused” in the ICMP specification. The high-order 16 bits remain unused, and MUST be set to zero.
So when an
IPv4 packet with the DF flag set is dropped by the router, the router is required to send a special ICMP Type 3, Code 4 (“fragmentation needed and DF set”) to alert the host initiating it, about the packet loss.
The ICMP message should also include the acceptable MTU size for the said router’s next-hop.
TCP, does a good job optimizing its packet size and works better and faster when they don’t get fragmented. It’s common for TCP packets to have DF flag set.
As you can see from the above example, not only the MTU of a path can be obtained like this, it can also be optimized in an ever changing environment.
IPv6, the burden of fragmenting packets is completely on the hosts and PMTUD plays a huge role in that. Routers only drop the packets and notify the hosts via ICMPv6 Type 2 (“Packet Too Big”) which again also includes the MTU size of their next-hop.
PMTUD Black hole
Some system administrators (whether not realizing the importance of these ICMP messages, not knowing about them at all, or maybe to protect their network from probing, dos attacks, etc), opt to disable the required ICMP generation and silently just drop the bigger-than-MTU packets that have DF flag set6. Also inadequate firewall setup on the host might filter those ICMP messages.
This, effectively cripples the host’s ability to correctly determine MTU of a path and is referred to as PMTUD black hole.
Typical PMTUD black hole symptoms are as follows (You should usually experience more than one):
- Some connections (like SSH) establish successfully but stall right when you start using them even though hosts are pingable all the time.
- Users complaining about not being able to visit “some sites” either completely or not fully. (usually as the result of SSL connections stalling).
- Connections may work fine (with some initial delay) in a device (e.g., an android phone), while failing on other ones (a desktop PC).
- Overall slow network experience even though the link to the ISP is reliable and it should work faster but it doesn’t.
Although some workarounds have been introduced by different OS’s to ease the effect, because the nature of the issue, they are not without their drawbacks.
In a later post, I will discuss how OS’s try to guess presence of a black hole in the path and workaround it. I will also talk about different methods that you could deploy to help the situation and even “fix” the PMTUD black hole problem altogether.
And as always, your comments would be most appreciated!
- https://en.wikipedia.org/wiki/Ethernet_frame ^
- This term is mostly used for PPP connections ^
- At that point the reserved space (16 bits) in packets header for its size will be exausted ^
- https://en.wikipedia.org/wiki/IPv6_packet#Jumbogram ^
- The next router in the path that the packet should be routed to ^
- some even remove the DF flag off your packet and then fragment them (please don’t do that) ^