Understanding Network IP Fragmentation

Hamy

Oct 9, 2018 11 min read Tutorial

Overview

In this post, I’m going to talk about what IP fragmentation is, how it works and why it’s needed. And while learning that, we’re going to touch on subjects like OSI Layers / PDU / MTU / MRU and PMTUD.

Knowledge about them is required for truly understanding IP fragmentation process and troubleshooting network connection issues in general.

OSI model

This is basically a standardized model for network communications which breaks them into what’s called abstraction layers.

The original model consists of 7 layers: Physical, Data link, Network, Transport, Session, Presentation, and Application .

Going up the OSI model hierarchy, layers stack up on top of one another.

Example 1:

In the reverse order, loading a secure web page would consist of:

HTTP (Application layer)
TLS (Presentation Layer)
TCP (Transport Layer)
IP (Network Layer)
Ethernet (Data Link Layer)
Ethernet physical layer (Physical Layer)

Each layer would add an overhead to the final size (Yes, even the physical layer!)¹

More info can be found Here.

PDU

In the OSI model, each layer consists of units. They are called Protocol Data Units.

Any transmission between 2 entities on a layer, would be be done by sending PDU of that layer.

If a unit does not get fully received, it must be dropped (provided that it couldn’t be reconstructed somehow) and if the carried information was important, there must be a way to signal the other end to re-send it.

PDU size

Depending on the application, each PDU be could be hundreds (or even thousands) of bytes in size.

Not counting the hardware limitations (which is usually the bottleneck), there is a balance that needs to be made between the maximum acceptable unit size, resource usage and links stability (As I said earlier, if a unit does not get fully received, it is lost. Bigger the unit, bigger the re-sent data going to be).

As a rule of thumb, assuming enough stability, the bigger the unit size, the lesser the overhead (each unit is bundled with its own header), and greater the speed. It also likely has less impact on resources (for each unit the header might be validated which depending on the used application, could also mean verifying the unit’s data as well).

MTU/MRU

Maximum Transmission Unit is the maximum size of a unit that can be successfully sent over a network to the other end.

likewise, Maximum Receive Unit, is the maximum size of a unit can be received from the other end².

Nowadays, MTU/MRU generally refer to the packet units (explained below). But do remember that in each layer, the corresponding PDU could have its own MTU/MRU. And as you go up the OSI layers, because of the overhead of the previous ones, MTU/MRU of later PDUs would decrease. From now on in this article unless otherwise stated, MTU/MRU would refer to the packet units.

Packets

Packets are the PDUs that travel over a link on the netwrok layer.

IP is the most widely used network layer and also the protocol in which TCP (The most widely used transposrt layer) relies on.

In our secure-webpage-loading example, everything above IP (which would be TCP, TLS and HTTP) should be packed in an IP packet with some added IP related header.

Packets size

Minimum acceptable packet size is:

IPv4: 576 bytes
IPv6: 1280 bytes

According to standard, all devices along the way, must be able to accept and route those packet sizes.

Absolute maximum packet size is:

IPv4: 64KiB³
IPv6: 4 freaking GiB!⁴

Note: The practical packets MTU/MRU, is much less than that. In a normal LAN, the maximum size of a packet is likely 1500 bytes. Network nodes (including NIC interfaces of each ends as well as switches/routers along the way) might not support bigger packets than that at all.

Now in a network like WAN (or basically your typical Internet connection), there is absolutely no guarantee that your MTU between different endpoints would be consistent. For example, connecting to one server, your MTU might be 1492 and for another one might be 1400. To make matter even worse, due to things like Dynamic Routing even your MTU to a single server might change over time.

As a client, your ISP has set an upper limit to your links MTU/MRU and you can not send/receive any packets successfully if it gets bigger than that size. In an ideal link, that upper limit is 1500 bytes (which is also the default for most OS’s and NIC drivers).

But things aren’t always that simple. Sometimes lower layers (e.g data link), append some extra headers to their own unit, making less space available to their payload (the upper layers data embedded in). As an example, your typical PPPOE connection adds 8 bytes of extra header, limiting your packets MTU/MRU to only 1492 (1500 - 8).

IP Fragmentation

But what if the data you want to send, don’t fit in a single packet? That’s where IP Fragmentation comes into play.

For IPv4, once the MTU of a path has been determined by the router, if a packet exceeds that size, it seamlessly fragments the packet into smaller ones before sending them along the way. The first router that determines its next-hop⁵ could handle the full packet size, would defragment it on the spot.

For further clarification, take a look at these examples:

Example 2:

The application on your hosts tries sending an IPv4 packet bigger than 1500 bytes.
Since its bigger than the default OS MTU, The OS splits it into smaller packets, each with their own special headers so they could be reconstructed later on.
fragmented packets will be sent and assuming no router allowing a defragmented packet with that size, they never get reconstructed along the way.
The other end receives the packet, The OS now defragments them and sends the packet to the receiving application.

Example 3:

The application on your hosts realizing the hosts MTU is 1500 bytes, constructs its transmission data in a way that each IPv4 packet size would not exceed it.
The OS receives the packets and finding no problem, it sends them along.
Your home router accepts the packets (since they don’t exceed its NIC interface’s MRU).
The router decides that it needs to route the packets through it PPPOE interface.
Because of the extra PPPOE headers, the MTU size of the interface is 1492 bytes. So the router fragments the packets and sends them through the link (Without the host knowing about it!).
Your ISP’s router receives the fragmented packets and decides the routing path. Since its next-hop⁴ can handle MTU size of 1500 bytes, it re-assembles them and send the defragmented packets along.
The other end receives the full packets size of 1500 bytes. No extra defragmentation needed.

Now all these (de)fragmentation is costly. Specially for routers which usually do the work. They have to tear the packets apart, split their payloads and calculate and add headers for each new ones.

The work adds up quickly when a router is dealing with tens/hundreds of thousands of packets per second. This is one of the reasons why on IPv6, fragmentation on router level is not allowed anymore and hosts are solely responsible for (de)fragmenting their own packets.

DF flag

The Don’t Fragment flag, is a special bit in the IPv4 packets header. It’s sole purpose is to tell the routers along the way that the packet should not be fragmented.

Well-configured routers, have no choice but to drop the packet if its bigger than the MTU of their next-hop. This flag has 2 main uses:

Well-designed applications/protocols can optimize the packet size themselves (instead of relying on routers IPv4 fragmentation capability). If a packet gets dropped because of violation of MTU size along the path, the said application/protocol use that to its own advantage to adjust its packets size in a way that they wouldn’t get fragmented anymore.
As we will see shortly, DF flag also plays a huge role in actually identifying the MTU of a path.

PMTUD

So how would an OS know about the MTU/MRU of a path?

Giving the fact that your MRU, is always the MTU of your next immediate hop, our main concern is MTU most of the times.

(i.e., if your ISP knows its MTU path to you is X, that would mean your MRU is X and the ISP, would never send a packet larger than that)

By now we know that its in our best interest to send as big of a packet as possible, but not bigger than our MTU. Also because of different factors in play, we can’t usually know the MTU of a path before hand.

Path MTU Discovery is here to the rescue! Originally intended for routers (since they were traditionally responsible for fragmenting IPv4), It is now a needed feature for every client OS.

When the DF flag is set on an IPv4 packet and a router along the way needs to drop it (because the packet size is higher than its next-hop MTU), it should also do something else…

rfc1191 explains:

When a router is unable to forward a datagram because it exceeds the MTU of the next-hop network and its Don’t Fragment bit is set, the router is required to return an ICMP Destination Unreachable message to the source of the datagram, with the Code indicating “fragmentation needed and DF set”. To support the Path MTU Discovery technique specified in this memo, the router MUST include the MTU of that next-hop network in the low-order 16 bits of the ICMP header field that is labelled “unused” in the ICMP specification. The high-order 16 bits remain unused, and MUST be set to zero.

So when an IPv4 packet with the DF flag set is dropped by the router, the router is required to send a special ICMP Type 3, Code 4 (“fragmentation needed and DF set”) to alert the host initiating it, about the packet loss.

The ICMP message should also include the acceptable MTU size for the said router’s next-hop.

TCP, does a good job optimizing its packet size and works better and faster when they don’t get fragmented. It’s common for TCP packets to have DF flag set.

Example 4:

An application using IPv4/TCP, sends a packet with the size of 1500 bytes and sets its DF flag.
The home router receiving that and realizing it can’t send it over its PPPOE connection and can’t fragment it either, drops the packet and sends an ICMP message to the host which includes the MTU of the routers next-hop (1492 bytes).
The host passes that to the application and TCP adjusts its packet size to not be higher than 1492 bytes and re-sends it (again with DF flag set).
Now routers along the way start routing the packet. If any of them has a next-hop MTU size lower than 1492 bytes, they drop the packet, notify the host and the cycle continues...
The receiving end, finally gets the packet without ever being fragmented. The sending host caches the final successful MTU size to that server for further use.
If at any point, MTU of the path gets decreased, The same process would result in finding the new MTU.
Every once in a while (usually 10 minutes depending on the OS), the host tries to send a packet higher than its cached MTU to see if it gets through and potentially adjusts the cached value.

As you can see from the above example, not only the MTU of a path can be obtained like this, it can also be optimized in an ever changing environment.

For IPv6, the burden of fragmenting packets is completely on the hosts and PMTUD plays a huge role in that. Routers only drop the packets and notify the hosts via ICMPv6 Type 2 (“Packet Too Big”) which again also includes the MTU size of their next-hop.

PMTUD Black hole

Some system administrators (whether not realizing the importance of these ICMP messages, not knowing about them at all, or maybe to protect their network from probing, dos attacks, etc), opt to disable the required ICMP generation and silently just drop the bigger-than-MTU packets that have DF flag set⁶. Also inadequate firewall setup on the host might filter those ICMP messages.

This, effectively cripples the host’s ability to correctly determine MTU of a path and is referred to as PMTUD black hole.

Typical PMTUD black hole symptoms are as follows (You should usually experience more than one):

Some connections (like SSH) establish successfully but stall right when you start using them even though hosts are pingable all the time.
Users complaining about not being able to visit “some sites” either completely or not fully. (usually as the result of SSL connections stalling).
Connections may work fine (with some initial delay) in a device (e.g., an android phone), while failing on other ones (a desktop PC).
Overall slow network experience even though the link to the ISP is reliable and it should work faster but it doesn’t.

Although some workarounds have been introduced by different OS’s to ease the effect, because the nature of the issue, they are not without their drawbacks.

In a later post, I will discuss how OS’s try to guess presence of a black hole in the path and workaround it. I will also talk about different methods that you could deploy to help the situation and even “fix” the PMTUD black hole problem altogether.

And as always, your comments would be most appreciated!

https://en.wikipedia.org/wiki/Ethernet_frame ↩︎
This term is mostly used for PPP connections ↩︎
At that point the reserved space (16 bits) in packets header for its size will be exausted ↩︎
https://en.wikipedia.org/wiki/IPv6_packet#Jumbogram ↩︎
The next router in the path that the packet should be routed to ↩︎
some even remove the DF flag off your packet and then fragment them (please don’t do that) ↩︎

Networking Troubleshooting