These are unedited transcripts and may contain errors.
Plenary sessionTuesday 25 September, 2012
2:00 p.m.:
FILIZ YILMAZ: OK everyone. Hello. Don't make they scream over the mic. Please get seated as soon as possible so we can start. We have a packed agenda for the afternoon, with two sessions. And before we start with the talks, I have a housekeeping agenda point: You heard about the ratings and the prize. We have now the first winner, and I will attempt to say the name and I am sorry if ?? I apologise if I say it incorrectly, if you are ?? please contact by the registration desk, you have a voucher, you are the winner for rating the talks.
(Applause)
Not for my spelling of the name, obviously. But there will be another one, so keep rating, please, there will be another announcement by the end of the day. So now we will start quickly with Randy, as our first speaker of the slots, talking about propagation. That doesn't like right, right, Randy? No.
RANDY BUSH: You have got a font problem. That cannot be mine.
FILIZ YILMAZ: Yes, we do have a font problem, Randy. Thank you.
RANDY BUSH: So the guys ?? there we go. So, this is a research talk, it's actually two talks in one, and the research is can you stare at comic sense for half an hour without losing your lunch. You should know the guys who cracked the Higgs boson, their talk was in common sense.
So, don't panic, I am an engineer. This is an engineering and research talk, engineers like to talk ?? think about the problems. I am a researcher and we are only interested in problems. In actualality, the RPKI is doing really well and especially in the RIPE region but I want to talk to you about the problems because that is what interests me.
And also, we are all friends. So when I point out problems, I have worked on them with friends, I continue to work on them with friends, etc.. that doesn't mean they are not problems, but we are doing what we can. Just a quick review of the RPKI infrastructure: There is issuing parties, so we publish it, in the repository. There are aligned parties, we pull it down with a gather into validating caches and all this is object security because it starts at a Trust Anchor like DNSSEC and all these objects are validated, from here on out to get, especially to get to the router it's transport security so that should be in your POP.
My routing is going to rely, to some extent, on the RPKI, so I really care about reliability of the publication. Of course, my relying party software, the stuff that sucks it down and uses it, is going to be built so it can deal with failures, so failures, by the publishing parties are not death, but when we look at publication, today, what we see is not very good quality. And this has to be fixed and it's being worked on very well.
Here is some graphs, all God's children ?? mine is better than yours. RIPE has Alex and that gang has have done some sexy software, you should look at it and play with it, we have more geeky software, so ours, actually, this is ?? it produces a website that is on your website, you run the caches etc. And you can display it on your website. So those people, you recognise RRD, so here we go with ?? this is pilot of detag, Deutsche Telecom, my favourite monopoly DDT and to sync their data down took about 3.14 seconds from Seattle, I don't think that is going to expand fully to pie, and there are very few objects, and but remember, the objects kind of ago grate up. Here is an RIR, LACNIC, so you can tell they were playing around, going from a mean of a couple of 100 objects to 2.5 K. So, that's pretty reasonable.
I didn't mean that, I pushed the wrong button. What do I do? So, here you get also nice little report that counts all the different ugly things that can happen, and so here is ARIN, these things say last week, these were done in July, OK? So this has been fixed by now but here is bad key usage, objects rejected, but it was fairly stable. Here is pretty bad. This is APNIC, this is ?? this is 3 July out here, anybody know what happened on the 30th of June? Anybody remember it? Leap second. They didn't make the leap. It didn't attack a second; it took two?and?a?half days. So they had the leap crawl down there in Australia. And they don't monitor, they don't have a real knock, they do not work week ends. I actually finally picked up Skype or Jabber or something and said George, fix it, and it was fixed within five seconds, right?
So, here we have our dearly beloved RIPE. Here we have some failures to fetch, and we will get into that in a little more ugly detail. Big sync time, because it's a flat organisation, not a hierarchy, they are fixing that, and but they had a whole lot of objects which is wonderful. RIPE is ?? you guys are registering your objects.
We look in detail at some of those little orangey guys and it's ugly. Those are failures to be able to get the data from the publication point. This was an NFS problem, NFS is evil. It went on for a long time. We had logs showing it but nothing can be wrong at an RIR. Finally, it was fixed, thank you, guys, from RIPE Ops, and but small problems still remain.
So I was going to go ?? I pulled the data down for today, thinking I would make a slide that showed everything was fixed. Unfortunately, this is isn't correct.
Now, this gap at RIPE was the migration to the new system, that's understandable, but we are still getting occasion hits so we are going to work on this. This is euro transit, APNIC, we are still getting hits. AfriNIC, this has been going on forever, perfectly. So, except ever seen that? They are files, the actual ones you transfer have wonderful permissions on them. Very, very inventive, it was helpful. We wrote to them multiple times and got snarky responses, the RIRs are not RIRs, they are PTTs, there can be no problem. But the relying party software saves us. We expect failures, etc. The RPKI software uses data that if you can't fetch nude uses the old data so RPKI data will be fairly stable so this is kind of OK. Until you look at the things like the Internet disaster that lasted five days.
So, some statistics: I am going to go through these quickly. This is all caused by the flat versus hierarchic organisation of both APNIC and RIPE being fixed. This is connection counts, which is again similar problem, but this is wonderful. This is the number of objects in the RIPE repository and it keeps climbing. Hang in there. RIPE ?? the bottom line; deployment is serious in the RIPE region, thanks Alex and Tim and the RIPE crew. RIRs are not operator?quality. JPNIC is going to be deploying in a month prototype, they are hoping to be better. The publication structure needs to be fixed. Relying party software solves all this by being stubborn and patient. Where do we go from here? I believe that asking publishers and some of you are going to be publishers, to be five nines reliable, is silly, expensive and doomed. The Internet is about building a reliable network from unreliable components. We need to develop and model that attacks advantage of that. So distributed caches that all feed off each other, etc., and you will see why in the next presentation that works really well.
Here, we go to, we have got the publishing party and the relying party again, and we want to see what the performance of this is. And imagine a very large global ISP and they are sucking off RIPE and APNIC, etc. But they have got a head cache in every continent or multiple ones and they feed all their POPs from there so nothing minimised the load and relying party problem on the global RPKI. So what is this going to behave like? What are the propagation characteristics for the infrastructure? How sensitive is it to cache fetch timers? How much is propagation, how much is validation? Is it delay?sensitive? So we know the publication hierarchy, IANA, the RIRs, ISPs, etc., but what this really looks like is here is the publication, here is the RIRs and that ISP has, we call them gatherers, they are top tier caches fetching down and they fed more caches who fed other caches who finally feed routers, so there is publication points, there is gatherers, there is caches and there are routers. Each cache syncs data from parent caches or gatherers. The timing you see here is ten minutes. Each cache has a route Trust Anchor so it can validate everything. What is propagation? Propagation is the time for when the CA publishes a certificate or ROA to when the party you care about receives it. Cache or router. So we measure this by time stamping the publication and time stamping every place it receives. And then we analyse. We don't care about routers as routers and BGP because they don't change the propagation measurement; we just stamp them when they get there. Everything logs.
So, here is a tier one provider, here is the things, the connection to all the caches, here are the routers and it looks pretty ugly. And we want ?? this is the small test?bed, we have run a big one with three tier ones, this is a one tier one model so it only has about 300 servers in it. How do you deploy a 300 server test?bed? You use a large cluster. Start bed. We used about 50 of their VMs. How do I configure 300 servers, you don't do it by hand especially when the time slot you have is only three days so there is something called auto net kit that was originally Romatray, went to Adelaide and in the UK had a bunch on it, literally you draw this with something like yEd which produces graph ML and it actually builds the deployment and deploys them on the target. Yes, magic. It is cool. It was originally, as I said, Romatray, Lorenzo's partially to blame, he is here. And went to Adelaide and we were hacking on it at Loftborough. It used to only do routing, it now does servers and even understands the RPKI. To understand whether latency is important, we hooked up Japan and star bed, wave VLAN tunnel using open V?switch to 75 router simulation in Dallas, so if you want delay, you insert a router in the configuration.
We ?? so, you can see to get from here, you know to here, you have to go through a router, therefore that introduced 500 M Fs of delay. Inter?router delays, since they are all in Dallas, is negligible. As I said, if you have got a dot that's a router in delay, no dots, none, so we did one run with two Tier1s, the result essentially delay had no effect.
Between publication gatherers, no effect. The numbers were essentially all the same. So the cache deployment architecture can be based, you do your architecture, don't worry about latency.
So, we will buy a two to three star dinner for somebody who can attack a BGP table dump, give us code to attack a table dump and create a realistic CA hierarchy. This is non?trivial. So, what we did instead is we start the run with one CA per entity and we run it for some hours, every second we generate another ROA and we end up with a lot of them and 45 and bunch of ROAs, about 14,000 for this small run I am going to show you today.
It attacks about two hours to generate and upload it to star bed. The up loads is about 250 meg, attacks about ?? we have to move some images around within star bed, we run it at a one?to?one time ratio so it runs for a full day, it produces a little bit of output and then we have to transfer the logs to a server, the analysis is now pretty quick, we have got it cooked.
This is the ?? would be the ideal, the Pubd to the gatherers, and running once an hour you would expect a mean for the CDF to be 30 minutes. I know. This is the idea. The time for the protocol transfer is negligible. The expensive operation is the sort of the new state which is order of N log N, about one would get if one did a cache re?set. Feeding new data from the server to the client is just dropping pre computed data so that's fast. This is in the actual measurement, the red is the ideal, this is Pubd to gatherer, start up, so it has the start?up load, how fast can it just get the start upload down. So, we are dealing with a mean of about 40 minutes instead of 30. Here is now running steady state, adding events, one a second. And we can compare that not ?? so it's better because it doesn't have this big load to dump, it's negligible, this is going to be boring. This is good. This is getting it from the original publication points all the way down to the routers, the initial load, here it is with a running load.
It's all really reasonable, all being driven by the how often the gatherers poll and it's once an hour. You don't want it to be too much more because it overloads the gatherer and poor Tim is going to faint. So we have to squeeze on that.
We also wanted to test other protocols. Here is bit torrent. We just got one measurement ton and what we have proven is bit torrent is hard to configure. Maybe we will have reasonable results for NANOG. Bit torrent is hard to configure and Richard points out it's not really the protocol you want to try here, maybe something like XMPP. And that is my story and I am sticking to it. Thanks to a whole bunch of people who threw up resources, cache. Anybody got any questions?
FILIZ YILMAZ: No?
RANDY BUSH: Nobody lost their lunch over the comic sands.
(Applause)
FILIZ YILMAZ: Thank you, Randy. Before I invite our next speaker to the mic, I also want to apologise for my jet lag to the NCC people, they told me when I made the announcement the prize winner should send a mail actually to ?? they will e?mail to him, so still going on, not totally out there, so you don't need to drop by the registration desk. I hope you know who you are now, the prize winner is Anastasia. Oh great, we have you here. I don't know if you heard about this at my first attempt at the beginning of the session, but you know it. Please wait for an e?mail. You will receive one.
So now we have David Lebrun and he is going to talk about routing configuration, changes with forwarding changes and the correlation between them. He is being set up over there with his mike. I will leave the floor to him. Being a super star is not easy, you see.
DAVID LEBRUN: Hi everyone. So I am David Lebrun, I am following a masters degree in UCL in Belgium and this is my work at the IIJ Innovation Institute, that I performed this summer as internship student.
So, correlating routing configuration changes before routing changes. So, we got 14 configurations from tier 1 ISP, we give us information like SIS logs and. The changes we get them from our own measurements, and so we went to measure the latency and the path changes from inside the ISP network. And we like to detect eBGP events.
So what measurements: We will do ping for latency and traceroutes for detect path changes. From servers or Atlas probes. So the probes are provided by the RIPE. And to what? To ISPs routers or some reachable IPs.
So about Atlas probes: It's a probes distributed worldwide, I think you should know them. So distributed worldwide and currently about 15000 active probes currently. It's just perfect to measure some tier one ISP because it's largely distributed but we need to know whether it's useable in all cases.
So to sum up what we have, we have the syslog of routers which gives the exact time of every modification, RANCID, which IPs in the neighbourhood of the measured ISP. The Atlas probes list and three dedicated servers provided by Randy, thank you Randy.
And we have all this raw data and we need to organise them. So for the reachable IP we clusterise them with respect to exit point of presence of the ISP so we perform trace routes to know where the exhibits to measure the ISP and the same for the probes except that we need to know the entry point of presence. So it's the same principle, we make trace routes from probes to some IP to know where the packet is sent of the ISP.
So the idea for the measurements is for each probe cluster we attack a probe and then for each probe, for each IP cluster we select an IP so we will perform grid?like measurements to try to measure the whole ISP network and repeat these measurements.
So for pings, there will be some problems. We went to detect eBGP events but if we ping IPs that are in the neighbourhood in the ISP so not the neighbouring AS, we can detect related events and we don't have guarantee that the ping will go to the ISP unless the route is enforced because the ISP is transit provider so it costs money to go through and we tend to avoid this and go to peering instead. So and moreover, the measure the distance from the ISP the more the noise will appear on the ??
But if we ping the ISP's border routers we are sure we will go through the ISP because it's the ISP border routers. Of course it will be more difficult to detect eBGP changes but we can detect much more Internet changes ?? internal changes and perhaps some congested links due to eBGP events but it's quite difficult.
So, we have the starting measurements, the probe, attached probe here. Went to ping, the destination, for example. So this scheme of ISPs with different POPs. If we ping this destination outside the ISP, we can have something like link problem here or something between the neighbouring AS and the destination which we will make the results quite what we do not want. Or if the destination has problem too, we will have, it will not be relevant for our case.
So this is why we will ping the ISP border routers to measure that and see. And traceroute to routes outside. For the traceroutes there is also a problem with traceroutes which is load balancers. So standard traceroute have problem with multipath. It can lead to non?existent lead ?? non?existent links so. We are here and we went to traceroute to the destination, so we send packets with incremented TTLs so this is load balancer, we can load balance to A or B. For example, the first packet will be load balanced to A and then D and E and we will infer a non?existent link between A and D. So this we do not want, and in order to solve this there is this beautiful tool which is Paris traceroute which used some multipath detection algorithm to fix this problem. So load balancers to distinguish they use some fields in the packet headers, it can be simplified as source IP, destination port, there is some more but not all interesting here. So the idea is to mark packets with fields that are not used to distinguish the flows so we can track the packets and maintain constant parameters, if we maintain constant parameters across different traceroutes we will get a consistent path between the results and the destination.
And we are not affected by per flow load balancers but for per packet load balancers there is still the issue.
The probes. So the problem with the probes is most of the probes do not use the ISP as default provider, as that was to be expected. And but some pairs, probe, IP, go through the ISP but they are far away from the ISP, two ASes away or something like that, and moreover, we detected some high rate of path changes for traceroutes from the probes, which means that there is some problem with the traceroute implementation, which is not maintain constant parameters. I don't know which parameter of change but something must changes.
So ping data for the probes is a little noisy, but every time some graphs are kind of noisy I will show some example. So, for example, here we have packet loss, yes, so this RTT in milliseconds and over time, we have packed loss here which should not appear, it does not appear on the same destination. Here you have packet loss again, and missing data. This missing data can appear on some graph from time to time it can last for days of missing data. Here we have missing data, too. Well, I guess it's because a lot of probes are on home connections and so it can be problem in this. Well of course, there is not only bad results; there is also clean data like this, this someone very cool. We have also this one, we can see a clear?cut here, some latency peaks.
Yes, this one is little strange, we have a peak at 5 minute RTT, I am not sure what happened, probably some overloaded buffer on the home.
Now the results we got. Another example of ping graph, so this is from the server, from the one in Seattle to a point of presence in Europe. We can see clear transitions here. And we would like to do automate the detection of this one if we want to make correlation to make automatic correlations. So, I viewed some heuristic to detect the transitions and it gives something like that. So it's alternating, light green/dark green for transitions, so here is some then other period, get a bit confused by the peak here but we can see clearly the different and there is another one here, different periods.
Yeah, another example of the algorithm. For this one, for example, we see there is a drop in RTT here, but we can't really see where is the cut and the it says this. And as we can see, this will be quite accurate because if we know ?? if we add the commit on the graphs, we get this, so each vertical line is a commit, and the point of presence of the pinged water, and I have highlighted the transition in orange, I am not sure it is quite visible on the screen, but here it is, and there is the transition. And what's happened here is, it's a sonet link was added to an aggregate link, so probably the load was already used and the RTT decreased.
Some other examples: So here we have metric, iGP metric change, set at 255, then the metric is set at 1,000, we have the commit here, then we set at 255 and the result is decrease in RTT.
Link maintenance. So this is a bit more noisy but we can still see. So there is here the orange come etc. Which brought down some link and the RTT increased and brought up and RTT decreased and reverse, RTT link down again and RTT increased. So this is not always the case. From time to time when a link is down RTT will increase or decrease, it depends on the context and the load of the routers and configuration of the network.
Yes, some MPLS changes. So this is a change in LSP bandwidth configuration. So here, we have a ping from several one in US to some POP in Asia and this is from server 2 in US to the same POP, the same router. What do we have here? Some commits we changed configuration from server one the RTT decreased and from server 2 the RTT increased. So what happened in the configuration?
From server 1 to POP 2 in Asia, the bandwidth increased by 38 percent and from POP 2 to server 1 the bandwidth increased by 300% and RTT decreased. Between server 2 and POP 2, the bandwidth decreased by 7% and it increased by 90% from POP 2 to server 2 which leads to increase in RTT. So this is this.
So router upgrade, from another POP in Asia, so we see several come etc. Related to the water upgrade, and it was clear drop in RTT, like 25 milliseconds or something like that. Some interesting thing to notice is that this POP is the Internet provider of this POP. We can see here a drop to 0 which means packet loss at the same time of the router upgrade here.
Yes, this one is very interesting, too. It's not very ?? it's not very easy to see on the screen but before each latency peak, there is the commit, and this commits are prefixes prefix list changes it and it happens all the time, every time there is a prefix list change there is a latency change just right after, it doesn't last long but it's still there. So you have an idea the ping every minute. And it's aggregation of 15 packets.
For the traceroutes, so it's to 5,000 targets, every hour it attacks like 15 minutes to complete. And over three weeks only 16% of the paths have changes. So it's not much; it's quite stable, actually. On the 16% of path changes, some of them change more often than the others. So what does it give? Not very easy to see. So, around 16% of the paths did not change more than one% of the time. And only 2% of 9 path that changed changed more than 5% of the time. So it's kind of fairly stable.
So we attack the two classes of traceroutes, traceroutes that led to internal path change, so only ?? so the exhibit POP did not change, and path changed where the exit point of presence changed so the first one is caused by iBGP change or router link failure and the exhibit POP change is of course eBGP change as expected.
The internal path is quite 50/50 of change, and yes, so when the path change can be longer or shorter or same length, so longer and shorter is about the same proportion so it means that it shrinked and then increase again or increase and then /SHR*EUPBGD and, so about 15% for path changes.
So some for the traceroutes. This is absolute number of path changes. This is the from one server to 5,000, reachable IPs and this is data every hour. So, for example, here we have like one internal path changes, here we have a lot of changes and it was related to shut down of tier 1 neighbour. It led to quite surprising to more internal path changes than exit POP change. And as you can see, well visual path changes between different like ten or something like that.
Here we have lots of path changes, like almost 250. So this was external events, there was no correlation in the configuration of the ISP and it was BGP events from US ISP, which led to a huge path changes involving exit path changes. And so this events will be related to the next one, which is for the same US ISP; changed it's status from ping to customer so the policy was changed and most changes from one exit POP to another one so this is what we can see here.
And the same thing appeared here for the same ISP, but on different point of presence. But with more intelligence here.
OK. So, most of the events we had are IGP related. They are much less eBGP events than we expected and most of them come from the outside so it's quite difficult to acquire them with configuration changes. But important events are very easily detected, even from one or two sources, but it's better to have these two sources but it's quite easily detected. For the Atlas probes, well, I think they need more work to be well usable in a project of this scale. I will be explain later what are the issues.
So what did we see? Prefix list continued to peak on the router, this was quite surprising. So, interface shut down causes permanent RTT increase or decrease and internal path changes. MPLS changes because RTT changes too and eBGP events unsurprisingly causes exit POP changes.
IGP events are quite easy to detect and correlate because it's IGP so we have the challenges in our own configuration. Significant eBGP events can be easily detect with traceroutes, but it's kind of more difficult to correlate them as, well most of them come from the outside so it's, we do not have every time related configuration change in the routers.
So for the Atlas probes, yes, for us the Jetson interface is very, very cool for automating scripting and so on. Yeah, the manual probes specification was indispensable. The geographical dispersion is the most important asset of the probes so it was quite well overall and thanks again for the credits.
But some things can be improved, for example, when we went to start large measurements, we have so many limitations of probes, or UDM, we got limited to like I think 27 UDMs per probe or ?? I don't remember. To fetch the results, when you have something like lots of traceroutes, pings and so on attack a long time for the interface to respond. Besides, if you are doing it from Japan it takes even longer.
Apply traceroute implementation, I didn't have the occasion to TCP the results, the outgoing packets of the probes but I was query something that is not correct in the implementation because the parameters are not kept constant between different ones for same destination, assigned the same UDM, I mean, of course. Any questions?
GEOFF HUSTON: In your data about path changes that you observed, what was the nature of those changes? I am kind of curious as to whether you were seeing a perturbation where you have state A and then some change and the routing went back to state A or were you seeing changes where you had state A and it changed to a stable state B?
DAVID LEBRUN: Usually it's in state A, it changed to state B and then state A again. So it's very rare that it changed from state A to state B and keeping the state B. This case happened in here, in this case it happens. It switched to state B and kept in this state because the policy changed and was not changed back.
SPEAKER: Jen Linkova, Google. You've done historical analysis. Are you looking into make this results useable for prediction? What I would expect if I am going to make that particular change on the network. ?
DAVID LEBRUN: Yes, it can be the goal of this, well the goal is to tools that can automate the detection of this and we can automate up to the point where we can get the configuration changes and with the related events and have some like data reports of this and then make prediction of what we have.
SPEAKER: Thank you.
DANIEL KARRENBERG: Responsible for RIPE Atlas. Thank you very much, this is one of the uses that I actually foresee with an infrastructure like this. I just wanted to make one remark here about the limitations that you want lifted. We are very carefully tuning the machine right now and we don't want to overload it. We definitely have the plan to lift all restrictions up to the point where the only have a two, one is the amount of credits that you have and the other is a daily spending limit, think about it as an ATM, you have a bank balance and you cannot get out more than the balance and there is a limit per day. But what I would like to stress and we also did it in this case, if somebody comes with an interesting project with an interesting thing to measure, please by all means contact us, we try to accommodate any interesting work in any which way we can but we have to be careful that the machine remains stable and that we do not overexert on the hospitality of the probe hosts, of course.
FILIZ YILMAZ: Anybody else? Any questions? Well, thank you, David. I just want to make an extra remark here.
(Applause)
I specifically like this session because we have young people presenting this time and David was one of them, and our next speakers are also young students, please, approach to get wired as I introduce you. We have Maikel de Boer and I have to be careful about the names, Jeffrey Bosma, and they are going to be talking about finding out about the black holes, discovering them using RIPE Atlas and this is their first RIPE meeting as well, so easy on them but it's really great to have I think new people joining us every meeting and then also having them as presenters. There you go.
JEFFREY BOSMA: Thank you, Filiz, and good afternoon and welcome. My name is Jeffrey and the guy standing over here is Maikel. What we will be presenting to you today is our ?? a research project that was part of our masters thesis, we both just got our masters degree in system network engineering at the University of Amsterdam, so we are really newcomers here. And yes, our research project is about discovery of path in two black holes in the Internet using RIPE Atlas, so this is a project that uses RIPE Atlas in, we think, very creative way.
But let's first look at the definition of a black hole. According to the dictionary, is a spree of influence into which or from which communication or similar activity is precluded. As fancy as this may sound in layman's terms, it all boils down to what goes is in forever lost. The current Internet is full of these black holes. As you might already know.
There are many possible causes for these such as miss configured nodes, let's say firewalls and in our research retried to focus exclusively on the path MTU black holes. So, in our research we came up with two leading questions, the first question we are going to see where the black holes are actually located, where they occur, and we do this for both the IPv4 and the IPv6 Internet. So, at the end we can make a comparison between both protocols and see if there is a larger occurrence in one or the other.
First, some concepts. You already all know that the Internet is a huge set of network links connected together, networks, etc.. and each of the interfaces connected to those links has a maximum transmission unit. The maximum transmission unit defines the amount of data that can be sent and received so it works both ways. The path MTU defines the amount of data that can be sent and received from point A in a path to point B. So, let's say the picture illustrates, you can see a bottleneck, a smaller MTU link and this link defines the path MTU for the entire path. The path MTU in the Internet is commonly 1,500 bytes, but this is not always the case, as you also know. This is because of different transmission media or encryption or other stuff.
So, we really need ?? detection mechanism to find out what the path MTU is in order to successfully deliver packets at another end point. And this is exactly what path MTU discovery issues for. The simple example illustrates a conceptual overview of the protocol, on the far left we see a client that's connected to the Internet through a firewall and on the far right we see a DNSSEC server also connected to the Internet but through a different firewall. On the first line we can see that the client issues a DNS query to the server, the DNS serve then replies with a rather large packet of 1444 bytes, that's because of the DNSSEC data being large and containing all kinds of hashes and signatures. And yeah, but we can see that there is a problem. The intermediate node in the Internet just before the bottleneck of 1280 bytes sends back an IC P packet to ?? this is because between firewall number one and the Internet there is a link of 1280 bytes so the packets of 1444 bytes offal can travel to the other side. The DNSSEC server in turn, once it receives the ICMP packet will fragment the DNS reply to two different packets, both with the size that can travel successfully to the client.
Unfortunately, we do not live in a perfect world and things obviously go wrong. What what we can see here, the second firewall in this example is configured to just simply drop ICMP packets, so what happens is that the DNSSEC server again replies to the query that the client sent, but the ICMP packet too big event, the packet that signifies that the replies is too big does not get all the way to the server because firewall too is configured to drop ICMP packets. This is one example of a case that could cause path MTU black holes.
Another example is the filtering of IP fragments, which in this case is what firewall one does. It can be ?? we can see that the DNSSEC server transmits a reply to the client, but firewall one simply drops this packet. So, we thought of an experimental set?up to ?? in order to determine where on the Internet these kind of problems occur, and we thought of three options; the first option has a measurement node and sends large packets such as the DNS data, we have seen before, of 1444 bytes, to the other nodes, and then receives small packets in the other direction.
The second option is basically the opposite; large packets are being sent to the measurement node and small answers or packets, you could say, travel back.
The first of these third option, we have a hybrid diagram where all nodes are being used as measurement nodes, so all nodes receive and send large packets.
In the end, we had to, yeah, to decide on which kind of set?up to use, and we decided to use the second and first option. What we do is basically capture on the ?? one measurement node we have, we capture all interface traffic and this means that the other nodes only have to log basic success and failure information. Why this is the case? Well, I can tell later.
A con of this set?up is that there is no possibility for triangulation, unfortunately. We use, as I have showed you a bit in the research title, we use RIPE Atlas for the measurements of our path MTU black holes. We do so by looking where ICMP packet too big packets and IP packets are filtered.
As David has already talked about RIPE Atlas probes, but this is actually how one looks for those who don't know, these are a small device powered via USB and they have basic measurement functionalities such as ping and traceroute, etc. Currently, there are around 1,800 probes up and running and these are locations. You can see that most are ?? most are in the RIPE NCC service region but there is also other probes around the world.
Having this said, I will give the word to Maikel and he is going to tell you more about the results and our practical approach.
MAIKEL DE BOER: So Geoffrey talked about what black holes are and how they work so the concepts would be clear. This leads us to our research question which we tried to answer. Where on the Internet do the path MTU black holes occur and do they occur more often on an IPv4 Internet compared to the IPv6 Internet?
So, to make it a little bit more practical, we have built two testing set?ups, for which this one is the first, in which we test where ICMP packet to big messages are filtered. We see two servers in our laboratory here inAmsterdam, the Belgrade number is the name not the location; they are both located in Amsterdam, and the Belgrade server is running an Apache 2.2 web server and runs TCP dump and capturing all the packets travelling in and out of the machine and the same is for the other serve. Between these two servers we have link with, for which we can control the maximum transmission unit, if necessary, for experiments.
What we did is we led every RIPE Atlas probe, which we could use at specific point in time, do an HTP post request to our web server so if we see the small packets like the three?way handshake in our log but then the bigger data from the post does not arrive at our set?up, there is probably a black hole occurring, so the [chummy] server is the place where the bottleneck is, he will said an ICMP packet back to the probe, if the probe responds, the post will actually work. If this is not the case then we found ?? we assume we find a black hole.
For IP fragment filtering to find out where IP fragments are filtered on the Internet, we built a second set?up, again this time we used the Berlin server located in Amsterdam, not in Berlin and the chummy server. We ?? we can control the maximum transmission unit and we said that to the protocol minimum MTU. On both servers, one works for IPv4, that's the Berlin server and the chummy serve does the IPv6 tests, runs L DNS test NS testing name server which is capable of responding with big replies, what we see here is 1590 bytes version bind reply which we send back, which is fragmented because our maximum transmission unit is set low.
And then if the actual request works, we can check the logs of the RIPE Atlas probes and if it failed we could also see that.
We get to some results. This is this for the ICMP packet to big filtering and what we see here is that 0 paths from our infrastructure to the probes fit erred the ICMP packet to big messages. So that's ?? you would think that's a good thing. Same for IPv6, relatively small amount of paths do filter but we believe because we set the maximum transmission unit to 1,500 bytes the complete path is 1,500 bytes and actually no ICMP packet to be messages are generated so this would explain why we don't have interesting results. I will call them.
Now, we see this ?? percentages, 0% ?? complete path is 1,500 bytes, 0% of the paths drop, the ICMP packet too big, messages and for IPv6 is between a quarter of a percent and half a percent. So, as I said before, we think no ICP packet too big messages are generated so we created the bottleneck and if we create that, set maximum transmission unit of the interface to 1280 bytes, for three runs of experiments we observed between 43 and 69 paths actually filtering the ICMP packet too big messages in IPv4.
And for IPv6, we run the same experiment and then between three and four paths filtered ICMP packet too big messages.
To compare those, in percentages, that's between 4 and 6% for IPv4 and between three?quarters of a percent and 1% for IPv6. We believe the results for IPv4 are rather higher because the complete path in IPv4 networks is probably 1,500 bytes and almost never path two discovery happens, is our belief.
And IPv6 most administrators will probably know the importance of the ICP packet to messages and will not filter them out.
So, we also did the experiments for fragment filtering. What we see here is fragment filtering for IPv4 when the maximum transmission unit is set to 1,500 bytes. This is for six runs. And I guess the numbers are rather self?explanatory. And the same we did for IPv6. The thing is, for IPv6, the fragments which are filtered are rather high so we thought how is it possible? 40% for IPv6 filtering, filtering is fragments and 13% for IPv4. Because the maximum transmission unit is set to 1,500 bytes, somewhere in the path it could be the case that a bottleneck occurs and our LDNS destination name server does not respond on ICMP packet to big messages does not start a new transmission or something, and it just fails. So what we can learn from this is our name server does not really respond to the ICMP packet to big messages it just stops and everything is lost.
So, when we set the interfaces to the protocol minimum MTU we believe we get better results so we can actually see fragments are filtered. For IPv4 this is 84 paths, from our infrastructure to the probes which are filtering the fragments, for IPv6, maybe it's easier to see them in percentages next to each other, between approximately 6% of the IPv4 paths filtered the fragments and approximately 10% of the paths filtered the fragments for IPv6. So, this problem is real and could become problematic when more and more people start using, for example, DNSSEC which would respond with fragmented ?? big fragmented replies, which in 6 to 9% of the cases are just filtered out.
So this answers the questions where do ??
AUDIENCE SPEAKER: A quick question, so these are DNS packets, this is GDP not TCP?
MAIKEL DE BOER: Yes.
AUDIENCE SPEAKER: You didn't test TCP?
MAIKEL DE BOER: With HT ?? sorry not for TCP fragments, no.
AUDIENCE SPEAKER: This is all UDP?
MAIKEL DE BOER: Yes, UDP fragments.
AUDIENCE SPEAKER: I would be really interested to see TCP.
MAIKEL DE BOER: We were looking at TCP fragments but it was difficult to get real fragments TCP to get this, unfortunately. Thank you. So we answered the questions do fragment filtering and ICMP packet too big filtering happen often and on which protocol it happens more often. So now we try to make a guess where on the Internet this happens. Does it happen near the customer premises, in the core? To make this educated guess, we developed a technique we like to call Hop counting. What we basically do, we make two lists of IPs of probes, we make a list of IPs of probes for which the paths do filter fragments or packets too big messages and a list of IPs from probes who doesn't. Then, we do a traceroute to all those ISPs and then after we collected all these traceroutes we walk through all them step by step and then, for example, a traceroute which has a path which does not have filtering of the fragments or back to big messages we put a plus one on the left side, for example here at router one, so two paths travelling through router one do not filter IP fragments or back to big messages and four of the paths do. So our theory is if you have a router like router 3 which has zero paths which went OK, and three paths which were faulty that router 3 is probably responsible for filtering out the ICMP packet to big messages or the fragments. And then, if these router is relatively big router through which many traceroutes came across, then it will indicate it's probably the core of the Internet; otherwise it could be somewhere on the sites in business premises or customer premises.
So, there are the results from ICMP packet to big filtering for IPv4. What we see here, the first ?? actually, the second line is the router to which our set?up is connected, so this is a surf net router and then if we go down the list until dotted line we see all the traffic scattered to different routes on the Internet. And then if we look at the first routers, which have an error percentage of 100, we see these are routers only two traceroutes came across, so this is relatively small, indicating this is probably somewhere around the edges of the Internet.
Same for IPv6. Again, we see the serve net router on second line, only this time we did not find any routers with 100 percent failure rate.
And for fragments, again we did the same. Again, the serve net router and three out of three, which is again a relatively small router, and the same for IPv6 where fragments are filtered and then it's a bit bigger, six out of six so it could be somewhere more in the core but we don't know for sure. There is one small sidenote I need to make about these results, because every traceroute is done to our infrastructure and if we have, for example, an Atlas probe in south Korea where it's only one, the last time I checked, if this specific core router of the South Korean ISP filters the ICMP packet to big messages then we would think it's somewhere in the customer edges or something, but it could be a core router of this ISP. So, to make this research more concrete, it would be nice to attack more vantage points where we put our infrastructure or to traceroute too.
Conclusion: The ICMP packet too big messages get dropped more often in IPv4 but nobody notices because the paths are all probably 1,500 bytes and the DF bit, the don't fragment bit, helps. Maybe we forget to mention in IPv4 it's possible for the core to fragment packets and reassemble them while travelling through the network. In IPv6 this is not possible so we believe this is why IPv4 is more resilient to these problems.
IP fragments also get dropped more than IPv6 and our DNS server did not respond ICMP packet too big messages which could be a problem for clients requesting DNS data.
And using our hope count algorithm we found out problems occur on the edges of the Internet, probably. So before you go home, we would like to give you something which you can think about when you get home or at work.
The path MTU discovered protocol is important protocol so the Internet works like it should work so please all read RFC 4890 which has recommendations about filtering packet too big, ICMP messages in firewalls. Don't filter three code four messages for IPv4 and if those two does not help, you could also think about using packet layer but in new discovery. This does not rely on ICMP packets but TCP packets growing and growing and growing and probing the path.
We know there are problems with fragmentation attacks, but for the sake of DNSSEC we think it's not a good idea to filter fragments. But wherever do you, please don't reduce the maximum transmission on your interface. By doing this we don't have dynamic ways of finding what the path MTU is any more and we keep with legacy small segments which are bad for the performance of the network and same counts for MSS clamping.
So, doing our research we were helped a lot by NLNet Labs, which bought us the tickets to speak here, thank you very much and we would like to thank RIPE NCC for making it possible for using the RIPE Atlas probe and to speak here. If you have any questions, please ask them now or read our full report at NLNet Labs/publications.
GEOFF HUSTON: From APNIC. I have two comments, if you would like to go back two slides. DNS servers don't respond to ICMP path too big because they are UDP. Once you send the packet, it's gone, and no matter what you get back is irrelevant because it's UDP and you have forgotten it. That's the problem with large packets and DNS and v6.
Next slide.
MAIKEL DE BOER: Can I comment on that, please? Because I believe that the information in a packet too big message is actually enough for the DNS server to give a good ?? a better reply back.
GEOFF HUSTON: It's forgotten the query. The server has forgotten everything, the transition is over, it's gone, the path MTU response is gratuitous. It does nothing with it.
The next slide.
MAIKEL DE BOER: Next or previous?
GEOFF HUSTON: "Don't reduce the MTU on the interface". I am not sure that's the best advice in the world that is still has all these crappy filters around. Actually doing, if you look at dub dub dub.google.com it runs at 1274; in other words, it deliberately has reduced the MTU in order to try and slide under that 1280 limit that everybody in theory should be supporting. And I am not sure that that advice about don't reduce it is actually good advice today. If we have got rid of all this ICMP filtering, I'd' agree with you, but while it's till there, what do you do because at some point if there is a filter you are sending out the big packet; you don't know that it didn't get there, and there is no information coming back because the path too big message that would help you has been lost. So I would seriously say that that particular recommendation folks should think about a lot because sometimes the quick work?around is to actually drop the MTU. Thank you.
AUDIENCE SPEAKER: One bit at a time, that's the best way to do it.
GERT DORING: I am not actually sure what I want to comment on now because now I have to challenge what Geoff said. Could you go back one slide? That one. I think it's very much unsurprising that IP fragments get dropped if you don't listen to the packet too big message because there is no 1,500 byte MTU to the edge. So, that result is not that enlightening. And in IPv4 everybody ignores the don't fragment bit anyway, so you don't need the ICMPs.
But I want to challenge what Geoff said, and that's while the DNS server software might have forgotten your query, the DNS server host will see the ICMP packet too big and when the client retransmits its packet, the next response will have smaller fragments because that's not happening in the DNS server software but in the routing layer where the MTU per destination is cached, so, querying again if, the ICMP packet too big reached the DNS server will get you smaller fragments, so running this test with just one packet per destination will give different results as compared to running multiple packets to the same destination and listening to the ICMP packets too big.
MAIKEL DE BOER: Thank you.
AUDIENCE SPEAKER: Andreas Polyrakis from GRNET. You have a slide that you ?? there is a big percentage in IPv6 fragment drop, 40% if I remember correct, so ?? yes, this one. So from our side, we confirm we receive this kind of drop in fragment v6 traffic and actually we are also scratching our head to figure out why and trying to conduct the comparison of various sites, and we don't have an answer yet, but we found out that one of the biggest vendors, router vendors has a bag in their firewall implementation, so fragmented traffic bypasses the firewall, and I assume that as a measure to this bug, some ISPs have dropped ?? have put a filter and they dropped entirely IPv6 fragmented traffic so I don't know if someone in the audience can confirm this, or has a better answer, but this could probably justify this big number.
MAIKEL DE BOER: We personally believe that this is because the server sends a big reply and there is a tunnel or ?? a bottleneck somewhere in path, ICMP packet too big should be sent back but the server does not respond so that's why we think we have a relatively high failure rate when we put the MTU to 1,500 bytes, we believe.
AUDIENCE SPEAKER: Jen Linkova, Google: Could you please show the slide for different v6 for ICMP filtering? If I got you correctly, sorry if I misunderstood, you mentioned probably the difference is explained by the different path, yes? What is your explanation for this difference in the numbers?
MAIKEL DE BOER: We believe in IPv4, administrators just block all ICMP tract and in IPv6 probably most administrators know the implications of filtering every ICMP packet. We believe.
AUDIENCE SPEAKER: It sounds right. I think it's just probably because of IPv4, you just say deny ICMP unreachable, yeah.
MAIKEL DE BOER: Exactly.
AUDIENCE SPEAKER: I would like to thank you, all people involved in writing ?? it's a dedicated ICMP type, just for packets too big. Yes.
MAIKEL DE BOER: Thank you.
FILIZ YILMAZ: No? Thanks a lot.
(Applause)
So, we are ten minutes early now to the coffee, which I am sure you don't have many objections. It will be good to have some coffee and come back right at time to follow the Internet governance panel. Just before you leave the room, though, keep rating the presentations and the talks you see. There will be another prize for that. Thank you.
(Coffee break)