RIPE 84, plenary session
17 May 2022
WOLFGANG TREMMEL: Welcome to the Tuesday afternoon session of the RIPE 84 plenary. I am Wolfgang and I am going to chair this session together with Jan, and one word in advance, you can still put your name into the hat for the PC elections, just check the website, what the Programme Committee does, and if you are interested, just send an e‑mail.
And with that, I'd like to introduce the first speaker, which is Stephan Schroeder of BENOCS and talk about the invisible impact of network hand‑overs within content delivery.
STEPHAN SCHROEDER: Thanks for having me. It's great to be back in personal meetings and obviously great to have you here in Berlin, so that's our home town, so I am from BENOCS, it's a spin‑off of DT and we are focusing on flow optimisation and visualisation for network analytics.
So, what I want to share with you today is a couple of experiences we have made during the last couple of years actually, in scaling up networks to deliver all that predominantly streaming media into, say, subscriber access networks, and actually what typically fails and works well in the handover between CDNs and networks, obviously larger networks where you have choice of ingress points to select from.
I think I need to check this one here. That works. So what we see actually over the years, the past years and I think that will likely continue, that we have a trend to the edge, so that more and more content and also compute power will actually move deeper into the networks, and what we also see, well alongside that trend, is that, over time, this will shrink the backbones. So there are a couple of positive elements in deploying content closer to the edge, it's good for customer experience and if it's done right it should also actually enable reducing the cost of traffic delivery in the networks themselves. On the other side, you need to allow for more headroom or because dedicated closer to a user group so you can't over book that much, and currently we also see fail‑over concepts are either non‑existing, syncronised are non‑existing or poorly aligned.
I think one warning actually stands here, the backbones will, over time, actually shrink to a level where they are not able to hand over the entire traffic from one region to another any more. If you don't do that, and don't shrink it to that level, you will just not enjoy the savings which all that insulation should bring over time. So that means localisation and stability of localisation will be key to make sure that that delivery actually works fine.
So I have an example here. How that actually can fail and how that land looks in a network. So that's actually a multi‑dimensional flow view of traffic actually that's about two weeks' old, so pretty recent, what you see in the different dimensions here, I'm not sure, I think ‑‑ so this is actually the source of traffic, so that's CDN is delivering their own traffic here, they have, say, two handful of ingress points with that network, then you see here these are the links start of backbone links which go into random region A of that network operator here, and then they distribute among couple of 100 BNGs to the subscriber base so that's a typical example of how traffic would get allocated. You would probably start wondering why that CDN is using all of the different ingress points to deliver traffic into that region A, that could be load balancing or whatever. If you look at that on a timeline, it turns out this is more random. So this is one week time frame now and the relative distribution into that region A by ingress point. So what you see there, doesn't seem to be a concept behind it or that CDNs, mapping is a bit confused by whatever. But actually what that ‑‑ what that does with the network is, that the network operator needs to maintain three times the capacity of that, say, egress capacity in that region from all the different ingress points on the backbone because just within a single week, all of the capacity was utilised on three different routes. So that obviously is a pretty bad example how localisation can work because if it's just flipping around like that, there's just no way that you can reduce a backbone capacities accordingly.
So that means when you actually start investing, regionalising service and doing all that effort to get into the edge, make sure your localisation actually holds up.
So why is localisation important? There are two elements to it. One is user experience involved, so we obviously have network distance, we also have HOP counts, then that typically comes to say network expenses on the operator side, so it is actually a relevant factor and can be measured as well. This is obviously not the new topic so there are different solutions and work‑arounds, how to address that mapping question. A lot of CDNs still just simply do Anycast, that means they just reply to the request at the location where they received the traffic, pretty easy, problem is you can't really do a lot of load balancing, you need to hold the content everywhere so there are a couple of down sides with that as well. Some operators and CDNs agree that they would virtualise the DNS locations or say I have location which ‑‑ or DNS which represents the full region, needs a lot of configuration and we see that in real life, that doesn't seem to work as promised. Larger CDNs do just pings and random measurements all over, comes also with up sides and down sides. Others by geoIP data or determine that themselves. What we actually do, there is a couple of CDNs for the DT network that we exchange realtime data to actually then base it on the routing tools where actually traffic should be delivered from. I take the highlighted ones like the virtual DNS and the ‑‑ or the DNS related mapping and the external geoIP as examples for some observations we made.
So identifying DNS with all of us to locate users, has a couple of up sides, one is you really have to manage just a couple of handful of addresses so that's pretty easy. The DNS locations are fairly stable, easy to communicate so it's very straightforward. And you also don't need ECS activated so that keeps cache rates at resolvers pretty decent. What we also see on the downside is DNS resolvers typically, you have a primary and secondary, that secondary may not be in the same location so that might be far off and it fails for DOH and for smart TV DNSs which are hooked up to whatever operator you would have. So it's not really a fantastic solution.
I have two examples what also goes wrong there, so in the top graph here, you see this one, that operator had ‑‑ well, historically, configured their DNSs on the mobile network in round‑robin load balance. That was pretty nice to start with, but once the CDNs actually started relying the mapping on the CDN ‑‑ DNS location, they were obviously tricked around all the time, was not easy to determine that so you need some sophistication and tools to actually see what's happening. Obviously one that was repaired, once that was repaired, it obviously then became far smoother and you see the impact on quality of experience here, so there are a couple of outliers in TCP handshake duration and that got repaired pretty quickly and actually was within expected levels afterwards.
Another one here was that for couple of handful of BNGs there was just a very far remote DNS server configured for that subscriber base, but that happens ‑‑ problems, it also needed some investigation then to identify these things, so the users in random cities, some of the population there just had a pretty poor experience because traffic came from the other corner of the country. Interesting here is, there are no daily forced cut‑offs any more in broadband subscriber lines so it took more than two months to edge out after that was reconfigured because all the subscribers obviously had to have a downtime so the home router would get a new DNS assigned. So these things happen. Good if you can identify them and then obviously different ways to repair that stuff.
Then getting to the second example on how mapping is done, geolocate users, the good thing is, databases are broadly available, it's pretty straightforward, and it seems to work, but I think I clearly recommend to double‑check if it's really working. And there's no active engagement needed between CDN and ISP so that is actually specifically if you cover whole region with hundreds of ISPs around, then this is, yeah, pretty straightforward way to go.
There are a couple of down sides to that. Very frequent one is you get announcements on eBGP doesn't really equal the iBGP allocation in the network. You have an update delay if something is reallocated, and then geodistance doesn't always equal network distance, and I have an example for that here. And also the reality ‑‑ outbound pass if you have asymmetrical pass is an important thing to consider, also in my example.
So the first iBGP versus eBGP, so what you would see on the route announcement on your peering or interconnect position would be I have just taken whatever /12 here so that's about a million users, or addresses here, and if you have the, say the average rated location, that would probably be branch wide in Germany, just as example here, but in reality that prefix here is actually allocated above 220 say major prefixes in the network and they literally are spread over the entire country. So you see here the different co‑locations of that different operator and if they are dark green they have about 100,000 users allocated so that's actually how the network looks from the inside. If you now try to figure or send everything to branch [wag] it's probably wrong in most cases.
Another topic what I said is edging out here, that is an example of major video player who was pretty accurate actually with his regionalisation and location. The problem here is, well, there have been specifically v4 prefixes get you actually reallocated every now and then because there is a scarcity around, so the traffic you see here is actually traffic into a region of that operator which comes from a remote ingress point, so traffic which shouldn't be there in a properly localised situation, but then all of a sudden it pops up and it takes actually two weeks for that video company to recognise that this IP prefix has changed. It's not too bad, two weeks is pretty good, but some geoIP data provides it might also take a couple of months, even, so better check also the speed of update you would get if you have a geoIP provider there.
Then, actually, fail 3 here, I spend a bit more minutes on explaining what happens here. The first step is you have been able to break down that /12 into the /20s which are allocated into the different regions, and I know have an example here from Germany, so sits between Frankfurt and Stuttgart and then you identify couple of prefixes for end users here. So in the second step you need to make sure or decide if you actually serve that ‑‑ these subscribers from Frankfurt or from Munich, so two ingress points you have available and you take a geolocation measurement here and see Munich is 414 kilometres, Frankfurt is 234, seems like an easy decision, that is 80% more for Munich so you take Frankfurt. If you actually had insights into the network topology of that operator, you would see that the traffic from Frankfurt actually takes a scenic tour and goes to Stuttgart first, it's not that obvious any more because Munich, Stuttgart high born is 460 and is Frankfurt, Sturrgart is 390 so that's pretty close already.
But now the biggest problem, that's the outbound routing path. In reality that operator is egressing that traffic to that CDN in Munich, for whatever reason. So what you see is, from Munich, Stuttgart and high born and back you are at 460 kilometres, but then the traffic from Frankfurt actually goes to Stuttgart first and goes to Munich where it's egressing to that CDN, well, it's a session so it needs to get back to the server in Frankfurt, the scenic route, so in total you end up with 735 kilometres which is three times what you expected in the first, so you see, well, geographical distance is really a pretty poor representative for network distance, for two reasons: The topology might be different and also the egress routing. And just to prove that this is actually, that's a factual case here, so that's a screenshot of that traffic actually, you will see 99% is ingressing in Frankfurt, goes to high born and then this is the egressing portion of it and 76% of that traffic actually egressed in Munich and then had to be carried back. So these it things happen. Visibility helps a lot, and then also collaboration between CDNs and operators to make that as efficient as possible and we see, well, this is huge events now, we see traffic levels which probably will force all of us to be more efficient with the network resources we deploy, and just to be able to carry all that traffic and I think localisation and that traffic management and traffic engineering becomes an ever important element of the game here.
Okay. So that's it from my side and I am open for questions now.
JAN ZORZ: Okay, thank you very much.
(Applause)
Any questions? There is nobody in the queue, online, nobody is running to the microphones in the rooms. Are there any questions ‑‑
WOLFGANG TREMMEL: Any questions?
JAN ZORZ: I think everything was clear.
STEPHAN SCHROEDER: Fantastic. If there are any questions popping up, stop me in the show, I will be around.
JAN ZORZ: Thank you very much.
STEPHAN SCHROEDER: Thank you.
(Applause)
WOLFGANG TREMMEL: While the next presenter is getting ready, I would like to remind everybody to rate the talks. You can rate also talks from this morning if you have not done so, and still you can put your name forward for the Programme Committee, and I see Matthias walking up the stage, so I would like to introduce the next presenter, which is Matthias Wichtlhuber of DE‑CIX and talking about peeking into black boxes, automated fuzzing of router resource usage. Matthias, go ahead.
MATTHIAS WICHTLHUBER: Obviously you can hear me, that's nice. This is joint work with appreciated colleagues from and from DE‑CIX, what we did there on the title, trying to peek into black boxes which are the routers that every one of us is using and trying to find out how they internally map their resources to configurations and vice versa.
So, why are we doing this? You are all operators of some critical infrastructure so you all know that doing changes requires testing, everyone who is committed a change on Friday afternoon and had weekend plans cancelled afterwards knows that. One thing that is often a blind spot in router or in test plans is the resource usage of the boxes, actually, so how are your configurations late out to the internal hardware resources of the box or more detailed, where are the routers' limits, for instance TCAM space, it's a good example here, it's the hardware part of the router that is ‑‑ that is doing the header matching and used for all the forewarning of the box, so this is a critical structure and that is something that we would like to measure here. Also, you are probably aware that like network and hardwares certain life cycles so you want to make sure you can use your hardware for five years or longer so you still want to know whether you have an enough head room for future innovation, in three or four years.
So, we found it really difficult to look into these issues and we are also probably a DE‑CIX running into this issues a bit more frequently than other providers because we are always doing things at scale and in very dense deployments, often doing it in a way that is not necessarily anticipated by the vendors. And so whilst it's hard to measure that stuff, vendors, at least from our experience, are pretty tight lipped on hand ware resources, you do not want to make yourself comparable to competition. This is a really complex topic, is the other thing. Because, you know, you are already starting with some deployment on your platforms so you already have a certain base resource usage, then there are different features everyone can use and it's really not easy to give a satisfying answer there. Let me give you a little anecdote on that.
A few years back, I was sitting in a meeting with a few engineers from my vendor and we were discussing a certain configuration construct and I asked them how much of this can I use, if I'm applying this to every portal, still work? And first answer was okay, we cannot tell you, like it's sort of in the aim, and after I insisted a little bit they told me of course we have this information but it's like a 50 page Excel sheet and has 300 footnotes and nobody really knows how to read it. Of course there are some people that know in the company but it would be really hard for us to find them and finally tell you about that.
So, at this point, I realised we probably have to do this on our own, and what we did there is we applied something that is well known in security testing, it's called fuzzing or fuzzy testing, it's an automate testing approach and the idea in the security context is that you generate a lot of random inputs, more or less random inputs, throw it against some implementation and you monitor the implementation behaviour and look whether, for instance, the shows any other unusual behaviour. And we are adopting this approach but we are not looking for the next buffer overflow on the router AS, but we are trying to find the hardware limbs of the router on the test. The general idea is like we are generating masses of guided valid configuration changes scaling up resource usage on the box. We measure the router behaviour. During deployment and check for any run time errors as far as you can scrap them from the box and we correlate and identify that the scaling behaviour and identify possible bottlenecks that appear during the deployment of the changes.
I came up with a couple of scripts for that, the first version was cobbled together in an afternoon and meanwhile this has developed into a little framework which I call the fuzzing framework so that is a modular five stage framework, it's mainly based on Python and Ginger 2 for templating, we will get to that in a minute. It's quite flexible and adaptable, could be adapted for different vendors and most of it is plug in based. The idea here is that you can push in an existing you are working from a valid base. And you can evaluate the scalability of the single routers, that means vertical scalability and not horizontal, like adding more boxes, we wanton know how far the single router scales. And the design goals we wanted to automate repetitive tests so we are generating large amount of configuration changes that we are applying here and measurement points, and the idea is that, in the end, you can visualise and identify the bottleneck resources in the router and you have some support by the framework in all stages if you are working with it, so really, keeping away the repetitive steps from the person that is using this little framework.
So let me guide you a bit through the framework design here. So, we are starting with the first stage, which is parameterisation, so usually the user will provide some sort of production configuration from a running router that we are going to test and the first thing we do is, like, we take the production configuration and do sort of a reconfiguration parsing. That sounds bigger than it is. What this does is it converts the usually sort of tree based configuration file and makes it traversable and easily workable for the following code.
And this, like, tree data structure that we create out of the configuration is called context, it gives you all the main information to work with in further stages of the framework and also you can add here the scaling parameters so the parts of the configuration that you want to scale and test. And additionally we have the base configuration which is just a copy of the running configuration, this will always serve as a basis for the test, our environment from which we are rolling out changes.
And the second part we are doing the actual scaling of the configuration so what you see here is, on top, the context, which is actually the parse configuration so here you see one part of it, for instance, those are policies for ‑‑ policies that were found on the actual configuration, you see there is a pretty simplified example but it's going from policy 1 to 6, 7, 8, what the user provides is the Jinja template for providing an extension configuation on top of the base configuation, so what you can do with this templating engine, you can mix Python code with configuration stuff, and then you can like traverse this context here and can add new items at certain parts of the configuration. We are traversing all the policies you find in the context here and adding new policy ID and description here so it's just a very simplified example.
And what that gives you is an extended configuration which you see here on the right which would in that case add a description to each of these policies that we have found in the context above.
So, a simple way to extend the running configuration here.
Next step is roll‑out, so what are we doing here? We are applying the configuration on a test router. From the parameterisation we get the best and extension configuration which is doing the up scaling of the ones we want to test. And we are simply rolling out that stuff via SSH so during the execution of ‑‑ during the application of the configuration to the box we gather all the logs, like, for instance, did the CLI say at some point say I cannot apply this any more, I am surrendering? Any errors? And occasionally found things like hardware counter dumps for resources that are used, we are scrapping everything we can get the box that tells us something about what the box is doing. I input whatever we can grab.
Afterwards we create a clean test environment that means we are either rebooting the router or rolling back to the base configuration if that's possible. And this gives us a clean environment and we repeat the step until we are done with all the extension configurations that you wanted to actually test. And the output of this is execution logs, error logs, resource dumps, whatever we can get this. This can be quite a couple of 100 of megabytes of measurements of resource data.
In the last step we are doing a data clean‑up, so usually up to this point you will have the error dumps and some vendor specific format and what we are doing there is standardising the output and combine a data pool, combine the data from all runs and in the end up with something like here on the top right so we have a description for resource X online card let's say 20 here, whether we had an error during the application of this ‑‑ of this ‑‑ whether we had an error during the extension of this configures and the error description as far as we get one. As I said, we are trying to grab what we can get.
And in the final step, we are doing things with the data. So we have ‑‑ currently we have two data consumers here, one is a visualisation so what we can do here is we can generate plots on measured errors and visualise the resource usage and usually this will give you something like this plot here in two dimensions so let's say we have scaled two parameters of the config which are X and Y, you would get here a visualisation of areas that are safe, areas where you are getting close to the border where it is possible with the router and the wide area here means applying this configuration failed and you ran into some resource problem on your router. The other nice thing we can do with this is once you have the data you can do sort of predictive modelling. I will go into detail on that a little bit later but you can use nice machine learning models, like decision tree models and provide flowchart‑like predictions and we found this quite useful for management and doing predictive maintenance without looking at the box. I have a separate part on this in the talk, so it will get a bit clearer in a few minutes.
So, quite a bit theoretical up to that point. That's fine, I chose to to show you a little case study here. This is called QOS and ACLs, we wanted to test whether we can drop DDoS or other unwanted traffic and the question was how many cross policies and ACLs and port can we apply before running out of resources? As I said, we are do this at scale at DE‑CIX, we have some locations we have very dense deployments with a lot of ports and this is really a relevant question here, so how does the resource usage scale is one of this question here and one is the bottleneck resources. Test set‑up was a complex router with multiple line cards, as a base configure, use a configuration of a multiple terabit router with more than 100 configured ports and we generated extension configurations to scale the number of cross policies and the number of IP filtering rules or ACL per cross policy, so, of course, you know, vendors are picky when it comes to AS so I have to obfuscate some of the axis here in the following, forgive me, that's a necessity, but it gives you an idea of how it works in general.
So, let's first talk about the timescale of the experiments that we ran in the lab. So you see here a table with different run times for experiments with different resolutions so the one thing you need to decide is what resolution you want for the data points you are actually measuring, so we chose here for the experiments a resolution of time, that means in each configuration step we are adding 10 cross policies and/or 10 ACLs per CEOs policy and with this resolution we were able to generate roughly 2300 data points within 24 hours and the nice thing with the framework is, you get into the lab in the morning click enter and you do something different and the next morning you get your results.
So this time here includes everything configuration scaling, roll‑out, data collection from the box and environment clean‑up. So, what turned out is that environment clean‑up is really the part that is consuming the most time here in these experiments. So, this can really break your neck, it's a difference between hours and days, so ‑‑ we have tried to ‑‑ so we have implemented two things here we can do. One is roll back to the best configuration and the other is reboot. So you see a little plot, do we have a pointer here? On the difference in time, what you see on the Y axis is the time to apply a certain configuration and roll it back to the initial state. And on top you see the time that you need for reboot, this one is pretty constant, around 50 seconds in our case, and the other is the roll back time and one nice thing is about using roll back, the times are gradually increasing depending on the size and extension config that you are actually applying to the box and at some point it starts to plateau, that's where the red marker here is. So this indicates large failing configurations, and but you are very fast with rolling back to the initial state here because anything that goes beyond can't apply as immediately ‑‑ it's not applied at all and you can immediately roll back to the base configuration here.
And with roll backs, we found that we are at least 2.5 times faster than with reboot, for the experiment we had before that would mean 2.5 days versus roughly one day so quite a speedup.
So once you have gathered all the data, you can start to drill down the data with respect to different criteria. One thing that we could do here because we had really a resource dump here also with upper limits for certain things that we could scrap from the box, we could really identify the resource that was here, the bottleneck resource, so I had to obfuscate them but you can imagine they are different types of hardware, building blocks and each of these A, B, C, D represents one of the building blocks, and in this case we immediately saw that resource A is a problem, and once you have identified that, you can drill down this specific resource that you identified as a ‑‑ as a per line card. So, that is also really interesting because, of course, in a big service router, different line cards of different configurations and resource usages so using this method you can find out the most likely to fail here are line card 1 and 2 and 4 you are pretty safe.
An additional thing that we found out by playing around with the data is that you can actually do something like monitoring without measurements, so this is sort of predictive maintenance here. So the idea there is that you look at the production configuration and you simply look ‑‑ extract the scaling parameters that you had in there and the example before that would have been how many cross policies and how many ACLs per cross policy do we have here and you can use this as training data for some machine learning model and after you have trained the model, you can simply put in the production configuration and predict the resource usage, so pretty nice for doing predictive maintenance and also very nice for mitigating problems even before the deployment so you can see whether you might be running into some problem even before you roll out the configuration to production.
So what does that look like in detail? Here is an example for that. So what we do with the data set is we are training a decision tree machine learning model. And personally I am very fond of decision tree models because they remain understandable, at least until you are in a certain depth of the tree. We are learning on a train of 70 /30 of the measurement data and we can reach very high prediction up to 98% or more. So what we see here is how the model actually try to proximate the resource usage of the box over different depth of the tree. So we are starting here with a maximum depth of one which is essentially proximating just a line, either horizontally or vertically and even that is pretty valuable because this immediately gives you a safe area. So now if you are staying below this black line here, you are safe, no matter what you do in the other dimension. And as the depth of the tree here increases, you see that, well, you have a trade‑off of understandability and accuracy, so 2 and 4 is still pretty understandable with a maximum depth of 4 you reach accuracy of 89% here already which is probably enough for most cases, and depth 8 you don't gain more accuracy any more but you really start to lose track on what is happening in this model.
Quick wrap‑up. So testing is crucial, and when there is a very tight lipped on hardware resources so can become pretty complex pretty quickly. All solution to this is this little framework for network resource testing automation, we can generate really large number of configuration changes and data points within a day, so smaller than 24 hours. You can identify bottlenecks per router and per line card and we can create pretty accurate and readable prediction models which is really a win here. And we have applied this general method in practice at DE‑CIX so one thing is assessing configuration changes for dimension in products. The other thing is of new work with simulated routers and labs so this was also a pretty nice to apply to this to compare simulated router with hardware, quite nice results on that. And also what you can use this for is for validating claims on hardware capability that your vendor sells you. And that's all. Thank you for your attention and feel free to ask questions.
(Applause)
JAN ZORZ: Thank you very much. Nobody in online queue. We have a question, please.
SPEAKER: James Rice, actually three questions if that's acceptable. This is all quite cool. Lots of fun and really useful. Have you Open Sourced any of the tools or data sets?
MATTHIAS WICHTLHUBER: It's not Open Sourced, but you can write me a mail if you are interested, I am sure we will find ‑‑
SPEAKER: Certainly are. QOS, I do for down prioritising denial of service attacks are you doing anything similar to that?
MATTHIAS WICHTLHUBER: Exactly that.
SPEAKER: Is that documented anywhere?
MATTHIAS WICHTLHUBER: Am I allowed to do ads here? We have a product which is calling blackholing advanced which is exactly using this. I think I presented here some RIPEs ago on the product if you check the website or talk to me afterwards, I can point ow to the right documentation.
SPEAKER: Is that always on types or ‑‑
MATTHIAS WICHTLHUBER: You need to ask for enabling it if you ask me I can enable it for you.
JAN ZORZ: That was all three questions?
SPEAKER: The other one was would you be interested in some more hardware to test, like if the community was willing to run your tools on other hardware, having this collated somewhere?
MATTHIAS WICHTLHUBER: I mean, this always depends a bit on the workload of course. I wouldn't say no, but I wouldn't say yes either.
SPEAKER: Also, is this on ‑‑ can you give any information what particular axis this is ‑‑
MATTHIAS WICHTLHUBER: Sorry, I can't tell you that.
Gordon: Hi. I have a question regarding the results themselves. So, you mentioned that you have anonymised or fuzzelled the results a bit, would it not make more sense to make vendors responsible for what they claim the hardware is capable of and Open Sourcing the actual results.
MATTHIAS WICHTLHUBER: Yes, of course, that's wishful thinking. If you are a big company like Deutsche Telecom maybe they can do it, if you are smaller you do not have the load to Open Source something like that. I think it's a pity because this would be really valuable but of course I see that vendors don't want it because it makes competition comparable, so ‑‑
Gordon: I see, thank you very much.
SPEAKER: From BT I ‑‑ how do you do parsing of the comments? My understanding it's a multi‑vendor tool so do you use any plug‑ins, do you have ‑‑ to parse the outputs from the devices and how do you store them as data structure?
MATTHIAS WICHTLHUBER: It's actual much more simple than it sounds. Most configurations are somehow like tree based, right, and what we are simply doing is parsing it line by line, applying certain to the lines to pull out the information that we want to pull out and we present everything, annotated is a free‑like Python structure so you can easily traverse that. It's not more than this.
SPEAKER: I have another one. In terms of Cisco would you consider using, for example, pie it's a which you could define exactly what you want in the data structure and then run the ‑‑ in the test‑bed and then run the script based on that because, for example, it ‑‑ it is quite useful parser for Cisco?
MATTHIAS WICHTLHUBER: A lot of automation tools immediately come to mind when you look at this. Other things like you could apply Ansible or something like that. I found it useful to work directly with SSH because it allows you to do some prototyping because none of these test cases, in other end you have to adapt things every time you use it so you could probably do something like this, it would be nice to have but in terms of ‑‑ I personally prefer to work with C line commands.
SPEAKER: When you are running your test, you do this in parallel or like you are connecting to all the devices at the same time or in ‑‑
MATTHIAS WICHTLHUBER: The times I told you is just a single device, it's a single lab instance, that actually also was a pretty large lab instance you don't want to have to do this side by side, you could replicate this and paralyse it but it's a matter of most utility trade‑off.
SPEAKER: James Bensley, sky. Thanks, this talk is really good. I have spent years testing hardware for kind of I guess the same reason as you but actually, you have done a great effort but I question what is the point of this work? The reason I say that is because, after years of testing, what I have learned is that you never, ever want to run your hardware anywhere need 100% of its capacity so I don't even care what the maximum is it can do. I know two things. I know what my ‑‑ what I need from my network, for my users so what what I need to do is test two things, approximately the number of ACL routes or ‑‑ whatever I need for my network and I need to test a little bit above that, so if I have got some growth room, if I work beyond that. The second thing is, whatever ‑‑ however much you need scale from your device, that may not be how much your vendor can support as well so really, for me, this is kind of like you just need ‑‑ my device needs to support this level, does it support that level, beyond that not important, I need the vendor to support this level and so I would say if you are having to do this fuzzing to work out what the device supports, really you should say to the vendor can they support this, they need to give you a yes or no or and if not, switch vendors.
MATTHIAS WICHTLHUBER: You are perfectly right, you don't want to run your infrastructure, the level of what it supports. I did it, part of curiosity I wanted to know whether it works. It turns out it does. The other thing is I still think it's better to know where the limits are than not knowing it, and the third point here is that I think probably DE‑CIX and ISPs special in that regard because there are very dense deployments. If you are talking about egress configuration somewhere to a port you have to keep in mind you have to do this potentially 1,000 times for your set‑up so this is a huge ‑‑ you still need to work whether this will work or not or whether it kills you in five years because you want to implement a different feature there.
SPEAKER: Generally I agree, perhaps I will see you afterwards, thank you for the talk, it was a good talk.
JAN ZORZ: Rudiger, you have eleven minutes.
RUDIGER VOLK: Oh e‑ that's even a challenge. Rudiger Volk, still retired. Looking at what you are doing, I wonder whether looking at the boxes as black boxes is actually the thing we should be doing. The complexities of how various ‑‑ variables kind of demand resources within the box, I think I understand the vendors that kind of disclosing that is very close to impossible, and as you mentioned, as far as something like that is disclosed, it is questionable whether the recipient can work with it. The thing that in the real world I think we have to ask and maybe push more on the vendors, is making transparent what the current usage, utilisation of the critical resources are and the list of what are ‑‑ what are critical resources in the boxes. And at least for a few critical resources, I think monitoring of the utilisation is possible in most real world products ) and asking for better information is something that I remember doing many times with my upstream vendors when I had some and, well, okay, progress on ‑‑
MATTHIAS WICHTLHUBER: Did it work?
RUDIGER VOLK: Progress is slow but it can be made and I think it should be made and the question is: What are the resources that you, in your environment, know you can actually monitor in real operation and if hopefully there are a few, why don't you do the monitoring and wrap this into your testing so that you actually ‑‑ that you actually see how the resources ‑‑ the reporting back of critical ‑‑ of the resource utilisations matches with the crash or don't crash operation of your routers? And for the remark you do not want, in real life, to run close to 100 or close to 98% or 95%, well kind of seeing how much water is beneath the keel is something that I think in such a testing environment actually should provide additional information and additional ‑‑ and additional information about the security of operations that results from using your test result.
MATTHIAS WICHTLHUBER: Yes. You are totally right, so first regarding the push for more transparency, so I don't know how you can get vendors to be more transparent on that. It's a really difficult topic because there is ‑‑ they are business interests and this is simply colliding with the way they perceive competition.
The other thing is, totally agree on monitoring, we ‑‑ as I said, this was also one of intentions here, how much water is under the keel so that's all I would like to add here.
JAN ZORZ: Thank you. Do we have any questions on the Internet?
WOLFGANG TREMMEL: No questions
JAN ZORZ: Nobody else is running to the mics. Thank you very much.
(Applause)
WOLFGANG TREMMEL: I don't want to repeat myself but I am doing it anyway. Please rate the talks. And the next speaker is Tom Strickx from Cloudflare and he is talking about the anatomy of route leak. Please.
TOM STRICKX: Hi everyone. It's me again. I am Tom Strickx, I am a network software engineer and I am going to talk about a thing that happened in 2019, which is a year ago.
So, this talk is made up of a couple of things, as you could have seen in the introduction slide, this is a presentation that we used to do with the three of us, Martin Levy, another and me. Martin Levy is retired and he is keeping retired, weirdly. That's where the Internet history comes from because he is a history buff, some BGP route leak history because we have got at least one of the ASNs that did one of them in here. Talk about what happened in June 2019. Certain things, certain comments about certain vendors of certain products, and some graphs that hopefully indicate why you should do a specific thing.
So, as some of you might know, the Internet has been around for a while. Roughly March 1977. Thing is, when all of that was built, when RP net was built security wasn't top of mind and that's not just indicative from the initial set‑ups, it's indicative of the first RFCs, the first definition of IP, there's an Internet header and a security option, but I don't think anyone actually ever uses it besides maybe like certain companies that use load balancing and use it for load balancing reasons more than anything else but it's not ‑‑ nobody really looks at this any more and nobody actually really cares.
And the same thing with everything that was built on top of IP. So, when Tim Berners Lee came out with the worldwide web he explicitly made it clear that the information exchange takes priority over information secrecy or information security. So, basically, everything that we have built since has been built on relatively unsecure foundations. So, ever since we have kind of been going back and forth and trying to improve things but we still have a very critical communication protocol within the Internet BGP where, again, security issues are not discussed in the spec, so everything again is kind of just bolted on top of it and that's unfortunately a pretty big problem. Because we started, like I said, unsecured, what you see at the top, that entire chain was unencrypted or unverified, and, since then, we have started realising that maybe we need to start verifying things, maybe we need to start encrypting things because not everyone is on the Internet with good intentions so, when you have the connection to the DNS name that's now nicely encrypted with PKIs, certificates, I think what 90% of the Internet at that point is TLS secured, so we are good, for DNS resolution we now have policies like DNS over TLS and https that adds a level of endescription, we have got DNSSEC for verification, but that last bit between just the IP addresses, like lower lower layer, there is still some issues there, because whenever we fail at verifying that lower layer, that layer 3, we are running into issues and it's becoming more and more public, because whenever there is a failure there, you get massive headlines that say, Amazon, Facebook Internet outage or Google traffic hijacked, most of the time it might not be the most accurate reporting but the end result is still the same. People are worried about the Internet security. And they are not wrong, because, unfortunately, BGP has had a long history of routing leaks or routing hijacks. The first one was the AS 7007 incident, Hi, Aaron, where a routing bug caused us to deaggregate the entire routing table and things kind of broke in fun ways. You had the Pakistan hijack which was trying to be regulators and blocking YouTube within the country at which point they decided the world was back stand. Malaysia, Google leaks to Horizon and that was causing impact in Japan because they were being routed to Chicago, it goes on and ends with what I am going to be talking about which is the routing leak from Verizon in 2019.
Buts, it goes going on, this is a tweet from Doug in 2020, it's interesting, it's apparently not safe to run any quad addresses, it's not just safe to not run 1.1.1 /24 as I was talking about this morning, also not 78.78.78 or anything of the other quad addresses. This is a really fun and interesting one, it didn't impact anything, it was a weird thing to say which is why Doug picked up on it.
Let's go back to June 24th, 2019. I'm sure all of you were aware what happened. A massive route leak from a specific provider caused worldwide impact for a bunch of different networks but Cloudflare was the most impacted. That was also what you could see in most of the reporting, was that Cloudflare was front and centre. But it wasn't just us. This is CEDEX is showing issues for pretty much all of the charge CDNs. You can quite easily see Cloudflare is the one at the bottom, we are Anycast, not Unicast based network, which means if any of our prefixes are being leaked it's going to be impacted globally, for the CDNs it's going to be slightly more local and slightly less impacting. As I said, this wasn't just Cloudflare, it was pretty much the entirety Internet community that were impacted, we are yelling about it or blog about it the most. This was the impact we saw. The interesting thing you can see it wasn't just impacting us in North America, that was ‑‑ the initial expectation kind of with route leaks like this it's going to stay relatively local, unfortunately the way that this route leak or this route hijack happened caused us to lose significantly more traffic than we actually really wanted to, and I will go into why or how a bit later.
How did we fix it? We didn't use any fancy APIs or JSON or anything fancy, we had to call for like an hour, trying to explain that what you are doing is wrong, could you not do that thing. And unfortunately that took a while because you need to explain, you are leaking our routes, could you not do that and the other side replies with no, we are not. And you are kind of stuck because you can't really ‑‑ he said/they said, right? That's how we fixed it, we just told them, hey you are looking your routes could you check your export filters and please fix this, you are causing impact.
And the thing is, this is a document from 2016, kind of defining what a BGP leak is. There's multiple definitions and multiple subsets of BGP leaks out there. You have got an inadvertent leak where someone is leaking their entire routing table, for example, those occur quite frequently, but there's more nefarious ones or an in‑between. The interesting thing with this one is that it's not just a leak, it was actually a prefix hijack, because what you can see here is, first of all, /21 and the AS path here is strange, to say the least, right? We should never see a Level 3 route in a Verizon route followed by, I think this is and it and it. Cloudflare doesn't advertise /21s, or at least not for these prefixes, we do advertise /21s but this specific specific is a /20 and that's the majority of our Anycast routes are /20s. So what we were seeing in the previous slide and in the leaked routes, they are 21s, so as you all know, longest prefix matching, that's always going to win, no matter how well peered and well‑connected Cloudflare is, that /21 being advertised is always going to win and that's main reason why Cloudflare was so severely impacted, it wasn't just someone winning/, it was them advertising a /21 that true providers like Verizon hit the default free zone and suddenly start attracting all of our traffic.
And that's where the Tier1s come in, because like I said, it didn't just stay locally, stay to a single IX, didn't just stay within the Pittsburgh area or the east coast; it went global, and it went global real quick because pretty much every single of these transit providers was either accepting it from Verizon or accepting it directly. And that's kind of what we see here. Like you see Cloudflare but it's not really because we don't do the /21s. Level 3. DQE, Verizon and it goes global. And that was really problematic. We ‑‑ we wrote two blogs about it and when Cloudflare writes two blogs about it, it's serious. But it was kind of really fun exercise for us, because it allowed us to really kind of flex our engineering muscles a bit to really dig in to what was happening, to really understand everything that we were seeing, everything that everyone else was seeing and Martin was nice enough to write a bunch of shell scripts that made it a lot easier for everyone to do some analysis themselves, have a look at what we were seeing and have a look at why we were seeing the things we were seeing so we made sure that everything was researched in an Open Sourced way and easy and understandable way.
But this was always, you know, like I showed you in IPv4 prefix, I have been talking about IPv4 and like IPv6 has been the family of the future for the last 15 years, why am I not talking about IPv6? Well, we kind of got lucky with that one becausal gain knee technologies is an IPv4 only network, they were only advertising IPv4 prefixes which meant even if they wanted to they couldn't because they didn't support the address family, so the interesting thing we saw for the combination of a bunch of different ASNs, this is like the red is IPv6 utilisation, it's a bit sad, but we do what we do. But then you can see during the incident IPv6 actually goes up, and that's where happy eyeballs comes into play. Happy eyeballs normally is there to prefer IPv6 over IPv4 but I guess in this case IPv6 is so bad that IPv4 still wins. But during the incident because everything was so degraded IPv6 actually won, IPv6 actually won those timing races on the clients to get traffic across so that was kind of cool. Also just kind of meant there was some self recovery happening so that's always good.
So how can we fix this? Because it's all fine and dandy for me to be here on stage and complain for half an hour about everything broken, please stop breaking things, if I don't suggest some solutions. And there is solutions, there is a bunch of solutions, there's always solutions, we are good at that. A concept primarily kind of talked about a lot by Job Snijders called peerlock is relying on a very clear attribute of what we call tier 1 providers where all tier 1 provides basically are in a mesh with each other, because you see this mesh, all of them are interconnected with each other, which means that you should always be just a single hop removed from another tier 1, that's kind of the entire concept of this. Which means that, for example, for let's keep the Level 3 thing going, for any of the customer prefixes from Level 3, cogent should never be receiving them from Verizon, should just be receiving them directly, that's the direct connection is there. So, what you can start doing with peerlock is, create explicit configuration on all of your routers that makes it impossible for multiple tiers, tier 1 ASN or transit ASNs to be chained behind each other but you don't need to just stick to Tier1s. Obviously Tier1s makes it significantly easier, it's basically the defining attribute of a tier 1 transit network but you can do the same thing with bunch of different networks where you know for a fact that this relationship specifically excludes them from being transited by this network. You can do this with content networks towards trance its. From Cloudflare's perspective, because we are so well peered with Google or Amazon we can explicitly state in our filtering that we don't want to see any of those specific ASNs because they shouldn't be seen through them. IXPs are a tonne harder to do but but it's still feasible and might be worth a thought and something to talk about.
So, where did that /21 come from? It was through a system or a piece of software called a BGP optimiser, what do they do? Very straightforward, they receive the BGP routing table from their peers, from their transits, from any of the up streams they want and they lie, because that's what they do, they are liars. They will deaggregate all of the received routes and split those across all of the available ports, allowing you to very fine grainly traffic balance or load balance, right? That's very ‑‑ I can understand where certain networks might find that useful but that's incredibly dangerous, as we have seen, because the more specifics will always win so even if you make the slightest, the tiniest mistake in your export filter and even one of them escapes, you are receiving all of that traffic.
So that's what we saw. Like I said, like that /21 shouldn't have been in the routing table, it isn't in the routing table. So, do route optimisers cause fake routes? Definitely they do. To have a vendor of a BGP optimiser to tell us how to do best practices is weird, because they are not secured by default. You would expect that you know, setting the no export community on those leaked ‑‑ fake routes would be sane or useful or very useful secure default. They specifically say it's not enabled by default. They have got an entire reasoning for that. It's that it's not perfect and it's actually default of everyone else, not us, which is sure. It's apparently, no export isn't a viable option for a specific subset of their customers base, therefore we are not enabling it for anyone. There is a bunch of complaints about that. It's not the right response or at least that's not what we think is the right response.
Now, onwards to RPKI because it isn't right if there isn't a single talk a day about RPKI and I think Job already did his one yesterday so I guess it's my turn today and on Thursday Frederick could maybe do his and we have got everything covered. Yeah, I mean, RPKI would have fixed this, RPKI, if everyone did RPKI origin validation because it's a /21, because it's a more specific, no one would have accepted it because we only sign our /20s as /20s and that's the only thing we do so if everyone had ROV‑enabled I wouldn't be up here talking so this is all on you. But it's not just RPKI, right? Like, MANRS is an amazing resource to help you figure out how do I do actual best practices and not just follow the specific vendor telling what you their best practices are; it's a neutral entity that helps you from very specific perspectives that goes from service providers to IXP participant to content delivery network to get the right things going. Obviously our IRR filtering is very useful pool. Unfortunately there is a bunch of unanswered questions and scaling issues specially for well peered networks, very big networks, it becomes very, very cumbersome, which of our databases are we to trust, what about the any of the free IRR databases, how do we automate this and how frequently update our prefix list? How do we tell our peers that we are currently not accepting a route because their IRR filters aren't correct or entries just don't exist? There is a bunch of remaining issues there that are still unanswered, still unsolved. But we have been making amazing strides over, you know, the last two years, that this still comes, over the last couple of years we have have seen a massive tonne of progress, right? We have seen a consistent and very strong increase in RPKI‑signed routes that is getting ‑‑ big networks saying and verifying they are doing route origin validation. Like this is slightly outdated but networks like AT&T, a bunch of different ones are dropping RPKI invalids which definitely helps a lot, it significantly reduces the blast radius of incidents like this. Because I will show that in a second, but the interesting thing here is, even if route origin validation was enabled on some networks it didn't actually protect them and the main reason for that is the ARIN TAL. I think since then they have slightly changed the legalese that you need to agree to to tally download the TAL but when this incident happened Seacom were doing, they weren't including the ARIN TAL. Unfortunately because the Cloudflare prefixes that we are talking about are issued from ARIN, they need to be signed by the ARIN TAL so for all intents and purposes for Seacom they were unsigned prefixes and I mean at that point, you know, you are kind of lost and that's unfortunate. So, again, there is clearly still some issues, right? I can't just say RPKI is the lord and saviour and it will save the day for everything that we do, but it would have helped. So there is a really, really good research paper by Christopher and David from the University of Pennsylvania, how we can make RPKI easier for operators to use. This is basically an open letter to ARIN, I hope everyone understands that, but yeah, it actually is really useful to deploy RPKI. You can see here is a network that will remain unnamed, although it's not, it's AT&T, that had filtering enabled and you can see everything is fine and dandy and everything, like sunshine and roses and this is without filtering and that's not fun, sunshine and roses. It was a masstive massive issue for us, but it's amazing when you see big, super big networks like AT&T and Telia go in, well, we are going to lead the charge on this, we are going to do what is right and start doing RPKI validation and rejecting invalids. So, summary of this is, it's 2022 and it's still broken. That's ‑‑ I wish I could have another message. I wish my message could be we fixed BGP route security but we are not there yet. There is a bunch of different issues out there still. RPKI is only going to fix a very specific subset the issues we see. We need BGP path validation and BGP Sec a bunch of different additional security tools on top of the existing stacks that we use day‑to‑day to make sure everything is okay, we are making progress, we are seeing RPKI is being used day‑to‑day in the biggest networks out there so there's nothing stopping any operator, today, to run RPKI route origin validation. You should be doing this today.
Unfortunately, Verizon still isn't doing it and we are three years later, because as some of of you may have seen, we had the tiny hiccup with a network in Brazil that was leaking /29 to Verizon who were happily accepting that and sending it within their network causing Verizon customers to end up in Brazil, which makes perfect sense. So we are not there yet, let's not kid ourselves, we are not there yet, there is a lot of work still to be done and I am exciting about the progress we have made.
I kind of flew through this we have eight minutes left. Does anyone have any questions and if you don't right now, I see Rudiger is standing up. If you have any questions, please don't hesitate too e‑mail. I think if you send an e‑mail to the Martin e‑mail address it will end up in ‑‑ you know where to find Martin, you see him driving around with 1.1.1 number plates.
THOMAS SCHMID: While ‑‑ one part of the story what about input ‑‑ what was going wrong with from Level 3 at that time?
TOM STRICKX: At that point, it's a combination of things, right. The reason why we are complaining primarily about Verizon in this case is they are the source. They are the ones that should have stopped it. But yeah, you are 100% right. Level 3 are different transit networks should have stopped the leak at their side. It's one of those things at the moment where it gets out, it's incredibly difficult ‑‑
THOMAS SCHMID: I would expect if they have a customer relationship they have proper filters in place there.
TOM STRICKX: And we have a proper relationship with Level 3, they were doing the right thing, they were providing transit so they are sending them a full table, the thing is we can't just shut down all of our Level 3 sessions globally because ‑‑
THOMAS SCHMID: That's clear, sure. I mean the other side.
TOM STRICKX: Yeah and we asked them to shut it down and that takes a while.
WOLFGANG TREMMEL: I have one online question from Kurt Kaiser, no afilllation: Is there the time right to set a date for all out objects to be signed, if such errors continue this is not fun. What about RIPE 100, we aim to have 100% of all Internet routes equal valid?
TOM STRICKX: That would be amazing, I am realistic to know that's not going to happen. I think it would be nicer maybe if we, hindsight being 2020, if five years ago mandated a policy where if you want a new prefix assigned it needs to be ROA signed, unfortunately that never happened, so we are where we are. I don't ‑‑ I think it's good we are making progress but I don't see this being a fixed issue in five years.
RUDIGER VOLK: Okay. An RPKI talk and me not commenting, hasn't occurred for long time. Unfortunately, I don't have real questions.
(Applause)
TOM STRICKX: In that case if you want to stand on stage and talk that might be easier, I will just go.
RUDIGER VOLK: I could do that. I have no prepared slides. And I survived a long career with a very amazing small amount of slides done. But yes, we are not yet done. The area is something that is actually absolutely not trivial to be solved. And for dealing with a non‑trivial subject, sometimes it is extremely important to speak precisely because people who are not in the know, easily pick up models of things that are simplified in weird ways. So let me kind of first respond to Kurt. We should not be talking about route objects, I think he was asking for route objects, because IRR, as you were mentioning, well okay, IRR can be used for pulling some information under some circumstance with some validity, but kind of it is only a possibility to be used if you are, well okay, completely dispaired about well okay, there is absolutely no information available to make decisions.
TOM STRICKX: I completely agree with that.
RUDIGER VOLK: And all best practice papers that claim that IRR are a good thing, are essentially wrong by not claiming that well, okay, yes, this may be the available thing for certain people at a certain time, which is in the past, but it is not really a solution. Cannot be.
Your suggestion of allowing only prefix assignments with RAOs, okay, not really unless you accept RAOs for AS0, which one could do, well the thing we should be looking forward to is that kind of the market dynamics, the interplay within the ‑‑ within the IP routing means that routes that do not have a ROA just are not worth being kept. And that's something that may develop, that we may run into a phase where well, okay, kind of people who are a little bit backwards just don't get universal connectivity because they are not signing up their stuff. Speaking ‑‑ well, okay. For speaking precisely, I think it is extremely important to use the word we are using RPKI ‑‑ we are doing RPKI, as maintaining the cryptographically armed database, the use of that database can have a couple of purposes, some of which are already kind of available and in the play and others are going to become more important and helpful, and so when people are saying well okay, we are deploying RPKI, I'm always asking are you signing your routes, are you dropping invalids? Are you authenticating your peers by the AS signatures, probably not? So let me point at just two numbers for agenda that is straight ahead. The other week, RFC9234, 9234, was finished, which defines an extension of BGP for characterising open policy. That's not using RPKI, but it is absolutely something that would help with the examples that you showed and everybody should be asking their vendors when that will be in their implementation. Same for 8893 which is just the hint of what details you should be using when you check the announcements that you are giving the other people, you should be doing that and not rely on the other people to fix your leaks. And then lets back later on asap and BGP Sec.
MARIA MATEJKA: It's time for a provocative question: Is it not now the time to implement ASPA and in bird or should we wait another five years?
TOM STRICKX: We should start immediately.
MARIA MATEJKA: Okay.
TOM STRICKX: Thanks for that, that was an easy one.
JAN ZORZ: All right. I see no other people rushing at the mic. The queue is empty, there is no questions online?
WOLFGANG TREMMEL: There is nothing online.
JAN ZORZ: Thank you, thank you very much, very informative.
TOM STRICKX: Thank you.
(Applause)
WOLFGANG TREMMEL: Okay, that's it. Happy coffee break, see you in half an hour and please rate the talks.
LIVE CAPTIONING BY AOIFE DOWNES, RPR
DUBLIN, IRELAND