Routing Working Group
19 May 2022
9 a.m.
CHAIR: Morning, Routing Working Group will start in one minute, if you could please find a way to your seats.
IGNAS BAGDONAS: Good morning again. So let's start off the RIPE Routing Working Group session. It's me, Ignas and my fellow co‑chairs are here. Actually, for both of them it's the first ‑‑ I really feel corrected by that apologies, I'm not an expert in that language. So both of them, it's the first physical meeting, so, please welcome them.
(Applause)
JOB SNIJDERS: We have done one physical meeting pre‑Covid.
IGNAS BAGDONAS: As always we have quite a few topics to discuss, and at the beginning the administrative topics, but the clicker doesn't seem to work ‑‑ it works now.
So, the queue etiquette and how to get in line with questions, please use the app and use the questions and answers. Don't use chat by raising questions there.
Minutes for the previous meeting, they are available in the usual place and as business as usual.
One topic that maybe we as the Chairs would like to encourage you to look a little bit more, it's from the, our friends in the Database Requirements Task Force, they have a set of requirements on how they would see the database ‑‑ well, a good system evolving and some of those requirements are directly related to the Routing Working Group. There is nothing special, or nothing really new. It's about well general practice, how things should be done. You have a link there, if you are interested in the details, and of course we, as a working group, haven't done much tangible work on that, and we, as the Chairs, we don't believe that this is, for us to decide and do that. It's for us to encourage you as a group if you care and if you are interested, to look into that and comment or maybe try to address those topics. It's not, say a hard push, it's just a generally reminder that we have this activity ongoing and if you are interested and if you care, please take a look at that.
And then we have our agenda for today.
PAUL HOOGSTEDER: We start with a presentation about BGP toolset by Ben. Ignas will tell about BGPSec. Mikhail will give a presentation about publish in parent. Job has got a lot more to tell about the working RPKI. Alexander will tell us about automatic updating of prefix lists, which is a good idea in my opinion, and then a lightning talk by Tim.
IGNAS BAGDONAS: Then, Ben, welcome on the stage and tell us about the BGP tools. Cox.
JOB SNIJDERS: My hope is that by the end of Ben's presentation all of you will set out eBGP towards Ben's projects. That is primarily the goal of this talk, to promote that that activity. Ben, take it away.
BEN COX: This is the unending misery of BGP tools.
BGP tools is this, if you ever used it before, this is the home page, it goes this is immediately what you are connected with. And effectively a search engine. It's not a rerefined site on the front page, not much to do other than give you what you are looking for.
It's designed in order to avoid you from teleneting into things like route views in order to get data about what's going on. It's designed to design a better user interface than the other things, which I have some usability concerns with.
So, one of the things that I have put a lot of effort into is making sure things like mobile support works correctly. One the things is nothing is worse than being paged in the pub or being woken from a page and somebody saying something is happening in your network. Trying to load one of these sites and finding that you have to pinch to Zoom or various other annoying Java script problems. It is functional on the desktop because that's probably where you are making your BGP changes. It's also built in mind with things like modern day problems. So, for example, a prefixes and ASN lists, what your upstreams or peers are on per prefix level rather than per network level. A lot of time when you are looking at individual processes with an individual prefix it's handy to know who the upstreams on that particular prefix or who more likely something in that prefix.
It also tries to help out in gotches like the hurricane and cogent split. And will put banners where things are potentially broken.
I make a lot of effort in making sure that the data is accurate and timely accurate and so the website will actually publish immediately what its last pole dates for these are the data with. Alternatively if the website looks like it's broken you can quickly go into the features tab and SOA if the most of the BGP sessions are down, which if they are the site is going to be broken.
And there are some basic APIs if you want to script against it. I generally avoid APIs, that is because I have commitment issues into maintaining that I don't break APIs into the near future. I do provide things like table dumps, like who announces each prefix. It's useful and if you really do feel like you really want to use the website, which I highly discourage, you are more than welcome to scrape the Gopher website. I want to see Gopher in the Internet.
BGP tools also tries to be more adventurous in terms of data sources than normal websites would decide to go for. For example in the Internet Exchange tabs, you'll notice there are two more things attached to ASNs on the side. One of them is the detected vendor and one is a tick to indicate they were online. The vendor detection, that's in the logos, are enabled by My Friends Internet Exchange and then dumping the up tables to me. Alternatively I can get them fromscraping the local addresses out of BGP MRTs. Or alternatively just looking at the looking glasses that some exhibitions provide.
Online detection is more interesting, so IX looking glasses provide some good table dumps where you can see the next hop it available. If they are available, they are probably online on the Internet Exchange. Abuse of larger super networks, who don't seem to ACL to assess the IXP ranges which is a real bonus, that's actually pretty bad. It's actually very useful for BGP tools. Hence I won't list you which networks are doing it.
The prefix page gives you this neat cute little curve so you can see from the graph you can see whether the prefix you can looking at has stuff in it or not. Also, attempts to give you better view on DNS, so reverse DNS, which is a standard feature, exists in BGP, but also reverse DNS works on v6 too using some DNS server tweaks or tricks. They also make sure that I go through things like certificate transparency and figure out which domain names are pointing into that prefix so, if you are looking at a prefix that might be from a hosting provider, you can immediately see which websites are pointing inside to that.
Handles annoying edge cases too. For example, you have this verifying prefix which is announced by 22 different ASNs, where the website is implodes when you SOA he this because it is not an expected thing to see. There are some other networks that do things differently, Gandia announces themselves from one prefix and then upstreams themselves from about 20 or 30 upsteams, that's handled as well. A lot of my effort and time goes into finding these horrifying prefixes and saying okay, time to redesign how that actually works.
.
One the biggest features is networks are broken up in different policies. TTL era of a network having a continuous backbone and announcement policy is kind of gone. Networks these days don't have a backbone between a lot of prefixes. Think of CloudFlare who don't have ‑‑ they have some of the backbone but not every I think of their prefix it is routable from each other, so they have independent announcement policies. This is not useful if you are looking at a wide view of the network. You are going to see 2 million upstreams. In this I say CA the site automatically computes which prefix have relatively the same or practically the same announcement path to tier 1s and then groups them together into things that you can individually scroll down.
It avoids situations exactly like this.
You may think about it, how hard can it be? So, this is the general data flow, or last week's data flow, and that is, I feel like it's a bit of a Cloud diagram. It's not that hard, basically there are MRT importers, stuff comes in from RIPE RIS and reviews and they crunch every 6 or 3 hours, depending on the publishing schedule of those route collectors. All that data gets put into Redis, any data that lives for 24 hours or less goes into Redis, and the data is ultimately he femoral anyway, and practically anything else that has permanence like for example Whois or peering DP or IRR databases gets sorted here.
And there was some realtime support to try and track new prefixes coming into the table.
This way was remarkably cheap to run, I ran a small amount of storage, the only real storage cost was the Whois database which is its own separate paying point. And Redis sat around 4 gigabytes of RAM to consume all of the full tables which is about I think it's about 200 full tables using RIS combined, some of them being the same. You can read the slides.
The site is generally very fast. The page render target is about 200 milliseconds, which is adhered to I have alerts for, trying to make sure the site will never generate slow responses, where I can.
So you may be wondering what is the absolutely hardest website feature of all of these platforms, and it's definitely what you think. You think might it's routing. It's this thing here. It's not Akamai. It's not the problem although the access denied is. The actual problem is naming ASNs. You may this is this is very easy but this is mind boggling difficult, and I'll give you some quick examples on some of the things that gives me the most misery in my life.
This is my ASNs Whois block which is slightly truncated, so, it's pretty obvious in this case, the AS name is Ben ‑‑ that's not useful to display to a you are auto, that's what they put in when they first register the DNS. That might not be true. Instead it's probably wise to us auto the organisation that RIPE gave you which is hopefully in the same reality. That's an easy case. However you can't apply the organisation rule to everyone. So for example, this renders that. Is a new example. This is an APNIC prefix that has upon sponsored by someone. Their policy says if you sponsor someone the organisation is the sponsor, and AS name description etc. Gets to be set by the person who is being sponsored. So obviously if you use the exact same policy on right, on APNIC, you will find that every single ASN that is had a been sponsored obtains the name of their sponsor, that's not useful. You have to yield a bit and allow APNIC sponsored ASN in particular to be able to set their own AS network. I try and avoid doing this because if you let people set their own description for the network they will put the worse things in there like super higher tier 1 network, or alternative just dashes. In general I try as hard as possible to avoid letting users set their own AS names, if they can they must be able to set it using the Ed feature in BGP tools.
This renders that.
You occasionally have quite exotic Whois servers, for example JPNIC has this, also unlike anything I have seen, which makes it creative to find the correct names, this is attempting to find out what is the correct name you should put to a prefix. The answer is I guess KPDI. But it can be very unassuming.
Occasionally you just have things like LACNIC, who just give you broken unicode over the Whois socket and for a long time they have recently fixed this and I don't know if they know they fixed this, but it was actually giving out what I call double mojibake, they had corrupted the UT A but in a normal situation this is recoverable using some careful unicode rules. Unfortunately they had then double reencoded the fact that they are making it completely poisonous but unfortunately if means that there was no way to scrape accurate data out of LACNIC, this changed recently being run year, LACNIC has some of the strict I was Whois rates limits that I know of, meaning that it's very hard to keep up with their changes.
So, eventually, most ‑‑ unfortunately most LACNIC's look like this.
Falsehoods protocolers believe about Whois, this will become a trend. Who is is incredibly hard thing to deal with because it's just dealing with unstructured text and trying to figure out roughly figure out where people's intentions were when they were doing things. Before people go and ask, things like RODP is basically just parsed Whois in my eyes, it doesn't solve the route problem. The data quality is questionable at most.
While we're on topic, false hoods programmers believe about BGP, this has also cost me ‑‑ everyone of these bullet points has cost me several hours of my life. I think the worse won is there is only one AS path per route. You may question what that means. Unfortunately BGP it's easy if you are write anything that touches BGP, there actually turns out to be two AS paths in the BGP message, it's easy to pick the wrong one and get really wild data. That cost me probably several days because it was very insidious and only happened on a few peers.
.
So generally speaking, websites like mine work by importing RIPE RIS and views, however, these slides are simple. There is a went server, you can download the data from it, they are generally MRT files, however, I have grown a small disliking to MRTs, in that format is not consistent, if that there is an RFC what different route collectors use sometimes differs. The files are sometimes published consistently or they can become stale without you realising. Sometimes the actual update times on the files can be recent but the data of two months old, which is not good.
I have a big axe to grind against MRT collectors these days.. Including some MRT collectors don't screen their peers at all and will submit outright false information about their network to you, which kind of poisons your view of the Internet.
Looking done the graph of BGP. Processing is easy until the billion node graph turns up. Rather than going down the route of things like ‑‑ instead BGP tools cheats a bit and has a hard‑coded list of tier 1 networks and then all upstreams on the network are derived about how can someone get to an tier 1? This is not necessarily true, for example some people peer with NTT level 3 etc. It's just easy year this way.
So, for example, this is a set of AS paths. We can assume, because 1229 POPs by 3170, that because they are between a tier 1, that my ASN is being up streamed by 3170.
This was absolutely fine, right up to the point where sprint and Zayo started accepting routes from route collectors on Internet Exchange. That was particularly annoying. So, suddenly everybody was connected.
Unfortunately, the normal mitigation I would have done here would be just to assume that everybody who has effectively one of the rules, C peer lock etc., of tier 1 relationships is they probably should never hop more than 2 ASes or they should, in general, you should be able to see if it's an upstream relationship if it gets handed at that another tier 1. Unfortunately you can't fores this policy too hard because certain networks have a lot of connectivity to a lot of different tier 1s and if you run this assumption then all of CloudFlare's up upstreaming relationships will disappear. Instead, some particular handling for sprint, Zayo, sometimes Hurricane Electric is set for it where it's double hops.
The awkward truth. In general, these sites are generally only as good adds data as they are provided. MRT files are really quite biased and I'll get to that later. All of them have stuck routes. I have lost several days of my life trying to debug weird data, thanks to mostly problematic aids sticking routes in control planes and debiassing data is a lot of work, you have to be prepared to send out a lot of e‑mails asking network is this broken? Is your router broken? What's going on here? For the most part people have been responsive to these e‑mails. Thank you. Writing these tools are very easy up to the 90% and the rest is just all of. The long tail of bugs is extremely long.
MRT files are also generally quite bias, so this is a graph showing the amount of paths that intersect tier 1, you will note Hurricane Electric spans the highest. Most carriers are not visible. For example we see Verizon is barely visible at all in RIS, and if we look at route views, relatively the same story. None of these intersect the full table size which means that it's very easy ‑‑ this implies there are MRT ‑‑ from the standard MRT data sources it is impossible to find full tables of horizon, orange and sprint ‑‑ you can, but there is a certain exotic Tier 1 that you cannot find full tables on.
So what am I doing about this? We built our own collector. If you can't beat them, just simply recreate what they have.
So, introducing the BGP tools route collector. It is frictionless and easy. You don't have to e‑mail me unless there is a problem. You can just log in at the bottom for a PeeringDB, it will ask you set up communication methods. There is exotic communication message. I can send you messages through slack or suspicious notification through slack, whatever, whatever you want, as long as I can reach you.
You can instantly create new sessions and bring them up as soon as possible, there is no time or propagation delay and you can see the status of every single session on the back bend in realtime.
This is the form that you will fill out in order to create, hopefully, you will fill out in order to create a new BGP session. It's very basic when you submit this page, capacity search goes and finds a route collector that has memory capacity for your session and it will give that you there, and once it gives that you IP it's literally good to go, I highly recommend you go and set one up.
When you go and set one up it will ask you these three questions. The first is do you want to export data outside of BGP tools. If you want to give to the academic community this data. The second one is if you want the MRT files split out into a separate pool. I feel like some of the reasons that people don't want to feed these tools is because commercial tools go and use them as well. So you get to choose.
And the third one is whether you meet an ago knew if your session goes down.
In order to pull this off I wrote a BGP day on. It does a couple of quite nice things. For example, changing every session into individual so they shall crash independently. Generally they co‑exist. There is nothing worse than dealing with a 64 gig to 128 gig mega process n this case you can go and kill sessions off or slowly role them without having too much misery in operational.
It makes a massive difference. The data quality has already improved despite only having around 80 sessions. Much better visibility of certain tier 1s especially things like Zayo and Verizon. So thank you to the people who are already feeding me and who have sent me sessions and built them up. Thank you.
Unfortunately, that was a gift that I didn't realise we were uses PDFs for this. Basically, if you keep prefixes open in realtime and make quality changes the actual website will update without refreshing. So you can use this to plan maintenance or anything you are doing like dropping a carrier, you can go and watch this change in the global Internet.
I won't go into this much too. This the what BGP tools likes like these days, and that is what BGP sessions now feed it rather than Redis. And if you like this tool, I am particularly looking for these tables, if you have these, talk to me afterwards, set‑up sessions.
If you want better peering numbers, because who doesn't want a better attractive looking network, you should feed anyway.
Also, obviously, helping the academic community. Questions, requests, etc?
SPEAKER: Have you considered ingesting BMP streams.
BEN COX: Yeah, I find those difficult to use because it's very hard to go and track ‑‑ it's hard to track when VMP streams disconnect I may be stupid in this situation and I don't understand BMP streams correctly. I always fear for example if I miss a session drop notification then effectively every single route from there on is stale and I have no idea. Which sounds like catastrophic failure. This to me is like beyond catastrophic in which case everything there from there on is stale.
.
ROBERT KISTELEKI: I applause you for the amount of information you conveyed in the amount of time you had and also the stenographers were keeping up with you.
The question of naming ASNs came up many, many times before. Would you say that there would be a point in making some kind of a crowdsource database that actually has useful AS names?
BEN COX: BGP tools ‑‑ someone earlier today suggested, asked about effectively an April, which I haven't also had API behind the sofa for dumping ASN information right out of BGP tools. I didn't realise there was so much of a demand for this. I'll happily make t it basically dump as CSD of what everyone's name is and I go through this list and make sure it's accurate.
ROBERT KISTELEKI: I think what you think is the correct name is useful, but with people, you know, the crowd can help you in there that's probably even more useful.
BEN COX: There is an added function on every page, if you think that something is obviously false or wrong you can submit an edit.
SPEAKER: Davide. I have a question regarding the data retention. So you mentioned like 13 gigabytes of data. So we have a similar database, it's dh E, we have 77 terabytes, and so my question is: How long do you retain your data in the ‑‑ on ups in your database?
BEN COX: I don't put updates into the database at all. Right now, until I start actually exporting MRT files out publicly, basically no data is retained. If an update comes in that goes and alters the table and the previous data is thrown away. This is probably going to change in the coming months when I purchase hard drives to store this. Right now I don't have the storage capacity.
SPEAKER: Well plan that and get a lot of storage capacity.
IGNAS BAGDONAS: We have one question from the online. And that's Anton Versuren. Have you considered investigated using air data protection instead of Whois where available?
BEN COX: What was the thing instead?
IGNAS BAGDONAS: Air data protection. (RRDP).
SPEAKER: ‑‑ RDAP. RRDP is a different thing.
BEN COX: I have looked at that, I may be wrong or misunderstanding things, but it effectively looked like doing the job of just parsing out the Whois into JSON fields which isn't always what I am looking for. I can parse the Whois field, it's about knowing which field to pick the data from correctly per RIR per sub‑RIR.
RUDIGER VOLK: Sorry, no question. Kind of, Flash back into history. I tried to get the schema of the SRI make in '89 and I was told okay we can't hand it out and anyway it's based on ancient software running on 36 computers which were still around at the time, well, okay, for this time anyway.
Unfortunately, as we managed to get something like RIRs out, the question of coming up with a common defined schema actually didn't happen because the Americans stayed with the old stuff, the Europeans did something nice which actually got standardised in, as RPSL essentially, but kind of APNIC broke some rules, LACNIC invented its own thing looking like the RIPE stuff but definitely completely different schema, and the IETF work to kind of recover from this is essentially ending in RDAP which is providing common assess protocol for a coordinated schema. So kind of saying well, okay, the presentation layer that you see, you find annoying, yes, I would prefer XML, please throw eggs, but kind of the presentation layer really is not the interesting thing. The interesting thing is first have a clear schema idea, and then work with the presentation layers and the RDAP there is the right direction to orient.
BEN COX: I agree, thank you.
SPEAKER: Not a question ‑‑ we have got a question online from Mike Mershel asking what's the link for setting up a BGP feed?
BEN COX: It's BGP dot tools/‑‑ I am going to be BGP tools and scroll to the bottom there is a button called contribute data. Click that, there will be an a brief page telling you what you get into. If you cannot log in with PeeringDB, send me an e‑mail and I'll sort it out.
IGNAS BAGDONAS: Hello again, let's discuss a little bit about BGPSec.
So all of you are well aware that BGPSec is just evil. It doesn't exist. It doesn't work, if it works, it's too slow. If you want to deploy that you need to replace all your new shiny kit because it has large resource requirements. You might also argue that origin validation is more than enough for securing the routing system. There are also other options available for validation, it just doesn't scale, it may leak things which you don't want others to see, and maybe it's just a nice academic experiment but doesn't address the real problem, and dealing with the keys is always complex. It involves and BGP is secure, why do we need to do anything else?
.
So, that's quite a lot of negativity. Let's focus on a couple of those areas and look deeper into them.
It's about the performance of BGPSec. A reminder of what BGPSec. That a path validation sluice business cryptographically sign the AS path and also forward signs to whom we are advertising the prefix. This way we get a cyrptographical verifiable change that can give you an answer that this path is really valid or it's not valid.
Each individual prefix is signed separately, and that is one of the contributing factors to the complexity, I really thank you for putting the light straight into my eyes. Now I cannot even read my own slides.
So, let's do some experiments and see how it operates in close to realistic world environment.
Modelling an operation in, as a route server in an exchange of a moderate size, take the available, publicly available piece of data, slice that, get the realistic distribution of both paths, prefix to path ratios, path length distributions and things like that. The total step is 450 for your feeds. BGPSec is pre‑computed on feeder node. All the CRYPT to have fee, all the heavy operations are done in advance ahead of time for the simple reason for example generating several hundred million numbers of random numbers is not a trivial task to do, and then that is all fed into the device under the test and the measurement is for the total conversion time, first prefix, last prefix out. BGPSec verification is done before the first half selection, therefore we are testing the performance of BGPSec operation all feasible paths, not just the best one.
And also, you try to use caching of various types and sorts where available and where feasible.
Now, the results themselves, they are relatively not absolutely, just show the ratio of plain BGP versus a specific implementation of BGPSEC. That's stuff in a specific environment, that should not be seen as an absolute source of truth. What is important the difference in convergence time. So that difference in convergence time is quite noticeable. If something takes around a minute and a half, if we redo exactly the same experiment with BGPSEC it takes over half an hour.
Now, you can argue that in a realistic network design, you will have multiple route servers or other BGP central nodes, and taking one out of service for half an hour is probably doable. However, those numbers seem to be a little bit off, and let's look deeper at what is causing that. Is that the proposal fault or is it something else?
.
Stepping back into an area of a general compute platforms analysis, if we look at the contemporary platforms today and for the last decade they have more enough of the raw compute capacity. The speed is not a limiting factor. There is definitely more than needed. The limiting factor is memory bandwidth and latency and that's not getting better. Following generations of memory systems, they basically operate at the same speed, so the capacity of operational speed is mostly the same. What is increasing is the interface bandwidth, but that can only be used if you are able to sustain multiple, by multiple, I mean multiple hundred of active memory of operations otherwise you are just getting bit by the memory latency cost.
If we are looking at the scale err compute platform performance increase, that is single digit percentage generation to generation at best. The major increase in compute platform performance is in invectorisation. Basically having operations which operation on wide or multiple barrel sets of data and if you want to use the compute platform to have potential ‑‑ that is the only practical way.
Caching and all sorts of batching approaches, doing multiple operations in one instance of time is normal software engineering practice. However, what is important that protocol mechanics and data structures need to be friendly to that. If today you write an implementation that is not friendly to the underlying hardware, you should not expect to get reasonable performance.
Looking into details of how BGPSEC operates on the wire. On the receive side, we receive the signed path, that's a rather long message by the BGP standards. Today it's 100 bytes her AS path hop. We don't sign that message and verify that message directly. We take a hash out of that message, get a much smaller fixed length value, and then sign the value of the cache, or validate it.
Hashing is computational and not that expensive operation but it touches memory. Touching memory is expensive.
For signatures, the current specification on BGPSEC users ‑‑ and that is noticeably compute intensive operation, that includes trivial operations but on the long inter injuries, 256 bits and above, and the good thing about that it is doesn't touch memory. Therefore you are not ‑‑ you are limited only by the capabilities of your compute platform.
Both of those operations can be vectorised, provided that memory accesses do not interfere with your platform operation and also the protocol allows for that.
How the parrellelised operation of signature validation can look like. You need to do validation for each hop recurvesively so that means starting from the longest part, you recursively continue for the origin. Operation that is perform that exactly the same you calculate the hash but on a different data. Then you have a set of hashes, then you perform the validation procedure with different keys. This is an ideal fit for vectorised processing. However, not everything is so fine. The wire format of BGPSEC is just directly opposing to our requirements for efficient vectorisation. That 100 bytes per hops spread around in multiple places and the message and fragments that are not multiples of 4 and 6 and if you want to use efficiency that hardware platform provide you, your granularity needs to be 4 or 8. Moreover, SHA 2 hash function also requires that the minimum elements on which it operates is 4 bytes. Therefore, in order to use the efficiency of the platform, first we need to copy memory back and forth. Again, being the cost of memory bandwidth. Then doing our actual work, and then, of course the wire format is not in our compute format, we need to copy that data back again only to send it out. We end up being memory latency cost multiple times.
About the transmit side. BGPSEC signs next hop AS number, together with prefix. The hash is taken from the whole constructed message and that assigned this path is simple. What is not so simple is the actual location of a target AS path, and that's in the very beginning of the message to be hashed. That means that if we, in this particular experiment context, have 450 outgoing neighbours, we will have to rehash exactly same message which differs only in the first four 50 times. If he move the location of the target AS to the pack, we can calculate the hash 1s, cache its value, it's a block hash function increments in 64 bytes, that's trivial owe cashable and the last batch will be different anyway and of course you are different target areas in different prefix. And again this operation can be vectorised efficiently if data layout allows for that.
So, with these experimental fixes and changes to the proposal layout and image on the wire, we can get, without too much effort, an order of magnitude of performance increase.
That is not BGPSEC as you have today. Does it mean that BGPSEC is completely broken? No it's not completely broken. The biggest problem is that the way how that is ‑‑ the way how proposal is specified today, it is suboptimal and openly unfriendly to the contemporary compute platforms, and going forward, that is not going to change.
Memory cost, memory latency will just increase and we will have far more plain raw compute capacity. Therefore, what can we do about this?
.
The current specification on BGPSEC allows for some flexibility. It allows you to define different sub‑deals over there, and therefore we also have certain advances in cryptographical industry, and in general, kind of algorithm capability of advances in the crypto systems, cryptography systems which are far better than what is currently used in BGPSEC due to the reason of the prime field on which we have based being much closer to the binary numbering, and certainly all of the RPKI and webless cannot be wrong using that universally as an default algorithm. That needs to be taken into account. There are other more suitable for this use case hash functions than SHA‑2, and also the wire format needs to be arranged.
So it is clearly working scope for the IETF to do, and that work is trying to progress slowly, trying to progress over here.
So, some open questions about the overall scope and feasibility of this. Can we avoid all of that by just putting some secret switch to favourite compile and it will take care and do that automatically? No it cannot. We are talking first thing about the wire image, the layout on the wire and that requires protocol changes. We are talking about beyond trivial layout of data structures. Compiles fault vectorisation, they can do a that they are good at trying to cover rather routine tasks but at this scope and coverage, a magic compile you are switch will not help you. .what if I rewrite my BGP implementation in whatever programming language is fashionable today? Well you will get mostly the same, if not worse, that's not a problem of language selection. That's a problem of the data structure layout.
Availability of vectorisation is universal. The laptop you are typing your e‑mails on with a very high chance it has vectorisation and in particular we are talking about X86 platform. So it has been available since 2011, AV X 512, since 2015, you just need to use that and you need to use that specifically. It's not something that you get the benefit by default.
Memory system evolution will certainly continue, but you cannot run against the laws of physics. The speed of the capacity array, which is the centre of the memory system, is mostly constant, what is increasing is the width and the speed of the interface into that capacity array. And yes, if you can sustain multiple operations, you will benefit from that. If you have your scale of code which tends to deal with memory excess scattered around you will still continue to pay the price, and that price is increasing. The newer generation of memory overall is slower for this type of usage than the previous ones.
Therefore, BGPSEC can be made reasonably performance. Yes you need to make some tweaks to the protocol and given that BGPSEC version zero, which is the current specification has zero percent worldwide domination in deployment, that doesn't seem to be too backwards breaking a change.
The other aspect is that the majority of the contemporary hardware boast specialised network kit and generalised compute platforms, they are more than enough capacity and they have plenty of means for using what we need here. Now, it's the crypto extensions for crypto acceleration, vectorisation of various flavours, there are also advances in plain software engineering where you can get benefit out of using your platform, therefore it's not true that BGPSEC as such requires a change in your hardware. Of course if you have an ageing platform you might have good reasons to replace that, not only on BGPSEC.
So with this, we addressed a couple of the problems with the BGPSEC. Of course, there is still a large list of remaining items. This doesn't mean that doing only this will magically allow for the global BGPSEC deployment to tend tomorrow. No, that is not going to ham. We still have a long way to go. Therefore, we are taking this step, step a step at the time and addressing one problem after the wore.
And that is basically end of my story here.
Any questions, comments, or radical disagreements?
JOB SNIJDERS: Before people jump to the queue, we have time for one question and I am going to pick it from the Q&A. Kurt Kayser asks: "Would an overhaul of BGPSEC ideally replace RPKI? I understand the implementation needs to be within the routers rather than parallel to the net as with RPKI."
IGNAS BAGDONAS: BGPSEC is the wire proposal, Rudiger you have a comment?
JOB SNIJDERS: I am going to do strict queue management. We have no time. Rudiger ‑‑
RUDIGER VOLK: There is one basic false statement Ignas that you did. You said BGPSEC is AS path policy. That's wrong.
IGNAS BAGDONAS: Did I say that?
RUDIGER VOLK: I think I missed the second word but I said it is path stuff, and you missed the essential thing: BGPSEC is authenticity of a T BGP attribute. Before BGP, everything in BGP is not authenticated, is not sure to be the truth. And BGPSEC does that for the ‑‑ for the path and that gives us the possibility to make same decisions about how we use the route, because we know this is true, and the RPKI is a cryptography system which is the basis for actually doing that. So the question is kind of showing basic misunderstandings that are very far around.
IGNAS BAGDONAS: Those things are ‑‑ one is a proposal and another the database, in order to get something you have to have both of them. One relies on the other. BGPSEC is not a replacement for RPKI. BGPSEC can be used for other attributes if you wish. There is nothing in proposal precluding that it's just that it's not specified today. I think we need to look into that but that's a completely separate topic.
JOB SNIJDERS: Rudiger, it is rude to jump the queue ahead of other people. Don't do it again.
RUDIGER VOLK: I was first there.
JOB SNIJDERS: We are chairing this session. You don't get to decide this.
We're out of time on this topic. Sorry Tom. The next presentation is Mikhail from RIPE NCC, who will tell us about a project they have been working on recently called publish in parent.
Thank you for being here, I look forward to your presentation.
MICHAIL PUZANOV: This is going to be relatively short talk. Essentially an update on what we are busy with in regards to this topic. It's slightly a clumsy name, RPKI publication in parent.
.
What is it about? It is about essentially the model that you can have for the certification authority when you create the RPKI setup, and of course there is two choices of doing this: The hosted or the delegated CA which is pretty much supported by every RIR nowadays, and as you can see in this big cartoonish pluses and minus, there is obviously pros and cons to both models, so for the hosted one it's actually very easy, let's say for simplist, let's say for the RIPE NCC version that is basically clicking around in the UI and just for a little bit and from this moment on, we, RIPE NCC, will create the object, update the object, rotate them, make sure that they don't expire and so on and so forth, but we also store your private keys in this case.
So, that requires some extra trust, and for people who are not really okay with that, there is a delegated CA version of that but the downside of it is that all the objects, all the certificates, ROAs, manifests whatnot is supposed to be maintained by the software running on the premise of the child CA, LIR, end user, whatever. So that is obviously requiring more work and effort to maintain.
So here comes in the publication in parent thing. Essentially, it's a service for delegated CAs. You probably have heard about it before called hybrid RPKI, or publish in parent from APNIC, I think historically APNIC was the first one to introduce that maybe two or three years ago, even more probably. It's also working in ARIN for about half a year or so. So we are not the innovateers in that sense in this area at least.
So, essentially, delegated CA does create all the objects, and does maintain them and rotate and all that. But, it sends them to the parent CA, in this case the RIPE NCC. There is RFC 8181, or publication proposal, in other words, which is used for that.
So, that's a tiny bit that essentially makes it not necessary to run your own repositories, which may be sort of the probably the most time and effort consuming part of running your own delegated CA. We as a parent CA will basically keep the repository for you.
So that's the point of it? Why would you want to do that? So, first of all, it's relatively ‑‑ has a relatively good availability. So let's say our repository is pretty nicely available, let's say, I'm not going to punch myself in the chest and say it's a hundred percent but it's treaty good. It's also backed up ‑‑ well cached by two CDNs at the moment so it can hold really high loads, it can absorb pretty serious traffic spikes. So things like changing of, let's say, a session IT proposal which usually results in a very high traffic is actual not a problem.
And the other probably even more important thing that I already mentioned is that having a delegated CA doesn't mean you have to maintain your own repository and make it available, make it, I don't know, set‑up the rsync, set‑up the HTTPS, all the thing together, and that seems to be a thing for quite a lot of current CAs, because from 25 plus or so, delegated repositories that we have under our trust anchor, at least 1, 2, 3, is usually not available having things expired, having things broken, rsync is not setup, something like that.
So people do tend to, I don't know, set‑up an experiment and then forget about it or just forget about it, I don't know. So that does happen and this thing would really help.
So, from the technical point of view, so the minimal viable product that we are working on now, I'm not reporting like the finished thing, it's a work in progress, so, this work is based on Krill's software publication server. We are essentially not expecting very high number of such delegated CAs because at the moment of making this slide, APNIC have four clients and ARIN probably more, anyway it's like below ten in any case. So, we don't expect very high churn of data objects or not. So, the redundancy set‑up that we are going for at the moment is essentially some sort of primary based on instances of Krill. So there is no big fancy cluster based solution at the moment because it probably doesn't make sense. But of course, if that feature has serious up take, which it may happen, because at least in theory that sounds like a quite a useful thing to have, then of course we'll have to change the implementation to something more suitable for, let's say, higher data.
The status of this work is essentially, it's being our main focus at the moment for the last probably couple of months. There is a lot of sort of infrastructural work, thinking and rethinking, and it's been probably already third version of the same thing that we kind of had to throw away the previous two versions and work again. And the production launch is planned for Q2, which is ‑‑ which means we have probably six, seven weeks left to do that.
And we would like to have people who participate in testing of this stuff. Because we definitely need someone to tell that you say we're doing something terribly wrong if we do. So, if you want to be included in this sort of trial, yeah, drop us an e‑mail, there is a channel. That would be nice to have people who would try this thing on production.
That was it. If you have any questions, please go ahead.
SPEAKER: Blake Willis, Zayo. Thanks for this. Do you have an idea of what scale we're talking about before you need to reevaluate like hundreds, thousands, ten thousand, in terms of CAs on the NCC platform?
MICHAIL PUZANOV: So for the ones that are hosted, we have at the moment I believe 20,000 of them. For the delegated it's tens, so, I mean tens of the repositories that are actually running and doing something because I believe that there is much more of them in our database but they are not doing anything. So, it's below 30 or so.
SPEAKER: However do you think that can go before you need to reevaluate?
MICHAIL PUZANOV: I guess it's not just the number of CAs, it's the amount of queries. We don't expect a lot of data but we may expect a lot of requests. Just basically knocking at doors at the time and if we have to be sort of more realtime and have multiple instances running and so you your don't time to set up, that may be an issue.
SPEAKER: Thanks.
SPEAKER: Tim Randels, NLnet Labs, data point. In Brazil they are using this model effectively for well over 1,000 delegated CAs so I think that it proves that running your own CA or publishing at your peril is relatively easy to do.
With regards to the availability, this is the availability of CAs to publish, that's independent from the vacate of the data relying parties. And well I'll leave it at that for now.
MICHAIL PUZANOV: That is what I meant by the availability. Availability of the repository.
JOB SNIJDERS: Not a question for you but an address to the group. I believe publish in parent is the best current practice when it becomes available as a production service. Imagine that there are 10,000 CAs and your RPKI validator needs to connect to 10,000 separate end points. This that is scaleability issues for the whole ecosystem. So sending your data upwards to RIPE reduces the numbers of connections each validator needs to make, improving performance for everybody. So, it's not just it's hard to host your own publication point, but you are literally doing everybody a favour by using RIPE's publication service instead of running your own. So, use publish in parent and Tim, thanks for coding support.
IGNAS BAGDONAS: Thanks a lot for presenting this, and then ‑‑ and next up is Job with his presentation.
(Applause)
JOB SNIJDERS: All right. We are running over time. I already feel bad.
My name is Job Snijders, I work for Fastly, and I code on the OpenBSD project and as part of the OpenBSD project, an operating system that tries to integrate cryptography as far as they can, we also run and develop an RPKI validator. And in this particular presentation, I want to teach you some things about the RPKI data structures and how you can look at that from a command line perspective.
This is trying to zoom in on the underlying false structures.
In this presentation, we'll go over what a signed object is, a little built of an overview of various type objects exist, how exploring and validation at a high level works and some remarks about timers.
Now, what is an RPKI object? It's super simple. You download an object and you can use the Unex cat file to look atten coded garbage and if you can read this without external tooling you have some more studying to do.
This is a huge departure from say IRR based data or even RDAP data, because we're I think generally somewhat used to as network operators that you can look at the data and have it presented to us in a somewhat human readable format which greatly speeds up our debugging abilities. But with RPKI, we lost that ability, and that means that ability needs to be supplemented by other tooling, and that tooling is now somewhat reaching a level of maturity that it's usable for the wider public.
So, what is a signed object? A signed object is a binary blob and it follows a serialisation format from ASN 1 and each blob is identifiable by its hash. Each signed object is non‑mallable. That means that there is only one way to binary encode a specific object, and that means that a unique hash exists for each object and there is a strong one to one map in between it too. Why this is important we will get into later.
Signed objects are noded according to the cryptographic message syntax, CMS. This is an IETF standard to facilitate cryptographic operations and to put data in a specific way into a structure which then lends itself for signing.
You can think of CMS as sort of an envelope and that on the outside of the envelope you have a signature and a hash about the contents of the envelope, and this allows validator to layer by layer peel back jumping towards the actual substance of the signed object in a safe and secure manner. What is in the envelope is mainly two things: The so‑called E content, or encapsulated content, and this is the meat of it. So in the case of a ROA, this is the origin ASN and one or more prefixes.
Also, inside the CMS envelope is an X.509 EE certificate. This X.509 certificate contains various pointers to other objects in the RPKI that you use to create a chain ‑‑ a validation chain, and some hash and signature stuff.
You can think of the RPKI as a sort of intermediate where you go from root to edge. Root is a so‑called trust anchor. You find this trust anchor using a trust anchor locator. There are intermediate nodes in this graph, certificate authorities, and at the far edges, leaves you a signed object such as ROAs, Gb Rs, C LLs, what knots. You can trust those objects through a minimum called derived trust. So the trust you put into the trust anchor assumes trust. You choose to trust the right trust anchor and from there, it logically flows, where you derive trust in relation to say ROAs or BGPSEC certificates.
You download the TAL. This is usually a one time operation. That TAL contains a reference to the trust anchor certificate. This is a self signed certificate. You download the trust anchor certificate. This contains a thing called the subject information access. It is basically a pointer to the manifest, you download the manifest. The manifest is a listing of false, which could be other certificates, at least a CRL and ROAs or Gb Rs or future signed objects. You then in turn download out TAL. You can verify that you downloaded the correct TAL by comparing the hash that you downloaded with the hash on the manifest and if one the TALs that you downloaded that it was listed on the manifest was a CA certificate, this repeats. And of course in reality validators do lots of optimisations and download stuff in bulk and go through it very efficiently, but in concept, these steps are what happens for each certificate, the manifest and then subsequent objects.
I made a beautiful drawing this very morning I am going to put it up as NFT. So at the top we have a certificate. This is like a pattern that repeats throughout the RPKI chain. So I took, say, a molecule that is repeated over and over in the data structure. You have the certificate which has an SKI. This is the subject key identifier, and a way to think of this is that it's the self. That value, that identifies that particular certificate authority, so SKI stands for stuff, it's the identifier of the thing itself. It also has an item which is and that is a pointer to its parent.
It points to the SKI after parent. So every AKI points to an SKI. And it's these parents that create validation chain. A certificate contains the reference to a manifest. This is the subject information access.
So, this points to the parent. This is the identifier. The manifest is checked call the manifest which also is ‑‑ a manifest contains a reference to its certificate, and allowing of files.
In this listing of files there is a reference to the CRL. The CRL identifies who it's parent is. So we go back up to the SKI of the CA. Analyst of revoked certificates. The manifest, for instance, lists a ROA, we go that way for discoverability. The ROA points to who its parent S again the AKI. The ROA itself has an identifier, the SKI. The ROA references what URL should be checked to confirm that the E certificate of the ROA is not revoked. And then there is the E content, the origin ASN and the list of one or more prefixes that the ROA contains.
There is more links and validations steps that wouldn't fit on this slide. But there is lots of linkages inside RPKI objects towards each other.
Now, how can we actually look at this stuff? There is multiple validators, but for this particular presentation we're going to look at OpenBSD RPKI client, which I like to think of as the TCP dump of the RPKI. Installing it on Debian unstable is easy. You install it, you type this in, you start the thing, you wait a little bit, and you can jump into the cache directory. The cache directly contains RPKI clients uses the TAL system as its database, if you compare this to for instance the RPKI.net implementation from Rob Austin and Randy Bush, there the back‑end was a sequel database and there is various models to store RPKI data. This particular implementation uses the TAL system itself which makes use of tools like find and grab and whatnot, much easier.
Inside the cache directory, you see a list of directories. Each one of those represents a publication point. And what I mentioned in my earlier comment towards publish in parent, we, as a community, have to keep this cache directory as small as possible, because if we have thousands of repositories, that means thousands of connections for each validator, and that has scaling capabilities. On the other hand, it's really interesting to dive into the repositories and inspect people's individual objects.
So, you can see ARIN's repositories for instance here, you can just change directory into it NLnet Labs, flow A dot... that's myself, that's the largest one and RPKI client will remove all invalid objects from this cache directory, so when you are navigating this directory, you are looking at the current valid state of the RPKI global database.
You can use RPKI in so‑called file mode. Am I pronouncing it right? File mode is activated with a dash F command line option and then you provide this argument a foul that is assigned object. So in this case, we provide it with a manifest as input, and then this RPKI client utility will decode the BR CMS structure inside that foul, that binary garbage from my second slide, and display in human readable format what the contents of the object are.
.
The hash identifier is the SHA 256 hash of this file and why it is important we'll get to. Subject key identifier or otherwise self reference, the authoritative key identifier, serial, where you can download the authority, what version or what number of the manifest, higher is better, manifest can only go forward, there can not be backward. The validity window of the manifest, the validation will say build and then listing itself. And for each foul it lists the SHA 256 hash that should be the result when you compute the hash of the particular foul.
If we look at a CRL, again you just use foul modes. The CRL has a hash identifier and that could, for instance, be listed on the manifest, and this is how you can link a particular foul to a particular manifest.
CRL could be contains a list of the seriously numbers of the certificates that were revoked, and each invocation of this utility shows the certificate serial if applicable. CRLs also have a validity window and if you are outside the validity window the CRL is not usable.
.
There is an interesting type of object that is not widely used at this point in time, but I think has many benefits when troubleshooting RPKI related data.
This is called a GBR, a Ghost buster record. And this is like dropping a business card in your repository in case you run into some kind of issue with the RPKI data itself, you then can ‑‑ you know who to reach out to to discuss what is going on with a particular set of RPKI data.
Each ‑‑ like any signed object, the Ghost buster record or RPKI business card has a reference to itself, the key was signed with, and its parent for the key identifier.
We can also look at a ROA. Again, same invocation, you just use foul mode. It will print the aspects that are common with all signed objects. That means you can find the parent or the the URI of the parent and in purple he can see the content. You see the AS that is authorised to original a Nate one or more prefix and the max length of each prefix. And the validation okay part means that the signature chain all the way up to the RIPE trust anchor is correct and that none of the intermediate certificates have been revoked through a CRL and that none of the CRLs have expired, and if you are writing an RPKI CA implementation or running your own obviouslycation point you can use this utility to confirm that your objects are in fact valid.
And then finally, since can be claimed there was no BGPSEC deployment, I am the only person on the planet publishing BGPSEC keys and this is what a BGPSEC key looks like.
There is a signed object, so it has the SKI, the AKI appointed to where you can find the parents and then finally, a subject public key information string encoded in bau 64 which is the public ECDSA component that you can use to verify BGP announcements I sent to you signed with the private key associated with this public key.
.
And the AS number this public key is showed with.
What is very, very different from IRR compared to RPKI, is that the RPKI is full of timers. Every signed object has an E certificate which has a not before and not after date. Every certificate authority has a not before/not after. Every CRL associated with a particular CA has a so‑called ‑‑ don't use it before this date, don't use it after this date. The manifest E content, as you saw a few slides earlier, has a window in which its valid. And validators will look at the transitive expiration moment. So they look at all the dates on which various things could expire, and the one that is soonest is the one that ultimately triggers whether an object is valid or not.
And this is a completely different world from the IRR especially in third party databases that don't require a recurring payment where objects will be around forever and ever and ever until the day to day of the universe. I think the RPKI mechanism is superior in this regard because the existence of expiration timers forces clean‑up to a degree, either because people took their publication point offline and then the expiration dates will trigger evaporation of this data in local caches elsewhere or for instance, if contracts expire with the RIR service, all in all expiration dates are good. I like to think of it as the expiration date on a carton box of milk, that date is very useful when you want to decide whether you want to drink that milk or not.
Other ways through which objects could be considered invalid is if adjacent objects are missing. A few slides back we saw the manifest listing. How the system works is that a manifest is a very robust grouping of a set of objects. The manifest tells you these objects belong together and that means that if one of those objects is missing, which could for instance happen if the foul is corrupted, which means that the foul and the listed hash do not match with each other, or if this foul, for instance, is just entirely missing, that all of the fouls are considered invalid, and the advantage is that if you have a specific ROA configuration where you, for instance, combine ROAs that contain some super nets and separate ROAs that contain more specifics, that you can tell to the rest of the ecosystem: Either use these fouls together exactly as I intended them to be used together, or don't consider them at all. And this bundling of fouls, the ability to communicate to validators these belong together, is a feature we do not have in the IRR but I think is a fantastic feature in the RPKI.
.
That's how things could be invalid. If you don't want to play with binary encoded fouls, I have a JSON dump that is 10 megabits that you can download that contains a JSON decoding of the entire turned global RPKI database, and this is fun to pool through a visualisation process or use a monitoring software.
And with that, I have gone 90 seconds over. My apologies to my fellow Chairs. If you have questions, you can either, if there is time, ask them now or e‑mail me.
IGNAS BAGDONAS: Please contact Job via contacts. There is up question in the virtual queue but we don't are time to answer that, Job will take that into account and respond and we have to move to our next presentation.
(Applause)
ALEXANDER ZUBKOV: Hello. My name is Alexander Zubkov I work at Qrator Labs and we know about the importance of filtering of routing information, and there are many articles about it and you can find examples of how to like BGP Q or ‑‑ to generate the prefix list. But if you want to do it automatically and on a regular basis you need to build some kind of infrastructure around it to do it the things together. And we decided to provide an example of such infrastructure. We took some scripts and tools and we use in our own company and polished it and this is an example.
You can find it there. What we have, we run Linux servers with BIRD and demon and because of that we have some requirements. For example, we do not want to handle and maintain all configuration by hand, because it will cause errors, and so also, we want to reuse the prefix lists so that we should not ‑‑ (something) route registry with our requests, so we published the scripts on the URL and there are three main components.
There are Ansible play book and templates that allow to you generate your BGP configuration and filter configuration, there is a Docker image that provides HTTPS API for fetch prefix lists and there are scripts that do the jobs or not. Fetch prefix lists and generate.
For example, in Ansible you can configure your PS with such configuration. And you will have output something like that.
Then I have many slides, just for reference.
And in that Docker image, we use under the hood, the BGP Q to actually fetch prefix lists and to recache the result with Nginx and it also provides an HTTPS interface and we you can fetch a prefix list.
Also, in that image, we have ‑‑ we aggregate prefixes with our own tools. We also published it. And we route it, we ‑‑ I did not find some good solutions that works fast, other tools either slow are do not support IPv6, and also some of them do not make you a perfect aggregation of your prefix list. And this tool makes it ‑‑ like it gives a minimal aggregation that's possible.
And in those scripts that generate prefix lists, we use such URLs that fetches aggregated prefix lists.
And I think that's all. And if you have questions.
PAUL HOOGSTEDER: Please raise questions to Alexander. We are short of time and we still have a lightning talk.
(Applause)
.
And now to finish up the Routing Working Group session, a pycode second length lightning talk.
SPEAKER: Many of you will know that at NL labs we make Krill and Routinator. But not everybody may realise is we also have a common RPKI library that is open source and available to anybody who wants to do stuff with RPKI. So it has objects, support for the basic object types let's say, used in RPKI. All the logic about validation, or creating and signing things is this the other stuff. So, it's ‑‑ the intent is to keep it small and simple.
If you made a different life choice and you do Java , then I recommend that you look at RPKI comments, open source by RIPE NCC. Any questions or comments about this, you can go to GitHub or e‑mail us or talk to us. That's actually all I have to say.
IGNAS BAGDONAS: Excellent. Thank you so much.
And with that, we are officially four minutes over time and that is the end of the Routing Working Group session. Enjoy your break.
(Coffee break)
LIVE CAPTIONING BY
MARY McKEON, RMR, CRR, CBC
DUBLIN, IRELAND