Monday, 16 May 2022
At 4 p.m.:
WOLFGANG TREMMEL: Take a seat.
FRANZISKA LICHTBLAU: People get in or decide to keep out, take your seats, we would like to kick off this plenary session. All right.
WOLFGANG TREMMEL: I am a member of the Programme Committee, and we will come to this Monday afternoon plenary session. So, might you perhaps close the doors at the back so we keep the chatter, keep the noise out. Thank you. And as a first speaker, I would like to introduce Constanze Burger, she is working for the German Ministry of Interior and community, it's the German government, and she is going to talk about IPv6 in ‑‑ IPv6 deployment in public authorities. Constanze.
CONSTANZE BUERGER: Thank you. Hello. I am really happy to be here, my first RIPE meeting was also in Berlin, it was RIPE 56, I came here fully over dressed, the first I went home, took my jeans and my T‑shirt from the board and came back. So this was my first action in the RIPE meeting 56. And after that, I found very, very good people in this community, friends and advisers and I'm thankful and really happy that I can be here and to tell you our story.
My name is Constanze Burger, I am working for the ministry of the interior, I studied computer science years ago and the first slide I want to show you and to introduce you is that stuff, you know, that's the IANA structure, and what is embedded in, it's our local internet registry because the government, the German public administration, is operating, operating Local Internet Registry, that's very special, and we are proud to do this, I do this for 10, 12 years, and it's really fun and I learned a lot over the years. You see this in this part.
First of all, we have a very interesting and complicated heterogenous state structure, after the World War 2, the alliance separated the power and therefore you can find her in our State structure, I have to go back. Can I go back one slide? Okay. You see here, our structure, where it's really heterogenous, we have a federal level, federal States, we have authorities, we have municipalities and after the World War 2 the power was separated and, therefore, after our constitutional rights, we have a lot of independence sees in our administration, and, therefore, we had been really proud and really happy that we, overall, got an address space for the whole public administration of Germany. We have an address space in /23, with unique worldwide addresses. We are independent of network providers and we are going to manage the addresses centrally and secure. We have transparency over the addresses. And this is a new fact. The Local Internet Registry is managed by the federal ministry of the interior and the agency of frequency digitisation. We designed address frames for the federal level, for the states and for municipalities. This is really a heterogenous structure, and from this heterogenous structure, we want to hand over the independency of our administrations and therefore we developed this organisational picture, what you see here. We are going to lead the Local Internet Registry by the ministry and by the agency and we founded a new organisational part and these new parts are called sub‑LIRs, these are small LIRs, we hand over the blocks of our address space and these blocks are able to use in own manner, they have the possibility, the countries, the municipalities who can justify the address space, all them can run a sub‑LIR in our environment.
You see here we have 17 had you been LIRs from the states, we have the sub‑LIR from the finance sector, we have a sub‑LIR from retirement sector, from the police, from military, and from the federal level. All they are, they can be independent and they can make independent decisions, but the first main point is RIPE policies are valid, and then we hand over some other policies because of security issues, laws and administrative requirements.
Overall, we have a small expert Working Group and I'm happy to find many of our experts here today because the RIPE meeting is in Germany, and they advise us in themes of address policy, security and so on.
This is the next slide on ‑‑ the first slide you had seen the technical issue, and we have, here, the address range and we separated our address range in themes of security, laws and administration requirements, and, for instance, we have the basic German law, called interconnection law, and we have here, you see in this red block, the separation of our address space and the communication background, the exchange of data between federal government and the states takes place via these interconnection networks. You see we separated the space for the federal states, for the governments and we have also non‑governmental blocks.
This is a technical challenge, and many themes we are not so far as we want to be. Some people from the RIPE community ask me, Constanze, you have this address space for nine years, or for ten years, what are you doing with this? ? What's about the implementation, what's about the deployment? It's so slow. And I can tell, yes, it's slow, it's slow because the federal structure is really slow and we have to have many convinces, we have to convince people, we have to work in the multi‑stakeholder nationwide and worldwide, so we have to convince and to find really good solutions for all.
Heterogenous infrastructure is followed by heterogenous IPv6 deployment. We have the federal government and here we find a federal programme for IT and network consolidation. We founded a timeline in 2020 to bring in and to deploy IPv6 in the co‑infrastructure, we want to make this transition to dual stack by 2025, and we want to make the IPv6 only transition until 2030.
Because of the federal structure, this leads us to increase complexity and variability of the network infrastructure, and the v6 migration status at the state level ranges from very far to not yet started. But I have, in case of constitutional reasons, no power to make pressure and to change it. I just can convince the people to work with us together.
The public administration has to provide IT services for everyone so we have to care the deployment of IPv6 everywhere, without discrimination. And this makes IPv6 mandatory for all, for the federal level, for the federal states and the municipalities.
And now, let's have a look, a really technical slide. I have three slides. The first slide is the slide, it's our wish slide, our ideal slide. The second slide is the worst case slide. And the third slide will be the idea of one solution.
We have a problem: We can't regulate and we don't have the power to make laws, so, therefore, you see on this slide, the ideal situation: Routing by one single provider. We have n public administrations with different prefixes. And we have one provider aggregated route in one AS. This is the ideal situation because we have worldwide one AS entry. This would be cool, and I would be happy about the situation. But, in the worst case, this is the same situation. In the worst case, we have n administrations with these prefixes but I can't aggregate this administration, this is 1, 3, 5 and if I assume two providers I can't aggregate them and I have at least worldwide and additional AS entries and this would be a catastrophe and this is worst case and we have to work work on this and yes, we have to engage the people to find a solution. What could be a solution for this challenge?
This is an example, an idea of a solution. We have two providers with maximum aggregation. We are going to, on the strategic level, to decide and to identify aggregatable blocks. We are thinking about which blocks could be aggregatable. For instance, for the police, for the schools, for the finance sector, and if we assume two providers, we have here the block and the administration 1 to 25, or, here, 33 to 36, or to 56 and we can aggregate them, and at least we have two additional new AS entries.
So, this is one solution of one of our problems we have because of our federal and really heterogenous infrastructure. We work on this and I see many people in the room who help us to solve all these problems.
On the next slide, I can show you a wonderful service, and this service is a really good success story, because the ELSTER is a service from the tax environment, it stands for the German electronic tax declaration, ELSTER is full implemented of IPv6, from core to edge, they started the project in 2016 and the IPv4‑6 reachability had been in August 2020. This was a project for my Bavarian colleagues and they worked together with also I think with Gert, and this is a good success story and you will see here the v6 deployment had been successful.
Connection statistics for October last year: We had IPv6 traffic of 52% and IPv4 traffic of 48%. And the URL is, you see here.
Last but not least, why is IPv6 not only a technical issue?
We learned a lot over the years and the deployment of IPv6 is infrastructure development. IPv6 forces us to think beyond our organisations and our processes. We have to work in new organisation forms. We have to engage within open multi‑stakeholder groups and we have to develop Internet standards and policy common with you. We have to learn from the community, and I'm thankful that we have a good base in this community.
What we also learned in these days is the standardisation. V6 leads to new fields of action in the standardisation for IT networks and regulatory authorities, as well as administration and politics. Because of the war, we see how important sovereign networks are and democracy in the culture of developing things, and we stand for RFCs who will be developed in a democratic way, so I think our goal is to developer sovereign scaleable and secure network infrastructures.
Thank you for your attention. And if you have questions.
FRANZISKA LICHTBLAU: Thank you, Constanze. So, do we have any questions? We are very good on time. Yes, we have Paolo.
PAOLO VOLPATO: Thank you for your presentation. I am currently involved in writing a paper in RFCin the v6 Ops group of IETF dedicated to IPv6 deployment status and I have to say, you mentioned that in Germany the adoption of IPv6 is probably a bit disappointing for you but actually if you compare it to other European countries, you are in much better shape, I would say, anyway, by the way, I will share some findings of our paper on Thursday in the v6 session. I'd like just to understand your point of view, looking from the public administration. What are the challenges or the issues you are facing which are delaying the adoption of IPv6? Thanks.
CONSTANZE BUERGER: Thank you. Yeah, this question is really hard to answer. I think infrastructure development is not to be seen on a political level and the political level is seeing apps and websites running and they understand there had been a new feature online but it's too hard to explain how does it work, how an IP packet is transmitted, and this gap, I have to say it unfortunately I can overcome this gap as well. It's really hard to say. I can't go to my State secretary and say, hey, we need IPv6 and, therefore, we get a new structure, we have new address frames. There is no one would understand that, yeah. I think the wording is really important, communication is really important but we don't have time for communication because we have to work, we have to run the LIR, the LIR, that's one of our problems.
SPEAKER: So my main question was how this, you talked ‑‑
FRANZISKA LICHTBLAU: Tell us who you are.
SPEAKER: I am representing myself mostly. I am just here for fun, really.
FRANZISKA LICHTBLAU: Which is the best reason.
SPEAKER: Well, my question would be: Since you talked about how you want it to be aggregated and only used one ASN how does this interact with how you are currently deployed for IPv4, is it in a similar way or completely spread out over multiple things for IPv4. Do you want to make the network structure for v4 and 6 or do you want them to have very different structures?
CONSTANZE BUERGER: In our networks, v4 is the main protocol until now and we have to overcome this and find solutions to do this, and we have to find a mixture between v4 and v6 but the pressure is really hard to find the way to go forward. But I can't ‑‑ there are so many key players in this game and I can't advise and say you have to turn it and you have to turn it, I have no chance to do this.
SPEAKER: Yeah, it's more specifically like how ‑‑ one of the earlier slides you mentioned how it would be just one ASN for all the v6 would be the ideal thing and one aggregated route, I believe
CONSTANZE BUERGER: Yes
SPEAKER: My question is is that how it currently looks for v4 in German government with one ASN or multiple ones for the IPv4 stuff?
CONSTANZE BUERGER: We should do this after, afterwards, and I can't explain, I'm sorry, I need my colleague to answer this.
FRANZISKA LICHTBLAU: Tom, please.
TOM HILL: Tom Hill from British Telecom. I wanted to thank you in particular actually for the response you gave to Paolo because I think you very clearly and very well articulated a problem that we all face, especially in service providers and every organisation around the world who is looking for a good reason to deploy IPv6 when faced with the sheer benefit of it, it's very difficult to justify. So I like that you have very clearly said we can't just explain how this is how IP routing works, we have to have another reason. That's very good. And again, it's been very delightful to have such a forthright presentation from a government department, speaking about this, because there are not enough enough government departments speaking up about this and I thank you for speaking to us about it, just the fact that this is recorded is going to be incredibly useful throughout the world. So my last question to you was: Have you any plans to solicit this with your counterparts in other countries? Please do it in the UK.
CONSTANZE BUERGER: Okay.
CONSTANZE BUERGER: We learn from each other, and we should learn how to support each other as well, yeah, thank you.
MARIA MATEJKA: I am from CZ.NIC I would like to ask you why are you aim for aggregation to /23 route when there is a routing software that can route all your deaggregated routes to these places where it is needed? For example, if you have one place in Hamburg and another place in Munich, why is it needed to aggregate these two routes when it may be possible to route the Munich by one route and the Hamburg by another route through different geographical places and through different network places? Is there any good reason to aggregate the routes for such a big place, for such a big country as Germany is?
CONSTANZE BUERGER: It would be a wonderful idea but in case of that federal structure, they can be independent and they can decide I don't want to aggregate my route in your eyes so that's a first point. They are able to say, no, and we don't trust you.
GERT DORING: Hi, well long time IPv6 person. I think the main reason why Constanze wants to do this is because she has been listening to the Internet ISPs out there, because look at how big their address space is, they have 23, if they deaggregate to 48 that's something like 10 billion extra routes in the global routing system, so please do all you can to keep my routers from exploding. So this is public benefit if you do the aggregate.
GERT DORING: Thank you for trying. I understand it's hard.
CONSTANZE BUERGER: Yes, but I'm not on the end for this colleague. The first ‑‑ I can't decide I do this over one IS, that is the first. I see the problem if I have the worst case, I don't want the worst case. I have to be carefully with this Internet resources, I want to have a transparency, in the former times every municipality, every administration could order address space at RIPE NCC, and no one had a clue which IP addresses had been to an office, to an administration and there was no overview. All systems had provided to everybody, so ‑‑ and there was no structure, no organisation, nothing. So and the first thing we wanted to have, transparency structure, we wanted to force the people to set up their old historic grown‑up network structures in a new world and new frames, new designs, new ordered address space, so therefore, many reasons on the political side, on the organisational side, but they have to match with the technical reasons. So, this is my task.
SPEAKER: Alex, I represent ‑‑ company as you can tell from my outfit, it's my first RIPE meeting as well. I have a quick question: First off, thank you so much for your presentation but in your presentation you talked a lot about v6 networks, but what's ‑‑ is the government policy towards v4 networks, will they be abandoned after all or what's the government viewpoint on that? Thank you.
CONSTANZE BUERGER: I would love not to see v4 any more, but ‑‑
CONSTANZE BUERGER: But I have no chance, so I told you, there's ‑‑ I can't do a law, but so many administrations are independent and they can decide, I do this in my way, so I can convince them, and if I have good arguments, then I can bring them on my side. And this we tried to do. So we will run v4 I think next ten, 20 years. So I fear, yeah.
SPEAKER: Thank you.
WOLFGANG TREMMEL: There was one more written question and that was about the AS numbers on the slide but I guess the two AS numbers were just an example
CONSTANZE BUERGER: Yes after the RFC, we learned this
WOLFGANG TREMMEL: So supposed to be separate networks via using one for everything and I guess that has been also answered, like you said the AS numbers on the slide were just an example and the entities are free to choose the AS number they want to go with.
CONSTANZE BUERGER: Okay, thank you.
FRANZISKA LICHTBLAU: Now, you can all thank Constanze. Thank you for your good collaboration.
FRANZISKA LICHTBLAU: Our next speaker is Massimo from NTT and he will report on experience of one year of RPKI operations.
MASSIMO CANDELA: Good afternoon, everybody. My name is Massimo Candela, I am really, really happy that finally we are back do these RIPE meetings in person, it was about time. And so, I am a senior software engineer at NTT, large company that offers a lot of services among which we are a tier 1 provider and I work in our global IT network.
As the name suggests, it is a large network so we need software to keep it under control and that's what I do, together with some amazing colleagues, some are also in this room. We do automation and monitoring of the network.
So, last year, exactly one year ago, RIPE 82, I presented a system that I created to reduce human error and to ease our internal RPKI operations, and at the end of this presentation, I got a question from Randy Bush, he is always on point with his questions and the question was okay, but what type of errors you were trying to reduce, what type of human errors and how many are you doing now? And I thought this ‑‑ at that time, I didn't have these numbers because I mean, the system was brand new, but I thought it was a really good question and I thought that it deserved an entire presentation, which is this one, and I think that we all do addressing by sharing this, I think we can all learn from it and this can be helpful for somebody that is embarking in their RPKI journey or they just embarked and they have to deal with that.
So, the presentation is a one year review of RPKI operations. But let's start with some facts. The 24th March 2020, this is a screenshot I did, we start doing route region validation so starting rejecting RPKI invalids and together with creating the RAOs, these are usually mentioned as the two steps to implement RPKI. Now, this is, maybe this is the overall message of this presentation, is like, one does not just implement RPKI and after it's just get finished with it. In reality, you will have to maintain it and keep on eye on it and it will involve your daily operations.
In particular, it will for sure require some additional knowledge in your team, but it will require also some additional procedures and while for other technical solutions we have already in place some best practices for RPKI as a community we are still developing some of them.
So when I say errors, when I say errors or in this slide, mistakes, what am I talking about? I am talking essentially to doing RPKI invalid announcements. So, you mentioned autonomous system it, you implemented RPKI so basically you created your own ROA and doing route region validation and but at the same time ‑‑ so you are RPKI aware, at least, but at the same time, you nouns occasionally RPKI invalids, valids there are various reasons, offing they can be a hypoand whatever the main two logical reasons, the first one is you want to announce a prefix and you forget completely about RPKI and after an invalid happens and often I got this as a feedback, but it was a new prefix, it was supposed to be a known so life was going to go like RPKI never existed. But did you check? So the reality is that it can happen that there is a ROA for a less specific and in that case you are going to do RPKI invalid, at least you should check.
The second case is that you know about RPKI, you did everything but you forgot about there is some timing involved, there is some timing, that is the publication times, so before ROA you created is in the publication server, Trust Anchoring may require sometimes some hours, up to 24 hours and there is also some propagation time which is like the normal time, the natural time that other players of the Internet they will need to just download again the repository and validate it.
And we will come back on the topic of the timing in a bit, but just this is to give an idea of what we are talking about. So what I did is that I took ‑‑ I did a review of our calendar year 2021, in a year we were already doing route region validation and I collected all the RPKI ‑‑ all the alerts that we received from BGP alerter which is an Open Source software that we use for BGP and RPKI monitoring, and, in addition to the alerts, I took the various tickets and conversation that were created after the alerts because my goal was to understand really what was the reason behind it, and I divided in three categories. The first one is wrong max length which is essentially what I initially thought and I said before. So we had a /20, a ROA for a /20 and we start announcing at some point /24. The second case is we start announcing a customer prefix because we have service going on. But they had no ROA for 2914, which is our autonomous system. You may think that's customer problem. I don't really agree with that, but anyway, at least it shows there is some margin for improving communication with the customer.
And third category is we migrated prefixes from one autonomous system to another and we forgot about ROA, or at least we forget to do it in time and properly.
We had 71 cases of this and when I plot them on this pie chart as I was expecting the biggest one, almost 60%, is the max length, and the announced customer prefix without the customer put in a ROA is 17% almost, and I'm surprised actually that the migration which I was not expecting is 25.2 but it was mostly one single event. So we have already an idea of what are the percentages of the various errors.
So now let's go back on the timing that I was talking about. As I said, there is some timing with the publication and propagation and you may think, well, it's fine, right, at some point it's going to be valid? Well, that really depends on you, on what your service you have on those resources. If you are managing resources for a large CDN, the fact that they, for some hours they will have sub‑optimal routing in the best case and reachability on those prefixes in the worst case, it is something that, at least for us, is important. But even if you do not care how do you define transit, how do you know it's going to be okay if you are not monitoring? So let me explain better this question.
I took here all the RIPE RIS data, which is a project from RIPE NCC that collects BGP data from various points of view, and I also took the historic RPKI data that also RIPE NCC has, and I did an analysis of 2021 only of the RPKI invalid for max length reason. I only took this one, the max length, because, in my mind, first of all, we saw that in our case at least is the most common error and I think also other organisations have the same ‑‑ are in the same situation. And the other thing is that also again in my mind is the ‑‑ the reason more prone to human error, let's say. When you analyse this data there are a lot of exceptions that you have to take into consideration which are not going to go into details, which include, for example, how much you can trust what you read in an AS path but this is just to give you overall idea of numbers.
So in particular, you can see that 40% of RPKI invalid for max length, they get resolved in less than a day or one day ‑‑ the granularity here is one day. And so, okay, that's good. What is not good, though ‑‑ it could be a second, it could be 20 hours, one day. What is not good though, is the fact that the rest, 60% actually takes more than one day and there is one bump at the end of the time window of one month which means that it took basically at the end of the 30 days the announcement was still invalid and it has been invalid for all those 30 days, which the window was one month so it could be that still invalid now or it finished one day after, we don't know. But it clearly gives you an idea that there are clearly some organisations that they are doing RPKI invalid announcements and they don't realise that.
So what did we do to address our situation? We introduced some a new automation platform, the one I presented last year, we improved our monitoring, we introduced some RPKI‑specific monitoring and we released it Open Source in BGP alerter, you can benefit from it in the Open Source tool. We introduced a strict procedure that we have to follow and improved communication with the customers. While these are four points, they are in fact all integrated together.
I'll give you an example:
When we have to start announcing a prefix for a customer, we input that in our automation platform and the platform says the ROA is missing so we ask the customer when the customer had the ROA, the monitoring and automation is able to see it and says now everything is good, green light you can proceed. So you can see that there is a procedure, the communication with the customer and all of this is supported by a technical solution of automation and monitoring.
So now I repeat the cases where only 71 in the previous chart and for the number of operations that a company like NTT does per day, it's a blip, it's nothing. But still, maybe it's this thing that we are originally a Japanese company, we really sometimes obsess about things and we want to reach perfection about stuff. So at some point this became a really hot topic, we have to address this, it has to be addressed, and the 26th March 2021 is when the blue dot you see here is when we deployed this solution in place, where we achieve a reduction of 87% of ‑‑ of involuntary RPKI invalid announcements. Of course we are trying to reach zero, there are some occasional one that they happen because procedure is not followed correctly but we are almost there, I would say, it's a good result.
A quick overview.
This is the platform I show last year. Basically I put here, there is a dashboard with the list of resources and when we click on these resources we can do various changes, among which there is the ‑‑ this is where we manage our RPKI. I divided this page in two columns, there is the one on the left which is the current status of RPKI for that resource; and the one on the right, which is the future status. So the current status is according to the public RPKI. The future status is what is going to happen after all these RAOs here that we added and change or remove are going to be applied, what it will be the future status, and how this is calculated, well, it is calculated, I had some trial and error and in the end I came out with this, what I call the four stages of ROA, and if you want to make the joke that the first stage is denial, that was already copyrighted, already made it millions of times.
So now, the first part is when we create this ROA, the ROA is staged and it's just fake ROA, it is existing only in our database and, together with the real RAOs, they concur on calculating the future status, and if everything is correct, if what we are announcing of what we would like to announce agrees with the future sort of RPKI, then in that case we can commit all the RAOs.
Now, when they are committed, it means they can be sent to public repositories and when these are visible they get parked as public, which is the third step. From that moment on we will start monitoring forever these ROAs. However, after 24 hours without any issue, we mark it additionally as stable. Nothing really changes but the stable one is mostly to create, to close, for example, automatically, tickets and stuff like this, it's okay, this is done, at least for now, everything is fine. Life goes on.
Now, the logic and most of it is implemented in BGP alerter which is Open Source tool that you can find in this repository as realtime monitoring both for BGP and RPKI. It is an application that you just run. You do not have ‑‑ you just have to input your autonomous system number because it does use public data so you not have to provide your own data, you don't need to. But and there is also auto configuration so just self‑configures. For the BGP part there is a huge list of things that alerts for, among which hijack detection, visibility loss, path monitoring. But for the RPKI monitoring which is the part we are more interested today, you will get notified if, for example, your autonomous system is announcing an RPKI invalid prefix or if your autonomous system is announcing a prefix not covered by a ROA, you can disable the alerts that you don't want. If your ROAs disappear, they can disappear because maybe somebody deleted it, one of your colleagues, or because there is some malfunction or because you ignore the expiration warnings and now they expire, or you can get informed for any change that involves any ROA impacting any of your prefixes or any of your ASs, for example we have it in a channel, it's really nice that if any of your colleagues changes something, there is a place where you can see the diff all every time.
There is also ‑‑ okay. I take it as a feedback. There is also like Trust Anchor malfunction alerting in case ‑‑ we will discuss about ‑‑ briefly about it, why it is important. And corrupted VRP file, that is the output of your validation and it's important for your implementation. And last but not least, if a ROA do expire and all the certificates in the chain of validation so you want to be informed if something is expiring and when it is expiring and you don't want to just put stuff in your calendar and hoping for the best.
Now, this is the ‑‑ an example of what you can receive as alert, for example this is the diff I was talking about the last one is when RPKI invalid is announced by your autonomous system.
Now, some shout out to some amazing project that they make my life much easier, the first one is RIPE RIS, in particular RIPE RIS Live, this is a project where there are various route collectors distributed in the world and you peer with these route collectors, they store the routes that they receive and they can be used for various reasons, among which the monitoring, that's what I do with BGP alerter for for research. And if you want to peer here is the link and I'm sure that there are a lot of RIPE NCC people around here that you can talk to and have more information.
The second one is OpenBSD rpki‑client which is an RPKI validator stable robust, we use internally, but it's also the only, at the moment, that is exporting various metadata that I find really useful for monitoring, in particular for monitoring, for example, for ROA expiration, and thanks to job Snyder for implementing these features. Of course MANRS, which I recently became one of the ambassadors, I think it's really interesting because it has a list of really concrete actions that you can read and implement for ISPs, examinerships and CDN and hardware vendors and especially good community so if you are embarking on this they can help you also with that.
And now let's go, before to close, why it is important ‑‑ where there are alerting for Trust Anchor malfunction, why you can be interested into that. First of all, trust anchors are an important part of the global RPKI infrastructure and if something happens that you may be impacted and you want to be aware of it, and another reason is that you don't want to waste time trying to debug something that is not related to you, why you would be much better spending that time to report to who is in charge so that they can fix it and would you do also a community effort.
Now, the first that we record of this Trust Anchor malfunction is 12 August 2020 where we see that basically various users start reporting also on the GitHub repository that some of their prefixes no longer are covered by ROA but they didn't delete the ROA, so what's going on? It must be a false positive. We spent some good amount of time trying to understand what was going on and after Job Snyder, with the RPKI client team and the ARIN team, they do a call together and try to debug this more deeply because it was involving only ARIN resources and they discovered was publishing certificates with some corrupted certificates so you can find more details in the repository there. At that time the main lesson especially for me was the amount of the time spent to boost up the entire investigation and possibly we needed some trust anchoring monitoring feature for that. I start implementing that, some of these features start to be useful. We have unique ROAs disappearing and send alerts and gets reported a hardware failure.
So, the 18 March 2021, we have that during a validation cycle, some ROAs were missing, so basically before we had a certain amount of ROAs, 15 minutes later, less ROAs and after they are back to normal. It already happened once in the past but we didn't give it too much attention. This time, we do, and especially thanks to our colleague, Coleen, which did some ‑‑ we discovered that some manifests were containing reference to not available certificates. We do a report to RIPE NCC, and basically, as also explained in this announcement, after, basically they were doing updates in the repository and if you were doing r‑sync in that moment you maybe downloading all new files at the same time.
The 17th June 2021, we discovered LACNIC disappearing over R‑Sync, we send immediatly a WhatsApp to our friends in LACNIC and they fix it on‑the‑fly not even 30 minutes and basically nobody possibly ever noticed this. And it was something with load balancers.
Now, the 1st February 2022, we have JPNIC partial Trust Anchor malfunction, some of our ROAs we received this alert that says these ROAs are expiring but that was weird, and we start digging and we do a report to JPNIC and they discovered indeed they had this that was preventing certificate renewal and the ‑‑ they do report and funny stories they manage to fix it one minute before expiration of these ROAs.
And the 16th February 2021, 2022, this is the of RIPE that becomes unreachable, we also get too many connections to R‑Sync because everybody was trying to go on R‑Sync, BGP alerted the issue but when we do an e‑mail to RIPE NCC they were already aware of it and were already fixing it, they almost ‑‑ basically, it was a DNS misconfiguation and you can read more about it there.
So this was the last and my presentation is over. And you can, if you have questions, you can ask now or also send by e‑mail and I will be available the entire conference and you can also follow me on Twitter and sometimes I share useful information, other times not, but I will do my best. Thank you
SPEAKER: Nice to see you, Massimo. I am Kostas Zorbadelos from CANAL+ Telecom and we are also a good customer of NTT. So, one question I would like to is: We have this need to create on demand announcements, for example, for MT T does reasons or whatever. This could generate alerts to BGP alerter due to ROA missing and stuff like that. The question is: ‑‑ two questions, actually: Doesn't NTT support this kind of stuff, customers doing on demand announcements depending and the second is, do you have any best practices to recommend to your customers that have this kind of need in order how to configure their ROAs?
MASSIMO CANDELA: My answer is I don't have an answer for you, and we will have to talk about this later.
SPEAKER: We will discuss it off‑line
MASSIMO CANDELA: With some of my more knowledgeable colleagues about some of the specific topics.
SPEAKER: From AMS‑IX, nice to see you around after the years. You mention you use BGP alerter, you mentioned user ‑ we know what it is. My question is what do you do with false positives? Because I also have my tool that I monitor prefixes and we also have a lot of false positives with customers and I have for many months chasing customers, okay, I see that announced this prefix with this origin AS and why? And you actually make some false positive, like you appear to be a transit provider, why you are not, that's an example. And you are not, why you do it and we don't announce it to the public Internet but RIPE because collector. I guess you have similar situations. You cannot force the customers to not announce things to RIPE RIS because it's a nice collaboration tool that we are based to make tools. But then on that case, the customer is legal, the guy is legal, he doesn't announce anything to the global Internet, so if you go to global table you don't see it. So have you you come into this problem and how you fix it?
MASSIMO CANDELA: Okay. I get what you mean. Essentially you mean there are some routes they are only announced to RIPE RIS and they create noise in the monitoring. Yes, that is true and we already dealt with this basically at the beginning of the project and I have to tell the reality is these are not a lot and occasionally they still appear back, but mostly what we did at that time is and we still keep them updated, we define thresholds, so most of these routes, they are visible only to four from one, two or a small amount of peers and you can safely discard those based on thresholds. If you look on BGP alerter configuration, you will see in the config file, I periodically update thresholds for most of the other thing that removes this kind of peers. Fortunately they are not a lot, though.
SPEAKER: We do exactly the same approach, so good to see you do the same stuff. Thank you very much.
FRANZISKA LICHTBLAU: Last question.
JOB SNIJDERS: I wanted to reply to the engineer asking about DDoS authentication and ROAs. If you have a need to be able to deaggregate in BGP the best current practice is to pre‑populate RPKI data such that you can deaggregate. So that means create your ROAs for /24s ahead of time, before you need to pull them through a DDoS mitigator.
MASSIMO CANDELA: You should never have moved to Fastly.
FRANZISKA LICHTBLAU: Thank you, Massimo. I can only recommend to follow his Twitter account, he has really cool stuff up there. So thank you.
WOLFGANG TREMMEL: Okay. Perhaps the people in the room haven't noticed but this is a hybrid meeting, and so the next presenter will be a remote presentation, I see him already behind me on the screen, Doug, you have the stage somehow, thank you.
DOUG MADORY: I cannot share my screen yet. All right. Hello, RIPE, I am from Kentik, that is talk I put together with Job Snijders about measuring RPKI using NetFlow and I want to thank everybody to present this. I was not able to make it to Germany this month but hopefully I will be in attendance at the next conference.
In this talk, I believe we have ‑‑ this is a good news talk around RPKI. With stats around looking at where we have ‑‑ where we have come. So where are we with ROV adoption? So RPKI stands presently as the Internet's best defence against hijacks due to typos and other BGP mishaps, origination leaks. Like any distributed security mechanism, a challenge is to get ‑‑ you need many individuals to take an action and make a decision to adopt, to participate in the system. And at the outset you have a chicken and egg issue of why would you bother going through the process of rejecting invalids if no one is creating ROAs and why create ROAs if no one is rejecting invalids? I guess I would like to present some data to officially close that phase of the RPKI evolution.
So, in the past couple of years, we have had a lot of movement on both sides of that question, so there's been a lot of adoption of rejecting RPKI invalids from the Tier1s, Arelion and so Lumen and a bunch of other companies who have yet to go through a rebranding, and then on the other side of it and the creation of ROAs, it's been a flexion point as of a couple of years ago where the stats were moving in the right direction so this is a graph from NIST, US Government agency that keeps statistics on a lot of things, they have a web page they keep up to date that's very handy and does a lot of graphs along these lines of looking at world trends, and in this case you have got the yellow line is how many routes in circulation in the global routing table have no ROA? So therefore cannot be protected in the RPKI system. And then the green line is those that have valid routes and so those are moving in ‑‑ those guys are moving in the right direction.
And then if you squint your eyes very closely you can find there's a red line that looks like it's the A axis and these are the persistently invalid routes, there is a very, very small number due to some misconfiguation or another. This is the v4 plot. The v6 one looks very similar.
So, it takes two steps to reject invalid BGP router, so as we mentioned on one side you needed someone had to create the owner or the person responsible for the address space had to create a ROA to alert the valid origin, and on the other side you need networks that are dropping invalids. On that step one, we have like NIST website I mentioned, RIPE stat has got some neat tools for tracking similar stats over time. And on the flip side, this is an active area of research that is a no one hard problem, how do you passively determine what ASs are dropping invalids?
In this talk we are looking more on the left‑hand side here, we are going back to the measurement of how many ROAs are created because I think we have got some new data to bring to bear that might help our understanding of the where we are at with the progress.
So, if we were to look at, again, pulling up just a recent day in v4 on the NIST RPKI monitor website, we see something like 34 .1% of the BGP routes in the global routing table are valid to twice as many are not found or unknown ‑‑ I may use those terms interchangeably here ‑‑ routes without a ROA. So it's two to one, unknown to the valid. And but I guess the question is: in what proportion of traffic is safeguarded by that? 34%, because I think we all know that not every route is created equal and not every route carries the same amount of traffic. Again if you were to pull up the stats for v6, they look pretty similar to v4.
At this point in the talk, I'll make a little detour to go back and tell a bit of history here.
So, back in 2019, a long, long time ago, long before the pandemic, another world, Job Snijders, my co‑author, along with Paolo, were trying to alleviate some of the concerns around RPKI adoption and specifically would a network lose important customer traffic as a result of rejecting invalids? And there was a way to allay those fears by answering precisely what it is would be disrupted had they took this measure. So they worked together to extend the MA cc tool to incorporate an RPKI validation along with NetFlow analysis so you could decide for your own network, have that information and before you make that decision. And after they did this, Job brought out an e‑mail to the list, possibly RIPE as well and put out a call of action to the premier NetFlow and analytical platforms that they ought to adopt this feature because it would be good for customer base and for the Internet so I am at Kentik now and they needed this challenge and within a few months we had this feature in our product and so a lot of this is going to be based on using that functionality and using some of our aggregate data and what we can tell about RPKI.
So, again, going way back to 2019, that e‑mail was February and then the following month, March, Job, who was at NTT at the time, gave a talk at DKNOG about this development and then also running it over NTT data just to get a sense for it, if they were to look at the volume of traffic they handle, how would they break down in the various buckets, so this was a graph presented at the time, you can see that the amount of RPKI unknown or not found traffic was the majority, vast majority. There's Orange or yellow is valid and then there is a very small sprinkling of invalid along the type and we have got a long way from here so remember this breakdown because it will be different. And then finally, he gave maybe a provocative slide here saying that maybe not everybody needs to do RPKI. Given the consolidations in the Internet industry if you were to get a bunch of content providers and the largest eyeball networks and maybe some of the DNS service providers, if you had like a very secure core that would be a few small number of companies that would deploy this but with a very large amount of benefit, they have a lot of traffic and a lot of the most important Internet functionality.
Right. So, what does Kentik have anything to say on this? So Kentik has 300 plus customers that we do NetFlow analysis for, about half of those have adopted ‑‑ have opted into have their data be allowed to be used as part of aggregate analysis and this is what I do, use, I ‑‑ in the news for reporting on a country going off‑line like me an mar is off‑line, I can using our to make an assessment on is the country or a particular network up or down. It's important to note that this, any of this analysis and the work I do with this data is subject to the ‑‑ the exist in our customer step, we have NSPs, CDNs, various types of digital businesses in there. We do skew towards the US in our customer base, and so it's worth noting I still think this data is still pretty informative. And as I mentioned earlier, we adopted this feature at the request of ‑‑ or at the call of Job several years ago, mostly to answer that question of what would be your ‑‑ if you were to start dropping invalids, what would happen to your customer traffic? But what's needed is I can take that same functionality on this very versatile analytics platform and answer a variety of questions like how are we doing by traffic volume on the various states of RPKI? So let's see.
So, as was done in the PMacct tool, there are 4 cases of RPKI evaluation: There is valid, unknown, invalid and also invalid but covered by a valid or unknown route. So, Job wanted me to emphasise that this is not part of any kind of IETF standard or RFC; this is simply the analysis plane, on the right here is a screenshot of BGP.at.net, many of you guys probably recognise this, this is an example of a /24 that is RPKI invalid, if you are trying to send the traffic to this particular IP, this dot 48 it would use the /17 route which is ‑‑ would be unknown because there is no ROA, and that's where you would deliver that ROA traffic. So those are the four categories.
And this is what we came upon. So looking at data from, I picked a week from a couple of months ago, we are seeing that the majority of traffic is going to valid routes, and this is ‑‑ looks a lot better than the 2:1 of unknown to valid, if you remember that from the beginning of the presentation. So we are seeing 56.4% of traffic is valid, and yeah, the persistent invalid traffic is again very small, very likely not a reason to not drop invalids but if you don't want to take my word for it take a look at the PMacct tool or our tool, you can run the same analysis to see if you would be losing any customer traffic.
Let's pivot now and we can use the RIPE stat tool, it has an RPKI by country functionality, and look at how do these things differ from when you are counting either routes or counting IP space versus looking at NetFlow to a country. So I am based in the US, and when we run this we see a big difference in these stats. So we see, for v4 and v6, we are seeing 20, 24% of the IP space is covered with ROAs, but then we are seeing 58.5% of bits per second measured NetFlow, combined v4/v6, and have been going to routes with ROAs ‑‑we'll call them valid traffic for lack of a better term. And why is that? What's happening there? It's the result of some major RPKI deployments. So in the US sector, Comcast and spectrum are major access networks and eyeball networks and have done tremendous work in RPKI deployment and they are seeing near complete, almost all the traffic going to the routes is going to RPKI valid routes. On the flip side for content providers, Amazon and Google are at 100%, Cloudflare is very high, again due to recent RPKI deployments and so these, while these networks may account for just a minority of the BGP routes or the IP address space, they account for a very large, maybe not majority by themselves but a very large amount of traffic. And this is is pushing up the stats.
So we can take this technique and kind of travel around the world, again we looked at US where the number is, in this graph the blue here, this is the NetFlow base bits per second, combined V 4, v6, traffic kind to routes RPKI valid and orange are v4 space covered by ROAs, grey is v6. So Canada, Brazil also look better, NetFlow Mexico goes the other way, a little bit lower. We skip around Europe. You can see a variety of outcomes here. So you have Turkey and Ireland have big RPKI deployments, that shows up both in the ‑‑ those RIPE stats of covered v4 address space as well NetFlow. This presentation is ‑‑ this conference is in Germany and so we are seeing again another improvement, the numbers look better in NetFlow just looking at BGP alone. The largest contributor there, or the biggest destination is D‑Tag which isn't a big surprise and what's pushing up the stats there is huge RPKI deployment on D‑Tag. As you go down the spectrum here, Great Britain here, GB, is another example, maybe one of the bigger ones in Europe where they are the stats for traffic to the country looks better in NetFlow than a strict BGP analysis.
If you go over to Asia, these numbers are all over the map, just like Asia, so we have Taiwan and the Philippines have big RPKI deployments, the numbers agree in both NetFlow and BGP analysis and south Korea and China have very low RPKI as it stands.
Also with this, with the ability to do NetFlow analysis with RPKI you can kind of look at traffic in different dimensions so one of them is just by protocol, we break it out by v6, v4 we are seeing a much higher percentage of traffic to routes in v6 than v4. You can form your own theory. Let me know if you have got a good theory on that. If we go by port, you will see port 443 has a very high percentage and I would theorise about this, probably related back to a few slides ago we talked about how the ‑‑ some of these major content providers are doing RPKI, that's going to push up a lot of the 443 ‑‑ I'm happy to discuss that with anybody. Another thing neat was in the RIPE stat thing, with this temporal plotting of how things change over time. I was able to pick out times we could see movement in those stats and movement in our stats so in this case, it was a big deployment, RPKI deployment in Poland in late September last year when the percentage of IPv4 address space with ROAs went from 39 to 64% and, at the same time, we saw a change from 46% to 60% so it's not a one‑to‑one movement, but they are related, these things are correlated.
In the previous talk Massimo was talking about some of the outages of taken ROAs out of circulation so there was an outage in Taiwan last fall so I looked into that when it occurred and we saw no traffic ‑‑ traffic ‑‑ no traffic seemed to be disrupted, which is good, and we were just, in our system, we were just classifying all the traffic going from valid to unknown, as it's supposed to follow appropriately and then back to valid once the system was back up, and that's how it looked to us.
There is also some interesting phenomenon when you break it down by the hours of the day, the breakdown between RPKI valid traffic and unknowns, which are the two main categories. It does not stay consistent through the day so there is a little bit of undulation, and I think the theory, the explanation there is that the breakdown is different on fixed line routers as it is for mobile carriers, not all have got large RPKI deployments so as people shift their daily activities, maybe more on fixed line to the evening when on the phones it changes this distribution, that's just another interesting observation we are digging into this data.
So to summarise, you know, I would reiterate the best current practice is to reject RPKI invalid BGP routes, and this by rejecting these invalid routes you are protecting your outbound traffic from not getting misdirected by you accepting a route that you shouldn't. This is due to a typo or a leak, origination leak, and I would argue it's not a legitimate, it's not a risk to legitimate traffic, but you don't have to take my word for it, we have got tools out there you can check this for yourself.
And the other recommendation is to not link things like local pref or communities based on validation states. The concern is especially for much larger networks, if there is some sort of a loss of connection to the validator and your routers are going to then announce a lot, potentially a lot of routes in a different state, different community or different local pref, based on this new transition from valid to unvalid, unknown, and we are going to create a lot of churn, if you will do that. So that's ‑‑ that is a concern.
And if you have any questions, please reach out to myself or Job, we are the BGP agents ‑‑ that was me and his cat there pictured on the right.
FRANZISKA LICHTBLAU: Thank you, Doug. And let's open the floor for questions. I think Jen was the first in the queue.
JEN LINKOVA: Question: If I go ‑‑ if I have got you right, you were looking to the destination addresses, right?
DOUG MADORY: Yes.
JEN LINKOVA: I have a theory, it may be situation for invalid even better because I'm curious if you look in the sources because I guess some of the traffic is just a white noise and a scam, some just random traffic sent to some random destinations so it's not a real actually bi‑directional flows, when you see actual some responses in the invalid prefixes?
DOUG MADORY: Okay, you are thinking that's pushing up the traffic ‑‑
JEN LINKOVA: If you looked in the sources if you see any actual, bi‑directional flows between when you ‑‑ someone is sending traffic to invalids and getting some responses back.
DOUG MADORY: I did not try to filter for bi‑directional flows. I think my assumption the vast majority ever the traffic is actually traffic and I don't think I don't think a scan would would amount to the same amount of traffic as network carriers. It's my theory. Good point.
CYNTHIA REVSTROM: My question is regarding the IF temporal which I found quite interesting and I am wondering was that looking at the entirety of the world or were you looking at some specific place? Because like, of course, like, I would assume it's going to be more similar if you look at the entire world because of different time zones you might see more interesting data if you just look at Western Europe or something?
DOUG MADORY: You are right. I think what's happening, I am looking ‑‑ I was looking at the whole world as far as what we see, so you are right, you are talking about everything is going to be out of phase like the whole world and I think what we are seeing is just overrun by US traffic because we probably get a lot more of that, if you were to isolate a different country, then it would be probably be even more pronounced if I were to take Germany, just in one time zone and it would likely be more pronounced, it's a good point.
PETER HESSLER: My question is: So we are familiar with the ARIN TAL situation with RPKI, and I wonder how much of the traffic you saw is covered under the ARIN TAL and would be unknown in many locations who are not willing or able to agree to the ARIN TAL licence?
DOUG MADORY: I don't know the answer to that question but that is an excellent feature investigation. I like that idea but I don't have the answer.
PETER HESSLER: Thanks.
FRANZISKA LICHTBLAU: Okay. I don't see any people queuing at the microphone or virtually. This is your last chance. Otherwise, thank you, Doug.
FRANZISKA LICHTBLAU: I am tempted to we have an at the big monitor but that's totally useless. Okay, housekeeping.
Again, please do rate the talks and do nominate yourself for the RIPE PC if you are interested in working with us. You can nominate yourself until 3:30 tomorrow by sending a mail at pc [at] ripe [dot] net. Until now we have received oneself‑nomination for two open slots so please don't let us die there alone. And, yes, with that, we will close this plenary session. And after this 30 minute break there will be a BoF on future Internet and protocol design and what we expect from the Internet in the future, it's hosted by Jelter Jansen and his colleagues and I think it's going to be quite interesting. So see you back in half an hour.
WOLFGANG TREMMEL: I think it's in the side room. That's across this way. See you.
LIVE CAPTIONING BY AOIFE DOWNES, RPR