Devops | 2i2c

Protecting our hubs against the CopyFail kernel exploit

Mon, 04 May 2026 00:00:00 +0000

The recently disclosed CopyFail Linux kernel zero-day (CVE-2026-31431) opens up a way for code running inside a container to break out onto the underlying node. We took a close look at our hubs to confirm whether they were exposed, confirmed that our hubs are likely not at risk, and added another layer of protection just in case.

Are 2i2c’s hubs at risk? #

No - based on our testing and mitigation efforts, our hubs are not vulnerable to CopyFail.

Why do we think we’re not at risk? #

We tried to reproduce the exploit on a staging hub by following the public Kubernetes proof-of-concept on both AWS and EKS, and the exploit was unable to break out of the container.
Existing JupyterHub hardening on Kubernetes from jupyterhub/kubespawner#545 (originally added by Yuvi in 2021 in response to a different security issue) had already significantly reduced our risk exposure, and the exposure of anyone else running Z2JH (the standard way to deploy JupyterHub on Kubernetes).
As an extra layer of protection, we deployed copyfail-ebpf-k8s as a daemonset across all of our clusters in 2i2c-org/infrastructure#8227. This runs on every node and covers all of our hubs (including those on non-commercial cloud infrastructure, like JetStream2). It blocks the specific kernel features that CopyFail depends on. See the project’s explanation for how that works.
We’ve upgraded all GKE clusters to use a patched image in 2i2c-org/infrastructure#8230.

What else did we look into #

Deckhouse’s mitigation was too platform-specific for us.
OVHcloud’s modprobe blocking likely won’t work on Amazon Linux 2023, since the relevant module is built into the kernel image.
AL2023 security advisories - no patched AL2023 image is available yet, so we can’t rely on a kernel-level fix from AWS for now.

Acknowledgements #

Huge thanks to Georgiana for the deep dive into the exploit and whether we’re exposed here.
Thanks to Yuvi for the PR that reduces JupyterHub’s exposure to this back in 2021!
Thanks to iwanhae for the eBPF daemonset we deployed in Kubernetes, and to JupyterHub for the upstream kubespawner hardening that lowered our exposure.
Thanks to our collaborators at NASA VEDA for the ongoing conversations about hub security.
Thanks to our collaborators at Pythia for supporting ongoing work around security in JupyterHub and BinderHub, especially on non-commercial cloud like JetStream.

Upgrading community infrastructure to Kubernetes 1.34 and JupyterHub 4.3.3

Wed, 08 Apr 2026 00:00:00 +0000

We’ve completed a major round of infrastructure upgrades across all 2i2c-managed hubs - every hub is now running Kubernetes 1.34 and Z2JH helm chart 4.3.3.

Running up-to-date versions of both Kubernetes and the JupyterHub helm chart ensures that our communities get the best support and reliability, both in terms of features and security.

A new approach to infrastructure upgrades: upgrading in rounds #

This was the first time we rolled out JupyterHub helm chart upgrades in rounds rather than all at once. By upgrading a subset of hubs at a time, we could identify and fix issues in isolation before they affected the broader network. This made the process safer and more predictable.

We’re planning to perform these kinds of upgrades on a regular schedule for our member communities. Around every 6 months we’ll create an issue to make sure nothing falls through the cracks (here’s example config for creating our reminder issues).

Check out our process docs for multi-hub upgrades for more information.

Learn more #

Check out these pages for what kinds of improvements we’ve brought into our clusters / hubs with these latest updates.

Acknowledgements #

Thanks to Georgiana Dolocan for leading this upgrade effort and establishing the rounds-based approach.
Thanks to Chris Holdgraf for adapting and editing Georgiana’s notes into a blog post.

How regularly upgrading core infrastructure leads to upstream improvements and better infrastructure

Fri, 03 Apr 2026 00:00:00 +0000

Our collaborators at NASA VEDA recently asked us about the rationale behind policies for upgrading our infrastructure relatively quickly when new versions come out. Here’s the explanation that we shared with them, in case it’s useful for others as well.

In this case, the decision was whether to upgrade to Helm 4, and you can find our rationale in the /initiatives repository. Here’s a brief summary from Yuvi:

Fundamentally, it helps keep moving us and the ecosystem forward, and drive improvements upstream, in both JupyterHub and Helm.

It has driven these PRs in JupyterHub:

jupyterhub/action-k3s-helm#126 (merged)
jupyterhub/zero-to-jupyterhub-k8s#3797 (validated, but not merged yet)

It’s also driven improvements to helm itself - see this bug report that is being worked on:

helm/helm#31919

Upgrading helm versions can break things (and it has for some of our other communities in the past - see this example). So it’s important we do that on a reasonable timeframe and carefully, to avoid disruptions.

We’re also discovering for example that potentially the new nginx-ingress controller we had to move to has some issues working with older helm versions (ongoing WIP in 2i2c-org/infrastructure#7995). That feels much more tractable because we can now go ‘ok, let us just apply a quick fix now, and wait for the helm 4 rollout, and try again’ instead of being totally stuck.

This is similar to the other part of [/our VEDA objective] - rolling out new versions of jupyterhub. If we need to roll out security fixes, it’s much easier now because we already did the hard work of being up to date:

2i2c-org/infrastructure#7996

This isn’t the case quite yet for helm v3, as it’s still supported, but it’s much better to do this work earlier than wait.

If you encounter a bug in a popular open source software, often you can just ‘wait’ for it to be fixed. But this isn’t just about time - someone somewhere has to put in the effort of getting it fixed, filing helpful upstream bug reports, and testing to make sure it works. This is an example of 2i2c continuing to contribute this effort upstream wherever we can.

Acknowledgements #

Thanks to NASA VEDA for collaborating deeply with us on infrastructure questions like this.

Enabling CloudBank to safely manage their own cluster infrastructure

Tue, 20 Jan 2026 00:00:00 +0000

We recently enabled CloudBank to run Terraform changes for their cluster without needing to wait on 2i2c engineers for each request. They run 50+ hubs for various community colleges, and we want to enable them to self serve as much of that as possible. When we introduced home directory quotas, they were no longer able to set up hubs by themselves without help from 2i2c engineers. Our goal was to empower them to be able to set up new hubs in a safe way while still benefiting from the home directory limits work.

CloudBank simplifies cloud access for computer science research and education.

To do this safely, we needed to avoid granting access to shared Terraform state that could impact other communities. Following Yuvi’s suggestion, we migrated CloudBank’s Terraform state to CloudBank’s own GCP project so that infrastructure changes from the CloudBank team are isolated to their cluster only, making this safe to try. This unblocks CloudBank to run changes like terraform plan and terraform apply themselves, meaning that CloudBank can deploy and update a hub without 2i2c engineers in the loop.

This is a good example of how we aim to balance community autonomy with infrastructure safety. CloudBank can now self-serve routine operations while our broader infrastructure remains protected.

Learn more #

Acknowledgements #

Thanks to Sean Morris and the CloudBank team at UC Berkeley for collaborating on this workflow.

Improving our community hub reliability and stability in Q4 2025

Tue, 16 Dec 2025 00:00:00 +0000

This year we’ve prioritized making the cloud safe to try for our member communities. This has driven work in monitoring, alerting, and automating infrastructure so that we resolve small problems before they become big problems. In the last quarter of 2025, we wrapped up this effort by testing the following hypothesis:

We can reduce P1 incidents if we shorten the time to act on current alerts and learnings from prior incidents.

Here’s what we accomplished and what we learned.

What we accomplished #

In short: we’re now much more confident in the stability of community infrastructure. Here’s a snapshot of our new incident dashboard, which shows high-level trends for the stability of our infrastructure:

See the real-time status of our community hubs at status.2i2c.org

We improved infrastructure reliability for our communities #

We made several technology and team process improvements that led to these benefits for our communities:

We are now more likely to catch outages before a community reports them to us.
We are now less likely to have an outage happen more than once, or affect more than one community, because we consistently fix the issues that cause outages.

We saw a consistent drop in critical alerts that required immediate response:

For August and September we had an average of 7 outages/month (6 from alerts, 1 from community)
In October, November, and December we had an average of 3 outages/month (9 in October, 0 in November, 1 in December, with only one of these being reported by a community)

We became more efficient, responsive, and focused #

We also got several team benefits from this work:

We get fewer interruptions and distractions from deeper work.
We have clear assignment policies to make it clear who is responsible for acting in response to alerts.
We avoid invisible work from falling down rabbit-holes when responding to outages.
We decreased the stress and pressure of doing upgrades, making them easier to split into sprint items and more likely to get done consistently.

The improvements we made #

Infrastructure improvements #

Created a status page for all 2i2c community hubs, giving our team and communities visibility into the status of our infrastructure.
Created an alert that triggers when two servers fail to start consecutively in a 30-minute time window.
Improved deployment infrastructure so that we can roll out sub-chart upgrades to individual clusters, allowing us to roll out major changes in batches.
Removed our “configurator” application from community hubs, because it was causing more confusion than it was resolving.
Allowed servers to start even when users hit their storage quotas.
Provided a number of upgrades to Kubernetes and the support services that we run alongside each community hub.

Process improvements #

Made a team commitment to prioritize issues from incident reports and other stability-related problems.
Defined incident escalation policies using the status page to calibrate the urgency of our response to the severity of incidents.
Defined “on-call” procedures so our team knows when and how to be more responsive to outages.
Time-boxed our alert response process to avoid accidentally falling down rabbit holes for non-urgent problems.
Created a more reliable process for responding to incidents and writing incident reports.

Looking forward #

After this push around infrastructure reliability, we’re significantly more confident in the stability and transparency of our community hub infrastructure. This will deliver better service for our member communities and free up more of our time to engage with them instead of fighting infrastructure fires.

We will continue to improve our infrastructure, and have a better foundation to do so incrementally in the coming quarters. Here are a few things we’d still like to improve:

We still need to improve how reliably we complete follow-up actions from incidents (e.g., writing incident reports). When a process doesn’t fit into planning & scoping ceremonies, we struggle to follow it consistently.
We’d like to improve our testing framework for major upgrades across all hubs (e.g., Kubernetes version upgrades) to catch bugs before communities do.

Learn More #

Faster reporting of user home directory sizes

Tue, 09 Dec 2025 00:00:00 +0000

Storage quotas help users avoid running out of space unexpectedly and give administrators visibility into capacity planning. However, storage usage can change rapidly, and it’s important to have quick information so that administrators know whether they are close to hitting limits.

We’ve improved how quickly hub administrators can see user home directory sizes across our JupyterHubs. This makes monitoring more responsive and adds quota limit visibility that wasn’t possible before.

Using `jupyterhub-home-nfs` for near-instant disk usage metrics #

Our existing storage monitoring tool, prometheus-dirsize-exporter, deliberately runs slowly to avoid excessive disk I/O. This meant home directory metrics could be hours out of date on systems with many users or large directories. Plus, there was no way to report user quota limits at all.

Our home directory storage is managed by jupyterhub-home-nfs, which enforces per-user quotas. It could also expose usage and limit information as Prometheus metrics using data from the underlying filesystem quota system. Because this information is already tracked by the filesystem, it’s available immediately without scanning individual files.

We made two key improvements:

Make disk usage reporting almost instantaneous. We made jupyterhub-home-nfs export total_size_bytes and hard_limit_bytes metrics to Prometheus for near-instant reporting. We used the same metric names and namespace as prometheus-dirsize-exporter for compatibility. See 2i2c-org/jupyterhub-home-nfs#76
Allow this to be used upstream in JupyterHub Grafana Dashboards so that it can support both types of disk usage reporting. This means users of the upstream JupyterHub Grafana dashboards get the same useful view about home directory usage, regardless of whether the metric comes from prometheus-dirsize-exporter or jupyterhub-home-nfs. See 2i2c-org/prometheus-dirsize-exporter#29

These changes were deployed across all our communities, so administrators can now access current home directory information within minutes regardless of directory size.

Home Directory Usage dashboard showing total size metrics from jupyterhub-home-nfs and other data from prometheus-dirsize-exporter

Try it out #

2i2c member organizations can try this out now. If you have access to your hub’s Grafana instance, you can see these new metrics in the Home Directory Usage dashboard:

Open your hub’s Grafana dashboard.
Go to Dashboards -> JupyterHub Default Dashboards -> Home Directory Usage.
Check the table for up-to-date total size and quota limit values.

For more details, see our docs on filesystem and disk dashboards.

Coming next #

We’d like to build on this work to enable alerting when individual users near their disk quotas. This will make it easier to more reliably track user disk usage across a community. See this issue for tracking: 2i2c-org/infrastructure#7166

Acknowledgements #

This was a directed contribution supported by NASA VEDA to enable more proactive monitoring and alerting for hub administrators.

Tech update: Multiple JupyterHubs, multiple clusters, one repository.

Tue, 19 Apr 2022 00:00:00 +0000

2i2c manages the configuration and deployment of multiple Kubernetes clusters and JupyterHubs from a single open infrastructure repository. This is a challenging problem, as it requires us to centralize information about a number of independent cloud services, and deploy them in an efficient and reliable manner. Our initial attempt at this had a number of inefficiencies, and we recently completed an overhaul of its configuration and deployment infrastructure.

This post is a short description of what we did and the benefit that it had. It covers the technical details and provides links to more information about our deployment setup. We hope that it helps other organizations make similar improvements to their own infrastructure.

Our problem #

2i2c’s problem is similar to that of many large organizations that have independent sub-communities within them. We must centralize the operation and configuration of JupyterHubs in order to boost our efficiency in developing and operating them, but must also treat these hubs independently because their user communities are not necessarily related, and because we want communities to be able to replicate their infrastructure on their own.

A year ago, we built the first version of our deployment infrastructure at github.com/2i2c-org/infrastructure. Over the last year of operation, we identified a number of major shortcomings:

Within a Kubernetes cluster, we deployed hubs sequentially, not in parallel. This grew out of a common practice of Canary deployments that allowed us to test changes on a staging hub before rolling them out to a production hub.
We used a single configuration file for all hubs within a cluster, which led to confusion and difficulty in identifying a hub-specific configuration.
Moreover, any change to a hub within a cluster caused a re-deploy of all hubs on that cluster. This is because we did not know whether a given change touched cluster-wide configuration or hub-specific configuration.

Our goal #

So, we spent several weeks discussing a plan to resolve these major problems - here were our goals:

We should be able to upgrade a specific hub alone, by inspecting which configuration files have been added or modified.
Production hubs should be upgraded in parallel when they are effectively run independently.
We should use staging hubs as “canary” deployments and not continue upgrading production hubs if the staging hub fails.

An overview of our changes #

To accomplish this, we needed to identify which hub required an upgrade based on file additions/modifications. This took a lot of discussion and iteration on design, and so we share it below in the hopes that it is helpful to others!

Improvements to our code and structure #

We made a few major changes to the infrastructure repository to facilitate the deployment logic described above. Here are the major changes we implemented:

We separated each hub’s configuration into its own file, or set of files. For example, here is 2i2c’s staging hub configuration.
We created a separate cluster.yaml file that holds the canonical list of hubs deployed to that cluster and the configuration file(s) associated with each one. For example, here is 2i2c’s GKE cluster configuration, which contains a reference to the previously mentioned staging hub.
We updated our deployer module to do the following things:
- Inspect the list of files modified in a Pull Request.
- From this list, calculate the name of a hub that required an upgrade, and the name of its respective cluster.
- Trigger a GitHub Actions workflow that deploys changes in parallel for each cluster/hub pair.

In addition to these structural and code changes, we also developed new GitHub Actions workflows that control the entire process.

A GitHub Actions workflow for upgrading our JupyterHubs #

We defined a new GitHub Actions workflow that carries out the logic described above. These are all defined in this deploy-hubs.yaml configuration file. Here are the major jobs in this workflow, and what each does:

generate-jobs: Generate a list of clusters/hubs that must be upgraded, given the files that are changed in a Pull Request.
- Evaluate an input list of added/modified files in a PR
- Decide if the added/modified files warrant an upgrade of a hub
- Generate a list of hubs and clusters that require upgrades, and some extra details:
  - Does the support chart that is deployed to the cluster also need an upgrade?
  - Does a staging hub on this cluster require an upgrade?
This produced two outputs to be used in subsequent steps:
- A human-readable table including information on why a given deployment requires an upgrade (using the excellent Rich library).
- JSON outputs that can be interpreted by GitHub Actions as sets of matrix jobs to run.
Our staging and support hub job matrix tells GitHub Actions to deploy staging and support upgrades that act as canaries and stop production deploys if they fail.
upgrade-support-and-staging: Update the support and staging Helm charts on each cluster. These are “shared infrastructure” Helm charts that control services that are shared across all hubs.
- Accepts the JSON list described above to determine what to do next
- Parallelises over clusters
- Upgrades the support chart of each if required
- Upgrades a staging hub for the cluster if required (for canary deployments, this is always required if at least one production hub is to be upgraded on the cluster)
filter-generate-jobs: Allows us to treat the support / staging hubs as canary deployments for all the production hubs on a cluster.
- If a staging/support hub deploy fails, removes any jobs for the corresponding cluster.
- Allows production deploys to continue on other clusters.
Our production hub job matrix tells GitHub Actions which hubs to update with new changes. These are triggered if a cluster’s staging/support job does not fail.
upgrade-prod-hubs: Deploy updates to each production hub.
- Accepts the JSON list described above to determine what to do next
- Parallelises over each production hub that requires an upgrade
- Deploy the relevant changes to that hub

Concluding Remarks #

We think that this is a nice balance of infrastructure complexity and flexibility. It allows us to separate the configuration of each hub and cluster, which makes each more maintainable by us, and is more aligned with a community’s Right to Replicate their infrastructure. It allows us to remove the interdependence of deploy jobs that do not need to be dependent, which makes our deploys more efficient. Finally, it allows us to make targeted deploys more effectively, which reduces the amount of toil and unnecessary waiting associated with each change. (It also reduces our carbon footprint by reducing unnecessary GitHub Action time).

We hope that this is a useful resource for others to follow if they also maintain JupyterHubs for multiple communities. If you have any ideas of how we could further improve this infrastructure, please reach out on GitHub! If you know of a community that would like 2i2c to manage a hub for your community, please send us an email.

Acknowledgements: The infrastructure described in this post was developed by the 2i2c engineering team, and this post was edited by Chris Holdgraf.

Devops | 2i2c

Protecting our hubs against the CopyFail kernel exploit

Are 2i2c’s hubs at risk? #

Why do we think we’re not at risk? #

What else did we look into #

Acknowledgements #

Upgrading community infrastructure to Kubernetes 1.34 and JupyterHub 4.3.3

A new approach to infrastructure upgrades: upgrading in rounds #

Learn more #

Acknowledgements #

How regularly upgrading core infrastructure leads to upstream improvements and better infrastructure

Acknowledgements #

Enabling CloudBank to safely manage their own cluster infrastructure

Learn more #

Acknowledgements #

Improving our community hub reliability and stability in Q4 2025

What we accomplished #

We improved infrastructure reliability for our communities #

We became more efficient, responsive, and focused #

The improvements we made #

Infrastructure improvements #

Process improvements #

Looking forward #

Learn More #

Faster reporting of user home directory sizes

Using jupyterhub-home-nfs for near-instant disk usage metrics #

Try it out #

Coming next #

Acknowledgements #

Tech update: Multiple JupyterHubs, multiple clusters, one repository.

Our problem #

Our goal #

An overview of our changes #

Improvements to our code and structure #

A GitHub Actions workflow for upgrading our JupyterHubs #

Concluding Remarks #

Using `jupyterhub-home-nfs` for near-instant disk usage metrics #