Skip to content
Merged
Show file tree
Hide file tree
Changes from 64 commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
f75b833
let's try a simpler MR to clean this up
djeebus Aug 6, 2025
a4e24b7
bring the tests back
djeebus Aug 6, 2025
f67a44f
fix test
djeebus Aug 6, 2025
10655b9
linting, bring back caching
djeebus Aug 6, 2025
3271334
lint, actually use the cache provider
djeebus Aug 6, 2025
94b1299
finish implementation
djeebus Aug 6, 2025
1b9b4cd
sort imports
djeebus Aug 6, 2025
828e82f
Merge branch 'main' into nfs-file-cache
djeebus Aug 6, 2025
3ca3783
add the terraform back
djeebus Aug 6, 2025
aeef165
test slab cache with integration tests
djeebus Aug 6, 2025
86ba16d
clean up some error messages
djeebus Aug 6, 2025
34cab05
dump out the expected total size so we can math
djeebus Aug 6, 2025
9995842
fix silly math
djeebus Aug 6, 2025
3696ef2
get body when http requests fail
djeebus Aug 6, 2025
f00ca7a
more silliness
djeebus Aug 6, 2025
66630bd
add some more validation to see what's going on
djeebus Aug 6, 2025
39beb80
fix tests
djeebus Aug 6, 2025
92d9e16
Merge branch 'main' into nfs-file-cache
djeebus Aug 7, 2025
6fefa2d
Merge branch 'main' into nfs-file-cache
djeebus Aug 18, 2025
f51a8b2
fix interface issue
djeebus Aug 18, 2025
d341c29
Use env var instead of '~'
djeebus Aug 18, 2025
aca1919
learn how to use github env vars
djeebus Aug 18, 2025
4b970d1
stop being fancy
djeebus Aug 18, 2025
8492e8f
only need to hash this file, really
djeebus Aug 18, 2025
8917055
Merge branch 'main' into nfs-file-cache
djeebus Aug 18, 2025
fd50041
Merge branch 'main' into nfs-file-cache
djeebus Aug 18, 2025
dad9b9e
fix bug in a loop. thanks, cursor!
djeebus Aug 18, 2025
bf73fb8
fix some terraform
djeebus Aug 18, 2025
8e96623
ignore some things
djeebus Aug 19, 2025
601e15d
only cache builds, not pauses (progress)
djeebus Aug 19, 2025
cb16411
reduce blast radius
djeebus Aug 19, 2025
a1e28f1
support enabling/disabling filestore cache
djeebus Aug 19, 2025
0546411
fix race condition in test
djeebus Aug 19, 2025
11b40d3
remove the 'moved' block
djeebus Aug 19, 2025
1b0bc50
some clean up
djeebus Aug 19, 2025
0c59e24
use launch darkly to gate nfs cache usage
djeebus Aug 19, 2025
f19c61e
Merge branch 'main' into nfs-file-cache
djeebus Aug 19, 2025
2431e32
support "make plan-only-jobs && make apply"
djeebus Aug 19, 2025
a922d5a
Merge remote-tracking branch 'origin/main' into nfs-file-cache
djeebus Aug 19, 2025
fe7d83a
Merge branch 'main' into nfs-file-cache
djeebus Aug 20, 2025
c204c3f
Merge remote-tracking branch 'origin/main' into nfs-file-cache
djeebus Aug 20, 2025
0854a8a
fix tests
djeebus Aug 20, 2025
07b1dfe
support configuration of filestore
djeebus Aug 20, 2025
1f567d3
Merge branch 'main' into nfs-file-cache
djeebus Aug 20, 2025
05d3b65
write to temp file, then rename file
djeebus Aug 21, 2025
923dd7d
filestore maintenance script
djeebus Aug 20, 2025
dd8bf55
clean up messages, hcl
djeebus Aug 21, 2025
153e622
cross platform is fun!
djeebus Aug 21, 2025
a1d9bdc
linting
djeebus Aug 21, 2025
9c3233b
Merge branch 'main' into nfs-file-cache
djeebus Aug 21, 2025
bff3701
fix the nomad job
djeebus Aug 21, 2025
7a9e172
reverse target (was based on free space, now based on used space)
djeebus Aug 21, 2025
803d330
add some data about the files that were deleted
djeebus Aug 21, 2025
ca96079
add standard deviation, fix some logic
djeebus Aug 21, 2025
dc42367
fix linux version
djeebus Aug 21, 2025
54e39fe
fork the upstream implementation of times to get size
djeebus Aug 21, 2025
1f0b775
go work sync
djeebus Aug 21, 2025
5419c3b
fix the linux version
djeebus Aug 21, 2025
4006bb1
fix the osx version
djeebus Aug 21, 2025
df084f4
add gcp alerting to filestore
djeebus Aug 22, 2025
f159ab3
remove the 'import' block
djeebus Aug 22, 2025
b98c8ff
Merge commit '19778a715e8df3adea83858c798582d289bd7159' into nfs-file…
djeebus Aug 22, 2025
c0a10d4
linting
djeebus Aug 22, 2025
2b67939
bring back the old version of the nomad provider
djeebus Aug 22, 2025
05a0c84
remove tracing, fix a bug
djeebus Aug 22, 2025
6e0f9fb
fix a test, remove an empty test
djeebus Aug 22, 2025
cb92317
Simplify for deployment
ValentaTomas Aug 24, 2025
9750894
Fix old name
ValentaTomas Aug 24, 2025
7a32e89
Add automatic protocol; Update recommended values
ValentaTomas Aug 24, 2025
12e02cf
Fix job condition
ValentaTomas Aug 24, 2025
7827f69
Change invalid default value
ValentaTomas Aug 24, 2025
f2d3df4
Add comment
ValentaTomas Aug 24, 2025
a485cf9
Merge branch 'main' into nfs-file-cache
ValentaTomas Aug 24, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/actions/start-services/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,9 @@ runs:
ENVIRONMENT: "local"
OTEL_COLLECTOR_GRPC_ENDPOINT: "localhost:4317"
MAX_PARALLEL_MEMFILE_SNAPSHOTTING: "2"
LOCAL_TEMPLATE_CACHE_PATH: "./.e2b-slab-cache"
run: |
mkdir -p $LOCAL_TEMPLATE_CACHE_PATH
mkdir -p ~/logs

# Start otel-collector
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
!.env.template
.last_used_env
.tfplan.*
terraform.tfvars
terraform.tfstate
terraform.tfstate.backup
**/.tfplan.*
Expand Down
8 changes: 8 additions & 0 deletions .mockery.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,11 @@ packages:
config:
dir: packages/envd/internal/services/legacy
pkgname: legacy

github.com/e2b-dev/infra/packages/shared/pkg/storage:
interfaces:
StorageObjectProvider:
config:
dir: packages/shared/pkg/storage
filename: storage_test.go
pkgname: storage
41 changes: 25 additions & 16 deletions .terraform.lock.hcl

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

9 changes: 7 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,8 @@ tf_vars := TF_VAR_environment=$(TERRAFORM_ENVIRONMENT) \
$(call tfvar, TEMPLATE_BUCKET_LOCATION) \
$(call tfvar, ENVD_TIMEOUT) \
$(call tfvar, REDIS_MANAGED) \
$(call tfvar, GRAFANA_MANAGED)
$(call tfvar, GRAFANA_MANAGED) \
$(call tfvar, USE_FILESTORE_CACHE) \

# Login for Packer and Docker (uses gcloud user creds)
# Login for Terraform (uses application default creds)
Expand Down Expand Up @@ -180,10 +181,14 @@ build/%:
build-and-upload:build-and-upload/api
build-and-upload:build-and-upload/client-proxy
build-and-upload:build-and-upload/docker-reverse-proxy
build-and-upload:build-and-upload/clean-nfs-cache
build-and-upload:build-and-upload/orchestrator
build-and-upload:build-and-upload/template-manager
build-and-upload:build-and-upload/envd
build-and-upload:build-and-upload/clickhouse-migrator
build-and-upload/clean-nfs-cache:
./scripts/confirm.sh $(TERRAFORM_ENVIRONMENT)
GCP_PROJECT_ID=$(GCP_PROJECT_ID) $(MAKE) -C packages/orchestrator build-and-upload/clean-nfs-cache
build-and-upload/template-manager:
./scripts/confirm.sh $(TERRAFORM_ENVIRONMENT)
GCP_PROJECT_ID=$(GCP_PROJECT_ID) $(MAKE) -C packages/orchestrator build-and-upload/template-manager
Expand Down Expand Up @@ -238,7 +243,7 @@ switch-env:
import:
@ printf "Importing resources for env: `tput setaf 2``tput bold`$(ENV)`tput sgr0`\n\n"
./scripts/confirm.sh $(TERRAFORM_ENVIRONMENT)
$(tf_vars) $(TF) import $(TARGET) $(ID)
$(tf_vars) $(TF) import "$(TARGET)" "$(ID)" -no-color

.PHONY: setup-ssh
setup-ssh:
Expand Down
8 changes: 7 additions & 1 deletion main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ terraform {
}
nomad = {
source = "hashicorp/nomad"
version = "2.1.0"
version = "~> 2.4.0"
}
random = {
source = "hashicorp/random"
Expand Down Expand Up @@ -137,6 +137,8 @@ module "cluster" {
consul_acl_token_secret = module.init.consul_acl_token_secret
nomad_acl_token_secret = module.init.nomad_acl_token_secret

filestore_cache = var.filestore_cache

labels = var.labels
prefix = var.prefix
}
Expand Down Expand Up @@ -255,6 +257,10 @@ module "nomad" {
redis_port = var.redis_port

launch_darkly_api_key_secret_name = module.init.launch_darkly_api_key_secret_version.secret

# Filestore
filestore_cache = var.filestore_cache
slab_cache_path = module.cluster.nfs_slab_cache_path
}

module "redis" {
Expand Down
7 changes: 7 additions & 0 deletions packages/cluster-disk-image/main.pkr.hcl
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,13 @@ build {
]
}

provisioner "shell" {
inline = [
"sudo apt-get -y update",
"sudo apt-get install -y nfs-common",
]
}

provisioner "shell" {
inline = [
"sudo snap install go --classic"
Expand Down
132 changes: 132 additions & 0 deletions packages/cluster/filestore/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
resource "google_filestore_instance" "slab-cache" {
name = var.name
description = "High performance slab cache"
tier = var.tier
protocol = "NFS_V4_1"

deletion_protection_enabled = true
deletion_protection_reason = "If this gets removed, the orchestrator will throw tons of errors"

file_shares {
capacity_gb = var.capacity_gb
name = "slabs"
}

networks {
modes = [
"MODE_IPV4",
]
network = var.network_name
}
}

data "google_monitoring_notification_channel" "notification" {
count = var.notification_display_name == null ? 0 : 1
display_name = var.notification_display_name
}

resource "google_monitoring_alert_policy" "warning" {
count = var.free_space_warning_threshold == 0 ? 0 : 1

combiner = "OR"

display_name = "memory-cache-disk-usage-low"

notification_channels = var.notification_display_name == null ? [] : [
data.google_monitoring_notification_channel.notification[0].id
]

severity = "WARNING"

alert_strategy {
notification_prompts = [
"OPENED",
"CLOSED",
]
}

conditions {
display_name = "Over ${var.free_space_warning_threshold}% of the memory cache disk has been used"

condition_threshold {
comparison = "COMPARISON_GT"
duration = "0s"
filter = <<EOT
resource.type = "filestore_instance"
AND metric.type = "file.googleapis.com/nfs/server/used_bytes_percent"
AND metric.labels.file_share = "${google_filestore_instance.slab-cache.file_shares[0].name}"
EOT
threshold_value = var.free_space_warning_threshold

aggregations {
alignment_period = "300s"
group_by_fields = []
per_series_aligner = "ALIGN_MAX"
}

trigger {
count = 1
percent = 0
}
}
}

documentation {
mime_type = "text/markdown"
subject = "Memory cache disk usage has gone over ${var.free_space_warning_threshold}%"
content = "Your memory cache filestore instance disk usage has gone over ${var.free_space_warning_threshold}%. "
}
}

resource "google_monitoring_alert_policy" "error" {
count = var.free_space_error_threshold == 0 ? 0 : 1

combiner = "OR"

display_name = "memory-cache-disk-usage-very-low"

notification_channels = var.notification_display_name == null ? [] : [
data.google_monitoring_notification_channel.notification[0].id
]

severity = "ERROR"

alert_strategy {
notification_prompts = [
"OPENED",
"CLOSED",
]
}

conditions {
display_name = "Over ${var.free_space_error_threshold}% of the memory cache disk has been used"

condition_threshold {
comparison = "COMPARISON_GT"
duration = "0s"
filter = <<EOT
resource.type = "filestore_instance"
AND metric.type = "file.googleapis.com/nfs/server/used_bytes_percent"
AND metric.labels.file_share = "${google_filestore_instance.slab-cache.file_shares[0].name}"
EOT
threshold_value = var.free_space_error_threshold

aggregations {
alignment_period = "300s"
group_by_fields = []
per_series_aligner = "ALIGN_MAX"
}

trigger {
count = 1
percent = 0
}
}
}

documentation {
mime_type = "text/markdown"
subject = "Memory cache disk usage has gone over ${var.free_space_error_threshold}%"
content = "Your memory cache filestore instance disk usage has gone over ${var.free_space_error_threshold}%. "
}
}
3 changes: 3 additions & 0 deletions packages/cluster/filestore/output.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
output "nfs_ip_addresses" {
value = google_filestore_instance.slab-cache.networks[0].ip_addresses
}
29 changes: 29 additions & 0 deletions packages/cluster/filestore/variables.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
variable "name" {
description = "The name of the Nomad cluster (e.g. nomad-stage). This variable is used to namespace all resources created by this module."
type = string
}

variable "network_name" {
description = "The name of the VPC Network where all resources should be created."
type = string
}

variable "tier" {
type = string
}

variable "capacity_gb" {
type = number
}

variable "notification_display_name" {
type = string
}

variable "free_space_warning_threshold" {
type = number
}

variable "free_space_error_threshold" {
type = number
}
Loading
Loading