Handle async deletion for lambda managed instances#13751
Handle async deletion for lambda managed instances#13751anisaoshafi wants to merge 2 commits intomainfrom
Conversation
Test Results (amd64) - Integration, Bootstrap 5 files 5 suites 2h 0m 3s ⏱️ Results for commit 6bda2af. ♻️ This comment has been updated with latest results. |
LocalStack Community integration with Pro 2 files ± 0 2 suites ±0 1h 33m 37s ⏱️ - 31m 4s Results for commit 6bda2af. ± Comparison against base commit 58d61f3. This pull request removes 1436 and adds 1 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
|
Failing tests (in APIGW) are unrelated and being handled here: #13752 |
joe4dev
left a comment
There was a problem hiding this comment.
We need to fix some breaking changes and race conditions before merging this PR. See PR comments for details and suggestions.
lambda.DeleteFunction (see TODOs in the lambda provider), so we need to be cautious not to break things, and cannot rely on CI to catch
| self, function: Function, version: FunctionVersion, qualifier: str | ||
| ): | ||
| def _cleanup(function: Function, qualifier: str): | ||
| time.sleep(0.1) |
There was a problem hiding this comment.
docs: Can we clearly document the motivation for using a sleep here?
For example: Simulate async function version cleanup
Could we clarify the async nature in the function name? (e.g., delete_function_async suffix)
| self, function: Function, version: FunctionVersion, qualifier: str | ||
| ): | ||
| def _cleanup(function: Function, qualifier: str): | ||
| time.sleep(0.1) |
There was a problem hiding this comment.
race condition: What's an appropriate sleep time to mitigate flakes in CI (if 2xget_function in the test takes longer than this)?
I wonder whether we should bump this slightly (e.g., 0.5) to reduce the likelihood of flakiness, given that 0.01 consistently fails and 10x deviation in CI could happen 🤷
There was a problem hiding this comment.
I thought 0.1s was okay, but with a longer sleep we're safer, will bump it to 0.5 just in case CI operations as slower.
There was a problem hiding this comment.
Increased it to 0.5 ✔️
localstack-core/localstack/services/lambda_/invocation/lambda_service.py
Outdated
Show resolved
Hide resolved
| # TODO: introduce locking for safe deletion: We could create a new version at the API layer before | ||
| # the old version gets cleaned up in the internal lambda service. | ||
| function = store.functions.pop(function_name) | ||
| function = store.functions.get(function_name) |
There was a problem hiding this comment.
blocking: This function removal is not properly mapped to LMI and introduces a couple of problems:
- Why do we map this function-level removal to a loop where we remove the function multiple times for each existing function version?
- Why do we switch the order of removing the function from the store and stopping the service? This leads to a race condition in which the function is visible at the API level while being destroyed!
- (minor) Why do we delete each function version individually if we can remove the top-level function once? The top-level function contains versions.
| # published versions are invokable. | ||
| if version.config.capacity_provider_config: | ||
| function_has_capacity_provider = True | ||
| if version.id.qualifier == "$LATEST": |
There was a problem hiding this comment.
blocking: This useless condition indicates that we're missing something related to the delete lifecycle
The current implementation does NOT stop any Lambda containers spawned by Lambda Managed Instances because we don't invoke stop_version. We can test this by setting a breakpoint at the end of the testcase and validating that all Lambda containers are gone. The following code fixes this:
# Stop all function versions except $LATEST because only published versions are invokable with
# Lambda Managed Instances, and we do NOT want to create a version manager for $LATEST.
if version.id.qualifier != "$LATEST":
self.lambda_service.stop_version(qualified_arn=version.id.qualified_arn())
There was a problem hiding this comment.
I checked that containers were removed in my local implementation, but then I went ahead refactoring the code in local a few times and forgot to recheck if any docker containers left and now I see that they're not cleaned up properly.
Thanks for flagging this, will fix it 🤞🏼
| else: | ||
| function.versions.pop(qualifier, None) | ||
| self.lambda_service.stop_version(version.id.qualified_arn()) | ||
| destroy_code_if_not_used(code=version.config.code, function=function) |
There was a problem hiding this comment.
question: Does destroy_code_if_not_used work with simulated async store deletion? The same question applies to version.config.code.destroy() below.
Motivation
Support async deletion for lambda managed instances functions.
The response when you delete a lambda managed instances function is 202, which means it's handled async.
The function can't be invoked anymore, but it can be retrieved.
Changes
If function is lambda managed instance type:
Deletingand lastupdate status toInProgresswhile the function is being deleted.Tests
Related
Part of DRG-157