-
Notifications
You must be signed in to change notification settings - Fork 92
BSQIK/Search: Indexing #4422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
BSQIK/Search: Indexing #4422
Changes from all commits
Commits
Show all changes
150 commits
Select commit
Hold shift + click to select a range
5189190
index pkg entries
sir-sigurd 9792c46
CI
sir-sigurd d6bc135
lint
sir-sigurd fb0822c
fix decorator
sir-sigurd ef11bcb
use non-deprecated name
sir-sigurd bd11a04
do not index dir entries & add debugging
sir-sigurd 25d6b5a
use prefix for pkg fields, parse pk
sir-sigurd 69666d4
use hash for ids
sir-sigurd 9ce314a
pointers as child docs, is_packaged field
sir-sigurd 6136100
fix pkg_stats name
sir-sigurd 99dd238
change fields names, fixes
sir-sigurd ec35e53
fix
sir-sigurd ef2b90f
fixes
sir-sigurd 6708669
ignore manifests with invalid hashes
sir-sigurd 941b664
fix total_files
sir-sigurd 28e42d9
attempt to fix parent ids
nl0 bc0a58a
fix check
sir-sigurd b6cb8ef
do not store more than CHUNK_LIMIT_DOCS
sir-sigurd 5354113
log errors
sir-sigurd 4060819
small adjustments
sir-sigurd 49616db
try http_compress
sir-sigurd 3f02276
disable http_compress, increase CHUNK_LIMIT_DOCS
sir-sigurd 9d3de8e
fix, increase CHUNK_LIMIT_DOCS
sir-sigurd 561aa46
increase CHUNK_LIMIT_DOCS
sir-sigurd 375b1eb
change log level
sir-sigurd 3a16832
rework indexing
sir-sigurd c065497
fix
sir-sigurd 0fc4c33
fix
sir-sigurd 0368b10
fix
sir-sigurd acee91b
fix
sir-sigurd a493eda
fix
sir-sigurd c132d68
fix
sir-sigurd 6821dda
fix
sir-sigurd 7b50bcf
fix
sir-sigurd f565614
fix
sir-sigurd 4c93b73
fix
sir-sigurd 267563d
Revert "fix"
sir-sigurd f3426b4
fix
sir-sigurd 70781c9
fix
sir-sigurd 9b7b34f
fix
sir-sigurd e8acfcf
delete objects
sir-sigurd 43140cb
test
sir-sigurd 237f5c8
error handling
sir-sigurd 05f4ae0
log retries
sir-sigurd 11a66d4
wait more
sir-sigurd e7f3067
adjust backoff
sir-sigurd ea50e64
more aggresive backoff
sir-sigurd f79ba13
more aggresive backoff
sir-sigurd 8358ed6
remove unneeded
sir-sigurd 5cb85f5
sleep after slow responses
sir-sigurd 9fca19b
more agressive backoff
sir-sigurd 58119ef
more aggressive backoff
sir-sigurd 68c59e7
more aggressive backoff
sir-sigurd 720abb8
more aggressive backoff
sir-sigurd e43a5c6
more aggressive backoff
sir-sigurd 4d4cd33
more
sir-sigurd 61f5702
more
sir-sigurd 91e9045
tmp
sir-sigurd 6a01419
tmp
sir-sigurd ed48279
tmp
sir-sigurd 7121784
tmp
sir-sigurd bc9c170
tmp
sir-sigurd 431be7b
more work
sir-sigurd bdb01ab
remove unused stuff
sir-sigurd 0f9ce90
remove fcsparser test data
sir-sigurd 2c47d9f
Revert "remove fcsparser test data"
sir-sigurd 1d4ea7e
try to build
sir-sigurd 5ea51a0
fix path
sir-sigurd 6ad380a
try to fix
sir-sigurd 899a351
fix
sir-sigurd a62aadc
try to fix
sir-sigurd 969d178
add test deps
sir-sigurd b0bd132
correct step name
sir-sigurd 2f2938a
do not deploy thumnail
sir-sigurd 74aa760
Merge branch 'master' into index-pkg-entries-2
sir-sigurd 3dc3b41
isort
sir-sigurd 3437f4f
fix some linting
sir-sigurd 0209922
fix logging
sir-sigurd b67840d
add pytest-cov
sir-sigurd 62f7efc
Merge branch 'master' into index-pkg-entries-2
sir-sigurd 01b449f
revert CHUNK_LIMIT_DOCS
sir-sigurd e4f3c42
adjust get_es_client
sir-sigurd a286699
use updated quilt_shared.es
sir-sigurd 2e0491a
polish/revert
sir-sigurd a4abc4c
update requirements.txt
sir-sigurd f9abe14
try to fix
sir-sigurd ee8b1a2
add prefix
sir-sigurd 5bb1066
update lock
sir-sigurd 11a0442
fix
sir-sigurd 3f0047a
fix dep
sir-sigurd 88ff687
move pytest-cov to test group
sir-sigurd ccd193d
cleanup
sir-sigurd 25812f2
use orjson
sir-sigurd 5edd631
bump py-shared
sir-sigurd 03e138d
switch to orjson
sir-sigurd 9be1c64
cleanup
sir-sigurd 5a1856f
update requirements.txt
sir-sigurd 97c9a25
fix some tests
sir-sigurd afdcf31
cleanup
sir-sigurd 5252337
cleanup
sir-sigurd 2f1119a
add test
sir-sigurd c4a35b5
remove log from quilt-shared
sir-sigurd 573dc86
fix tests
sir-sigurd 301d92a
cleanup
sir-sigurd eca588c
cleanup
sir-sigurd 0e1f231
cleanup
sir-sigurd bb9fef7
add some tests
sir-sigurd a06b308
isort
sir-sigurd 9569aab
remove duplicated test
sir-sigurd 8c9b804
remove readmes
sir-sigurd 88cf525
more tests
sir-sigurd f86ccd0
fix handling no version id
sir-sigurd 59fded7
update and more tests
sir-sigurd 4d0b6fd
use mocker
sir-sigurd d2d0b17
supress linter
sir-sigurd 9c16074
remove leftovers
sir-sigurd 5acf8d9
remove more leftovers
sir-sigurd 9d6f6f6
linters
sir-sigurd 3c2f213
add missing stats
sir-sigurd 8adb4b8
fix test
sir-sigurd 947c798
update base images
sir-sigurd dd9ab72
refresh lock
sir-sigurd 02782d4
remove dummy description
sir-sigurd 7611ee2
fix normalize_object_version_id
sir-sigurd e916221
bump
sir-sigurd 67d4575
fix pointer deletion
sir-sigurd 300c876
fix comment
sir-sigurd 6abb662
remove usage of old name
sir-sigurd e35f475
send manifest event from indexer to manifest indexer queue
sir-sigurd 1466340
fix
sir-sigurd cdeb9de
add timeout
sir-sigurd 69333fc
fix
sir-sigurd a547751
fix tests
sir-sigurd c35a5bc
rework batcher
sir-sigurd a6e790a
use update quilt-shared
sir-sigurd 76a002f
use const
sir-sigurd 0251b9a
add changelogs
sir-sigurd f900b26
add changelog stuf for py-shared
sir-sigurd 2fa4910
indexer changelog
sir-sigurd 839064c
Merge branch 'master' into index-pkg-entries-2
sir-sigurd 103bc82
update hash
sir-sigurd 7a0e367
update locks
sir-sigurd 5443f91
revert a bit
sir-sigurd a3c8a5e
add comment
sir-sigurd 9801de7
revert CI
sir-sigurd 11a6268
update some comments
sir-sigurd 3a2ab47
adjust test
sir-sigurd bd63b07
handle versioned objects
sir-sigurd e8731ac
Revert "revert CI"
sir-sigurd 697fd2e
Reapply "revert CI"
sir-sigurd File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
3.11 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
<!-- markdownlint-disable line-length --> | ||
# Changelog | ||
|
||
Changes are listed in reverse chronological order (newer entries at the top). | ||
The entry format is | ||
|
||
```markdown | ||
- [Verb] Change description ([#<PR-number>](https://github.com/quiltdata/quilt/pull/<PR-number>)) | ||
``` | ||
|
||
where verb is one of | ||
|
||
- Removed | ||
- Added | ||
- Fixed | ||
- Changed | ||
|
||
## Changes | ||
|
||
- [Added] Bootstrap the change log ([#4422](https://github.com/quiltdata/quilt/pull/4422)) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
[project] | ||
name = "t4_lambda_es_ingest" | ||
version = "0.1.0" | ||
authors = [ | ||
{ name = "Sergey Fedoseev", email = "[email protected]" } | ||
] | ||
requires-python = ">=3.11" | ||
dependencies = [ | ||
"quilt-shared[boto,es]", | ||
"t4-lambda-shared", | ||
] | ||
|
||
[build-system] | ||
requires = ["hatchling"] | ||
build-backend = "hatchling.build" | ||
|
||
[tool.uv.sources] | ||
t4-lambda-shared = { url = "https://github.com/quiltdata/quilt/archive/d496dffbfb4b7a2ae05f6c1f7f0cb7d5d43bc984.zip", subdirectory = "lambdas/shared" } | ||
quilt-shared = { url = "https://github.com/quiltdata/quilt/archive/df53c9ce125ea051e0d1ac41d58796336e202256.zip", subdirectory = "py-shared" } | ||
|
||
[dependency-groups] | ||
test = [ | ||
"pytest~=8.4", | ||
"pytest-cov~=6.2", | ||
"pytest-env~=1.1", | ||
"pytest-mock~=3.14", | ||
] | ||
|
||
[tool.uv] | ||
default-groups = ["test"] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
[pytest] | ||
env = | ||
ES_ENDPOINT = http://localhost:9200 | ||
AWS_ACCESS_KEY_ID = test_key | ||
AWS_SECRET_ACCESS_KEY = test_secret | ||
AWS_DEFAULT_REGION = us-east-1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
import json | ||
import os | ||
import random | ||
import time | ||
|
||
import boto3 | ||
import elasticsearch | ||
|
||
from quilt_shared.es import make_elastic | ||
from t4_lambda_shared.utils import get_quilt_logger | ||
|
||
s3_client = boto3.client("s3") | ||
es = make_elastic(os.environ["ES_ENDPOINT"]) | ||
logger = get_quilt_logger() | ||
|
||
|
||
EXPECTED_ES_RESPONSE_TIME = 10 # seconds | ||
|
||
|
||
class BulkDocumentError(Exception): | ||
pass | ||
|
||
|
||
class TooManyRequestsError(Exception): | ||
pass | ||
|
||
|
||
def sleep_until_timeout(context): | ||
"""Sleep until the lambda timeout""" | ||
remaining = context.get_remaining_time_in_millis() / 1000 - 1 | ||
logger.warning("Sleeping for %s seconds just before lambda timeout", remaining) | ||
# good night, sweet prince | ||
time.sleep(remaining) | ||
|
||
|
||
def bulk(context, es, data: bytes): | ||
t0 = time.time() | ||
try: | ||
resp = es.bulk( | ||
data, | ||
filter_path="errors", | ||
# wait as much as possible because it's better die trying than just die | ||
# leave a second to avoid lambda timeout | ||
request_timeout=context.get_remaining_time_in_millis() / 1000 - 1, | ||
) | ||
except elasticsearch.exceptions.TransportError as e: | ||
if e.status_code == 429: | ||
# at this point ES seems to be *very* overloaded, so we just sleep until lambda timeout | ||
logger.warning("Got a 429 Too Many Requests error, sleeping until lambda timeout") | ||
sleep_until_timeout(context) | ||
raise TooManyRequestsError | ||
raise | ||
|
||
t1 = time.time() | ||
delta = t1 - t0 | ||
logger.info("Bulk request took %s seconds", delta) | ||
overtime = delta - EXPECTED_ES_RESPONSE_TIME | ||
if overtime > 0: | ||
# if the request took so long ES seems to be overloaded, so it's better to sleep | ||
# now to avoid 429 Too Many Requests error later causing lambda failure and retry | ||
time_to_sleep = min( | ||
random.uniform(overtime / 2, overtime) + 15, | ||
context.get_remaining_time_in_millis() / 1000 - 1, | ||
) | ||
sir-sigurd marked this conversation as resolved.
Show resolved
Hide resolved
|
||
logger.warning("Sleeping for %s seconds to avoid ES overload", time_to_sleep) | ||
time.sleep(time_to_sleep) | ||
if resp["errors"]: | ||
# TODO: log errors from items.*.error? | ||
# TODO: ignore index_not_found_exception for delete operations? | ||
sir-sigurd marked this conversation as resolved.
Show resolved
Hide resolved
|
||
raise BulkDocumentError | ||
sir-sigurd marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
|
||
def handler(event, context): | ||
logger.debug("Invoked with event: %s", event) | ||
assert len(event["Records"]) == 1, "Expected exactly on SQS message" | ||
(event,) = event["Records"] | ||
event = json.loads(event["body"]) | ||
assert len(event["Records"]) == 1, "Expected exactly one S3 event record" | ||
(event,) = event["Records"] | ||
|
||
bucket = event["s3"]["bucket"]["name"] | ||
key = event["s3"]["object"]["key"] | ||
version_id = event["s3"]["object"].get("versionId") | ||
params = {"Bucket": bucket, "Key": key} | ||
if version_id: | ||
params["VersionId"] = version_id | ||
|
||
data = s3_client.get_object(**params)["Body"].read() | ||
bulk(context, es, data) | ||
s3_client.delete_object(**params) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
import json | ||
|
||
import elasticsearch | ||
import pytest | ||
from botocore.stub import Stubber | ||
|
||
import t4_lambda_es_ingest | ||
|
||
|
||
def test_bulk_error(mocker): | ||
mock_bulk = mocker.patch("elasticsearch.Elasticsearch.bulk", return_value={"errors": True}) | ||
mock_context = mocker.MagicMock() | ||
with pytest.raises(t4_lambda_es_ingest.BulkDocumentError): | ||
t4_lambda_es_ingest.bulk(mock_context, t4_lambda_es_ingest.es, b"data") | ||
|
||
mock_bulk.assert_called_once_with( | ||
b"data", | ||
filter_path=mocker.ANY, | ||
request_timeout=mocker.ANY, | ||
) | ||
|
||
|
||
def test_bulk_too_many_requests(mocker): | ||
mocker.patch("elasticsearch.exceptions.TransportError.status_code", 429) | ||
sir-sigurd marked this conversation as resolved.
Show resolved
Hide resolved
|
||
mock_bulk = mocker.patch("elasticsearch.Elasticsearch.bulk", side_effect=elasticsearch.exceptions.TransportError) | ||
mock_context = mocker.MagicMock() | ||
mock_sleep_until_timeout = mocker.patch("t4_lambda_es_ingest.sleep_until_timeout") | ||
|
||
with pytest.raises(t4_lambda_es_ingest.TooManyRequestsError): | ||
t4_lambda_es_ingest.bulk(mock_context, t4_lambda_es_ingest.es, b"data") | ||
|
||
mock_bulk.assert_called_once_with( | ||
b"data", | ||
filter_path=mocker.ANY, | ||
request_timeout=mocker.ANY, | ||
) | ||
mock_sleep_until_timeout.assert_called_once_with(mock_context) | ||
|
||
|
||
@pytest.mark.parametrize("version_id", ["test-version-id", None]) | ||
def test_handler(mocker, version_id): | ||
mock_context = mocker.MagicMock() | ||
s3_record = { | ||
"s3": { | ||
"bucket": {"name": "test-bucket"}, | ||
"object": { | ||
"key": "test-key", | ||
}, | ||
} | ||
} | ||
if version_id: | ||
s3_record["s3"]["object"]["versionId"] = version_id | ||
|
||
mock_event = { | ||
"Records": [ | ||
{"body": json.dumps({"Records": [s3_record]})}, | ||
] | ||
} | ||
|
||
with Stubber(t4_lambda_es_ingest.s3_client) as stubber: | ||
mock_bulk = mocker.patch("t4_lambda_es_ingest.bulk") | ||
get_object_params = {"Bucket": "test-bucket", "Key": "test-key"} | ||
if version_id: | ||
get_object_params["VersionId"] = version_id | ||
|
||
stubber.add_response( | ||
"get_object", | ||
{ | ||
"Body": mocker.MagicMock(read=lambda: b"test data"), | ||
"LastModified": "2023-10-01T00:00:00Z", | ||
}, | ||
get_object_params, | ||
) | ||
stubber.add_response( | ||
"delete_object", | ||
{}, | ||
( | ||
{"Bucket": "test-bucket", "Key": "test-key"} | ||
if version_id is None | ||
else { | ||
"Bucket": "test-bucket", | ||
"Key": "test-key", | ||
"VersionId": version_id, | ||
} | ||
), | ||
) | ||
|
||
t4_lambda_es_ingest.handler(mock_event, mock_context) | ||
|
||
stubber.assert_no_pending_responses() | ||
mock_bulk.assert_called_once_with(mock_context, t4_lambda_es_ingest.es, b"test data") |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.