-
-
Notifications
You must be signed in to change notification settings - Fork 28
Add MongoDB Adapter #509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 0.12
Are you sure you want to change the base?
Add MongoDB Adapter #509
Conversation
final class ClientWrapper | ||
{ | ||
public function __construct( | ||
private readonly Client $client, | ||
private readonly string $databaseName, | ||
) { | ||
} | ||
|
||
public function getClient(): Client | ||
{ | ||
return $this->client; | ||
} | ||
|
||
public function getDatabase(): Database | ||
{ | ||
return $this->client->getDatabase($this->databaseName); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this wrapper class didn't want that the services always require:
private readonly Client $client,
private readonly string $databaseName,
and wanted a hande getDatabase
method to get the current configured database for this "Adapter" instance.
I'm also not sure if my DSN will conflict with MongoDB URI parsing I'm using the DSN path
part to select the database:
search/packages/seal-mongodb-adapter/src/MongoDBAdapterFactory.php
Lines 75 to 85 in 903fe74
\assert(!isset($dsn['path']), 'A selected database is required.'); | |
$databaseName = \ltrim($dsn['path'] ?? 'default', '/'); | |
if (isset($dsn['query'])) { | |
$dsnUri .= '?' . \http_build_query($dsn['query']); | |
} | |
if (isset($dsn['fragment'])) { | |
$dsnUri .= '#' . $dsn['fragment']; | |
} | |
return new ClientWrapper(new Client($dsnUri), $databaseName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default database can be specified in the "path" part of the DSN: https://www.mongodb.com/docs/manual/reference/connection-string/#connection-string-database-options
The MongoDB\Client
class doesn't have the notion of default database, you have to extract the default database name from the DSN (or other options).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would try to avoid extracting the database name from the connection string - it would require you to parse the connection string and thus be conformant with the above specification. The default database was used by some tools, but the MongoDB drivers don't leverage it for anything other than authentication.
490dcfb
to
4326138
Compare
$filter instanceof Condition\SearchCondition => $filterQueries[]['$search'] =[ | ||
'index' => $index->name . '-search-index', | ||
'text' => [ | ||
'query' => $filter->query, | ||
'path' => $index->searchableFields, | ||
], | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I greated a quick prototype but where I'm stuck is with the query. I implemented the basic filtering =, !=, <, <=, =>, <, IN and NOT IN filtering. But stuck how combine this now with the search. Maybe you have some hint for me :)
Is there a way to combine the search
and filters functionality in collection->find ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Search using the Lucene index can only be done with the $search
stage; it must be the first stage of the aggregation pipeline.
The text
query operator doesn't use the Lucene index but a the specific "text" index that exists in MongoDB server for a long time before the introduction of Atlas Search.
In order to combine Lucene-base search criteria with basic filters, you have to use the compound
search operator.
Is there a way to combine the search and filters functionality in
collection->find
?
You must use $collection->aggregate($pipeline)
with a $search
stage first. Atlas Search is incompatible with $collection->find($query)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see — so the search has its completely own API over aggregate method. Do you think it make sense for the SEAL to support then find
and basic search or always go to the aggregate
method with Atlas Search, would aggregate
also work if I only have filters without a search query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When there is only basic filters, it might be more performant to use find($query)
or and aggregation with aggregate([['$match' => $query]])
, both will be the same. But I would start with an implementation that uses only the $search
stage and operators as this is the target of this package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add that if you require the latest version of the MongoDB driver, you will have access to the aggregation pipeline builder which gives you a programmatic API to assemble a $search
stage (and any other stages). For examples, check out the tests for SearchStage and CompoundOperator. In essence, these would allow you to generate the correct operator based on the various conditions, then combine them all in a single must
array for the compound
operator and pass that to Stage::search
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's off to a great start! I've tried to answer your questions.
final class ClientWrapper | ||
{ | ||
public function __construct( | ||
private readonly Client $client, | ||
private readonly string $databaseName, | ||
) { | ||
} | ||
|
||
public function getClient(): Client | ||
{ | ||
return $this->client; | ||
} | ||
|
||
public function getDatabase(): Database | ||
{ | ||
return $this->client->getDatabase($this->databaseName); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default database can be specified in the "path" part of the DSN: https://www.mongodb.com/docs/manual/reference/connection-string/#connection-string-database-options
The MongoDB\Client
class doesn't have the notion of default database, you have to extract the default database name from the DSN (or other options).
private readonly SearcherInterface $searcher; | ||
|
||
public function __construct( | ||
ClientWrapper $client, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can inject the MongoDB\Database
object instead; instantiation does not open the connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea, if I can do all over the database I think that would be the best option.
{ | ||
// $searchDefinition = $this->createPropertiesMapping($index->fields); | ||
|
||
$this->client->getDatabase()->createCollection($index->name); // TODO check with mongodb team how we could add stricter schema / validator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can set the validator
option with a $jsonSchema
. Example:
$database->createCollection('planes', [
'validator' => [
'$jsonSchema' => [
'bsonType' => 'object',
'required' => ['_id', 'name', 'created_at'],
'properties' => [
'_id' => ['bsonType' => 'objectId'],
'name' => ['bsonType' => 'string'],
'created_at' => ['bsonType' => 'date'],
'seats' => [
'bsonType' => 'array',
'items' => [
'bsonType' => 'object',
'required' => ['_id', 'number', 'row', 'column'],
'additionalProperties' => false,
'properties' => [
'_id' => ['bsonType' => 'objectId'],
'number' => ['bsonType' => 'int'],
'row' => ['bsonType' => 'int'],
'column' => ['bsonType' => 'int'],
],
],
],
],
'additionalProperties' => false,
],
],
]);
$filter instanceof Condition\SearchCondition => $filterQueries[]['$search'] =[ | ||
'index' => $index->name . '-search-index', | ||
'text' => [ | ||
'query' => $filter->query, | ||
'path' => $index->searchableFields, | ||
], | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Search using the Lucene index can only be done with the $search
stage; it must be the first stage of the aggregation pipeline.
The text
query operator doesn't use the Lucene index but a the specific "text" index that exists in MongoDB server for a long time before the introduction of Atlas Search.
In order to combine Lucene-base search criteria with basic filters, you have to use the compound
search operator.
Is there a way to combine the search and filters functionality in
collection->find
?
You must use $collection->aggregate($pipeline)
with a $search
stage first. Atlas Search is incompatible with $collection->find($query)
.
return null; | ||
} | ||
|
||
return new SyncTask(null); // TODO wait for index drop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the code I use to query the server until the index is ready: https://github.com/mongodb/mongo-php-library/blob/ae6a3a53d507fe38e447dfde4e77c8efe5724925/tests/SpecTests/SearchIndexSpecTest.php#L54
@GromNaN Really thank you that helps. Will keep you uptodate and let you know. Only added one question here: #509 (comment) The other parts I think is now more doing and trying out. |
final class ClientWrapper | ||
{ | ||
public function __construct( | ||
private readonly Client $client, | ||
private readonly string $databaseName, | ||
) { | ||
} | ||
|
||
public function getClient(): Client | ||
{ | ||
return $this->client; | ||
} | ||
|
||
public function getDatabase(): Database | ||
{ | ||
return $this->client->getDatabase($this->databaseName); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would try to avoid extracting the database name from the connection string - it would require you to parse the connection string and thus be conformant with the above specification. The default database was used by some tools, but the MongoDB drivers don't leverage it for anything other than authentication.
* fragment?: string, | ||
* } $dsn | ||
*/ | ||
public function createClient(array $dsn): ClientWrapper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid creating the connection string from parts. Since MongoDB drivers support connecting to a range of different topologies, the DSN can't easily split into the parts you used above.
For example, when using a replica set or sharded cluster, you don't have a single host:port
pair, but rather multiples, with each server using different ports:
mongodb://host-a:12345,host-b:23456
Authentication is also not always done using user and password, but also through X509 and other authentication mechanisms. And to top it off, the Initial DNS Seedlist Discovery specification allows you to specify hosts and connection options through DNS entries which the driver will then look up - this is where the mysql+srv://
scheme comes in. That also complicates extracting the database name (see earlier comment) as any connection options can be specified in a TXT record.
Long story short: require users pass you a connection string and let them deal with the rest.
{ | ||
$identifierField = $index->getIdentifierField(); | ||
|
||
foreach (BulkHelper::splitBulk($saveDocuments, $bulkSize) as $bulkSaveDocuments) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the API is a little cumbersome due to it requiring array structures, the Bulk Write API allows you to combine multiple different write operations (such as insertOne, replaceOne, deleteMany) into a single write. The driver will then take care of batching these writes to not exceed the maximum OP_MSG message size. By creating an ordered bulk write (which is the default), writes will stop after the first error; an unordered write will continue writing despite errors but also without guaranteeing that operations are executed in the specified order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the API is a little cumbersome due to it requiring array structures
You mean the documents itself? It later targetted to be mapped to objects in #81, but currently you can use any normalizer denormalizer to get the objects into that struct. The bulk parameters itself support any kind of iterable which mostly I recommend using \Generators or something which keep memory usage low for reindexing all documents.
The driver will then take care of batching these writes to not exceed the maximum OP_MSG message size.
That sounds kind of interesting maybe better in my case would be aslong as possible give the iterables to the underlaying layer and change aslong as not own bulk size is given the underlaying client of that search engine takes care of that.
an unordered write will continue writing despite errors but also without guaranteeing that operations are executed in the specified order.
Thx for the insights here 🙏 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean the documents itself?
I was talking about the bulk write API in the MongoDB driver - it requires an array of operations like this:
[
[ 'deleteMany' => [ $filter ] ],
[ 'deleteOne' => [ $filter ] ],
[ 'insertOne' => [ $document ] ],
[ 'replaceOne' => [ $filter, $replacement, $options ] ],
[ 'updateMany' => [ $filter, $update, $options ] ],
[ 'updateOne' => [ $filter, $update, $options ] ],
]
As for batching and such, you would create a single bulk write containing all operations (insertions and deletions) and then send that to the server. When assembling the actual bulk write command, the driver will automatically split the write into batches if it exceeds the maximum size supported by MongoDB. While this size is dependent on information returned in the MongoDB handshake, it currently supports 100k operations in a single batch, with the message size limited to 48 MB. Considering this, you can assume that most cases will only need a single round trip and no batching.
If you were to go for that option, creating the operations is basically an array_map
returning the corresponding insertOne
or deleteMany
operation, then combining both lists and passing that to the Collection::bulkWrite()
method.
$filter instanceof Condition\SearchCondition => $filterQueries[]['$search'] =[ | ||
'index' => $index->name . '-search-index', | ||
'text' => [ | ||
'query' => $filter->query, | ||
'path' => $index->searchableFields, | ||
], | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add that if you require the latest version of the MongoDB driver, you will have access to the aggregation pipeline builder which gives you a programmatic API to assemble a $search
stage (and any other stages). For examples, check out the tests for SearchStage and CompoundOperator. In essence, these would allow you to generate the correct operator based on the various conditions, then combine them all in a single must
array for the compound
operator and pass that to Stage::search
.
aa0d258
to
16b4d15
Compare
This PR is stream lining a first prototype to use MongoDB as a Search Adapter,
ToDo
Test this out: