Add MongoDB Adapter #509

alexander-schranz · 2025-04-04T16:54:19Z

This PR is stream lining a first prototype to use MongoDB as a Search Adapter,

ToDo

Test this out:

git clone [email protected]:PHP-CMSIG/search.git
cd search
git remote add alex [email protected]:alexander-schranz/schranz-search.git
git checkout feature/mongodb-adapter

cd packages/seal-mongodb-adapter
composer install

docker compose up
vendor/bin/phpunit

alexander-schranz · 2025-04-04T17:07:06Z

packages/seal-mongodb-adapter/src/ClientWrapper.php

+final class ClientWrapper
+{
+    public function __construct(
+        private readonly Client $client,
+        private readonly string $databaseName,
+    ) {
+    }
+
+    public function getClient(): Client
+    {
+        return $this->client;
+    }
+
+    public function getDatabase(): Database
+    {
+        return $this->client->getDatabase($this->databaseName);
+    }
+}


Not sure about this wrapper class didn't want that the services always require:

private readonly Client $client, private readonly string $databaseName,

and wanted a hande getDatabase method to get the current configured database for this "Adapter" instance.

I'm also not sure if my DSN will conflict with MongoDB URI parsing I'm using the DSN path part to select the database:

search/packages/seal-mongodb-adapter/src/MongoDBAdapterFactory.php

Lines 75 to 85 in 903fe74

\assert(!isset($dsn['path']), 'A selected database is required.');

$databaseName = \ltrim($dsn['path'] ?? 'default', '/');

if (isset($dsn['query'])) {

$dsnUri .= '?' . \http_build_query($dsn['query']);

}

if (isset($dsn['fragment'])) {

$dsnUri .= '#' . $dsn['fragment'];

}

return new ClientWrapper(new Client($dsnUri), $databaseName);

The default database can be specified in the "path" part of the DSN: https://www.mongodb.com/docs/manual/reference/connection-string/#connection-string-database-options

The MongoDB\Client class doesn't have the notion of default database, you have to extract the default database name from the DSN (or other options).

I would try to avoid extracting the database name from the connection string - it would require you to parse the connection string and thus be conformant with the above specification. The default database was used by some tools, but the MongoDB drivers don't leverage it for anything other than authentication.

alexander-schranz · 2025-04-04T18:13:40Z

packages/seal-mongodb-adapter/src/MongoDBSearcher.php

+                $filter instanceof Condition\SearchCondition => $filterQueries[]['$search'] =[
+                    'index' => $index->name . '-search-index',
+                    'text' => [
+                        'query' => $filter->query,
+                        'path' => $index->searchableFields,
+                    ],
+                ],


Hello @GromNaN @alcaeus 👋,

I greated a quick prototype but where I'm stuck is with the query. I implemented the basic filtering =, !=, <, <=, =>, <, IN and NOT IN filtering. But stuck how combine this now with the search. Maybe you have some hint for me :)

Is there a way to combine the search and filters functionality in collection->find ?

Search using the Lucene index can only be done with the $search stage; it must be the first stage of the aggregation pipeline.
The text query operator doesn't use the Lucene index but a the specific "text" index that exists in MongoDB server for a long time before the introduction of Atlas Search.

In order to combine Lucene-base search criteria with basic filters, you have to use the compound search operator.

Is there a way to combine the search and filters functionality in collection->find ?

You must use $collection->aggregate($pipeline) with a $search stage first. Atlas Search is incompatible with $collection->find($query).

I see — so the search has its completely own API over aggregate method. Do you think it make sense for the SEAL to support then find and basic search or always go to the aggregate method with Atlas Search, would aggregate also work if I only have filters without a search query?

When there is only basic filters, it might be more performant to use find($query) or and aggregation with aggregate([['$match' => $query]]), both will be the same. But I would start with an implementation that uses only the $search stage and operators as this is the target of this package.

I will add that if you require the latest version of the MongoDB driver, you will have access to the aggregation pipeline builder which gives you a programmatic API to assemble a $search stage (and any other stages). For examples, check out the tests for SearchStage and CompoundOperator. In essence, these would allow you to generate the correct operator based on the various conditions, then combine them all in a single must array for the compound operator and pass that to Stage::search.

GromNaN

It's off to a great start! I've tried to answer your questions.

GromNaN · 2025-04-07T10:15:30Z

packages/seal-mongodb-adapter/src/ClientWrapper.php

+final class ClientWrapper
+{
+    public function __construct(
+        private readonly Client $client,
+        private readonly string $databaseName,
+    ) {
+    }
+
+    public function getClient(): Client
+    {
+        return $this->client;
+    }
+
+    public function getDatabase(): Database
+    {
+        return $this->client->getDatabase($this->databaseName);
+    }
+}


The default database can be specified in the "path" part of the DSN: https://www.mongodb.com/docs/manual/reference/connection-string/#connection-string-database-options

The MongoDB\Client class doesn't have the notion of default database, you have to extract the default database name from the DSN (or other options).

GromNaN · 2025-04-07T10:18:39Z

packages/seal-mongodb-adapter/src/MongoDBAdapter.php

+    private readonly SearcherInterface $searcher;
+
+    public function __construct(
+        ClientWrapper $client,


You can inject the MongoDB\Database object instead; instantiation does not open the connection.

I like the idea, if I can do all over the database I think that would be the best option.

GromNaN · 2025-04-07T10:23:32Z

packages/seal-mongodb-adapter/src/MongoDBSchemaManager.php

+    {
+        // $searchDefinition = $this->createPropertiesMapping($index->fields);
+
+        $this->client->getDatabase()->createCollection($index->name);  // TODO check with mongodb team how we could add stricter schema / validator


You can set the validator option with a $jsonSchema. Example:

$database->createCollection('planes', [ 'validator' => [ '$jsonSchema' => [ 'bsonType' => 'object', 'required' => ['_id', 'name', 'created_at'], 'properties' => [ '_id' => ['bsonType' => 'objectId'], 'name' => ['bsonType' => 'string'], 'created_at' => ['bsonType' => 'date'], 'seats' => [ 'bsonType' => 'array', 'items' => [ 'bsonType' => 'object', 'required' => ['_id', 'number', 'row', 'column'], 'additionalProperties' => false, 'properties' => [ '_id' => ['bsonType' => 'objectId'], 'number' => ['bsonType' => 'int'], 'row' => ['bsonType' => 'int'], 'column' => ['bsonType' => 'int'], ], ], ], ], 'additionalProperties' => false, ], ], ]);

GromNaN · 2025-04-07T10:34:02Z

packages/seal-mongodb-adapter/src/MongoDBSearcher.php

+                $filter instanceof Condition\SearchCondition => $filterQueries[]['$search'] =[
+                    'index' => $index->name . '-search-index',
+                    'text' => [
+                        'query' => $filter->query,
+                        'path' => $index->searchableFields,
+                    ],
+                ],


Search using the Lucene index can only be done with the $search stage; it must be the first stage of the aggregation pipeline.
The text query operator doesn't use the Lucene index but a the specific "text" index that exists in MongoDB server for a long time before the introduction of Atlas Search.

In order to combine Lucene-base search criteria with basic filters, you have to use the compound search operator.

Is there a way to combine the search and filters functionality in collection->find ?

You must use $collection->aggregate($pipeline) with a $search stage first. Atlas Search is incompatible with $collection->find($query).

GromNaN · 2025-04-07T10:36:51Z

packages/seal-mongodb-adapter/src/MongoDBSchemaManager.php

+            return null;
+        }
+
+        return new SyncTask(null); // TODO wait for index drop


Here is the code I use to query the server until the index is ready: https://github.com/mongodb/mongo-php-library/blob/ae6a3a53d507fe38e447dfde4e77c8efe5724925/tests/SpecTests/SearchIndexSpecTest.php#L54

alexander-schranz · 2025-04-07T11:15:57Z

@GromNaN Really thank you that helps. Will keep you uptodate and let you know. Only added one question here: #509 (comment) The other parts I think is now more doing and trying out.

packages/seal-mongodb-adapter/composer.json

alcaeus · 2025-04-09T06:31:59Z

packages/seal-mongodb-adapter/src/ClientWrapper.php

+final class ClientWrapper
+{
+    public function __construct(
+        private readonly Client $client,
+        private readonly string $databaseName,
+    ) {
+    }
+
+    public function getClient(): Client
+    {
+        return $this->client;
+    }
+
+    public function getDatabase(): Database
+    {
+        return $this->client->getDatabase($this->databaseName);
+    }
+}


I would try to avoid extracting the database name from the connection string - it would require you to parse the connection string and thus be conformant with the above specification. The default database was used by some tools, but the MongoDB drivers don't leverage it for anything other than authentication.

alcaeus · 2025-04-09T06:36:03Z

packages/seal-mongodb-adapter/src/MongoDBAdapterFactory.php

+     *     fragment?: string,
+     * } $dsn
+     */
+    public function createClient(array $dsn): ClientWrapper


I would avoid creating the connection string from parts. Since MongoDB drivers support connecting to a range of different topologies, the DSN can't easily split into the parts you used above.

For example, when using a replica set or sharded cluster, you don't have a single host:port pair, but rather multiples, with each server using different ports:
mongodb://host-a:12345,host-b:23456

Authentication is also not always done using user and password, but also through X509 and other authentication mechanisms. And to top it off, the Initial DNS Seedlist Discovery specification allows you to specify hosts and connection options through DNS entries which the driver will then look up - this is where the mysql+srv:// scheme comes in. That also complicates extracting the database name (see earlier comment) as any connection options can be specified in a TXT record.

Long story short: require users pass you a connection string and let them deal with the rest.

alcaeus · 2025-04-09T06:40:55Z

packages/seal-mongodb-adapter/src/MongoDBIndexer.php

+    {
+        $identifierField = $index->getIdentifierField();
+
+        foreach (BulkHelper::splitBulk($saveDocuments, $bulkSize) as $bulkSaveDocuments) {


While the API is a little cumbersome due to it requiring array structures, the Bulk Write API allows you to combine multiple different write operations (such as insertOne, replaceOne, deleteMany) into a single write. The driver will then take care of batching these writes to not exceed the maximum OP_MSG message size. By creating an ordered bulk write (which is the default), writes will stop after the first error; an unordered write will continue writing despite errors but also without guaranteeing that operations are executed in the specified order.

While the API is a little cumbersome due to it requiring array structures

You mean the documents itself? It later targetted to be mapped to objects in #81, but currently you can use any normalizer denormalizer to get the objects into that struct. The bulk parameters itself support any kind of iterable which mostly I recommend using \Generators or something which keep memory usage low for reindexing all documents.

The driver will then take care of batching these writes to not exceed the maximum OP_MSG message size.

That sounds kind of interesting maybe better in my case would be aslong as possible give the iterables to the underlaying layer and change aslong as not own bulk size is given the underlaying client of that search engine takes care of that.

an unordered write will continue writing despite errors but also without guaranteeing that operations are executed in the specified order.

Thx for the insights here 🙏 .

You mean the documents itself?

I was talking about the bulk write API in the MongoDB driver - it requires an array of operations like this:

[ [ 'deleteMany' => [ $filter ] ], [ 'deleteOne' => [ $filter ] ], [ 'insertOne' => [ $document ] ], [ 'replaceOne' => [ $filter, $replacement, $options ] ], [ 'updateMany' => [ $filter, $update, $options ] ], [ 'updateOne' => [ $filter, $update, $options ] ], ]

As for batching and such, you would create a single bulk write containing all operations (insertions and deletions) and then send that to the server. When assembling the actual bulk write command, the driver will automatically split the write into batches if it exceeds the maximum size supported by MongoDB. While this size is dependent on information returned in the MongoDB handshake, it currently supports 100k operations in a single batch, with the message size limited to 48 MB. Considering this, you can assume that most cases will only need a single round trip and no batching.
If you were to go for that option, creating the operations is basically an array_map returning the corresponding insertOne or deleteMany operation, then combining both lists and passing that to the Collection::bulkWrite() method.

alcaeus · 2025-04-09T06:46:49Z

packages/seal-mongodb-adapter/src/MongoDBSearcher.php

+                $filter instanceof Condition\SearchCondition => $filterQueries[]['$search'] =[
+                    'index' => $index->name . '-search-index',
+                    'text' => [
+                        'query' => $filter->query,
+                        'path' => $index->searchableFields,
+                    ],
+                ],


I will add that if you require the latest version of the MongoDB driver, you will have access to the aggregation pipeline builder which gives you a programmatic API to assemble a $search stage (and any other stages). For examples, check out the tests for SearchStage and CompoundOperator. In essence, these would allow you to generate the correct operator based on the various conditions, then combine them all in a single must array for the compound operator and pass that to Stage::search.

alexander-schranz added the Adapter: MongoDB MongoDB Adapter related issue label Apr 4, 2025

alexander-schranz commented Apr 4, 2025

View reviewed changes

alexander-schranz force-pushed the feature/mongodb-adapter branch 5 times, most recently from 490dcfb to 4326138 Compare April 4, 2025 18:09

alexander-schranz commented Apr 4, 2025

View reviewed changes

GromNaN reviewed Apr 7, 2025

View reviewed changes

alcaeus reviewed Apr 9, 2025

View reviewed changes

Add MongoDB Adapter

16b4d15

alexander-schranz force-pushed the feature/mongodb-adapter branch from aa0d258 to 16b4d15 Compare August 2, 2025 13:57

	\assert(!isset($dsn['path']), 'A selected database is required.');
	$databaseName = \ltrim($dsn['path'] ?? 'default', '/');

	if (isset($dsn['query'])) {
	$dsnUri .= '?' . \http_build_query($dsn['query']);
	}
	if (isset($dsn['fragment'])) {
	$dsnUri .= '#' . $dsn['fragment'];
	}

	return new ClientWrapper(new Client($dsnUri), $databaseName);

Uh oh!

Add MongoDB Adapter #509

Are you sure you want to change the base?

Add MongoDB Adapter #509

Uh oh!

Conversation

alexander-schranz commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ToDo

Uh oh!

alexander-schranz Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexander-schranz Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GromNaN left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GromNaN Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GromNaN Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexander-schranz commented Apr 7, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexander-schranz commented Apr 4, 2025 •

edited

Loading

alexander-schranz Apr 4, 2025 •

edited

Loading

alexander-schranz Apr 4, 2025 •

edited

Loading

GromNaN Apr 7, 2025 •

edited

Loading

GromNaN Apr 7, 2025 •

edited

Loading