Skip to content

Add MongoDB Adapter #509

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: 0.12
Choose a base branch
from

Conversation

alexander-schranz
Copy link
Member

@alexander-schranz alexander-schranz commented Apr 4, 2025

This PR is stream lining a first prototype to use MongoDB as a Search Adapter,

ToDo

  • Create CI Task
    • Replace pull_request with pull_request_target
  • Adapter
    • Indexer
    • Searcher
    • SchemaManager
  • Client Creation
    • Check Database selection
  • SchemaManager
    • Exist
    • Create
      • Validator for Document Schema?
      • Create Search Index
        • Strict Schema
    • Drop
  • Searcher
    • getDocument (by Identifier)
    • Search
    • Search with Highlighting
    • Limit and Offset
    • Equal Condition
    • Equal condition with boolean
    • Equal condition with special string
    • Multi Equal condition
    • Equal condition with search condition
    • Not equal condition
    • Greater than condition
    • Greater than equal condition
    • Greater than equal condition multi value
    • Less than condition
    • Less than equal condition
    • Geo distance condition
    • Geo bounding box condition
    • Less than equal condition multi value
    • In condition
    • Not in condition
    • Sort by asc
    • Sort by desc
    • Searching with nested and or conditions

Test this out:

git clone [email protected]:PHP-CMSIG/search.git
cd search
git remote add alex [email protected]:alexander-schranz/schranz-search.git
git checkout feature/mongodb-adapter

cd packages/seal-mongodb-adapter
composer install

docker compose up
vendor/bin/phpunit

@alexander-schranz alexander-schranz added the Adapter: MongoDB MongoDB Adapter related issue label Apr 4, 2025
Comment on lines +19 to +36
final class ClientWrapper
{
public function __construct(
private readonly Client $client,
private readonly string $databaseName,
) {
}

public function getClient(): Client
{
return $this->client;
}

public function getDatabase(): Database
{
return $this->client->getDatabase($this->databaseName);
}
}
Copy link
Member Author

@alexander-schranz alexander-schranz Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this wrapper class didn't want that the services always require:

        private readonly Client $client,
        private readonly string $databaseName,

and wanted a hande getDatabase method to get the current configured database for this "Adapter" instance.

I'm also not sure if my DSN will conflict with MongoDB URI parsing I'm using the DSN path part to select the database:

\assert(!isset($dsn['path']), 'A selected database is required.');
$databaseName = \ltrim($dsn['path'] ?? 'default', '/');
if (isset($dsn['query'])) {
$dsnUri .= '?' . \http_build_query($dsn['query']);
}
if (isset($dsn['fragment'])) {
$dsnUri .= '#' . $dsn['fragment'];
}
return new ClientWrapper(new Client($dsnUri), $databaseName);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default database can be specified in the "path" part of the DSN: https://www.mongodb.com/docs/manual/reference/connection-string/#connection-string-database-options

The MongoDB\Client class doesn't have the notion of default database, you have to extract the default database name from the DSN (or other options).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to avoid extracting the database name from the connection string - it would require you to parse the connection string and thus be conformant with the above specification. The default database was used by some tools, but the MongoDB drivers don't leverage it for anything other than authentication.

@alexander-schranz alexander-schranz force-pushed the feature/mongodb-adapter branch 5 times, most recently from 490dcfb to 4326138 Compare April 4, 2025 18:09
Comment on lines +178 to +184
$filter instanceof Condition\SearchCondition => $filterQueries[]['$search'] =[
'index' => $index->name . '-search-index',
'text' => [
'query' => $filter->query,
'path' => $index->searchableFields,
],
],
Copy link
Member Author

@alexander-schranz alexander-schranz Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @GromNaN @alcaeus 👋,

I greated a quick prototype but where I'm stuck is with the query. I implemented the basic filtering =, !=, <, <=, =>, <, IN and NOT IN filtering. But stuck how combine this now with the search. Maybe you have some hint for me :)

Is there a way to combine the search and filters functionality in collection->find ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Search using the Lucene index can only be done with the $search stage; it must be the first stage of the aggregation pipeline.
The text query operator doesn't use the Lucene index but a the specific "text" index that exists in MongoDB server for a long time before the introduction of Atlas Search.

In order to combine Lucene-base search criteria with basic filters, you have to use the compound search operator.

Is there a way to combine the search and filters functionality in collection->find ?

You must use $collection->aggregate($pipeline) with a $search stage first. Atlas Search is incompatible with $collection->find($query).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see — so the search has its completely own API over aggregate method. Do you think it make sense for the SEAL to support then find and basic search or always go to the aggregate method with Atlas Search, would aggregate also work if I only have filters without a search query?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there is only basic filters, it might be more performant to use find($query) or and aggregation with aggregate([['$match' => $query]]), both will be the same. But I would start with an implementation that uses only the $search stage and operators as this is the target of this package.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add that if you require the latest version of the MongoDB driver, you will have access to the aggregation pipeline builder which gives you a programmatic API to assemble a $search stage (and any other stages). For examples, check out the tests for SearchStage and CompoundOperator. In essence, these would allow you to generate the correct operator based on the various conditions, then combine them all in a single must array for the compound operator and pass that to Stage::search.

Copy link

@GromNaN GromNaN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's off to a great start! I've tried to answer your questions.

Comment on lines +19 to +36
final class ClientWrapper
{
public function __construct(
private readonly Client $client,
private readonly string $databaseName,
) {
}

public function getClient(): Client
{
return $this->client;
}

public function getDatabase(): Database
{
return $this->client->getDatabase($this->databaseName);
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default database can be specified in the "path" part of the DSN: https://www.mongodb.com/docs/manual/reference/connection-string/#connection-string-database-options

The MongoDB\Client class doesn't have the notion of default database, you have to extract the default database name from the DSN (or other options).

private readonly SearcherInterface $searcher;

public function __construct(
ClientWrapper $client,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can inject the MongoDB\Database object instead; instantiation does not open the connection.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea, if I can do all over the database I think that would be the best option.

{
// $searchDefinition = $this->createPropertiesMapping($index->fields);

$this->client->getDatabase()->createCollection($index->name); // TODO check with mongodb team how we could add stricter schema / validator
Copy link

@GromNaN GromNaN Apr 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can set the validator option with a $jsonSchema. Example:

$database->createCollection('planes', [
    'validator' => [
        '$jsonSchema' => [
            'bsonType' => 'object',
            'required' => ['_id', 'name', 'created_at'],
            'properties' => [
                '_id' => ['bsonType' => 'objectId'],
                'name' => ['bsonType' => 'string'],
                'created_at' => ['bsonType' => 'date'],
                'seats' => [
                    'bsonType' => 'array',
                    'items' => [
                        'bsonType' => 'object',
                        'required' => ['_id', 'number', 'row', 'column'],
                        'additionalProperties' => false,
                        'properties' => [
                            '_id' => ['bsonType' => 'objectId'],
                            'number' => ['bsonType' => 'int'],
                            'row' => ['bsonType' => 'int'],
                            'column' => ['bsonType' => 'int'],
                        ],
                    ],
                ],
            ],
            'additionalProperties' => false,
        ],
    ],
]);

Comment on lines +178 to +184
$filter instanceof Condition\SearchCondition => $filterQueries[]['$search'] =[
'index' => $index->name . '-search-index',
'text' => [
'query' => $filter->query,
'path' => $index->searchableFields,
],
],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Search using the Lucene index can only be done with the $search stage; it must be the first stage of the aggregation pipeline.
The text query operator doesn't use the Lucene index but a the specific "text" index that exists in MongoDB server for a long time before the introduction of Atlas Search.

In order to combine Lucene-base search criteria with basic filters, you have to use the compound search operator.

Is there a way to combine the search and filters functionality in collection->find ?

You must use $collection->aggregate($pipeline) with a $search stage first. Atlas Search is incompatible with $collection->find($query).

return null;
}

return new SyncTask(null); // TODO wait for index drop
Copy link

@GromNaN GromNaN Apr 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexander-schranz
Copy link
Member Author

@GromNaN Really thank you that helps. Will keep you uptodate and let you know. Only added one question here: #509 (comment) The other parts I think is now more doing and trying out.

Comment on lines +19 to +36
final class ClientWrapper
{
public function __construct(
private readonly Client $client,
private readonly string $databaseName,
) {
}

public function getClient(): Client
{
return $this->client;
}

public function getDatabase(): Database
{
return $this->client->getDatabase($this->databaseName);
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to avoid extracting the database name from the connection string - it would require you to parse the connection string and thus be conformant with the above specification. The default database was used by some tools, but the MongoDB drivers don't leverage it for anything other than authentication.

* fragment?: string,
* } $dsn
*/
public function createClient(array $dsn): ClientWrapper
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid creating the connection string from parts. Since MongoDB drivers support connecting to a range of different topologies, the DSN can't easily split into the parts you used above.

For example, when using a replica set or sharded cluster, you don't have a single host:port pair, but rather multiples, with each server using different ports:
mongodb://host-a:12345,host-b:23456

Authentication is also not always done using user and password, but also through X509 and other authentication mechanisms. And to top it off, the Initial DNS Seedlist Discovery specification allows you to specify hosts and connection options through DNS entries which the driver will then look up - this is where the mysql+srv:// scheme comes in. That also complicates extracting the database name (see earlier comment) as any connection options can be specified in a TXT record.

Long story short: require users pass you a connection string and let them deal with the rest.

{
$identifierField = $index->getIdentifierField();

foreach (BulkHelper::splitBulk($saveDocuments, $bulkSize) as $bulkSaveDocuments) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the API is a little cumbersome due to it requiring array structures, the Bulk Write API allows you to combine multiple different write operations (such as insertOne, replaceOne, deleteMany) into a single write. The driver will then take care of batching these writes to not exceed the maximum OP_MSG message size. By creating an ordered bulk write (which is the default), writes will stop after the first error; an unordered write will continue writing despite errors but also without guaranteeing that operations are executed in the specified order.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the API is a little cumbersome due to it requiring array structures

You mean the documents itself? It later targetted to be mapped to objects in #81, but currently you can use any normalizer denormalizer to get the objects into that struct. The bulk parameters itself support any kind of iterable which mostly I recommend using \Generators or something which keep memory usage low for reindexing all documents.

The driver will then take care of batching these writes to not exceed the maximum OP_MSG message size.

That sounds kind of interesting maybe better in my case would be aslong as possible give the iterables to the underlaying layer and change aslong as not own bulk size is given the underlaying client of that search engine takes care of that.

an unordered write will continue writing despite errors but also without guaranteeing that operations are executed in the specified order.

Thx for the insights here 🙏 .

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the documents itself?

I was talking about the bulk write API in the MongoDB driver - it requires an array of operations like this:

[
    [ 'deleteMany' => [ $filter ] ],
    [ 'deleteOne'  => [ $filter ] ],
    [ 'insertOne'  => [ $document ] ],
    [ 'replaceOne' => [ $filter, $replacement, $options ] ],
    [ 'updateMany' => [ $filter, $update, $options ] ],
    [ 'updateOne'  => [ $filter, $update, $options ] ],
]

As for batching and such, you would create a single bulk write containing all operations (insertions and deletions) and then send that to the server. When assembling the actual bulk write command, the driver will automatically split the write into batches if it exceeds the maximum size supported by MongoDB. While this size is dependent on information returned in the MongoDB handshake, it currently supports 100k operations in a single batch, with the message size limited to 48 MB. Considering this, you can assume that most cases will only need a single round trip and no batching.
If you were to go for that option, creating the operations is basically an array_map returning the corresponding insertOne or deleteMany operation, then combining both lists and passing that to the Collection::bulkWrite() method.

Comment on lines +178 to +184
$filter instanceof Condition\SearchCondition => $filterQueries[]['$search'] =[
'index' => $index->name . '-search-index',
'text' => [
'query' => $filter->query,
'path' => $index->searchableFields,
],
],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add that if you require the latest version of the MongoDB driver, you will have access to the aggregation pipeline builder which gives you a programmatic API to assemble a $search stage (and any other stages). For examples, check out the tests for SearchStage and CompoundOperator. In essence, these would allow you to generate the correct operator based on the various conditions, then combine them all in a single must array for the compound operator and pass that to Stage::search.

@alexander-schranz alexander-schranz force-pushed the feature/mongodb-adapter branch from aa0d258 to 16b4d15 Compare August 2, 2025 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Adapter: MongoDB MongoDB Adapter related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants