Neo4jCsvPublisher Speed Optimization (Parallelism) 

Hi Team,
I’m wondering if there’s a plan to apply **multiprocessing** on the publishers. We have a large amount of metadata in our production, which ended up running 3 million queries on `neo4j` . It takes about 90 minutes to finish.

To investigate the bottleneck, I looked into the code and logged the time elapsed for each step in a single iteration in the `_publish_node` [function](https://github.com/amundsen-io/amundsen/blob/699652f7e9a884a05000a9ca00bce72179cfa593/databuilder/databuilder/publisher/neo4j_csv_publisher.py#L253). This is the result

- Neo4j query: 0.1ms
- Create statement: 1ms
- Others: super fast, doesn’t matter
![image](https://user-images.githubusercontent.com/44046550/144763323-dcdecaad-8cb6-4654-8fad-4a3ffb56102e.png)

Surprisingly, the **bottleneck** is not the db query, it’s the statement creation. The process is basically

1. loop each row in csv
2. parse the row into a dictionary
3. loop through each key value pair in the dictionary to get the props
4. fill the statement `Jinjia`  template with the props
5. execute the query with the statement

I’m thinking that instead of read a row => create a node in graph db one by one, maybe we could use multiprocessing to speed up the process. I believe there will be no dependency issue as long as we publish all the nodes before publishing relations, which is already handled in the current codebase.
I’m planning on implementing multiprocessing for this, is there any potential problem? Like dependency, graph db load, etc..

## Expected Behavior or Use Case
Speed up the performance of the publisher. Currently, a 90 min sync is not acceptable for our use case 😢 

## Service or Ingestion ETL
Ingestion ETL, publisher

## Possible Implementation

Thanks to @dkunitsk 's idea, I think there are three possible implementations

1. Multiprocessing on call side
2. Multiprocessing on Neo4j publisher
3. Neo4j UNWIND (Batch processing)

![image](https://user-images.githubusercontent.com/44046550/144763755-5aaeb2f4-4315-4cac-9326-a3a3eb13d859.png)
![image](https://user-images.githubusercontent.com/44046550/144763760-5e92731a-5fc4-4712-9ac1-5bc79750fcda.png)
![image](https://user-images.githubusercontent.com/44046550/144763761-53c3fcae-cfa2-4993-b0b6-8d55fc54e89d.png)

```python
class HiveParallelIndexer:
    # Shim for adding all node labels to the NEO4J_DEADLOCK_NODE_LABELS config
    # which enables retries for those node labels. This is important for parallel writing
    # since we see intermittent Neo4j deadlock errors relatively often.
    class ContainsAllList(list):
        def __contains__(self, item):
            return True

    def __init__(self, publish_tag: str, parallelism: int):
        self.publish_tag = publish_tag
        self.parallelism = parallelism

    def __call__(self, worker_index: int):
        # Sharding:
        #   - take the md5 hash of the schema.table_name
        #   - convert the first 3 characters of the hash to decimal (3 chosen arbitrarily)
        #   - mod by total number of processes
        where_clause_suffix = """
            WHERE MOD(CONV(LEFT(MD5(CONCAT(d.NAME, '.', t.TBL_NAME)), 3), 16, 10), {total_parallelism}) = {worker_index}
            AND t.TBL_TYPE IN ('EXTERNAL_TABLE', 'MANAGED_TABLE', 'VIRTUAL_VIEW')
            AND (t.VIEW_EXPANDED_TEXT != '/* Presto View */' OR t.VIEW_EXPANDED_TEXT is NULL)
        """.format(total_parallelism=self.parallelism,
            worker_index=worker_index)

        # configs relevant for multiprocessing
        job_config = ConfigFactory.from_dict({
            'extractor.hive_table_metadata.{}'.format(HiveTableMetadataExtractor.WHERE_CLAUSE_SUFFIX_KEY):
                where_clause_suffix,
            # keeping this relatively low, in our experience, reduces neo4j deadlocks
            'publisher.neo4j.{}'.format(neo4j_csv_publisher.NEO4J_TRANSACTION_SIZE):
                100,
            'publisher.neo4j.{}'.format(neo4j_csv_publisher.NEO4J_DEADLOCK_NODE_LABELS):
                HiveParallelIndexer.ContainsAllList(),
        })
        job = DefaultJob(conf=job_config,
                         task=DefaultTask(
                             extractor=HiveTableMetadataExtractor(),
                             loader=FsNeo4jCSVLoader()),
                         publisher=Neo4jCsvPublisher())
        job.launch()


parallelism = 16
indexer = HiveParallelIndexer(
    publish_tag='2021-12-03'
    parallelism=parallelism)

with multiprocessing.Pool(processes=parallelism) as pool:
    def callback(_):
        # fast fail in case of exception in any process
        print('terminating due to exception')
        pool.terminate()
    res = pool.map_async(indexer, [i for i in range(parallelism)], error_callback=callback)
    res.get()
```

## Screenshots of Slack Discussion

![image](https://user-images.githubusercontent.com/44046550/144763614-23eca0fe-b78a-4cc0-bbb2-141181416e44.png)
![image](https://user-images.githubusercontent.com/44046550/144763544-64c1c4f5-f5be-44ab-80cc-c736e4cf497a.png)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Neo4jCsvPublisher Speed Optimization (Parallelism) #1610

Expected Behavior or Use Case

Service or Ingestion ETL

Possible Implementation

Screenshots of Slack Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Neo4jCsvPublisher Speed Optimization (Parallelism) #1610

Description

Expected Behavior or Use Case

Service or Ingestion ETL

Possible Implementation

Screenshots of Slack Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions