-
Notifications
You must be signed in to change notification settings - Fork 196
Open
Labels
bugSomething isn't workingSomething isn't working
Description
What happens?
I'm trying to run Splink on a Spark cluster to learn how it can be used in distributed environments but am getting an unexpected error:
pyspark.errors.exceptions.captured.AnalysisException: [UNABLE_TO_INFER_SCHEMA] Unable to infer schema for Parquet. It must be specified manually.
when I run linker.training.estimate_u_using_random_sampling.
This impacts one's ability to use Splink on Spark.
To Reproduce
Run this example on Spark. I've found that the happens both in a local cluster and on EMR Serverless (emr-7.6.0 ARM). Everything works until I run:
linker.training.estimate_u_using_random_sampling(max_pairs=5e5)
OS:
NA
Splink version:
4.0.7
Have you tried this on the latest master branch?
- I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- I agree
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working