Skip to content

Unable to infer schema for Parquet on Spark #2689

@edufschmidt

Description

@edufschmidt

What happens?

I'm trying to run Splink on a Spark cluster to learn how it can be used in distributed environments but am getting an unexpected error:

pyspark.errors.exceptions.captured.AnalysisException: [UNABLE_TO_INFER_SCHEMA] Unable to infer schema for Parquet. It must be specified manually.

when I run linker.training.estimate_u_using_random_sampling.

This impacts one's ability to use Splink on Spark.

To Reproduce

Run this example on Spark. I've found that the happens both in a local cluster and on EMR Serverless (emr-7.6.0 ARM). Everything works until I run:

linker.training.estimate_u_using_random_sampling(max_pairs=5e5)

OS:

NA

Splink version:

4.0.7

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions