Unable to infer schema for Parquet on Spark

### What happens?

I'm trying to run Splink on a Spark cluster to learn how it can be used in distributed environments but am getting an unexpected error:

```
pyspark.errors.exceptions.captured.AnalysisException: [UNABLE_TO_INFER_SCHEMA] Unable to infer schema for Parquet. It must be specified manually.
```

when I run `linker.training.estimate_u_using_random_sampling`.

This impacts one's ability to use Splink on Spark.

### To Reproduce

Run [this example](https://moj-analytical-services.github.io/splink/demos/examples/spark/deduplicate_1k_synthetic.html) on Spark. I've found that the  happens both in a local cluster and on EMR Serverless (`emr-7.6.0 ARM`). Everything works until I run:
```
linker.training.estimate_u_using_random_sampling(max_pairs=5e5)
```

### OS:

NA

### Splink version:

4.0.7

### Have you tried this on the latest `master` branch?

- [x] I agree

### Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

- [x] I agree

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unable to infer schema for Parquet on Spark #2689

What happens?

To Reproduce

OS:

Splink version:

Have you tried this on the latest `master` branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unable to infer schema for Parquet on Spark #2689

Description

What happens?

To Reproduce

OS:

Splink version:

Have you tried this on the latest master branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Have you tried this on the latest `master` branch?