-
Notifications
You must be signed in to change notification settings - Fork 15.8k
Description
Description
SparkSubmitHook should track yarn cluster-mode application status with yarn CLI rather than rely on spark-submit process logs. This would cut back on excessive memory usage and also make it much easier to make the operator deferrable later on.
Use case/motivation
While running most of our Spark workloads in Yarn cluster mode using SparkSubmitHook, we observed that celery workers were consistently low on memory. The main driver for the high memory consumption were spark-submit processes started from SparkSubmitHook, that took about 500mb of memory even though in yarn cluster mode they were doing essentially next to none of actual work. We refactored the hook to kill spark-submit process right after Yarn accepts the application and track the status with yarn application -status
calls similar to how spark standalone mode is being tracked.
Another motivation for the change is to prepare the operator to be made deferrable later on. Polling from external java process that needs to be kept alive until operator exits can't be made deferrable. Using Yarn CLI for polling would fit easily with how deferrable operators work.
Related issues
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct