-
Notifications
You must be signed in to change notification settings - Fork 408
Pinterest open source: early shuffle deletion #3564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-0.4
Are you sure you want to change the base?
Conversation
|
|
||
| trait RunningStageManager { | ||
| def isRunningStage(stageId: Int): Boolean | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a version based on the original implementation of that PR, I will clean it up to merge to master branch
|
Is there any design doc for this feature? This feature is what I needed. |
|
@FMX I will post the design description/doc while trying to merge this feature to master branch (as it was developed based on 0.4 internally)... on the other side, I am discussing with @YaoRazor , it seems we missed certain important pieces of code when upstreaming, so please do not bother look at the current implementation, it missed many things |
73dc785 to
c0c4019
Compare
|
I have started merging code to main branch at #3569, should be ready before the new year |
|
Hi, all #3569 is ready to review , appreciate the feedbacks in advance |
mridulm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not go over the entire PR.
From a quick look, this appears to be an unsound change. Shuffle gets used not just in the currently active stages/jobs - but also 'future' jobs. AQE based sql query execution, RDD reuse, etc all leverage this - I want to make sure I am not missing something here.
|
|
||
| import org.apache.spark.scheduler.{EventLoggingListener, SparkListenerInterface} | ||
|
|
||
| object CelebornSparkContextHelper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleanup methods from this object which are not required ?
It looks like the only thing needed is eventLogger right now.
|
|
||
| import org.apache.spark.SparkContext | ||
|
|
||
| trait RunningStageManager { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this being used ?
| private def dagScheduler = SparkContext.getActive.get.dagScheduler | ||
|
|
||
| override def isRunningStage(stageId: Int): Boolean = { | ||
| dagScheduler.runningStages.map(_.id).contains(stageId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unsafe - runningStages is expected to be used only from within DAGScheduler

What changes were proposed in this pull request?
This PR represents Pinterest work to aggressively delete shuffle data
Ack
The main work is done by @CodingCat while he is working here, I am currently rolling out the feature and see good results
Why are the changes needed?
Does this PR resolve a correctness bug?
Does this PR introduce any user-facing change?
How was this patch tested?