Skip to content

Conversation

noot
Copy link
Contributor

@noot noot commented Jan 7, 2022

Changes

  • refactor babe epoch handling logic to be separated into an epochHandler module, which will handle the authorship slots for that epoch

Tests

go test ./lib/babe

Issues

Primary Reviewer

@noot noot self-assigned this Jan 7, 2022
@noot noot marked this pull request as ready for review January 19, 2022 22:10
@@ -340,8 +342,9 @@ func (ps *PeerSet) reportPeer(change ReputationChange, peers ...peer.ID) error {
return err
}

// TODO: change Noop to Drop once #2098 is fixed
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me know if i should put commenting out this ban peer stuff into another PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you not just uncomment this out for now? I wasn't able to reproduce the nodes banning at this point at least locally.

"github.com/stretchr/testify/require"
)

var keyring, _ = keystore.NewSr25519Keyring() //nolint:typecheck
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I don't declare keyring here I get an undeclared error, but I then get a lint error since it's defined in babe_integration_test.go. how do i fix this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, i just gotta put it in a non integration test file

@noot noot requested a review from kishansagathiya January 21, 2022 19:30
Comment on lines 38 to 44
require.Equal(t, 200, len(eh.slotToProof))
require.Equal(t, uint64(1), eh.epochNumber)
require.Equal(t, getSlotStartTime(9999, sd), eh.startTime)
require.Equal(t, uint64(9999), eh.firstSlot)
require.Equal(t, constants, eh.constants)
require.Equal(t, epochData, eh.epochData)
require.NotNil(t, eh.handleSlot)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you assert the entire eh? Also eh is eh variable naming wise 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't assert the entire epochHandler because the slotToProof value is uses randomness in the VRF calculation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we assign it some predictable value in the test? Or perhaps dependency inject the source of randomness? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm it would probably require overriding the claimPrimarySlot function which is used to generate the proof... I can do it if you'd really like but seems like that sorta veers into territory of changing the code for tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in if it's a lot of code change, don't; otherwise I'd say yes your code should be changed to be more testable 😉

lib/babe/babe.go Outdated
return err
}
err = func() error {
ctx, cancel := context.WithCancel(b.ctx)
Copy link
Contributor

@timwu20 timwu20 Jan 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm of the proponent where we should not be using context.Context within package boundaries. I know we already do this in most of the Service implementations. If what we're looking for is cancellation, then we can just create a channel for that purpose. context.Context is meant to traverse through API boundaries and contain request scoped data. I've seen a lot of examples of go code using them to handle long running processes and I feel like it introduces more weight than required.

In this case the Service.ctx is being wrapped to add cancellation to the current epochHandler. So to my knowledge that the Service.ctx was created with context.WithCancel, then ctx.Done() would close if the original CancelFunc was called. But we wrap this context again in line 347. So does that mean that if we call the local cancel CancelFunc than any goroutines listening on Service.ctx would receive the channel close? I would assume no, but these are things that make the code more difficult to understand. I think as a best practice you shouldn't call context.WithCancel or any of the other context.WithXXX functions on supplied contexts.

I think epochHandler.run() should be modified to accept a context.Context that's not the original one stored in Service. This runEngine function should listen on the Service one and trigger cancellation within this function for the current epochHandler.

@noot
Copy link
Contributor Author

noot commented Jan 24, 2022

@timwu20 made your suggested change regarding the context, let me know what you think now!

@noot noot requested review from timwu20 and qdm12 January 24, 2022 18:20
Copy link
Member

@EclesioMeloJunior EclesioMeloJunior left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just small changes!

qdm12
qdm12 previously requested changes Jan 24, 2022
lib/babe/babe.go Outdated

if err = babeService.setupParameters(cfg); err != nil {
return nil, err
constants: &constants{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a pointer by the way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't know lol. want me to change it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes pleaaaase

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

lib/babe/babe.go Outdated
Comment on lines 265 to 267
auths := make([]types.Authority, len(b.epochHandler.epochData.authorities))
copy(auths, b.epochHandler.epochData.authorities)
return auths
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you can copy types.Authority directly, you would need a deep copy method I guess?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// retry to run the engine at some point (maybe the next epoch) if
// there's an error.
if err := b.runEngine(); err != nil {
logger.Criticalf("failed to run block production engine: %s", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 💯

lib/babe/babe.go Outdated
if err != nil {
logger.Errorf("failed to get current epoch: %s", err)
return err
return nil, fmt.Errorf("failed to initiate epoch %d: %s", epoch, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, fmt.Errorf("failed to initiate epoch %d: %s", epoch, err)
return nil, fmt.Errorf("failed to initiate epoch %d: %w", epoch, err)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done (also fixed all the other error returns)

lib/babe/babe.go Outdated
}
epochStartSlot, err := b.epochState.GetStartSlotForEpoch(epoch)
if err != nil {
return nil, fmt.Errorf("failed to get start slot for current epoch %d: %s", epoch, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, fmt.Errorf("failed to get start slot for current epoch %d: %s", epoch, err)
return nil, fmt.Errorf("failed to get start slot for current epoch %d: %w", epoch, err)

func (b *Service) runEngine() error {
epoch, err := b.epochState.GetCurrentEpoch()
if err != nil {
return fmt.Errorf("failed to get current epoch: %s", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return fmt.Errorf("failed to get current epoch: %s", err)
return fmt.Errorf("failed to get current epoch: %w", err)

lib/babe/babe.go Outdated
return err
}
for {
err = func() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have this one as a separate global scope function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

lib/babe/babe.go Outdated
defer cancel()
b.epochHandler, err = b.initiateAndGetEpochHandler(epoch)
if err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return err
return fmt.Errorf("cannot initiate epoch handler: %w", err)

lib/babe/babe.go Outdated
// get start slot for current epoch
nextEpochStart, err := b.epochState.GetStartSlotForEpoch(epoch + 1)
if err != nil {
return fmt.Errorf("failed to get start slot for next epoch %d: %s", epoch+1, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return fmt.Errorf("failed to get start slot for next epoch %d: %s", epoch+1, err)
return fmt.Errorf("failed to get start slot for next epoch %d: %w", epoch+1, err)

// TODO: sometimes grandpa fails to initiate due to a "Key not found"
// error, this shouldn't happen.
if err := s.initiate(); err != nil {
logger.Criticalf("failed to initiate: %s", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Panic? Error channel? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the same sort of issue as with BABE, we probably need to add an error channel so the higher-level service handler can either panic or restart the service. I'll make an issue for it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 416 to 417
epochData.authorities = make([]types.Authority, 1)
epochData.authorities[0] = types.Authority{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
epochData.authorities = make([]types.Authority, 1)
epochData.authorities[0] = types.Authority{
epochData.authorities = []types.Authority{

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

epochData.threshold = maxThreshold
epochData.authorityIndex = 0

var slotNumber uint64 = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var slotNumber uint64 = 1
const slotNumber = 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -214,14 +230,18 @@ func TestVerificationManager_VerifyBlock_Secondary(t *testing.T) {
}

func TestVerificationManager_VerifyBlock_MultipleEpochs(t *testing.T) {
t.Skip() // no idea why it's complaining it can't find the epoch data. fix later
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO 😉

Comment on lines +49 to +53

if genCfg == nil {
genCfg = genesisBABEConfig
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this moved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this isn't here the verifier tests panic :p

Comment on lines +76 to +121
mockBlockState := NewMockBlockState(ctrl)
mockEpochState0 := NewMockEpochState(ctrl)
mockEpochState1 := NewMockEpochState(ctrl)
mockEpochState2 := NewMockEpochState(ctrl)

mockEpochState0.EXPECT().GetStartSlotForEpoch(gomock.Eq(uint64(0))).Return(uint64(1), nil)
mockEpochState1.EXPECT().GetStartSlotForEpoch(gomock.Eq(uint64(1))).Return(uint64(201), nil)
mockEpochState2.EXPECT().GetStartSlotForEpoch(gomock.Eq(uint64(1))).Return(uint64(201), nil)

mockEpochState1.EXPECT().HasEpochData(gomock.Eq(uint64(1))).Return(true, nil)
mockEpochState2.EXPECT().HasEpochData(gomock.Eq(uint64(1))).Return(true, nil)

kp := keyring.Alice().(*sr25519.Keypair)
authority := types.NewAuthority(kp.Public(), uint64(1))
testEpochData := &types.EpochData{
Randomness: [32]byte{1},
Authorities: []types.Authority{*authority},
}

mockEpochState1.EXPECT().GetEpochData(gomock.Eq(uint64(1))).Return(testEpochData, nil)
mockEpochState2.EXPECT().GetEpochData(gomock.Eq(uint64(1))).Return(testEpochData, nil)

mockEpochState1.EXPECT().HasConfigData(gomock.Eq(uint64(1))).Return(true, nil)
mockEpochState2.EXPECT().HasConfigData(gomock.Eq(uint64(1))).Return(false, nil)

testConfigData := &types.ConfigData{
C1: 1,
C2: 1,
}

mockEpochState1.EXPECT().GetConfigData(gomock.Eq(uint64(1))).Return(testConfigData, nil)

testLatestConfigData := &types.ConfigData{
C1: 1,
C2: 2,
}

mockEpochState2.EXPECT().GetLatestConfigData().Return(testLatestConfigData, nil)

testEpochDataEpoch0 := &types.EpochData{
Randomness: [32]byte{9},
Authorities: []types.Authority{*authority},
}

mockEpochState0.EXPECT().GetLatestEpochData().Return(testEpochDataEpoch0, nil)
mockEpochState0.EXPECT().GetLatestConfigData().Return(testConfigData, nil)
Copy link
Contributor

@qdm12 qdm12 Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you could configure your mocks according to test cases?

Like this for example.

There is also a reason where I believe having NewController(t) with the parent test might give some funny logs/behavior especially when running parallel subtests inside it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that would be nice, I'll work on that. gotta work on making my tests cleaner haha

lib/babe/babe.go Outdated
defer timer.Stop()
select {
case <-timer.C:
if errors.Is(err, errServicePaused) || errors.Is(err, context.Canceled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to return an error if the context gets canceled 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can? idk, the only time the context would be cancelled is if the node shutdowns afaik, so i don't think it matters to return an error

babeService.epochHandler, err = babeService.initiateAndGetEpochHandler(0)
require.NoError(t, err)

authoringSlots := getAuthoringSlots(babeService.epochHandler.slotToProof)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit just in case

Suggested change
authoringSlots := getAuthoringSlots(babeService.epochHandler.slotToProof)
authoringSlots := getAuthoringSlots(babeService.epochHandler.slotToProof)
require.NotEmpty(t, authoringSlots)

}

rt, err := babeService.blockState.GetRuntime(nil)
require.NoError(t, err)

_, err = babeService.buildBlock(parentHeader, slot, rt)
_, err = babeService.buildBlock(parentHeader, slot, rt, 0, &VrfOutputAndProof{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's 0 here? Perhaps have it as a named constant above this line? 🤔

@@ -77,7 +77,7 @@ func TestCalculateThreshold(t *testing.T) {
}

func Test_checkPrimaryThreshold(t *testing.T) {
keyring, _ := keystore.NewSr25519Keyring()
keyring, _ := keystore.NewSr25519Keyring() //nolint:govet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is govet complaining here 🤔
If it's because of the error, just assert it's nil?

Comment on lines 132 to 135
if err := h.handleSlot(h.epochNumber, swt.slotNum, h.epochData.authorityIndex, h.slotToProof[swt.slotNum]); err != nil { //nolint:lll
logger.Warnf("failed to handle slot %d: %s", swt.slotNum, err)
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit don't use the short err check form with a big line like this.

Suggested change
if err := h.handleSlot(h.epochNumber, swt.slotNum, h.epochData.authorityIndex, h.slotToProof[swt.slotNum]); err != nil { //nolint:lll
logger.Warnf("failed to handle slot %d: %s", swt.slotNum, err)
continue
}
err := h.handleSlot(h.epochNumber, swt.slotNum,
h.epochData.authorityIndex, h.slotToProof[swt.slotNum])
if err != nil {
logger.Warnf("failed to handle slot %d: %s", swt.slotNum, err)
continue
}

Comment on lines 143 to 147
authoringSlots := make([]uint64, len(slotToProof))
i := 0
for authoringSlot := range slotToProof {
authoringSlots[i] = authoringSlot
i++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit simpler

Suggested change
authoringSlots := make([]uint64, len(slotToProof))
i := 0
for authoringSlot := range slotToProof {
authoringSlots[i] = authoringSlot
i++
authoringSlots := make([]uint64, 0, len(slotToProof))
for authoringSlot := range slotToProof {
authoringSlots = append(authoringSlots, authoringSlot)

Comment on lines 84 to 87
timer := time.After(sd * 100)
select {
case <-timer:
require.Equal(t, 100-(firstExecutedSlot-startSlot), callsToHandleSlot)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit Have 100 as a shared inline constant please


errCh := make(chan error)
go eh.run(ctx, errCh)
timer := time.After(sd * 100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit Ideally use time.NewTimer(sd * 100) so you can stop the timer in the error receiving select case.


ctx, cancel := context.WithCancel(context.Background())
defer cancel()
eh, err := newEpochHandler(1, startSlot, epochData, constants, testHandleSlotFunc, keypair)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit Please rename eh to epochHandler or handler or any full work 😸

lib/babe/babe.go Outdated
Comment on lines 341 to 344
for {
if err = b.handleEpoch(epoch); err != nil {
if errors.Is(err, errServicePaused) || errors.Is(err, context.Canceled) {
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
for {
if err = b.handleEpoch(epoch); err != nil {
if errors.Is(err, errServicePaused) || errors.Is(err, context.Canceled) {
return nil
for {
err = b.handleEpoch(epoch)
if errors.Is(err, errServicePaused) || errors.Is(err, context.Canceled) {
return nil
} else if err != nil {
return err
}

lib/babe/babe.go Outdated
@@ -263,7 +263,9 @@ func (b *Service) Stop() error {
// Authorities returns the current BABE authorities
func (b *Service) Authorities() []types.Authority {
auths := make([]types.Authority, len(b.epochHandler.epochData.authorities))
copy(auths, b.epochHandler.epochData.authorities)
for i, auth := range b.epochHandler.epochData.authorities {
auths[i] = *(auth.DeepCopy())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit do you need the parentheses? I think this would work too

Suggested change
auths[i] = *(auth.DeepCopy())
auths[i] = *auth.DeepCopy()

lib/babe/babe.go Outdated
defer cancel()
b.epochHandler, err = b.initiateAndGetEpochHandler(epoch)
if err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return err
return fmt.Errorf("cannot initiate and get epoch handler for epoch %d: %w", epoch, err)

Comment on lines +353 to +354
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit perhaps move this further down before it's needed at go b.epochHandler.run(ctx, errCh)

lib/babe/babe.go Outdated
// stop current epoch handler
cancel()
case err := <-errCh:
logger.Errorf("error from epochHandler: %w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.Errorf("error from epochHandler: %w", err)
logger.Errorf("error from epochHandler: %s", err)

case <-epochTimer.C:
// stop current epoch handler
cancel()
case err := <-errCh:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also call cleanup() here

Suggested change
case err := <-errCh:
case err := <-errCh:
cleanup()

@noot
Copy link
Contributor Author

noot commented Jan 27, 2022

@qdm12 I think I addressed all your comments, feel free to rereview!

@noot noot requested a review from qdm12 January 27, 2022 16:50
@noot
Copy link
Contributor Author

noot commented Jan 27, 2022

i think this also closes #2217 (don't see the error anymore when running this branch), will try it more to confirm though.

@noot noot changed the title maintainence(lib/babe): refactor code to use epochHandler for control of epoch logic maintainence(lib/babe): refactor code to use epochHandler for control of epoch logic~ Jan 28, 2022
@noot noot changed the title maintainence(lib/babe): refactor code to use epochHandler for control of epoch logic~ maintainence(lib/babe): refactor code to use epochHandler for control of epoch logic Jan 28, 2022
@noot noot merged commit 656fd8d into development Jan 28, 2022
@noot noot deleted the noot/babe-refactor branch January 28, 2022 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

devnet: BABE does not restart after pausing, and pauses at wrong times
5 participants