commit 6567c374ab5471615f5e7430ce6c1b37212ba731
Author: Victor Lowther <victor@rackn.com>
Date: Thu Mar 3 12:48:12 2022 -0600
perf(etags): Parallelize etag bulk processing.
This refactors etag bulk checking to operate in as parallel a fashion
as possible while not causing the system to explode too much if it
winds up needing to recalculate a bunch of checksums.
M datastack/etags.go
commit 737e985d9e03e71783d116af73db18611afa6bc9
Author: Victor Lowther <victor@rackn.com>
Date: Mon Feb 28 11:40:46 2022 -0600
perf(etags): Avoid opening files to calculate etags.
There appears to be a huge performance penalty when running
dr-provision on systems that have any sort of monitoring that hooks
the open system call. To work around this, refactor most places we
check etags to do so without opening the file involved.
This patch also adds two utilities that can be used to benchmark and
identify this issue.
cmds/etagerator/etag.go will run just the etag BulkProcess function on
all the directories passed in as command line args. It can be used to
get an idea how long the etag process will take in various
environments
cmds/start_io_trace/startTrace.go runs the complete dr-provision
startup sequence up to the point that we would start joining a cluster
or loading data from the database. It runs with full tracing enabled,
and emits a go trace log once it finishes the startup process.
M backend/dataTracker_test.go
A cmds/etagerator/etags.go
A cmds/start_io_trace/startTrace.go
M datastack/etags.go
M datastack/stack.go
M midlayer/fake_midlayer_server_test.go
M midlayer/static_test.go
M midlayer/tftp_test.go
M server/args.go
commit aaab33873d0573d89eb22a9b68384c61ce5cde7d
Author: Victor Lowther <victor@rackn.com>
Date: Mon Feb 28 13:31:31 2022 -0600
fix(panic): Fix race when removing a server that can lead to panic.
The Raft FSM can get into an inconsistent state that allows a
LastArtifactOp operation to succeed at the same time the node issuing
the request is being removed from the cluster. Depending on the exact
timing, this can trigger a panic if the command is committed after the
node removal command, leading to a panic when replaying the log on the
followers and on the server.
Work around thgis for now by solently ignoring LastArtifactApply
operations from nodes that we have removed from the cluster. A longer
term fix will require adding a dedicated API path for updating this
that can check tto see if the operation is allowed befor committing it
through Raft.
M consensus/raftFSM.go
M frontend/consensus.go
M server/args.go
End of Note