You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I had PrometheusTSDBCompactionsFailing alerts I had corrupted WAL files (with error messages in the logs looking like this: WAL truncation in Compact: create checkpoint: read segments: corruption in segment /prometheus/wal/00018151 at 72: unexpected full record).
With the following procedure I was able to fix the issue:
Exec into the pod (or find the mount path of the PersistentVolumeClaim on the host) and delete the corrupted file (in the example above: rm /prometheus/wal/00018151).
Delete all the WAL files in /prometheus/wal that are older than the file deleted in the previous step (for example rm /prometheus/wal/00018150).
Create empty files in the place of all the deleted files from the previous steps (for example touch /prometheus/wal/00018150 /prometheus/wal/00018151).
Make sure the file ownership and permissions are the same as with the other WAL files (eg. chown 1000:2000 /prometheus/wal/00018150 /prometheus/wal/00018151 and chmod g+w /prometheus/wal/00018150 /prometheus/wal/00018151).
Restart the pod.
Depending on how long ago the last successful compaction was, the next compaction might use a lot of memory and take a while. Look out if the pod gets out-of-memory-killed and (temporarily) increase the memory requests and limits of the prometheus container. Disable the startupProbe and the livenessProbe if the container terminates with exit code zero and you see the message "See you next time!" in the logs and a failed startup probe in the pod events (kubectl describe).
I do not know if this is good practice, though.
Should I open a pull request to extend the PrometheusTSDBCompactionsFailing runbook?
The text was updated successfully, but these errors were encountered:
When I had PrometheusTSDBCompactionsFailing alerts I had corrupted WAL files (with error messages in the logs looking like this:
WAL truncation in Compact: create checkpoint: read segments: corruption in segment /prometheus/wal/00018151 at 72: unexpected full record
).With the following procedure I was able to fix the issue:
rm /prometheus/wal/00018151
)./prometheus/wal
that are older than the file deleted in the previous step (for examplerm /prometheus/wal/00018150
).touch /prometheus/wal/00018150 /prometheus/wal/00018151
).chown 1000:2000 /prometheus/wal/00018150 /prometheus/wal/00018151
andchmod g+w /prometheus/wal/00018150 /prometheus/wal/00018151
).I do not know if this is good practice, though.
Should I open a pull request to extend the PrometheusTSDBCompactionsFailing runbook?
The text was updated successfully, but these errors were encountered: