Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PrometheusTSDBCompactionsFailing instructions for corrupted WAL files #53

Open
elchenberg opened this issue May 3, 2023 · 0 comments
Open

Comments

@elchenberg
Copy link

When I had PrometheusTSDBCompactionsFailing alerts I had corrupted WAL files (with error messages in the logs looking like this: WAL truncation in Compact: create checkpoint: read segments: corruption in segment /prometheus/wal/00018151 at 72: unexpected full record).

With the following procedure I was able to fix the issue:

  1. Exec into the pod (or find the mount path of the PersistentVolumeClaim on the host) and delete the corrupted file (in the example above: rm /prometheus/wal/00018151).
  2. Delete all the WAL files in /prometheus/wal that are older than the file deleted in the previous step (for example rm /prometheus/wal/00018150).
  3. Create empty files in the place of all the deleted files from the previous steps (for example touch /prometheus/wal/00018150 /prometheus/wal/00018151).
  4. Make sure the file ownership and permissions are the same as with the other WAL files (eg. chown 1000:2000 /prometheus/wal/00018150 /prometheus/wal/00018151 and chmod g+w /prometheus/wal/00018150 /prometheus/wal/00018151).
  5. Restart the pod.
  6. Depending on how long ago the last successful compaction was, the next compaction might use a lot of memory and take a while. Look out if the pod gets out-of-memory-killed and (temporarily) increase the memory requests and limits of the prometheus container. Disable the startupProbe and the livenessProbe if the container terminates with exit code zero and you see the message "See you next time!" in the logs and a failed startup probe in the pod events (kubectl describe).

I do not know if this is good practice, though.

Should I open a pull request to extend the PrometheusTSDBCompactionsFailing runbook?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant