You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a reboot is triggered while a task is processing, processes can sometimes hang for abnormally long times when power is restored. This is usually fixed by canceling the hanging task and restarting it.
However: With 8 tasks queued and after cancelling the hanging task, instead of beginning processing on another task (or project), every project indicated it was queued and further attempts to cancel queued projects and relaunch the tasks indicated an unchanging queue length but no active process (again, I checked every project and every task).
Further investigation led to the discovery of a few symptoms:
CPU/GPU/Memory/Disk usage were all elevated and the docker container appeared to be running when monitoring statistics from the server itself (remote desktop).
Docker GPU container took a while to spin up (20-30 seconds) when they normally take about 3 or less [docker run -dp 3001:3000 --gpus all --name nodeodmgpu opendronemap/nodeodm:gpu || docker start nodeodmgpu && ../webodm.sh start]
The issue was eventually traced back to the corresponding /var/lib/docker/overlay2/[Very-Long-ID]/diff/var/www/data/tasks.json file for the hanging node whereby the codes indicated by each project were not consistent with the web client, and failed to sync automatically during any reboot/stop/start.
I.e {"code":10} (Queued) or {"code":20} (Processing??) indicated on a project that was canceled and should read {"code":50}
The solution was to use the command below to manually terminate the tasks after stopping the docker container and issuing a webodm.sh down:
sudo sed -i -e "s/:10}/:50}/g" /var/lib/docker/overlay2/[Very-Long-Node-ID]/diff/var/www/data/tasks.json
sudo sed -i -e "s/:20}/:50}/g" /var/lib/docker/overlay2/[Very-Long-Node-ID]/diff/var/www/data/tasks.json
The result was retaining the task list and not orphaning 100gb of task resources. It functions normally after restarting the services
How can we reproduce this? (What steps trigger the problem? What parameters are you using for processing? Include screenshots. If you are having issues processing a dataset, you must include a copy of your dataset uploaded on Dropbox, Google Drive or https://dronedb.app)
It seemed to be triggered by an abrupt reboot from another user account session which may have prevented docker from stopping /starting cleanly causing a de-sync between the web client container information and the node whereby the node was still commanded to process tasks that were apparently canceled or queued from the web client.
I would recommend queueing a few tasks, killing the docker process (or otherwise stopping the docker container abruptly), and cancelling the remaining tasks. This should replicate the issue, though I have not yet been able to test this as our production and test environments are both occupied
The text was updated successfully, but these errors were encountered:
Update: Issue seems to duplicate when cancelling projects in rapid succession. Clicking cancel on multiple queued projects will replicate the de-sync about 50% of the time and the queue length will remain >0 until the manual fix above is employed. Reboots do not fix it, UNLESS the queue that hangs is for the default CPU processing node that is set up by default - in which case restarting WebODM from the terminal WILL correctly update and clear the queue.
Additional launched Docker Containers for GPU Processing will not automatically sync, and upon completion of the project will generate orphaned files in the /var/lib/docker/overlay2/ directory that are not viewable, deletable or otherwise accessible from the Web Client. In this case files must be manually removed after issuing a webodm.sh stop and docker stop command
How did you install WebODM (docker, installer, etc.)?
I installed WebODM (and it's pre-requisites) from a Bash CLI Script in Ubuntu 20.04 LTS.
What's your browser and operating system? (Copy/paste the output of https://www.whatismybrowser.com/)
Server: Ubuntu 20.04 LTS
Host: Windows 10 Enterprise | Firefox
What is the problem?
When a reboot is triggered while a task is processing, processes can sometimes hang for abnormally long times when power is restored. This is usually fixed by canceling the hanging task and restarting it.
However: With 8 tasks queued and after cancelling the hanging task, instead of beginning processing on another task (or project), every project indicated it was queued and further attempts to cancel queued projects and relaunch the tasks indicated an unchanging queue length but no active process (again, I checked every project and every task).
Further investigation led to the discovery of a few symptoms:
The issue was eventually traced back to the corresponding /var/lib/docker/overlay2/[Very-Long-ID]/diff/var/www/data/tasks.json file for the hanging node whereby the codes indicated by each project were not consistent with the web client, and failed to sync automatically during any reboot/stop/start.
I.e {"code":10} (Queued) or {"code":20} (Processing??) indicated on a project that was canceled and should read {"code":50}
The solution was to use the command below to manually terminate the tasks after stopping the docker container and issuing a webodm.sh down:
sudo sed -i -e "s/:10}/:50}/g" /var/lib/docker/overlay2/[Very-Long-Node-ID]/diff/var/www/data/tasks.json
sudo sed -i -e "s/:20}/:50}/g" /var/lib/docker/overlay2/[Very-Long-Node-ID]/diff/var/www/data/tasks.json
The result was retaining the task list and not orphaning 100gb of task resources. It functions normally after restarting the services
How can we reproduce this? (What steps trigger the problem? What parameters are you using for processing? Include screenshots. If you are having issues processing a dataset, you must include a copy of your dataset uploaded on Dropbox, Google Drive or https://dronedb.app)
It seemed to be triggered by an abrupt reboot from another user account session which may have prevented docker from stopping /starting cleanly causing a de-sync between the web client container information and the node whereby the node was still commanded to process tasks that were apparently canceled or queued from the web client.
I would recommend queueing a few tasks, killing the docker process (or otherwise stopping the docker container abruptly), and cancelling the remaining tasks. This should replicate the issue, though I have not yet been able to test this as our production and test environments are both occupiedThe text was updated successfully, but these errors were encountered: