-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add WACZ support #770
base: master
Are you sure you want to change the base?
Add WACZ support #770
Changes from all commits
779978a
e82a4f7
133a9e4
9436999
a422b47
25e91ad
f9447d2
2566b7c
e4b9998
7e2589d
038d2bb
b6e789a
aae8b39
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,6 +17,7 @@ | |
import zlib | ||
import surt | ||
import ntpath | ||
import shutil | ||
import traceback | ||
import tempfile | ||
|
||
|
@@ -29,6 +30,8 @@ | |
# from requests.exceptions import ConnectionError | ||
|
||
from ipwb.util import iso8601_to_digits14, ipfs_client | ||
from ipwb.util import is_wacz, extract_warcs_from_wacz | ||
from ipwb.util import cleanup_warc_files_extracted_from_wacz | ||
|
||
import requests | ||
import datetime | ||
|
@@ -119,6 +122,21 @@ def index_file_at(warc_paths, encryption_key=None, | |
for warc_path in warc_paths: | ||
verify_file_exists(warc_path) | ||
|
||
# Extract WARCs from any WACZ files | ||
warc_paths_to_append = [] | ||
wacz_paths = [] | ||
for warc_path in warc_paths: | ||
if is_wacz(warc_path): | ||
(w_paths, dirs_to_cleanup) = extract_warcs_from_wacz(warc_path) | ||
warc_paths_to_append += w_paths | ||
wacz_paths.append(warc_path) | ||
|
||
# Manipulate list of WARCs extracted from WACZ | ||
for ptr in wacz_paths: | ||
warc_paths.remove(ptr) | ||
|
||
warc_paths = warc_paths + warc_paths_to_append | ||
machawk1 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
cdxj_lines = [] | ||
|
||
if outfile: | ||
|
@@ -167,6 +185,8 @@ def index_file_at(warc_paths, encryption_key=None, | |
cdxj_metadata_lines = generate_cdxj_metadata(cdxj_lines) | ||
cdxj_lines = cdxj_metadata_lines + cdxj_lines | ||
|
||
cleanup_warc_files_extracted_from_wacz(warc_paths_to_append) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not a big issue, but I think the temporary folders created by the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The docs say that the creator is responsible for the deletion, so I think we should handle this. Given each WARC gets a new temp directory, it might be better to just retain the copy of this directory path and delete it along with its contents instead of deleting the WARC then the directory, which would require tracking the directory path, too. Which approach would you rather be implemented, @ibnesayeed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Not really! It is possible to get the path of the directory if the path of a file is known that it contains. That said, I would perhaps preferred not holding onto the list of WARC files, instead, operate on each WARC as we discover them, whether those are regular WARC files or those extracted from WACz files. I would deal with one file at a time and loop over for the next one. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ibnesayeed This seems like it requires a revamp outside of the scope of this GH issue/PR. I agree that dealing with one WARC at a time would likely be more computationally optimal. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I understand that it would require change in the workflow. When done, it would be more space efficient as not all the WARCs need to be extracted from WACZ files upfront, duplicating them on the disk, before processing them. It is okay to leave it as things are right now and get back to this when we have a WARC record iterator for the WACZ files, when most of these changes will be rendered useless. |
||
|
||
if quiet: | ||
return cdxj_lines | ||
|
||
|
@@ -180,6 +200,10 @@ def index_file_at(warc_paths, encryption_key=None, | |
else: | ||
print('\n'.join(cdxj_lines)) | ||
|
||
# Cleanup, e.g., dirs for WARCs from WACZ | ||
for dir_to_cleanup in dirs_to_cleanup: | ||
shutil.rmtree(dir_to_cleanup) | ||
|
||
|
||
def sanitize_cdxj_line(cdxj_line): | ||
return cdxj_line | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
dirs_to_cleanup
here is overwritten in each loop, so at the end it will only hold the reference to the temporary dirs of the last WACZ file (unless I am missing something) for cleanup.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch on
dirs_to_cleanup
not being retained.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the logic of extracted WARCs and temp directories can be simplified. There is a bit too much disk maintenance. The general gist is that paths to WACZ files could be inter-mingled with WARCs here and thus the WARCs extracted from the WACZ files need to be removed but not the WARCs that were passed in.