Added pseudolabel_frames.py #19

yahya010 · 2024-07-21T21:40:31Z

Pseudolableing code:

Goes through each tar file, checks mp4 files and for each frame in the video, psuedolabels it.
Here are is an example of a frame before and after pseudolabeling:
Before:

After Pseudolabeling the frame:

markus583 · 2024-07-22T12:01:00Z

Thanks, Yahya! Logic looks fine but I feel this script lacks some configurability.
It would be very useful to have dirs & potentially some other options configurable like here:
#14, #15.

markus583 · 2024-07-22T12:05:07Z

Moreover, crucially, you save the result as .jpg (here). I assume this is what you show above. For visualization purposes, this is just fine. We could leave this optionally in as an argument (but default: False). But as input into any model we need BB coordinates, right?
Currently, it seems as if this is not provided. See #11.
Please clarify/add this.

kdu4108 · 2024-07-22T16:11:35Z

Hey Yahya, yes thanks for making the PR and including the pictures! :)

Before merging in, +1 to Markus' comments and suggestions. To consolidate our suggestions together, can you make the following changes?

make it more configurable by adding parse args (e.g. similar to here? https://github.com/swiss-ai/ml-4m/pull/15/files#diff-6b8437a4b6e49cdfed57a0f443ca4f41d8a7fb8d387ed742f6eb439133631576R112)? The parse args should at least include an input directory flag which should should point to the video_rgb/ folder by default.
make it save outputs of the bounding boxes (in JSON format) to the output_dir of root/video_det/ (as described in Transform from video_rgb format into video_det format and save in video_det/ directory. #11)?
Change the filename to pseudolabelers/video_rgb_to_det.py?

Thanks! And let me know if you're unsure about how to tackle any of these :)

kdu4108 · 2024-07-22T16:12:15Z

fourm/pseudolabel_frames.py

+import cv2
+from ultralytics import YOLO
+
+SHARDS = "/cluster/work/cotterell/mm_swissai/datasets/hdvila/1000_hd_vila_shuffled/0000000000.tar"


This should be a configurable input

kdu4108 · 2024-07-22T16:12:37Z

fourm/pseudolabel_frames.py

+from ultralytics import YOLO
+
+SHARDS = "/cluster/work/cotterell/mm_swissai/datasets/hdvila/1000_hd_vila_shuffled/0000000000.tar"
+OUTPUT_DIR = "bbox-yolo/extracted_frames"


This should be root/data/video_det

kdu4108 · 2024-07-22T16:13:40Z

fourm/pseudolabel_frames.py

+            video.release()
+
+            # Apply pseudolabeling to the extracted frames
+            results = model(frame_paths, project=LABELED_OUTPUT_DIR, name=file[:-4])


Can you extract the bounding box representations as JSONs?

markus583 · 2024-07-25T08:16:29Z

Plus, I think it would be useful to only select every nth frame here already, similar to how it is done in #14. In general, you can take big inspiration from this, should be easy to adapt. @yahya010

yahya010 · 2024-07-28T12:10:45Z

Made the changes, so now the shards is an argument, (the paths to save will be changed on todi, I kept them for now as they are on euler because the YOLO model is a bit weird with where it puts the save directory)

Also did the nth frame selection and also saves both the bounding box image and the json output like this:

[
{
"bbox": [
621.873046875,
300.680419921875,
1166.7474365234375,
656.718017578125
],
"confidence": 0.8789449334144592,
"class": 42,
"class_name": "fork"
},
{
"bbox": [
0.431396484375,
1.4971923828125,
1157.58642578125,
703.078857421875
],
"confidence": 0.3820949196815491,
"class": 60,
"class_name": "dining table"
},
{
"bbox": [
261.07958984375,
34.11528015136719,
786.9713134765625,
492.86407470703125
],
"confidence": 0.2623625695705414,
"class": 45,
"class_name": "bowl"
}
]

markus583 · 2024-07-29T13:00:09Z

Great! Thanks! It is also tested on Euler, right?

I realize you extract the tarfile to a directory. I am not sure if we want this. Maybe it is better not to keep them and just use a temp dir/temp file instead. I am also doing so in tokenization #14.

Another thing we need to keep in mind is if we have different fps for different videos. In this case, should we normalize the --nth_frame argument so all videos behave in the same way? This also becomes relevant with the temporal embedding later; if we do not do this here (and also in tokenization, where it is not implemented either), we should ensure proper time offsets later.

kdu4108 · 2024-07-30T06:59:16Z

fourm/pseudolabel_frames.py


 # Load the YOLO model
-model = YOLO('bbox-yolo/yolov8n.pt')  # pretrained YOLOv8n model
+model = YOLO('/cluster/work/cotterell/yemara/ml-4m/bbox-yolo/yolov8n.pt') # pretrained YOLOv8n model


super-nit: for more configuration, maybe even make this path an arg in the parseargs (and set this is as the default)?

kdu4108 · 2024-07-30T07:04:56Z

fourm/pseudolabel_frames.py


 # Extract the tar file
 with tarfile.open(SHARDS, "r") as tar:
-    tar.extractall(path="bbox-yolo/extracted_files")
+    tar.extractall(path="extracted_files")


+1 to Markus's comment -- better perhaps to do everything within a tempdir like this https://stackoverflow.com/questions/3223604/how-do-i-create-a-temporary-directory-in-python

kdu4108 · 2024-07-30T07:15:57Z

fourm/pseudolabel_frames.py

+                    conf = box.conf.item()  # get confidence score
+                    cls = int(box.cls.item())  # get class id
+                    json_data.append({
+                        "bbox": xyxy,


Can we use key-names consistent with the og 4M's bounding box json key names? (I'm 90% sure it's the ones in the example here - #11 (comment))

So for one frame that'd be like:

{ "num_instances": 5, "image_height": 512, "image_width": 906, "instances": [ { "boxes": [ 0.4229210317134857, 0.00020096010121051222, 0.5715101361274719, 0.13699540495872498 ], "score": 0.9029952883720398, "class_id": 74, "class_name": "clock", "segmentation": [ [ 0.5055187637969095, 0.1337890625, ... ] ] }, { "boxes": [ ... ], ... }, ... ] },

kdu4108 · 2024-07-30T07:16:49Z

fourm/pseudolabel_frames.py

            video.release()

            # Apply pseudolabeling to the extracted frames
            results = model(frame_paths, project=LABELED_OUTPUT_DIR, name=file[:-4])

            for i, result in enumerate(results):
+                # Save labeled image
                result.save(filename=f'{file[:-4]}_labeled_frame_{i}.jpg')


Can you actually make this optional in an arg (default false)? While this is useful for debugging, when we run at scale I think we won't want to save every image.

kdu4108 · 2024-07-30T07:22:17Z

fourm/pseudolabel_frames.py

+                    })
+
+                # Save JSON file
+                json_filename = os.path.join(JSON_OUTPUT_DIR, f"{file[:-4]}_frame_{i}_boxes.json")


Can we actually do 2 more things for saving the results?

aggregate the results in a list of jsons and save it using jsonl? (https://jsonlines.readthedocs.io/en/latest/) So, each video should be saved with the same name as the mp4 (so if a file is named 00004.mp4, it should be saved as 00004.jsonl).

repackage the jsonl's back into the tar files corresponding in name to the tarfiles containing those mp4s. For example, if all videos extracted from the tar video_rgb/00043.tar, should have its corresponding jsonls in video_det/00043.tar. See Transform from video_rgb format into video_det format and save in video_det/ directory. #11 (comment) for more deets!

kdu4108 · 2024-07-30T07:23:47Z

Hey Yahya, looks much better! Just have a few more requests regarding how the outputs of yolo should be saved that I left in-line, can you take a look please?

kdu4108 · 2024-07-30T07:29:00Z

Another thing we need to keep in mind is if we have different fps for different videos. In this case, should we normalize the --nth_frame argument so all videos behave in the same way? This also becomes relevant with the temporal embedding later; if we do not do this here (and also in tokenization, where it is not implemented either), we should ensure proper time offsets later.

This is a good point. I think we have a couple of options here?

(1) we keep the same FPS for all videos for a given modality.
Pro: easy to deal with
Con: Maybe less flexible. Possible for some videos we want bounding boxes more or less frequently than others.

(2) we mark the FPS of the video for each modality in metadata.
We could add keys like "video_det_fps", "video_tok_rgb_fps", etc. to the metadata and hope the model learns to leverage that info usefully.
Pro: more flexible
Con: a little more work, a little more learning burden on the model, but hopefully doable?

I'd be a proponent of (2) but also would like to get a take from Ali/other 4M experts.

markus583 · 2024-07-30T07:43:09Z

Another thing we need to keep in mind is if we have different fps for different videos. In this case, should we normalize the --nth_frame argument so all videos behave in the same way? This also becomes relevant with the temporal embedding later; if we do not do this here (and also in tokenization, where it is not implemented either), we should ensure proper time offsets later.

This is a good point. I think we have a couple of options here?

(1) we keep the same FPS for all videos for a given modality. Pro: easy to deal with Con: Maybe less flexible. Possible for some videos we want bounding boxes more or less frequently than others.

(2) we mark the FPS of the video for each modality in metadata. We could add keys like "video_det_fps", "video_tok_rgb_fps", etc. to the metadata and hope the model learns to leverage that info usefully. Pro: more flexible Con: a little more work, a little more learning burden on the model, but hopefully doable?

I'd be a proponent of (2) but also would like to get a take from Ali/other 4M experts.

Thx for coming up with the options. (1) may be enough for a start but could restrict us later, so it may be better indeed to implement this more flexibly already. So I agree that (2) is better.
However, I wonder if we need/want to rely on the metadata on this. Would it not be better to keep this part of the temporal encoding? Hence the --every_nth_frame also becomes problematic: Taking every 5th frame from a video with 25 fps results in different timestamps than every 5th frame from a video with 30 fps. So the temporal encoding should also change.

kdu4108 · 2024-07-30T08:40:17Z

Another TODO: instead of doing every-nth-frame, keep a fixed FPS for each modality. While less flexible, it's easier to implement and it doesn't seem necessary now to engineer for differing FPS for different videos.
We might want that in the future for e.g. video_tok_rgb for videos with drastically different lengths, but for now oh well.

The cost of changing this later is just needing to re-pseudolabel everything, but while we work with small amounts of data to start this shouldn't be a concern. Also do the same for #14

markus583 · 2024-07-30T09:35:57Z

For the fixed fps, see b5f902a
This should be done in the same way for all modalities!
@yahya010

markus583 · 2024-07-31T15:30:15Z

@yahya010 thanks, please also move to fps instead of every_nth_frame (see comment above)

…it take in a dir of tars, move things into tempdirs

yahya010 requested review from kdu4108 and markus583 July 21, 2024 21:40

kdu4108 reviewed Jul 22, 2024

View reviewed changes

kdu4108 mentioned this pull request Jul 22, 2024

Add transcript + metadata processing #15

Open

kdu4108 reviewed Jul 30, 2024

View reviewed changes

yahya010 and others added 5 commits August 5, 2024 09:45

Added pseudolabel_frames.py

de8bcbf

Added configurability and json output for pseudolabel_frames.py

b9d68a9

Added saving each video to Jsonl

3420f89

Add more configuration args to pseudolabel frames, rename file, make …

e50fa4d

…it take in a dir of tars, move things into tempdirs

Make video_det pseudolabeler take in a filtered_raw/ dir as input

dab7df1

kdu4108 force-pushed the yemara/video_rgb branch from 004794a to dab7df1 Compare August 5, 2024 07:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added pseudolabel_frames.py #19

Added pseudolabel_frames.py #19

yahya010 commented Jul 21, 2024

markus583 commented Jul 22, 2024

markus583 commented Jul 22, 2024

kdu4108 commented Jul 22, 2024 •

edited

Loading

kdu4108 Jul 22, 2024

kdu4108 Jul 22, 2024

kdu4108 Jul 22, 2024

markus583 commented Jul 25, 2024

yahya010 commented Jul 28, 2024

markus583 commented Jul 29, 2024

kdu4108 Jul 30, 2024

kdu4108 Jul 30, 2024

kdu4108 Jul 30, 2024

kdu4108 Jul 30, 2024

kdu4108 Jul 30, 2024

kdu4108 commented Jul 30, 2024

kdu4108 commented Jul 30, 2024

markus583 commented Jul 30, 2024

kdu4108 commented Jul 30, 2024

markus583 commented Jul 30, 2024 •

edited

Loading

markus583 commented Jul 31, 2024

Added pseudolabel_frames.py #19

Are you sure you want to change the base?

Added pseudolabel_frames.py #19

Conversation

yahya010 commented Jul 21, 2024

markus583 commented Jul 22, 2024

markus583 commented Jul 22, 2024

kdu4108 commented Jul 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markus583 commented Jul 25, 2024

yahya010 commented Jul 28, 2024

markus583 commented Jul 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kdu4108 commented Jul 30, 2024

kdu4108 commented Jul 30, 2024

markus583 commented Jul 30, 2024

kdu4108 commented Jul 30, 2024

markus583 commented Jul 30, 2024 • edited Loading

markus583 commented Jul 31, 2024

kdu4108 commented Jul 22, 2024 •

edited

Loading

markus583 commented Jul 30, 2024 •

edited

Loading