Skip to content

Functions to get all darwin cut notes based on image dimensions - in python and spark for efficient parallel processing

Notifications You must be signed in to change notification settings

HackTheStacks/darwin-image-preprocessing

Repository files navigation

darwin-image-preprocessing

Functions to get all darwin cut notes based on image dimensions and throw away full-page notes (non cut notes). Works by comparing image dimensions to mean image dimensions within folder. Written in PySpark for efficient parallel processing due to dataset size of ~350GB and ~60k images.

About

Functions to get all darwin cut notes based on image dimensions - in python and spark for efficient parallel processing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages