You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm writing a package using PyAthena that might or might not have extremely large results sets, so I've been interested in memory usage.
It seems that the default cursor is always fine, it is streaming (default 1000 rows at a time) and uses very little memory due to this default. Your benchmarks show that that's a little slow.
_as_pandas always downloads the entire csv result into memory and then creates a Pandas df in memory. That uses the most possible memory, and your benchmarks show that that's fast.
There's no easy builtin way to fetch the result csv to local disk, which would be the preferred call for bigger-than-ram results. I could write something to use the standard Cursor and write to disk, but then it would probably be slower than the already well-tuned boto download_file.
So, what am I suggesting?
The standard cursor / result_set should auto-tune itself. I don't see a way to find out the total number of query result rows in the csv, but you can still autotune it by having it increase arraysize by 10% per call in the __fetch() method. (If the user has set arraysize you can still respect that.)
There should be a standard method to download the result csv to a local file using s3.Bucket().download_file(). This handles the bigger-than-memory results case very nicely.
For the pandas case, at least document the memory usage, better to have an option to download to disk and then load into a df, which cuts the memory usage in half.
And for a better streaming Pandas experience, introduce a new Cursor class that returns a Pandas result_set that is chunks of a dataframe, instead of the entire dataframe. This could have memory usage similar to the standard cursor.
These suggestions don't address issue #61 but the Pandas chunks API suggestion might be a good start for dask -- perhaps someone smarter than me about dask could comment.
The text was updated successfully, but these errors were encountered:
I have not been able to take the time to do much new implementation of this library, although I have considered implementing a new cursor using S3Fs and CSV reader without Pandas. #272
I'm writing a package using PyAthena that might or might not have extremely large results sets, so I've been interested in memory usage.
It seems that the default cursor is always fine, it is streaming (default 1000 rows at a time) and uses very little memory due to this default. Your benchmarks show that that's a little slow.
_as_pandas always downloads the entire csv result into memory and then creates a Pandas df in memory. That uses the most possible memory, and your benchmarks show that that's fast.
There's no easy builtin way to fetch the result csv to local disk, which would be the preferred call for bigger-than-ram results. I could write something to use the standard Cursor and write to disk, but then it would probably be slower than the already well-tuned boto download_file.
So, what am I suggesting?
These suggestions don't address issue #61 but the Pandas chunks API suggestion might be a good start for dask -- perhaps someone smarter than me about dask could comment.
The text was updated successfully, but these errors were encountered: