Documenting/improving memory behavior #257

wumpus · 2022-01-07T07:51:48Z

I'm writing a package using PyAthena that might or might not have extremely large results sets, so I've been interested in memory usage.

It seems that the default cursor is always fine, it is streaming (default 1000 rows at a time) and uses very little memory due to this default. Your benchmarks show that that's a little slow.

_as_pandas always downloads the entire csv result into memory and then creates a Pandas df in memory. That uses the most possible memory, and your benchmarks show that that's fast.

There's no easy builtin way to fetch the result csv to local disk, which would be the preferred call for bigger-than-ram results. I could write something to use the standard Cursor and write to disk, but then it would probably be slower than the already well-tuned boto download_file.

So, what am I suggesting?

The standard cursor / result_set should auto-tune itself. I don't see a way to find out the total number of query result rows in the csv, but you can still autotune it by having it increase arraysize by 10% per call in the __fetch() method. (If the user has set arraysize you can still respect that.)
There should be a standard method to download the result csv to a local file using s3.Bucket().download_file(). This handles the bigger-than-memory results case very nicely.
For the pandas case, at least document the memory usage, better to have an option to download to disk and then load into a df, which cuts the memory usage in half.
And for a better streaming Pandas experience, introduce a new Cursor class that returns a Pandas result_set that is chunks of a dataframe, instead of the entire dataframe. This could have memory usage similar to the standard cursor.

These suggestions don't address issue #61 but the Pandas chunks API suggestion might be a good start for dask -- perhaps someone smarter than me about dask could comment.

wumpus · 2022-03-29T03:56:07Z

I'm happy to do a PR for the above suggestions, but I hesitate to do so in communities where there is no discussion of issues.

laughingman7743 · 2022-03-29T12:36:17Z

Downloading S3 files locally and loading them by chunk size is a good idea.
However, I believe Pandas 1.2 or 1.3 already supports reading per chunk size using S3Fs.
https://pandas.pydata.org/docs/user_guide/io.html#reading-writing-remote-files

The current implementation downloads files on-memory with Boto3.
https://github.com/laughingman7743/PyAthena/blob/master/pyathena/pandas/result_set.py#L143
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.get_object
If the chunk size is specified and the S3fs library is installed, I think it would be a good idea to pass the S3:// URL directly to Pandas to read by chunk size.

I have not been able to take the time to do much new implementation of this library, although I have considered implementing a new cursor using S3Fs and CSV reader without Pandas.
#272

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documenting/improving memory behavior #257

Documenting/improving memory behavior #257

wumpus commented Jan 7, 2022

wumpus commented Mar 29, 2022

laughingman7743 commented Mar 29, 2022

Documenting/improving memory behavior #257

Documenting/improving memory behavior #257

Comments

wumpus commented Jan 7, 2022

wumpus commented Mar 29, 2022

laughingman7743 commented Mar 29, 2022