Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documenting/improving memory behavior #257

Open
wumpus opened this issue Jan 7, 2022 · 2 comments
Open

Documenting/improving memory behavior #257

wumpus opened this issue Jan 7, 2022 · 2 comments

Comments

@wumpus
Copy link

wumpus commented Jan 7, 2022

I'm writing a package using PyAthena that might or might not have extremely large results sets, so I've been interested in memory usage.

It seems that the default cursor is always fine, it is streaming (default 1000 rows at a time) and uses very little memory due to this default. Your benchmarks show that that's a little slow.

_as_pandas always downloads the entire csv result into memory and then creates a Pandas df in memory. That uses the most possible memory, and your benchmarks show that that's fast.

There's no easy builtin way to fetch the result csv to local disk, which would be the preferred call for bigger-than-ram results. I could write something to use the standard Cursor and write to disk, but then it would probably be slower than the already well-tuned boto download_file.

So, what am I suggesting?

  • The standard cursor / result_set should auto-tune itself. I don't see a way to find out the total number of query result rows in the csv, but you can still autotune it by having it increase arraysize by 10% per call in the __fetch() method. (If the user has set arraysize you can still respect that.)
  • There should be a standard method to download the result csv to a local file using s3.Bucket().download_file(). This handles the bigger-than-memory results case very nicely.
  • For the pandas case, at least document the memory usage, better to have an option to download to disk and then load into a df, which cuts the memory usage in half.
  • And for a better streaming Pandas experience, introduce a new Cursor class that returns a Pandas result_set that is chunks of a dataframe, instead of the entire dataframe. This could have memory usage similar to the standard cursor.

These suggestions don't address issue #61 but the Pandas chunks API suggestion might be a good start for dask -- perhaps someone smarter than me about dask could comment.

@wumpus
Copy link
Author

wumpus commented Mar 29, 2022

I'm happy to do a PR for the above suggestions, but I hesitate to do so in communities where there is no discussion of issues.

@laughingman7743
Copy link
Owner

Downloading S3 files locally and loading them by chunk size is a good idea.
However, I believe Pandas 1.2 or 1.3 already supports reading per chunk size using S3Fs.
https://pandas.pydata.org/docs/user_guide/io.html#reading-writing-remote-files

The current implementation downloads files on-memory with Boto3.
https://github.com/laughingman7743/PyAthena/blob/master/pyathena/pandas/result_set.py#L143
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.get_object
If the chunk size is specified and the S3fs library is installed, I think it would be a good idea to pass the S3:// URL directly to Pandas to read by chunk size.

I have not been able to take the time to do much new implementation of this library, although I have considered implementing a new cursor using S3Fs and CSV reader without Pandas.
#272

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants