Converter class does not convert Athena string data to pandas str type #148

krishanunandy · 2020-07-01T21:44:23Z

First of all, thank you for creating this library! It's been immensely helpful and I've used it in multiple contexts over several years and would love to contribute - especially if it helps solve my current problem!

With pyathena=1.10.7 and pandas=1.0.5 I am running the following code with the expectation that the converter class will cast the Athena string data type as an str pandas dtype.

from pyathena import connect
from pyathena.pandas_cursor import PandasCursor
from pyathena.converter import Converter

class CustomPandasTypeConverter(Converter):

    def __init__(self):
        super(CustomPandasTypeConverter, self).__init__(
            mappings=None,
            types={
                'boolean': bool,
                'tinyint': int,
                'smallint': int,
                'integer': int,
                'bigint': int,
                'float': float,
                'real': float,
                'double': float,
                'decimal': float,
                'char': str,
                'varchar': str,
                'array': str,
                'map': str,
                'row': str,
                'varbinary': str,
                'json': str,
                'string': str
            }
        )

    def convert(self, type_, value):
        # Not used in PandasCursor.
        pass
    
cur = connect(s3_staging_dir='<staging_directory_url>',
                region_name='<aws_region>',
                cursor_class = PandasCursor,
                converter=CustomPandasTypeConverter(),
                work_group = '<workgroup_name>').cursor()

query = 'SELECT * FROM <schema>.<table>'
df = cur.execute(query).as_pandas()
df.dtypes

When I inspect the dtypes, Athena ints are converted to pandas ints, decimals are converted to floats and strings are consistently returned as object dtypes. However Athena string NULLs are cast as NaNs which require explicit column-by-column fillna operations. This is particularly inconvenient, since I'm trying to subsequently convert the pandas dataframe to a Spark dataframe. Now that I've typed all this out, I'm guessing this is related to #118?

Also, I'm not sure where the right place to ask this is, but are there any plans to implement a PySparkCursor for PyAthena? If not can I help by contributing?

The text was updated successfully, but these errors were encountered:

laughingman7743 · 2020-07-02T12:41:24Z

The problem with #118 is that the string type cannot distinguish between empty and null characters. Currently, all empty characters are treated as null.

The reason Dtype is an Object is probably because the type is represented as a Python String object.
Isn't the missing value of the String object a NaN? Is there any other way to describe it?

I don't plan to implement PySparkCursor.
Contributions are free. However, you are required to provide your own test environment with AWS.

krishanunandy · 2020-07-02T17:12:17Z

The reason Dtype is an Object is probably because the type is represented as a Python String object.

That's my understanding too.

Isn't the missing value of the String object a NaN? Is there any other way to describe it?

A NaN is a contraction of 'Not a Number' - it probably shouldn't be used to describe the missing value for a string though I think it is. Until very recently, pandas int dtypes didn't support null values - only float did - and datetimes have their own NaT (Not a Time) type. While not ideal, giving the user the choice to explicitly cast an Athena NULL string as an empty string ('') might be a good solution until someone comes up with an NaS type?

I don't plan to implement PySparkCursor.
Contributions are free. However, you are required to provide your own test environment with AWS.

Thanks for the information. I'll look at the source code and see what I can come up with.

laughingman7743 · 2020-07-03T00:31:43Z

https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values
If Pnadas 1.0.0 or higher, it seems to good to handle all missing values in pd.NA.

krishanunandy · 2020-07-09T16:19:23Z

Good find. It seems though, that -

PyAthena casts strings as object type (i.e. Numpy arrays) instead of the new dedicated StringDtype
The StringDtype is still experimental and probably shouldn't be a dependency of PyAthena

Any thoughts on a potential solution? Pinging @EdwardJRoss since he seems to understand the codebase better than I do.

Liam3851 · 2020-08-14T02:52:49Z

@krishanunandy As noted upthread in Pandas the canonical way of representing strings, before the dedicated StringDtype, was as objects. dtype=str is accepted but results in an object dtype. This is just how Pandas works and has nothing to do with PyAthena, at least not directly:

>>> pd.Series(['a', 'b'], dtype=str).reindex([0, 1, 2])
0    a
1    b
2    NaN
dtype: object

Pandas of course is developing the new String dtype which is the ultimate solution here. You could change your custom converter to use "string" instead of str and use the new dtype.

However, given your use case, it is relevant that through version 3.0, Spark has no support for pd.NA, in any of the new nullable dtypes (including Int64, which PyAthena supports). See SPARK-30966 in the Spark Jira for the ticket on this. Realistically you're probably looking at Spark 3.1... maybe early 2021?... at the earliest.

My suggestion would just be to write a wrapper method that calls .as_pandas(), takes the result, loops over the cursor description to get the type of each column, and runs .fillna('') on any columns with string type.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converter class does not convert Athena string data to pandas str type #148

Converter class does not convert Athena string data to pandas str type #148

krishanunandy commented Jul 1, 2020 •

edited

Loading

laughingman7743 commented Jul 2, 2020

krishanunandy commented Jul 2, 2020 •

edited

Loading

laughingman7743 commented Jul 3, 2020

krishanunandy commented Jul 9, 2020

Liam3851 commented Aug 14, 2020

Converter class does not convert Athena string data to pandas str type #148

Converter class does not convert Athena string data to pandas str type #148

Comments

krishanunandy commented Jul 1, 2020 • edited Loading

laughingman7743 commented Jul 2, 2020

krishanunandy commented Jul 2, 2020 • edited Loading

laughingman7743 commented Jul 3, 2020

krishanunandy commented Jul 9, 2020

Liam3851 commented Aug 14, 2020

krishanunandy commented Jul 1, 2020 •

edited

Loading

krishanunandy commented Jul 2, 2020 •

edited

Loading