-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implements Index.putmask #1560
base: master
Are you sure you want to change the base?
Implements Index.putmask #1560
Conversation
Could you also delete the @@ -58,7 +58,6 @@ class MissingPandasLikeIndex(object):
is_type_compatible = _unsupported_function("is_type_compatible")
join = _unsupported_function("join")
map = _unsupported_function("map")
- putmask = _unsupported_function("putmask")
ravel = _unsupported_function("ravel")
reindex = _unsupported_function("reindex")
searchsorted = _unsupported_function("searchsorted")
@@ -131,7 +130,6 @@ class MissingPandasLikeMultiIndex(object):
is_type_compatible = _unsupported_function("is_type_compatible")
join = _unsupported_function("join")
map = _unsupported_function("map")
- putmask = _unsupported_function("putmask")
ravel = _unsupported_function("ravel")
reindex = _unsupported_function("reindex")
remove_unused_levels = _unsupported_function("remove_unused_levels") |
@beobest2 can you fix the test? |
@HyukjinKwon okay I'll fix the test |
Codecov Report
@@ Coverage Diff @@
## master #1560 +/- ##
==========================================
- Coverage 94.55% 94.25% -0.31%
==========================================
Files 38 38
Lines 8767 8715 -52
==========================================
- Hits 8290 8214 -76
- Misses 477 501 +24
Continue to review full report at Codecov.
|
databricks/koalas/indexes.py
Outdated
masking_col = verify_temp_column_name(sdf, "__masking_column__") | ||
|
||
if isinstance(value, (list, tuple)): | ||
replace_udf = udf(lambda x: value[x], _infer_type(value[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to use pandas_udf
instead of udf
? If possible, could you replace with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible! I modified it to use pandas_udf
databricks/koalas/indexes.py
Outdated
sdf = sdf.withColumn(replace_col, replace_udf(dist_sequence_col_name)) | ||
elif isinstance(value, (Index, Series)): | ||
value = value.to_numpy().tolist() | ||
replace_udf = udf(lambda x: value[x], _infer_type(value[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
databricks/koalas/indexes.py
Outdated
elif not isinstance(mask, list) and not isinstance(mask, tuple): | ||
raise TypeError("Mask data doesn't support type " "{0}".format(type(mask).__name__)) | ||
|
||
masking_udf = udf(lambda x: mask[x], BooleanType()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
databricks/koalas/indexes.py
Outdated
sdf = sdf.withColumn(replace_col, F.lit(value)) | ||
|
||
if isinstance(mask, (Index, Series)): | ||
mask = mask.to_numpy().tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should do this.
databricks/koalas/indexes.py
Outdated
# | 4| e| 500| false| | ||
# +-------------------------------+-----------------+------------------+------------------+ | ||
|
||
cond = F.when(sdf[masking_col], sdf[replace_col]).otherwise(sdf[scol_name]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use scol_for(sdf, scol_name)
?
self.assert_eq( | ||
kidx.putmask(kidx < "c", ks.Series(["g", "h", "i", "j", "k"])).sort_values(), | ||
pidx.putmask(pidx < "c", pd.Series(["g", "h", "i", "j", "k"])).sort_values(), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the length of value
is not same as the index length? Could you add the tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin Thanks for the comment! I will address it as you comments. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin
If the length of the mask
in the pandas is different, ValueError
is raised.
>>> pidx
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
>>> pidx.putmask([True, False], pd.Series(["g", "h", "i", "j", "k"])).sort_values()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/hwpark/Desktop/git_koalas/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4041, in putmask
raise err
File "/Users/hwpark/Desktop/git_koalas/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4037, in putmask
np.putmask(values, mask, self._convert_for_op(value))
File "<__array_function__ internals>", line 6, in putmask
ValueError: putmask: mask and data must be the same size
So I fixed Koalas to raise the same error as well.
>>> kidx.putmask([True, False], ks.Series(["g", "h", "i", "j", "k"])).sort_values()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/hwpark/Desktop/git_koalas/koalas/databricks/koalas/indexes.py", line 1612, in putmask
raise ValueError("mask and data must be the same size")
ValueError: mask and data must be the same size
If the value
have different length in pandas, it works like this:
>>> pidx
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
>>> pidx.putmask(pidx > 'c', pd.Series(["g", "h"])).sort_values()
Index(['a', 'b', 'c', 'g', 'h'], dtype='object')
>>> pidx.putmask(pidx < 'c', pd.Series(["g", "h"])).sort_values()
Index(['c', 'd', 'e', 'g', 'h'], dtype='object')
>>> pidx.putmask(pidx < 'c', pd.Series(["g"])).sort_values()
Index(['c', 'd', 'e', 'g', 'g'], dtype='object')
>>> pidx.putmask([True, False, True, False, True], pd.Series(["g", "h"])).sort_values()
Index(['b', 'd', 'g', 'g', 'g'], dtype='object')
I thought the behavior of Pandas was ambiguous, so I left the comments at line 1593 for now.
# TODO: We can't support different size of value for now.
@beobest2 could you rebase this when available ? |
@itholic sure :) |
if isinstance(value, (list, tuple, Index, Series)): | ||
if isinstance(value, (list, tuple)): | ||
pandas_value = pd.Series(value) | ||
elif isinstance(value, (Index, Series)): | ||
pandas_value = value.to_pandas() | ||
|
||
if self.size != pandas_value.size: | ||
# TODO: We can't support different size of value for now. | ||
raise ValueError("value and data must be the same size") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can support for only same size, I think we shouldn't support this API for non-scalar objects for now.
Since we're using pd.Series(value)
and value.to_pandas()
above, It looks quite dangerous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we better support this API only for the ks.Index
so that we can avoid the collect all the data into single machine.
Maybe I think we can apply almost same concept with implementation of Series.where
. (https://koalas.readthedocs.io/en/latest/_modules/databricks/koalas/series.html#Series.where)
Would you tell me what do you think about this way when you available, @ueshin @HyukjinKwon ?
Hi @beobest2, since Koalas has been ported to Spark as pandas API on Spark, would you like to migrate this PR to the Spark repository? Here is the ticket https://issues.apache.org/jira/browse/SPARK-36403. Otherwise, I may do that for you next week. |
Hi @xinrong-databricks I would like to migrate this PR to the Spark repository. I will try to finish it by next week. |
Please take your time :) Thank you! |
@xinrong-databricks I created a PR at apache/spark#33744 . Please take a look :) |
Certainly, let's discuss in the new PR then! Thanks for the porting. |
Implementing
Index.putmask