Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scalability of geometry encoding #534

Open
rabernat opened this issue Sep 10, 2024 · 3 comments · Fixed by #535
Open

scalability of geometry encoding #534

rabernat opened this issue Sep 10, 2024 · 3 comments · Fixed by #535

Comments

@rabernat
Copy link

I've been playing with encoding larger datasets using the cf-xarray geometry approach. Here's some code

import geopandas as gp
import xvec
import xarray as xr

url = (
    "s3://overturemaps-us-west-2/release/2024-08-20.0/theme=buildings/type=building/"
    "part-00000-2ad9544f-1d68-4a5a-805c-7a5d020d084d-c000.zstd.parquet"
)

# ~ 30s
df = gp.read_parquet(url, columns=['id', 'geometry', 'level'])
# 11447790 rows

# all very fast
ds = xr.Dataset(df).set_coords("geometry").swap_dims({"dim_0": "geometry"}).drop_vars("dim_0")
ds = ds.xvec.set_geom_indexes('geometry', crs=df.crs)

# encode only 100_000 rows
%time ds_enc = ds.isel(geometry=slice(0, 100_000)).xvec.encode_cf()
# -> Wall time: 4.25 s

I confirmed that the scaling is roughly linear. At this rate, it will take > 500 seconds to encode the whole dataset. Decoding is about 20x faster.

I'm wondering if there are some low-hanging fruit that can be found to optimize this.

Alternatively, we could explore storing geometries as WKB, as geoparquet does.

@dcherian dcherian changed the title scalability of encoding scalability of geometry encoding Sep 10, 2024
dcherian added a commit that referenced this issue Sep 10, 2024
dcherian added a commit that referenced this issue Sep 10, 2024
@dcherian
Copy link
Contributor

image

The whole thing:
image


I think we can do better though, primarily by supporting chunked arrays as input. The arrays get passed to GEOS, so we'll need to map_blocks at the appropriate places.

Everything else is up to shapely.to_ragged_array. That function has 10% overhead from doing this kind of error-checking:

   250         1        282.0    282.0      0.0      if include_z is None:
   251         2     725395.0 362697.5      0.0          include_z = np.any(
   252         1  749690266.0    7e+08      5.5              get_coordinate_dimension(geometries[~is_empty(geometries)]) == 3
   253                                                   )
   254                                           
   255         1  552494652.0    6e+08      4.0      geom_types = np.unique(get_type_id(geometries))
   256                                               # ignore missing values (type of -1)
   257         1      21755.0  21755.0      0.0      geom_types = geom_types[geom_types >= 0]

@martinfleis a couple of questions:

  1. is it possible for us to set include_z based on what GeometryIndex knows?
  2. np.unique can be slow for large arrays (it sorts first). Perhaps a get_unique_type_id at C-layer would be nice.

@dcherian dcherian reopened this Sep 11, 2024
@martinfleis
Copy link
Member

is it possible for us to set include_z based on what GeometryIndex knows?

Not without writing that line of code shown above in our code. GeometryIndex does not have an idea about this and to determine the presence of Z we would need to check it across all geometries exactly as shapely does.

np.unique can be slow for large arrays (it sorts first). Perhaps a get_unique_type_id at C-layer would be nice.

Could you open a request in shapely? That would need to be implemented there by someone who speaks C.

@dcherian
Copy link
Contributor

Could you open a request in shapely?

OK let me do some more thorough profiling and I'll open an issue there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants