scalability of geometry encoding #534

rabernat · 2024-09-10T16:09:49Z

I've been playing with encoding larger datasets using the cf-xarray geometry approach. Here's some code

import geopandas as gp
import xvec
import xarray as xr

url = (
    "s3://overturemaps-us-west-2/release/2024-08-20.0/theme=buildings/type=building/"
    "part-00000-2ad9544f-1d68-4a5a-805c-7a5d020d084d-c000.zstd.parquet"
)

# ~ 30s
df = gp.read_parquet(url, columns=['id', 'geometry', 'level'])
# 11447790 rows

# all very fast
ds = xr.Dataset(df).set_coords("geometry").swap_dims({"dim_0": "geometry"}).drop_vars("dim_0")
ds = ds.xvec.set_geom_indexes('geometry', crs=df.crs)

# encode only 100_000 rows
%time ds_enc = ds.isel(geometry=slice(0, 100_000)).xvec.encode_cf()
# -> Wall time: 4.25 s

I confirmed that the scaling is roughly linear. At this rate, it will take > 500 seconds to encode the whole dataset. Decoding is about 20x faster.

I'm wondering if there are some low-hanging fruit that can be found to optimize this.

Alternatively, we could explore storing geometries as WKB, as geoparquet does.

Closes #534

dcherian · 2024-09-11T13:11:26Z

The whole thing:

I think we can do better though, primarily by supporting chunked arrays as input. The arrays get passed to GEOS, so we'll need to map_blocks at the appropriate places.

Everything else is up to shapely.to_ragged_array. That function has 10% overhead from doing this kind of error-checking:

   250         1        282.0    282.0      0.0      if include_z is None:
   251         2     725395.0 362697.5      0.0          include_z = np.any(
   252         1  749690266.0    7e+08      5.5              get_coordinate_dimension(geometries[~is_empty(geometries)]) == 3
   253                                                   )
   254                                           
   255         1  552494652.0    6e+08      4.0      geom_types = np.unique(get_type_id(geometries))
   256                                               # ignore missing values (type of -1)
   257         1      21755.0  21755.0      0.0      geom_types = geom_types[geom_types >= 0]

@martinfleis a couple of questions:

is it possible for us to set include_z based on what GeometryIndex knows?
np.unique can be slow for large arrays (it sorts first). Perhaps a get_unique_type_id at C-layer would be nice.

martinfleis · 2024-09-11T13:58:00Z

is it possible for us to set include_z based on what GeometryIndex knows?

Not without writing that line of code shown above in our code. GeometryIndex does not have an idea about this and to determine the presence of Z we would need to check it across all geometries exactly as shapely does.

np.unique can be slow for large arrays (it sorts first). Perhaps a get_unique_type_id at C-layer would be nice.

Could you open a request in shapely? That would need to be implemented there by someone who speaks C.

dcherian · 2024-09-11T18:46:17Z

Could you open a request in shapely?

OK let me do some more thorough profiling and I'll open an issue there.

dcherian changed the title ~~scalability of encoding~~ scalability of geometry encoding Sep 10, 2024

dcherian added a commit that referenced this issue Sep 10, 2024

Speed up geometry encoding

d8d3de8

Closes #534

dcherian added a commit that referenced this issue Sep 10, 2024

Speed up geometry encoding

f9ce4e6

Closes #534

dcherian mentioned this issue Sep 10, 2024

Speed up geometry encoding #535

Merged

dcherian closed this as completed in c349436 Sep 11, 2024

dcherian closed this as completed in #535 Sep 11, 2024

dcherian reopened this Sep 11, 2024

This was referenced Sep 15, 2024

Avoid one deepcopy in geometry encoding #537

Merged

Avoid one deepcopy in geometry encoding #538

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scalability of geometry encoding #534

scalability of geometry encoding #534

rabernat commented Sep 10, 2024

dcherian commented Sep 11, 2024

martinfleis commented Sep 11, 2024

dcherian commented Sep 11, 2024

scalability of geometry encoding #534

scalability of geometry encoding #534

Comments

rabernat commented Sep 10, 2024

dcherian commented Sep 11, 2024

martinfleis commented Sep 11, 2024

dcherian commented Sep 11, 2024