Add endpoint for querying a folder or asset path in a Dandiset #1837

jwodder · 2024-01-26T15:22:28Z

Implementing https://github.com/dandi/dandi-webdav and some other things would be much easier if it were possible to make a single API request that provided a path and received in return either the asset at that path or an indication that the path is a folder, possibly along with the folder's children. Note that the current /dandisets/000108/versions/draft/assets/paths/ endpoint is insufficient for this, as it does not return asset properties or metadata, and even if it did, setting path_prefix to an asset path returns an empty list, so one cannot make a single request that returns either an asset or the contents of a folder at a path.

Possible endpoint: GET /dandiset/{dandiset_id}/versions/{version_id}/assets/atpath/?path={path}[&metadata=<0|1>]

If path is omitted from the request, return a 4xx error.
If path is not an asset path and there are no assets under {path}/, return 404.

If path points to an asset, the following is returned:

{
    "type": "asset",
    "asset": {
         // Asset properties (ID, size, created, etc.) as would be returned while paginating over `/dandisets/{dandiset_id}/versions/{version_id}/assets/`
         // If `metadata=1` was included in the query, "asset" also contains a "metadata" field containing the asset metadata
    }
}

If there are assets under {path}/ (i.e., if path is a folder), the following or similar is returned:

{
    "type": "folder",
    "folder": {
         "aggregate_files": INT,
         "aggregate_size": INT
    }
}

Ideally, I'd like it if including a children=1 query parameter would, if path turns out to be a folder, cause the response to include a list of all direct children of path, both assets (with properties, and including metadata if metadata=1) and paths of subfolders. However, this would need to be paginated, and having a response that (a) is sometimes paginated and (b) only paginates a subset of the data returned seems awkward.

A parallel endpoint could also be implemented for Zarr entries.

The text was updated successfully, but these errors were encountered:

satra · 2024-01-26T15:28:24Z

@jwodder - in the current paths api the absence of asset i think indicates it's a folder. doesn't that not satisfy your requirements except for the metadata return parameter?

current return of `paths` api at the top level of 000108

{
  "count": 7,
  "next": null,
  "previous": null,
  "results": [
    {
      "path": "dataset_description.json",
      "version": 500,
      "aggregate_files": 1,
      "aggregate_size": 71,
      "asset": {
        "asset_id": "2847011b-f9fb-4933-a6b5-1641e0c1886b",
        "url": "https://dandiarchive.s3.amazonaws.com/blobs/c07/71a/c0771a4f-3483-47e7-821e-b28ac8df46a5"
      }
    },
    {
      "path": "samples.tsv",
      "version": 500,
      "aggregate_files": 1,
      "aggregate_size": 572,
      "asset": {
        "asset_id": "a8a5fed9-02bf-4d54-b457-12427a4d3e70",
        "url": "https://dandiarchive.s3.amazonaws.com/blobs/369/c1b/369c1b33-7311-402a-a236-373b5af3a800"
      }
    },
    {
      "path": "sub-mEhm",
      "version": 500,
      "aggregate_files": 168,
      "aggregate_size": 3102758281057,
      "asset": null
    },
    {
      "path": "sub-MITU01",
      "version": 500,
      "aggregate_files": 6375,
      "aggregate_size": 290203968395869,
      "asset": null
    },
    {
      "path": "sub-MITU01h3",
      "version": 500,
      "aggregate_files": 475,
      "aggregate_size": 52424119215488,
      "asset": null
    },
    {
      "path": "sub-SChmi53",
      "version": 500,
      "aggregate_files": 574,
      "aggregate_size": 12364425736517,
      "asset": null
    },
    {
      "path": "sub-U01hm15x",
      "version": 500,
      "aggregate_files": 490,
      "aggregate_size": 16635711419780,
      "asset": null
    }
  ]
}

jwodder · 2024-01-26T15:32:07Z

@satra

That doesn't return asset properties.
If I want to check whether, say, foo/bar is an asset (and get its properties) or a folder (and get its children), there is no single paths request that gives me what I want:
- If I set path_prefix to foo/bar:
  - If foo/bar is an asset, I get nothing — BAD.
  - If foo/bar is a folder, I get all of its children — GOOD.
- If I set path_prefix to foo:
  - If foo/bar is an asset, I get its asset ID, along with a bunch of other irrelevant stuff that I may have to paginate through — SUBOPTIMAL.
  - If foo/bar is a folder, I get confirmation that it's a folder, but I don't get its children, and there's still a bunch of other irrelevant stuff to filter out — BAD.

satra · 2024-01-26T16:42:43Z

If I set path_prefix to foo/bar:
If foo/bar is an asset, I get nothing — BAD.

this could be fixed i think. the current api assumes one is using it for navigation and hence this situation never comes up. on that front can you describe your use case in webdav that would be different from the tree navigator in the web app?

If I set path_prefix to foo:
If foo/bar is a folder, I get confirmation that it's a folder, but I don't get its children, and there's still a bunch of other irrelevant stuff to filter out — BAD.

i'm not sure why i would return children of anything that is a folder, unless we parameterize that with a depth field. again something we could add to the api. however, it would be good to know what the webdav use case for it is. an intent behind that api was to provide a limited view so as to not overwhelm the server and to reflect what the ui looks like.

i think the specific use case of webdav that's different from the online file browser would be good to know. if not, it would be good to examine whether the filebrowser could be improved.

jwodder · 2024-01-26T16:56:47Z

@satra In order to make dandi-webdav more efficient (cf. dandi/dandi-webdav#5), it needs to be able to take a path and do the following operations on it:

For a GET request or a PROPFIND request with Depth: 1, dandi-webdav needs to be able to efficiently fetch whatever's at the path as follows:
- If the path points to a blob asset, the blob's S3 download URL (for GET) or the asset's properties & metadata (for PROPFIND) must be retrieved.
- If the path points to an asset folder, all assets (including their properties & metadata) and names of subfolders in the folder must be retrieved.
- If the path points to a Zarr asset, then the Zarr's properties & metadata and all top-level entries (including their S3 properties) and subfolders must be retrieved.
- If the path is a Zarr asset path concatenated with a Zarr entry path, then the Zarr asset's S3 URL must be retrieved, followed by a query to S3 (because the Archive currently doesn't have an equivalent of …/assets/paths/ for Zarr entries) to retrieve whatever's at the entry path:
  - If the entry path points to an entry, its S3 download URL (for GET) or S3 properties (for PROPFIND) must be retrieved.
  - If the entry path points to a folder, all entries (including their S3 properties) and names of subfolders in the folder must be retrieved.
For a PROPFIND request with Depth: 0, same as above, except that folder entries do not need to be retrieved.

satra · 2024-01-30T02:58:58Z

let's consolidate this with the web file browser. @mvandenburgh - do you have thoughts on the issues raised here.

If the path points to a Zarr asset, then the Zarr's properties and all top-level entries (including their S3 properties) and subfolders must be retrieved.

not sure why this is a requirement for webdavfs. i would suggest any zarr reading part be left to a zarr library and a zarr object is treated the same way as any other binary file.

jwodder · 2024-01-30T03:07:53Z

@satra It's a requirement because that's what @yarikoptic told me to do. You can't treat a Zarr as a binary file, as it isn't a file in the first place.

waxlamp · 2024-01-31T21:59:56Z

let's consolidate this with the web file browser

I'd prefer we avoid this design approach. The web file browser and this webdav server have fundamentally different requirements and therefore should not be served by one consolidated API. And I'd rather not add to the public API for a single client's needs (unless and until that API shows promise for general usefulness).

What I'd like to try is creating some private API tailored to the dandidav's specific needs (recognizing that dandidav, like the web client and the CLI, is an official interface to the DANDI datastore and therefore has special privileges). Later on, we can factor out anything that is suitable for the public API.

I realize this may sound like an odd approach, but it has advantages:

it can immediately deliver the customized performance value that dandidav is seeking
it means we don't need to agonize over the design of this new API--it can instead be whatever John needs
it leaves open the future possibility of creating a public version (if needed)

yarikoptic · 2024-02-01T15:06:52Z

what is "private API" @waxlamp in the context of DANDI?

yarikoptic · 2024-02-01T15:12:59Z

If the path points to a Zarr asset, then the Zarr's properties and all top-level entries (including their S3 properties) and subfolders must be retrieved.

not sure why this is a requirement for webdavfs. i would suggest any zarr reading part be left to a zarr library and a zarr object is treated the same way as any other binary file.

@satra It's a requirement because that's what @yarikoptic told me to do. You can't treat a Zarr as a binary file, as it isn't a file in the first place.

the goal here for having zarr as a browsable folder is multifold

navigate entire dandiarchive "space" as if it was just a filesystem to browse - to review files, to FUSE mount it and access any file or "formatted folder" like zarr is
to provide access to zarrs as if they are just hosted on the web (thus not just a "binary thing" for zarr).

Just try it at https://dandi.centerforopenneuroscience.org/dandisets/000108/draft/sub-MITU01h3/ses-20211130h19m15s39/micr/ and around there ;-)

yarikoptic · 2024-02-01T15:53:58Z

I think we first need to acknowledge that current API related to paths is quite odd in

(already known and thus acknowledged) swapped meaning of path vs path_prefix: Fix naming of path and path_prefix parameter names #1562
(already known and thus acknowledged. edit: actually fixed at API level) /paths should 404 if there are not assets with path_prefix/ #546
its behavior of returning a record with count 0 and empty results when you hit an asset:

{
  "count": 0,
  "next": null,
  "previous": null,
  "results": []
}

it is odd because it is an implicit knowledge to give meaning to such a result. It at least does return 404 when pointing to non-existing path (phewph).
4. implicit type of a record - no clear indication on what result is (folder? asset? and further type of asset - file? zarr folder?). Implicit logic: "if there is an "asset" -- it is a 'file'; if no "asset" - it is a 'folder'":

    {
      "path": "sub-RAT123",
      "version": 419,
      "aggregate_files": 1,
      "aggregate_size": 18792,
      "asset": null
    },
    {
      "path": "test1234",
      "version": 419,
      "aggregate_files": 1,
      "aggregate_size": 9,
      "asset": {
        "asset_id": "28155d24-9342-48bc-a3a7-b69191e86e13",
        "url": "https://dandiarchive.s3.amazonaws.com/blobs/031/7cf/0317cf5a-4047-4e19-aae1-4f7b7434d2d7"
      }
    },

From this point of acknowledging that it sucks quite a bit, it sounds like we really need to strive to make it better, and then benefit from an improvement.

What I see possible to be done to make it better without breaking anything:

add to result records date_modified so could be used by file browser
add to result records some kind of result_type and make it as useful as we can: e.g. Folder, AssetBlob, AssetZarr - to remove ambiguity of 4. above

And our file browser could immediately benefit and expose Date Modified column and allow sorting etc. Nobody complained yet but I bet it would be UX improvement.

And now we can think about possible changes which we can expose "conditionally" based on some option to the request without breaking API (we already did that for metadata of /assets/ end point and only benefited IMHO from that).

E.g. we could add include_hit (or alike) which would default to False and thus retaining the current behavior; but for True it would

add a record for the exact path (well -- path_prefix here due to oddity 1.), so instead of that empty record with 0 we would get count: 1 with a record

    {
      "path": null,
      "version": 419,
      "aggregate_files": 1,
      "aggregate_size": 9,
      "date_modified": "2023-03-16T11:50:08.973069-04:00",
      "result_type": "AssetBlob",
      "asset": {
        "asset_id": "28155d24-9342-48bc-a3a7-b69191e86e13",
        "url": "https://dandiarchive.s3.amazonaws.com/blobs/031/7cf/0317cf5a-4047-4e19-aae1-4f7b7434d2d7"
      }

and if path points to the folder -- it would be that folder's record as when listed from its parent, but with result_type "Folder". It will be pretty much the "." you see in ls -la and thus could even be included in the listing of the folder we have now. Again -- our file browser could benefit from that and present that information to the user.

@jwodder do you think this would suffice for our needs in webdav implementation?

jwodder · 2024-02-01T16:01:17Z

@yarikoptic This would improve things, but WebDAV would still have to make further requests to get asset properties & metadata (both for an asset at a path and for assets under a folder). Note that I already described what information WebDAV needs per path in #1837 (comment).

yarikoptic · 2024-02-01T17:37:42Z

I see:

the asset's properties & metadata (for PROPFIND) must be retrieved

could you @jwodder list explicitly specific metadata fields and properties which are of interest here? I guess adding metadata flag to return actual metadata record as we have done for /assets would be overkill here.

jwodder · 2024-02-01T17:40:30Z

@yarikoptic

Asset properties used by dandidav:
- asset_id
- blob_id
- zarr_id
- path
- size
- created
- modified
  - Note: It might be more accurate to use the blobDateModified metadata field for this instead; cf. Use blobDateModified metadata instead of modified property for blob assets' "modified" timestamps dandidav#17
Asset metadata used by dandidav:
- encodingFormat (blobs only)
- contentUrl (API download URL for blobs, S3 URL for Zarrs)
- digest["dandi:dandi-etag"] (blobs only)

yarikoptic · 2024-02-01T19:37:41Z

Let's "side discuss" at #1851 where it might be easier "per line"

jwodder added the enhancement New feature or request label Jan 28, 2024

jwodder mentioned this issue Jan 31, 2024

404 for a legit subject "folder" dandi/dandidav#39

Closed

yarikoptic added a commit that referenced this issue Feb 1, 2024

Just a discussion pasted from #1837

c87cc8c

yarikoptic mentioned this issue Feb 1, 2024

A discussion on needed /paths properties and metadata #1851

Draft

yarikoptic mentioned this issue Feb 13, 2024

Systematically compare/investigate performance on Zarrs dandi/dandidav#54

Open

jwodder mentioned this issue Aug 13, 2024

Design doc for Zarr versioning/publishing support via Zarr Manifest Files #1892

Draft

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add endpoint for querying a folder or asset path in a Dandiset #1837

Add endpoint for querying a folder or asset path in a Dandiset #1837

jwodder commented Jan 26, 2024 •

edited

Loading

satra commented Jan 26, 2024

jwodder commented Jan 26, 2024 •

edited

Loading

satra commented Jan 26, 2024

jwodder commented Jan 26, 2024 •

edited

Loading

satra commented Jan 30, 2024

jwodder commented Jan 30, 2024

waxlamp commented Jan 31, 2024

yarikoptic commented Feb 1, 2024

yarikoptic commented Feb 1, 2024

yarikoptic commented Feb 1, 2024 •

edited

Loading

jwodder commented Feb 1, 2024

yarikoptic commented Feb 1, 2024

jwodder commented Feb 1, 2024 •

edited

Loading

yarikoptic commented Feb 1, 2024

Add endpoint for querying a folder or asset path in a Dandiset #1837

Add endpoint for querying a folder or asset path in a Dandiset #1837

Comments

jwodder commented Jan 26, 2024 • edited Loading

satra commented Jan 26, 2024

jwodder commented Jan 26, 2024 • edited Loading

satra commented Jan 26, 2024

jwodder commented Jan 26, 2024 • edited Loading

satra commented Jan 30, 2024

jwodder commented Jan 30, 2024

waxlamp commented Jan 31, 2024

yarikoptic commented Feb 1, 2024

yarikoptic commented Feb 1, 2024

yarikoptic commented Feb 1, 2024 • edited Loading

jwodder commented Feb 1, 2024

yarikoptic commented Feb 1, 2024

jwodder commented Feb 1, 2024 • edited Loading

yarikoptic commented Feb 1, 2024

jwodder commented Jan 26, 2024 •

edited

Loading

jwodder commented Jan 26, 2024 •

edited

Loading

jwodder commented Jan 26, 2024 •

edited

Loading

yarikoptic commented Feb 1, 2024 •

edited

Loading

jwodder commented Feb 1, 2024 •

edited

Loading