Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload Image Files to IDR #54

Open
shntnu opened this issue Dec 11, 2020 · 149 comments
Open

Upload Image Files to IDR #54

shntnu opened this issue Dec 11, 2020 · 149 comments

Comments

@shntnu
Copy link
Collaborator

shntnu commented Dec 11, 2020

We will upload image files to the Image Data Resource and add URL and metadata information to the Broad Bioimage Benchmark Collection.

We will use this issue to outline the required steps.

From IDR:

  1. study file describing the overall study and the screens that were performed e.g. cell health
  2. library file(s) describing the plate layout of each screen e.g. cell health
  3. processed data file(s) containing summary results and/or a ‘hit' list for each screen

All files should be in tab-delimited text format.
Templates are provided but can be modified to suit your experiment.
Add or remove columns from the templates as necessary.

@gwaygenomics Did you have a processed data file for cell health?

@shntnu
Copy link
Collaborator Author

shntnu commented Feb 26, 2021

A conclusion from our internal discussion: Let's also include the LKCP dataset when submitting to IDR.

@gwaybio
Copy link
Member

gwaybio commented Mar 5, 2021

The first step is to reach out to IDR to see if they would be interested in hosting these data. I plan on doing this today.

Becki will be taking notes on the submission process, on the wiki. I will use this issue to jot down specific metadata information that we'll likely need to track for IDR.

A couple immediate answers to track:

How many files?
How big is the total set?

Batch Plates File count Size
2016_04_01_a549_48hr_batch1 136 ~2,200,000 22 TB
2017_12_05_Batch2 135 ~2,700,000 23.8 TB

Could you send us a draft of the publication or is it submitted to an pre-print archive such as BioRxiv? Do you have a time line for publication?

We hope to submit a preprint in 2-3 months.

What is the image file format?

.tiff

Do you have feature level data, ROIs or tracking data available for this dataset?

Yes. Feature level data are available at https://github.com/broadinstitute/lincs-cell-painting/

Could this dataset can be integrated with other datasets e.g. through genes (orthologs) or phenotypes?

Definitely. These data are morphologies after thousands of drug perturbations. Data can be linked by drug information.

@gwaybio
Copy link
Member

gwaybio commented Mar 16, 2021

Initial inquiry sent on March 16, 2021 with ipLINCS project tag and subject: "[IDR] LINCS Cell Painting - a 45TB benchmark dataset of drug perturbations"

@gwaybio
Copy link
Member

gwaybio commented May 21, 2021

On March 23, 2021, we received word from the IDR staff that they will not accept our data without first a manuscript draft.

I believe the current plan is to introduce this dataset with the LINCS profiling complementarity paper.

@shntnu
Copy link
Collaborator Author

shntnu commented Oct 13, 2021

I've created a checklist based on an email Frances Wong:


Transfer images

We’ve recently setup the Globus platform for file transfer (https://www.globus.org/).

  • @shntnu You (or the person transferring the data) will need to setup a Globus account using the Globus Web App (https://app.globus.org/).
  • @shntnu Once you have an account setup, please email me your full Globus username/identity.
  • @shntnu I will setup a shared folder and send you an email from my Globus account with details on how to access this shared folder and you can upload your files to it.
  • @shntnu do a test run "As you have a large volume of data to transfer, we would like to suggest that you upload one complete plate to us first for testing. We will check that this plate passes validation prior to the full upload of your data. Please upload one complete plate and drop us an email when the transfer is finished."
  • @shntnu Let IDR know that the test run is done
  • @shntnu Continue upload full dataset to S3
  • @shntnu Wait for confirmation that all checks have passed

When preparing your image files for transfer, you may wish to refer to your previous submission (idr00080) as scripts like https://github.com/IDR/idr0080-way-perturbation/blob/master/scripts/illumcorrect_plate_symlinks.sh may be useful.

Note: We will not create illumination corrected files; we don't have the capacity to do that. See broadinstitute/cell-health#106 to understand why this is a very labor-intensive task.

Steps

du -h --max-depth 0 /cmap/imaging/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/2016_04_01_a549_48hr_batch1/images/
26T     /cmap/imaging/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/2016_04_01_a549_48hr_batch1/images/

Upload script

TOP_LEVEL_FOLDER=/cmap/imaging/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/2016_04_01_a549_48hr_batch1/images/
aws s3 sync \
  --profile jump-cp-role \
  --acl bucket-owner-full-control \
  ${TOP_LEVEL_FOLDER} \
  s3://cellpainting-gallery/lincs/broad/images/2016_04_01_a549_48hr_batch1/images/

I should exclude 4 plates because these were bad plates (they got left behind in the freezer, and the images were terrible once we did image them; they were excluded from all analyses)

 parallel aws s3 rm --recursive --profile jump-cp-role s3://cellpainting-gallery/lincs/broad/images/2016_04_01_a549_48hr_batch1/images/{1} ::: SQ00015225__2016-10-29T16_09_17-Measurement1 SQ00015226__2016-10-29T17_50_20-Measurement1 SQ00015227__2016-10-29T19_31_37-Measurement1 SQ00015228__2016-10-29T21_13_50-Measurement1

Fill templates

As before with idr0080, we need some information about the study and the images for this new submission. We have some metadata templates for this information. Empty templates can be downloaded here https://github.com/IDR/idr0000-lastname-example/archive/master.zip.

  • @shntnu Please choose the templates for HCS data (including templates in screenA folder).

There are 3 template files to fill in.

There are examples of completed templates for other studies here https://github.com/IDR/idr-metadata/. Please try to fill in as much information as you can.

Our most recent submission is idr0080:

Wrap up

Please keep using [email protected] email address for any future communication.

@shntnu
Copy link
Collaborator Author

shntnu commented Oct 26, 2021

@gwaygenomics Any thoughts on this?

@gwaybio
Copy link
Member

gwaybio commented Oct 26, 2021

I didn't add phenotypes or any quantification to the cell health submission.

The info on the right is all I provided:

image

https://idr.openmicroscopy.org/webclient/?show=screen-2701

Thanks!

@shntnu
Copy link
Collaborator Author

shntnu commented Nov 5, 2021

  • @gwaygenomics Please note accepted datasets will be published under the Creative Commons Attribution 4.0 International license (CC BY 4.0, https://creativecommons.org/licenses/by/4.0/). If this is not suitable please discuss your preferred licence with us before submitting any data.

This is less permissive than the license that we will use in the s3://cellpainting-gallery (CC0 https://github.com/awslabs/open-data-registry/blob/899c7a0e44e331dfc9c844a2a28261406ad73eb7/datasets/cellpainting-gallery.yml#L29) but that's ok I think. Do you see any issues @gwaygenomics ?

@gwaybio
Copy link
Member

gwaybio commented Nov 5, 2021

Sounds good to me 👍 - as long as people are free to use, I'm good

@francesw
Copy link

francesw commented Nov 9, 2021

It's fine, we are happy to go with a more permissive license than CC BY 4.0, so CC0 is good for us. Thanks

@gwaybio
Copy link
Member

gwaybio commented Feb 14, 2022

Hi all! Sorry to not have pinged sooner, but how are we doing with this upload?

We received favorable reviews, but the journal will not publish without public data. Thanks! (hope all is well!)

@shntnu
Copy link
Collaborator Author

shntnu commented Mar 10, 2022

@joshmoore said:

The next step is for you to send us a link to the Docker you already have for doing these conversions, and then Erin will create a tool use the template https://github.com/CellProfiler/Distributed-Something to do the conversion, and then actually do the conversion.

I gave everything a trial run earlier this week sans upload. (Scripts below) Download went well enough. With default parameters, conversion took 4 hours. If the download/upload are not part of the blackbox, then I can create this as an official image like “openmicroscopy/bioformats2raw”. If we want more logic within, then we’ll need to discuss naming.

head -n 100 download.sh

aws --no-sign-request --region us-east-1 s3 ls --summarize --human-readable --recursive s3://cellpainting-gallery/lincs/broad/images/2016_04_01_a549_48hr_batch1/images/SQ00014812__2016-05-23T20_44_31-Measurement1/Images/ 2>&1 | tail -n 2

exec time \
    conda run -n aws \
    aws --no-sign-request --region us-east-1 s3 sync \
    s3://cellpainting-gallery/lincs/broad/images/2016_04_01_a549_48hr_batch1/images/SQ00014812__2016-05-23T20_44_31-Measurement1/Images/ \
    SQ00014812__2016-05-23T20_44_31-Measurement1/Images/ | tee "$(date "+%F_%T").log"

head -n 100 run.sh

time sudo docker run \
    -u $(id -u) -v $PWD:/src --rm josh-bf2raw \
    --debug=INFO \
    /src/SQ00014812__2016-05-23T20_44_31-Measurement1/Images/Index.idx.xml \
    /src/SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr

head -n 100 docker/Dockerfile

# See https://github.com/mamba-org/micromamba-docker

FROM mambaorg/micromamba

COPY --chown=$MAMBA_USER:$MAMBA_USER env.yaml /tmp/env.yaml

RUN micromamba install -y -f /tmp/env.yaml && \

    micromamba clean --all --yes

ENTRYPOINT ["/usr/local/bin/_entrypoint.sh", "bioformats2raw"]
du -sh SQ00014812__2016-05-23T20_44_31-Measurement1*
151G     SQ00014812__2016-05-23T20_44_31-Measurement1
202G     SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr
ome_zarr info SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/
/data/josh/cellpainting/SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr [zgroup]

- metadata
   - Plate

- data
   - (1, 5, 1, 2160, 3240)
ome_zarr info SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/0/0/0
/data/josh/cellpainting/SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/0/0/0 [zgroup]

- metadata
   - Multiscales

- data

   - (1, 5, 1, 2160, 2160)
   - (1, 5, 1, 1080, 1080)
   - (1, 5, 1, 540, 540)
   - (1, 5, 1, 270, 270)
   - (1, 5, 1, 135, 135)

@shntnu
Copy link
Collaborator Author

shntnu commented Mar 10, 2022

Over to @ErinWeisbart and @bethac07

@bethac07
Copy link

@joshmoore -

If the download/upload are not part of the blackbox, then I can create this as an official image like “openmicroscopy/bioformats2raw”. If we want more logic within, then we’ll need to discuss naming

I think it makes sense to leave the download and upload out .

@joshmoore
Copy link

@bethac07 : Perfect. Thanks.

@shntnu
Copy link
Collaborator Author

shntnu commented Mar 10, 2022

151G SQ00014812__2016-05-23T20_44_31-Measurement1
202G SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr

@joshmoore – do we expect this 33% increase in storage?

From David Logan, I had learned this:

With zlib compression switch on when running bioformat2raw causes a small (~5%) but acceptable increase in storage space for cell painting-style HCS images.

I wonder if the zlib compression switch was off in your conversion?

@joshmoore
Copy link

joshmoore commented Mar 10, 2022

@joshmoore – do we expect this 33% increase in storage?

On average the TIFFs are 9 MB and 5 make up the equivalent of one OME Image. Looking at the pyramid of an OME-Zarr:

(base) [jamoore@pilot-zarr1-dev cellpainting]$ du -sh SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/4/0/6/*
45M	SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/4/0/6/0
12M	SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/4/0/6/1
2.8M	SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/4/0/6/2
720K	SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/4/0/6/3
184K	SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/4/0/6/4

the full resolution matches 5*9MB. So the extra space should come primarily from the extra four levels of the pyramid (1080x1080, 540x540, 270x270 and 135x135):

>>> 45 * 0.34
15.3
>>> 12 + 2.8 + .72 + .184
15.704

A different compression might help, but configuring the pyramid levels will definitely make a difference:

  "compressor" : {
    "clevel" : 5,
    "blocksize" : 0,
    "shuffle" : 1,
    "cname" : "lz4",
    "id" : "blosc"
  },

@shntnu
Copy link
Collaborator Author

shntnu commented Mar 11, 2022

Thanks for the explainer @joshmoore!

A different compression might help, but configuring the pyramid levels will definitely make a difference:

Is this a decision that IDR will make (to keep it standard across all datasets) or do we need to / get to decide? If we need to decide, we might need some help on understanding what we'd trade if we went with fewer levels (better storage-wise but worse interactivity-wise?)

@sbesson
Copy link

sbesson commented Mar 11, 2022

At least from the OMERO perspective, the individual fields of view would not classify under the category of large images (aka larger than 3K x 3K) where the server would mandate pyramidal levels. This means all data access operations would only happen using tiled access to the top-level resolution .

Said otherwise, the intermediate resolutions generated are not critical for OMERO/IDR and we can likely make compromises in order to keep the data volumes largely equivalent between both representation. I pre-computing some intermediate resolution levels in the NGFF representation is valuable. In particular the lowest resolution typically correspond to the thumbnail representation of a field of view. OMERO currently recomputes these levels internally but with growing usage of NGFF, I could certainly imagine it could make natively use of these resolutions if they exist.

Maybe rather than using scale factor of 2 which will create a 1/4+1/16+1/64+1/256 i.e. a 33% increase, we could use a scale factor of 4 which would bring us to 1/16+1/256 i.e. a 6.7 % increase for the conversion?

@ErinWeisbart
Copy link
Member

I've talked to Beth and I think I'm up to speed on my part of this project, at least as up-to-speed as I can get without starting to get my hands dirty. (Sorry for missing out on joining the meeting, but I'm on the West Coast which makes it quite a challenge to reasonably schedule meetings with folks across the pond).

It sounds like I should go ahead with a scale factor of 4?
@joshmoore It doesn't look like scale factor is being passed to the docker. Is this easily configurable?

@joshmoore
Copy link

Hi @ErinWeisbart. If a US-timezone call is necessary in the next few weeks, let me know. The docker has the ENTRYPOINT set to the equivalent of the bioformats2raw executable so all arguments should be passed directly including -h to see all available options.

@ErinWeisbart
Copy link
Member

Thanks @joshmoore. This project is right at the edge of my knowledge base, so I apologize for asking naive questions.

Our "Distributed-Something" usually points to a Docker on Dockerhub. Were you planning on creating an official openmicroscopy/bioformats2raw docker?

@shntnu
Copy link
Collaborator Author

shntnu commented Mar 23, 2022

Were you planning on creating an official openmicroscopy/bioformats2raw docker?

@joshmoore would you recommend that @ErinWeisbart creates a docker herself using this?:

# See https://github.com/mamba-org/micromamba-docker

FROM mambaorg/micromamba

COPY --chown=$MAMBA_USER:$MAMBA_USER env.yaml /tmp/env.yaml

RUN micromamba install -y -f /tmp/env.yaml && \

    micromamba clean --all --yes

ENTRYPOINT ["/usr/local/bin/_entrypoint.sh", "bioformats2raw"]

@joshmoore
Copy link

@shntnu : I've failed to find an automated mechanism that will keep the conda-based docker above up-to-date with the latest tag of glencoesoftware/bioformat2raw. Instead, I've built that repo directly and pushed it to openmicroscopy/bioformats2raw:0.4.0 (link). Note: there's no automation there either but adding it will be straight-forward if we choose to stick with this strategy.

@shntnu
Copy link
Collaborator Author

shntnu commented Apr 12, 2022

@joshmoore I believe @ErinWeisbart will be following up on this once she is back from vacation.

Meanwhile, is it possible to get an IDR identifier while we are working through this pilot? Very soon, we will be submitting our revision for the paper associated with this dataset, and they require an identifier for us to be able to submit.

@joshmoore
Copy link

@shntnu: re: @ErinWeisbart 👍. I'll be here. 😉

For the IDR identifier, I'd gently push you back to the standard IDR channels.

@shntnu
Copy link
Collaborator Author

shntnu commented Apr 12, 2022

For the IDR identifier, I'd gently push you back to the standard IDR channels.

Of course, will do

@shntnu
Copy link
Collaborator Author

shntnu commented Apr 13, 2022

Frances said

Your IDR accession number is idr0125. To cite your submission in a manuscript, please include your IDR accession number and the URL to the IDR homepage (https://idr.openmicroscopy.org/). For example, “Data was deposited to the Image Data Resource (https://idr.openmicroscopy.org/) under accession number idr0125.” Please note, this accession number won’t be active until your submission is publicly available in IDR.

🎉

cc @gwaygenomics

@ErinWeisbart
Copy link
Member

@joshmoore I'm running some tests to optimize instance specs for our distributed deployment and it looks like I'm getting slightly different outputs than you. Is it obvious to you what I'm missing? Thanks in advance for your help.

I have an EBS volume mounted as /ebs_tmp with the images downloaded to it (for PLATE I used the same SQ00014812__2016-05-23T20_44_31-Measurement1 as you).

# Enter shell in docker, allowing access to ebs_tmp:
sudo docker run -it --rm --entrypoint /bin/sh -v ~/ebs_tmp:/ebs_tmp openmicroscopy/bioformats2raw:latest
# Run bioformats2raw:
sh /opt/bioformats2raw/bin/bioformats2raw /ebs_tmp/PLATE/Images/Index.idx.xml /ebs_tmp/images_zarr/PLATE.ome.zarr

du -sh SQ00014812__2016-05-23T20_44_31-Measurement1*
151G
du -sh images_zarr/SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/
194G

ome_zarr info SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/
doesn't return anything. This is what concerns me.

ome_zarr info images_zarr/SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/0/0/0

WARNING:ome_zarr.io:version mismatch: detected:FormatV02, requested:FormatV04
WARNING:ome_zarr.io:version mismatch: detected:FormatV04, requested:FormatV02
/home/ubuntu/ebs_tmp/images_zarr/SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/0/0/0 [zgroup]
 - metadata
   - Multiscales
 - data
   - (1, 5, 1, 2160, 2160)
   - (1, 5, 1, 1080, 1080)
   - (1, 5, 1, 540, 540)
   - (1, 5, 1, 270, 270)
   - (1, 5, 1, 135, 135)

I don't know if the warnings matter, but the output otherwise matches yours

@will-moore
Copy link

Hi, apologies for the long pause and thanks for bringing this back to my attention...

I guess there's a few threads to catch up on...

  • Downsampling / thumbnails: the creation of thumbnails on import into IDR was very slow because of the lack of downsampled resolutions. The option to create a single extra thumbnail resolution by using a large downsample factor in bioformats2 raw isn't going to be supported (see Downscale factor option glencoesoftware/bioformats2raw#193). That leaves a couple of options: Either we post-process the data to add a thumbnail resolution layer (I need to find or rewrite the script I had above for that) or we use bioformats2raw and generate multiple resolutions (test first to see how much bigger the data is).

  • Preparation of IDR server with ZarrReader. The existing IDR server doesn't include the new Bio-Formats ZarrReader needed to read OME-Zarr data. However, it does contain various custom Bio-Formats readers that are IDR-specific (added to support various studies in the past). Maintenance of these custom readers, especially during the upgrade necessary to add ZarrReader, has become too costly so we have decided to remove them and convert all the custom data into OME-Zarr. We are making good progress on this but it has been a fair bit of work. We are getting close to start import into the "next" IDR release server and at that point we'll be ready to start importing the cellpainting data (but will need some solution to thumbnailing first)

  • Metadata - I need to find and start validating the metadata you've provided to check that it corresponds to every Plate/Well/Image in the data. I'll let you know of any issues...

@shntnu
Copy link
Collaborator Author

shntnu commented Sep 1, 2023

Thanks a lot for recapping the status, Will!

The preparation of IDR server and metadata is in your hands, so we can only help with the thumbnails. I tried reading the past few comments to determine whether we preferred 1 vs. 2, but I couldn't conclude.

  1. We post-process the data to add a thumbnail resolution layer (I need to find or rewrite the script I had above for that)
  2. We use bioformats2raw and generate multiple resolutions (test first to see how much bigger the data is)

@ErinWeisbart - do you have an opinion? Your past comments #54 (comment) might help remind

@will-moore
Copy link

Created an issue at IDR/idr0125-way-cellpainting#2 wrt the annotation.csv file.

@ErinWeisbart
Copy link
Member

  1. We post-process the data to add a thumbnail resolution layer (I need to find or rewrite the script I had above for that)
  2. We use bioformats2raw and generate multiple resolutions (test first to see how much bigger the data is)

I'm happy enough with either, though I have a preference for (1). It sounds like either way I would need to reprocess this dataset?

For (1) I would add Will's script to our Distributed-OMEZarrCreator which would add the thumbnail creation functionality so the thumbnail creation could be triggered either independently (for this dataset) or as part of the OMEZarr conversion (for any/all future datasets).

For (2) it's @shntnu 's call how much larger we can expand the data if we were to add the whole pyramid down to the necessary thumbnail size. Alternatively, we can re-create the whole pyramid and then delete the layers we don't want, but that isn't a very elegant approach. I'd much rather add functionality than just make stuff and delete part of it ;)

@shntnu
Copy link
Collaborator Author

shntnu commented Sep 1, 2023

It sounds like 2. would be a lot simpler to accomplish. Based on previous notes, below it looks like we can get away with a 6.7% increase if we go with a scale factor of 4. Does that sound right to you @ErinWeisbart? If so, I am all for this approach

#54 (comment)
#54 (comment)

Specifically this:

Maybe rather than using scale factor of 2 which will create a 1/4+1/16+1/64+1/256 i.e. a 33% increase, we could use a scale factor of 4 which would bring us to 1/16+1/256 i.e. a 6.7 % increase for the conversion?

@shntnu
Copy link
Collaborator Author

shntnu commented Sep 1, 2023

Oh wait, now that I read #54 (comment) I'm not sure if 4x factor is allowed. If it is then the 1/16 resolution would be perfect: it will produce 135x135 thumbnails

@will-moore
Copy link

@shntnu No, the 4x factor won't be supported. So it could be up to a 33% increase in data, but it would be good to test since it's possible that it might compress further.

I'll try scripting the thumbnails too so we have that option available...

@will-moore
Copy link

will-moore commented Sep 4, 2023

@shntnu - I found and uploaded the script I was using for testing downsampling: https://github.com/IDR/idr0125-way-cellpainting/blob/main/scripts/add_downsampling.py
It only works for a single image at the moment, e.g. to add a downsample resolution to an image.zarr, scaling by a factor of 8... (which would give suitable thumbnail performance and still a good resolution):

python /path/to/image.zarr 8

EDIT: (This seems to be fine now - see next comment below) I was reminded in my testing that vizarr doesn't support images with downsamplings of factors other than 2.
I can look at scaling this up to work for a whole plate and testing in OMERO/IDR...

@will-moore
Copy link

@shntnu I have processed a sample plate to create a single extra resolution level at a factor-8 downsampling.

Described at
IDR/idr0125-way-cellpainting#3 (comment)

View in ome-ngff-validator at https://ome.github.io/ome-ngff-validator/?source=https://uk1s3.embassy.ebi.ac.uk/idr0125/SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr

This looks pretty good in vizarr - haven't imported into IDR yet but don't see any issues there.

This option looks like it could be a viable solution, but it still is more work than using bioformats2raw to generate a full pyramid.

So I guess it comes down to your workflow (whether you can include a python add_downsample.py step for each Plate) or whether you can afford the extra space for a full pyramid?

@ErinWeisbart
Copy link
Member

ErinWeisbart commented Sep 13, 2023

That's great @will-moore !
It should be pretty simple for me to add add_downsample.py to our Distributed-OMEZarrCreator Docker such that when we create .ome.zarr's we can optionally pass an extra flag to perform the downsample at the same time. Do you want to make a PR to add the script to the repo (in the worker folder) so you have credit for the contribution and I can do the extra work of integrating it?

@will-moore
Copy link

Thanks @ErinWeisbart - I opened a PR at DistributedScience/Distributed-OMEZarrCreator#6

@will-moore
Copy link

I see that you've added the downsampling to the data - e.g. https://ome.github.io/ome-ngff-validator/?source=https://uk1s3.embassy.ebi.ac.uk/idr0125/SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/A/1/0/
Sorry, I'd not checked earlier.

We are making progressing with the update of the IDR to support NGFF data - a few more things to cover but the end is in sight. We're currently thinking of releasing the upgrade to IDR, followed by a separate release with the cellpainting data but we'll let you know when the schedule is clearer.

@shntnu
Copy link
Collaborator Author

shntnu commented Feb 10, 2024

We are making progressing with the update of the IDR to support NGFF data - a few more things to cover but the end is in sight. We're currently thinking of releasing the upgrade to IDR, followed by a separate release with the cellpainting data but we'll let you know when the schedule is clearer.

@will-moore did you get any closer? :D

@will-moore
Copy link

Hi @shntnu - apologies for not updating you on progress. Unfortunately the IDR upgrade to support NGFF data is taking longer than expected. We finally have all the NGFF data and software updates in place but are finding that reduced performance of reading NGFF data from s3 (mounted as a file-system) is causing issues with the server stability. So we are looking at installing microservices to spread the load...

Once the upgrade is released we will focus on getting your study in.
Frances was wondering if you'd got her e-mail "Annotations for idr0125" on the 16th November about compounds, concentration units etc? Thx

@shntnu
Copy link
Collaborator Author

shntnu commented Feb 21, 2024

Once the upgrade is released we will focus on getting your study in.

Thanks for the update, @will-moore

Frances was wondering if you'd got her e-mail "Annotations for idr0125" on the 16th November about compounds, concentration units etc? Thx

We have not done this yet but I will paste her email in here so we can keep track of it

As Will continues to import your plates into IDR, your library file has also been curated (attached) to provide annotations for your images. Are the identifiers in column I reagent identifiers? If not, please amend column header. Please can you confirm that the unit concentration for your compounds is in microMolar (column L), if not please amend the unit. Please could you also provide the InChIKey for each compound (column M) if available. If an InChIKey is not available, please leave blank. Please email the updated library file to us when ready.

idr0125-screenA-library.csv.zip

@will-moore
Copy link

Starting to look at this again since we are getting closer to releasing OME-NGFF support in IDR (apologies for the delay).

I noticed that I'd got a bit confused at #54 (comment) and mixed up the URLs to our sample data on embassy.ebi and the original data on your cellpainting-gallery.s3.

As far as I can see, the original data doesn't yet have down-sampled resolution levels: E.g. this shows a single multiscales resolution of shape 1,5,1,2160,2160:

https://ome.github.io/ome-ngff-validator/?source=https://cellpainting-gallery.s3.amazonaws.com/cpg0004-lincs/broad/images/2016_04_01_a549_48hr_batch1/images_zarr/SQ00015118__2016-04-13T19_52_28-Measurement1.ome.zarr/A/1/0/

Am I looking at the right data there? Are you still considering whether to add downsampling to that data?
It's possible that our recent OME-NGFF performance improvements can mitigate some of the issues we were seeing previously with lack of downsampling, but I think that it would still make the data more user-friendly to have the extra resolutions.

Cheers,
Will

@ErinWeisbart
Copy link
Member

@will-moore Honestly, I lost track of this. I had thought I put a test in at s3://cellpainting-gallery/cpg0004-lincs/broad/images/2016_04_01_a549_48hr_batch1/test_downsample/ but it doesn't look any different from the images_zarr or images_zarr_50 folders (if I'm reading them right).

If you/your team are focusing on this again, I can add it back to my priority list (though I am out the next two weeks so there will be a delay on my end).

@will-moore
Copy link

Thanks @ErinWeisbart. It's not urgent for us. We are still several weeks away from moving this towards release and I was just starting to test again ahead of time. But if you have a chance to look at it sometime after you're back that would be great.

@ErinWeisbart
Copy link
Member

Hi @will-moore.
I've implemented your downsample script into Distributed-OMEZarrCreator and tested it on a single plate.
https://ome.github.io/ome-ngff-validator/?source= https://cellpainting-gallery.s3.amazonaws.com/cpg0004-lincs/broad/images/2016_04_01_a549_48hr_batch1/images_zarr_withdownscale8/SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/

Can you confirm that this looks and performs as expected and then I can convert the rest of the cpg004-lincs dataset?

@will-moore
Copy link

Thanks - I'll let you know...

@will-moore
Copy link

@ErinWeisbart - That plate worked well and reduced the time for generating thumbnails in IDR from about 12 hours to approx 1.5 hours - Big improvement!
So please go ahead with the other plates.
Will you update the previously published plates to add the lower resolutions or use new locations?
Thanks!

@ErinWeisbart
Copy link
Member

@will-moore The plates are now all updated in the original s3://cellpainting-gallery/cpg0004-lincs/broad/images/2016_04_01_a549_48hr_batch1/images_zarr location.

@will-moore
Copy link

Great, thanks @ErinWeisbart. I'll get working on them - although I might be delayed a bit due to our OME meeting next week...

@will-moore
Copy link

Just a quick update.... I've been importing all the plates into a test server and they're looking good but they take a while to import - About 5 hours a plate - approx a month for all the plates, so I'll see if I can do this in parallel...

On the good-news side, we finally released the OME-Zarr support in IDR - see https://forum.image.sc/t/ome-ngff-data-in-the-idr/98630 so we are one step closer.
Still some work to do but we are making progress...

@shntnu
Copy link
Collaborator Author

shntnu commented Aug 4, 2024

@will-moore – could you post an update when you get the chance? An upcoming (smallish) dataset would benefit a lot from IDR's tools, so if this overall approach is looking promising, we will start converting that dataset in the format needed.

(For Erin – this is cpg0038-tegtmeyer-neuropainting)

@will-moore
Copy link

Hi @shntnu, I was just comparing the names of the NGFF plates e.g. SQ00014812 as defined in the plate metadata with the plate names in the library.csv file, e.g. SQ00014812__2016-05-23T20_44_31-Measurement1.

These need to match in order that our annotation scripts can assign rows to the imported Plates.
Currently, NGFF data imported into the server is given the shorter name SQ00014812. If you are happy to use this name in the IDR, then the library file would need to be updated to use those names.

https://ome.github.io/ome-ngff-validator/?source=https://cellpainting-gallery.s3.amazonaws.com/cpg0004-lincs/broad/images/2016_04_01_a549_48hr_batch1/images_zarr/SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr

I've also been checking the thumbnail generation for Plates in our IDR test server and identified a bunch of images where this failed due to the images being all black. I went through and manually triggered generation of black thumbnails and I listed these images so I know where do this again. I thought I'd share this list to check it corresponds with your expectations?

  • SQ00015043 P9 Field 1
  • SQ00015120 P1 Field 6
  • SQ00015148 P24 Field 6
  • SQ00015173 O19 Field 2
  • SQ00015195 B23 Field 5
  • SQ00015197 P1 Field 1
  • SQ00015198 P1 Fields 2-8, P21 Field 2
  • SQ00015207 O1 Fields 1 & 3-9 (P1 missing Well - as discussed above)
  • SQ00015208 B22 Field 5

@shntnu
Copy link
Collaborator Author

shntnu commented Sep 16, 2024

A note to ourselves that the progress here is partially blocked by us (Broadies), because this below from Frances needs to be addressed

#54 (comment)

This is currently on my plate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants