Upload Image Files to IDR #54

shntnu · 2020-12-11T00:16:35Z

We will upload image files to the Image Data Resource and add URL and metadata information to the Broad Bioimage Benchmark Collection.

We will use this issue to outline the required steps.

From IDR:

study file describing the overall study and the screens that were performed e.g. cell health
library file(s) describing the plate layout of each screen e.g. cell health
processed data file(s) containing summary results and/or a ‘hit' list for each screen

All files should be in tab-delimited text format.
Templates are provided but can be modified to suit your experiment.
Add or remove columns from the templates as necessary.

@gwaygenomics Did you have a processed data file for cell health?

shntnu · 2021-02-26T11:24:51Z

A conclusion from our internal discussion: Let's also include the LKCP dataset when submitting to IDR.

gwaybio · 2021-03-05T15:12:47Z

The first step is to reach out to IDR to see if they would be interested in hosting these data. I plan on doing this today.

Becki will be taking notes on the submission process, on the wiki. I will use this issue to jot down specific metadata information that we'll likely need to track for IDR.

A couple immediate answers to track:

How many files?
How big is the total set?

Batch	Plates	File count	Size
2016_04_01_a549_48hr_batch1	136	~2,200,000	22 TB
2017_12_05_Batch2	135	~2,700,000	23.8 TB

Could you send us a draft of the publication or is it submitted to an pre-print archive such as BioRxiv? Do you have a time line for publication?

We hope to submit a preprint in 2-3 months.

What is the image file format?

.tiff

Do you have feature level data, ROIs or tracking data available for this dataset?

Yes. Feature level data are available at https://github.com/broadinstitute/lincs-cell-painting/

Could this dataset can be integrated with other datasets e.g. through genes (orthologs) or phenotypes?

Definitely. These data are morphologies after thousands of drug perturbations. Data can be linked by drug information.

gwaybio · 2021-03-16T15:16:22Z

Initial inquiry sent on March 16, 2021 with ipLINCS project tag and subject: "[IDR] LINCS Cell Painting - a 45TB benchmark dataset of drug perturbations"

gwaybio · 2021-05-21T20:42:53Z

On March 23, 2021, we received word from the IDR staff that they will not accept our data without first a manuscript draft.

I believe the current plan is to introduce this dataset with the LINCS profiling complementarity paper.

shntnu · 2021-10-13T19:57:28Z

I've created a checklist based on an email Frances Wong:

Transfer images

We’ve recently setup the Globus platform for file transfer (https://www.globus.org/).

@shntnu You (or the person transferring the data) will need to setup a Globus account using the Globus Web App (https://app.globus.org/).
@shntnu Once you have an account setup, please email me your full Globus username/identity.
@shntnu I will setup a shared folder and send you an email from my Globus account with details on how to access this shared folder and you can upload your files to it.
@shntnu do a test run "As you have a large volume of data to transfer, we would like to suggest that you upload one complete plate to us first for testing. We will check that this plate passes validation prior to the full upload of your data. Please upload one complete plate and drop us an email when the transfer is finished."
@shntnu Let IDR know that the test run is done
@shntnu Continue upload full dataset to S3
@shntnu Wait for confirmation that all checks have passed

When preparing your image files for transfer, you may wish to refer to your previous submission (idr00080) as scripts like https://github.com/IDR/idr0080-way-perturbation/blob/master/scripts/illumcorrect_plate_symlinks.sh may be useful.

Note: We will not create illumination corrected files; we don't have the capacity to do that. See broadinstitute/cell-health#106 to understand why this is a very labor-intensive task.

Steps

@shntnu (abandoned) Review notes from Uploading Image Files to IDR and BBBC cell-health#106
@shntnu (in progress) Upload from /cmap/imaging/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/2016_04_01_a549_48hr_batch1/images/. Note the size below.

du -h --max-depth 0 /cmap/imaging/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/2016_04_01_a549_48hr_batch1/images/
26T     /cmap/imaging/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/2016_04_01_a549_48hr_batch1/images/

Upload script

TOP_LEVEL_FOLDER=/cmap/imaging/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/2016_04_01_a549_48hr_batch1/images/
aws s3 sync \
  --profile jump-cp-role \
  --acl bucket-owner-full-control \
  ${TOP_LEVEL_FOLDER} \
  s3://cellpainting-gallery/lincs/broad/images/2016_04_01_a549_48hr_batch1/images/

I should exclude 4 plates because these were bad plates (they got left behind in the freezer, and the images were terrible once we did image them; they were excluded from all analyses)

 parallel aws s3 rm --recursive --profile jump-cp-role s3://cellpainting-gallery/lincs/broad/images/2016_04_01_a549_48hr_batch1/images/{1} ::: SQ00015225__2016-10-29T16_09_17-Measurement1 SQ00015226__2016-10-29T17_50_20-Measurement1 SQ00015227__2016-10-29T19_31_37-Measurement1 SQ00015228__2016-10-29T21_13_50-Measurement1

Fill templates

As before with idr0080, we need some information about the study and the images for this new submission. We have some metadata templates for this information. Empty templates can be downloaded here https://github.com/IDR/idr0000-lastname-example/archive/master.zip.

@shntnu Please choose the templates for HCS data (including templates in screenA folder).

There are 3 template files to fill in.

@shntnu create gsheets with appropriate templates and insert links here
@tnat1031 create the study file, which is for top-level information about the study - title, description, protocols etc. e.g. cell health. Done in adding IDR files #81. The file is
- https://github.com/broadinstitute/lincs-cell-painting/blob/58c86d50ec58af5adae330ac7e4329841c1e30e7/metadata/idr/idr0000-study_batch_1.csv
@tnat1031 create the library file, which is where you list all the images per well and describe the samples that were imaged and explain what is shown in each channel. Each row corresponds to a well as described in the "Well" column. Then in the other columns in each row corresponding to an image file per well the sample is described and any treatment to the sample. Please feel free to add a free text description of the image as well in a "Comment [Sample Description]" column if that is useful. Then there is a column for Channels to describe both the stain/label and what is shown with that stain/label. e.g. cell health. Done in adding IDR files #81. The file is
- https://github.com/broadinstitute/lincs-cell-painting/blob/58c86d50ec58af5adae330ac7e4329841c1e30e7/metadata/idr/idr0000-screenA-library_batch_1.txt.gz
@gwaygenomics the final file is the processed data file. It is for any results relating to each image. It could include such things as phenotypes observed or quantification of label intensities. Whatever we did for Cell Health should work here; maybe just point them to this repo? Or share spherized collated files e.g. https://github.com/broadinstitute/lincs-cell-painting/tree/master/spherized_profiles/consensus? Decision: nothing to do here; see Upload Image Files to IDR #54 (comment).

There are examples of completed templates for other studies here https://github.com/IDR/idr-metadata/. Please try to fill in as much information as you can.

Our most recent submission is idr0080:

Wrap up

@shntnu Don't give a go-ahead to pull all the data in until this PR is wrapped up Initialize cellpainting-gallery on RODA awslabs/open-data-registry#1003 because the data should first be announced on the AWS Open Data Registry
@shntnu Please drop us an email when your ftp transfer is complete and completed metadata templates can be sent by email to [email protected].
@gwaygenomics Please note accepted datasets will be published under the Creative Commons Attribution 4.0 International license (CC BY 4.0, https://creativecommons.org/licenses/by/4.0/). If this is not suitable please discuss your preferred licence with us before submitting any data.

Please keep using [email protected] email address for any future communication.

shntnu · 2021-10-26T20:06:24Z

@gwaygenomics the final file is the processed data file. It is for any results relating to each image. It could include such things as phenotypes observed or quantification of label intensities. Whatever we did for Cell Health should work here; maybe just point them to this repo? Or share spherized collated files e.g. https://github.com/broadinstitute/lincs-cell-painting/tree/master/spherized_profiles/consensus?

@gwaygenomics Any thoughts on this?

gwaybio · 2021-10-26T20:43:02Z

I didn't add phenotypes or any quantification to the cell health submission.

The info on the right is all I provided:

https://idr.openmicroscopy.org/webclient/?show=screen-2701

Thanks!

shntnu · 2021-11-05T14:01:21Z

@gwaygenomics Please note accepted datasets will be published under the Creative Commons Attribution 4.0 International license (CC BY 4.0, https://creativecommons.org/licenses/by/4.0/). If this is not suitable please discuss your preferred licence with us before submitting any data.

This is less permissive than the license that we will use in the s3://cellpainting-gallery (CC0 https://github.com/awslabs/open-data-registry/blob/899c7a0e44e331dfc9c844a2a28261406ad73eb7/datasets/cellpainting-gallery.yml#L29) but that's ok I think. Do you see any issues @gwaygenomics ?

gwaybio · 2021-11-05T16:35:54Z

Sounds good to me 👍 - as long as people are free to use, I'm good

francesw · 2021-11-09T11:13:40Z

It's fine, we are happy to go with a more permissive license than CC BY 4.0, so CC0 is good for us. Thanks

gwaybio · 2022-02-14T23:34:10Z

Hi all! Sorry to not have pinged sooner, but how are we doing with this upload?

We received favorable reviews, but the journal will not publish without public data. Thanks! (hope all is well!)

shntnu · 2022-03-10T18:17:46Z

@joshmoore said:

The next step is for you to send us a link to the Docker you already have for doing these conversions, and then Erin will create a tool use the template https://github.com/CellProfiler/Distributed-Something to do the conversion, and then actually do the conversion.

I gave everything a trial run earlier this week sans upload. (Scripts below) Download went well enough. With default parameters, conversion took 4 hours. If the download/upload are not part of the blackbox, then I can create this as an official image like “openmicroscopy/bioformats2raw”. If we want more logic within, then we’ll need to discuss naming.

head -n 100 download.sh

aws --no-sign-request --region us-east-1 s3 ls --summarize --human-readable --recursive s3://cellpainting-gallery/lincs/broad/images/2016_04_01_a549_48hr_batch1/images/SQ00014812__2016-05-23T20_44_31-Measurement1/Images/ 2>&1 | tail -n 2

exec time \
    conda run -n aws \
    aws --no-sign-request --region us-east-1 s3 sync \
    s3://cellpainting-gallery/lincs/broad/images/2016_04_01_a549_48hr_batch1/images/SQ00014812__2016-05-23T20_44_31-Measurement1/Images/ \
    SQ00014812__2016-05-23T20_44_31-Measurement1/Images/ | tee "$(date "+%F_%T").log"

head -n 100 run.sh

time sudo docker run \
    -u $(id -u) -v $PWD:/src --rm josh-bf2raw \
    --debug=INFO \
    /src/SQ00014812__2016-05-23T20_44_31-Measurement1/Images/Index.idx.xml \
    /src/SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr

head -n 100 docker/Dockerfile

# See https://github.com/mamba-org/micromamba-docker

FROM mambaorg/micromamba

COPY --chown=$MAMBA_USER:$MAMBA_USER env.yaml /tmp/env.yaml

RUN micromamba install -y -f /tmp/env.yaml && \

    micromamba clean --all --yes

ENTRYPOINT ["/usr/local/bin/_entrypoint.sh", "bioformats2raw"]

du -sh SQ00014812__2016-05-23T20_44_31-Measurement1*

151G     SQ00014812__2016-05-23T20_44_31-Measurement1
202G     SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr

ome_zarr info SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/

/data/josh/cellpainting/SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr [zgroup]

- metadata
   - Plate

- data
   - (1, 5, 1, 2160, 3240)

ome_zarr info SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/0/0/0

/data/josh/cellpainting/SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/0/0/0 [zgroup]

- metadata
   - Multiscales

- data

   - (1, 5, 1, 2160, 2160)
   - (1, 5, 1, 1080, 1080)
   - (1, 5, 1, 540, 540)
   - (1, 5, 1, 270, 270)
   - (1, 5, 1, 135, 135)

shntnu · 2022-03-10T18:21:17Z

Over to @ErinWeisbart and @bethac07

bethac07 · 2022-03-10T18:25:58Z

@joshmoore -

If the download/upload are not part of the blackbox, then I can create this as an official image like “openmicroscopy/bioformats2raw”. If we want more logic within, then we’ll need to discuss naming

I think it makes sense to leave the download and upload out .

joshmoore · 2022-03-10T18:43:30Z

@bethac07 : Perfect. Thanks.

shntnu · 2022-03-10T23:36:52Z

151G SQ00014812__2016-05-23T20_44_31-Measurement1
202G SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr

@joshmoore – do we expect this 33% increase in storage?

From David Logan, I had learned this:

With zlib compression switch on when running bioformat2raw causes a small (~5%) but acceptable increase in storage space for cell painting-style HCS images.

I wonder if the zlib compression switch was off in your conversion?

joshmoore · 2022-03-10T23:50:21Z

@joshmoore – do we expect this 33% increase in storage?

On average the TIFFs are 9 MB and 5 make up the equivalent of one OME Image. Looking at the pyramid of an OME-Zarr:

(base) [jamoore@pilot-zarr1-dev cellpainting]$ du -sh SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/4/0/6/*
45M	SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/4/0/6/0
12M	SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/4/0/6/1
2.8M	SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/4/0/6/2
720K	SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/4/0/6/3
184K	SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/4/0/6/4

the full resolution matches 5*9MB. So the extra space should come primarily from the extra four levels of the pyramid (1080x1080, 540x540, 270x270 and 135x135):

>>> 45 * 0.34
15.3
>>> 12 + 2.8 + .72 + .184
15.704

A different compression might help, but configuring the pyramid levels will definitely make a difference:

  "compressor" : {
    "clevel" : 5,
    "blocksize" : 0,
    "shuffle" : 1,
    "cname" : "lz4",
    "id" : "blosc"
  },

shntnu · 2022-03-11T00:05:05Z

Thanks for the explainer @joshmoore!

A different compression might help, but configuring the pyramid levels will definitely make a difference:

Is this a decision that IDR will make (to keep it standard across all datasets) or do we need to / get to decide? If we need to decide, we might need some help on understanding what we'd trade if we went with fewer levels (better storage-wise but worse interactivity-wise?)

sbesson · 2022-03-11T08:49:23Z

At least from the OMERO perspective, the individual fields of view would not classify under the category of large images (aka larger than 3K x 3K) where the server would mandate pyramidal levels. This means all data access operations would only happen using tiled access to the top-level resolution .

Said otherwise, the intermediate resolutions generated are not critical for OMERO/IDR and we can likely make compromises in order to keep the data volumes largely equivalent between both representation. I pre-computing some intermediate resolution levels in the NGFF representation is valuable. In particular the lowest resolution typically correspond to the thumbnail representation of a field of view. OMERO currently recomputes these levels internally but with growing usage of NGFF, I could certainly imagine it could make natively use of these resolutions if they exist.

Maybe rather than using scale factor of 2 which will create a 1/4+1/16+1/64+1/256 i.e. a 33% increase, we could use a scale factor of 4 which would bring us to 1/16+1/256 i.e. a 6.7 % increase for the conversion?

ErinWeisbart · 2022-03-15T21:10:35Z

I've talked to Beth and I think I'm up to speed on my part of this project, at least as up-to-speed as I can get without starting to get my hands dirty. (Sorry for missing out on joining the meeting, but I'm on the West Coast which makes it quite a challenge to reasonably schedule meetings with folks across the pond).

It sounds like I should go ahead with a scale factor of 4?
@joshmoore It doesn't look like scale factor is being passed to the docker. Is this easily configurable?

joshmoore · 2022-03-15T22:45:14Z

Hi @ErinWeisbart. If a US-timezone call is necessary in the next few weeks, let me know. The docker has the ENTRYPOINT set to the equivalent of the bioformats2raw executable so all arguments should be passed directly including -h to see all available options.

ErinWeisbart · 2022-03-16T18:25:18Z

Thanks @joshmoore. This project is right at the edge of my knowledge base, so I apologize for asking naive questions.

Our "Distributed-Something" usually points to a Docker on Dockerhub. Were you planning on creating an official openmicroscopy/bioformats2raw docker?

shntnu · 2022-03-23T19:26:07Z

Were you planning on creating an official openmicroscopy/bioformats2raw docker?

@joshmoore would you recommend that @ErinWeisbart creates a docker herself using this?:

# See https://github.com/mamba-org/micromamba-docker

FROM mambaorg/micromamba

COPY --chown=$MAMBA_USER:$MAMBA_USER env.yaml /tmp/env.yaml

RUN micromamba install -y -f /tmp/env.yaml && \

    micromamba clean --all --yes

ENTRYPOINT ["/usr/local/bin/_entrypoint.sh", "bioformats2raw"]

joshmoore · 2022-03-24T00:06:02Z

@shntnu : I've failed to find an automated mechanism that will keep the conda-based docker above up-to-date with the latest tag of glencoesoftware/bioformat2raw. Instead, I've built that repo directly and pushed it to openmicroscopy/bioformats2raw:0.4.0 (link). Note: there's no automation there either but adding it will be straight-forward if we choose to stick with this strategy.

shntnu · 2022-04-12T16:35:41Z

@joshmoore I believe @ErinWeisbart will be following up on this once she is back from vacation.

Meanwhile, is it possible to get an IDR identifier while we are working through this pilot? Very soon, we will be submitting our revision for the paper associated with this dataset, and they require an identifier for us to be able to submit.

joshmoore · 2022-04-12T18:44:45Z

@shntnu: re: @ErinWeisbart 👍. I'll be here. 😉

For the IDR identifier, I'd gently push you back to the standard IDR channels.

shntnu · 2022-04-12T18:59:00Z

For the IDR identifier, I'd gently push you back to the standard IDR channels.

Of course, will do

shntnu · 2022-04-13T15:26:55Z

Frances said

Your IDR accession number is idr0125. To cite your submission in a manuscript, please include your IDR accession number and the URL to the IDR homepage (https://idr.openmicroscopy.org/). For example, “Data was deposited to the Image Data Resource (https://idr.openmicroscopy.org/) under accession number idr0125.” Please note, this accession number won’t be active until your submission is publicly available in IDR.

🎉

cc @gwaygenomics

ErinWeisbart · 2022-04-25T19:51:13Z

@joshmoore I'm running some tests to optimize instance specs for our distributed deployment and it looks like I'm getting slightly different outputs than you. Is it obvious to you what I'm missing? Thanks in advance for your help.

I have an EBS volume mounted as /ebs_tmp with the images downloaded to it (for PLATE I used the same SQ00014812__2016-05-23T20_44_31-Measurement1 as you).

# Enter shell in docker, allowing access to ebs_tmp:
sudo docker run -it --rm --entrypoint /bin/sh -v ~/ebs_tmp:/ebs_tmp openmicroscopy/bioformats2raw:latest
# Run bioformats2raw:
sh /opt/bioformats2raw/bin/bioformats2raw /ebs_tmp/PLATE/Images/Index.idx.xml /ebs_tmp/images_zarr/PLATE.ome.zarr

du -sh SQ00014812__2016-05-23T20_44_31-Measurement1*
151G
du -sh images_zarr/SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/
194G

ome_zarr info SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/
doesn't return anything. This is what concerns me.

ome_zarr info images_zarr/SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/0/0/0

WARNING:ome_zarr.io:version mismatch: detected:FormatV02, requested:FormatV04
WARNING:ome_zarr.io:version mismatch: detected:FormatV04, requested:FormatV02
/home/ubuntu/ebs_tmp/images_zarr/SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/0/0/0 [zgroup]
 - metadata
   - Multiscales
 - data
   - (1, 5, 1, 2160, 2160)
   - (1, 5, 1, 1080, 1080)
   - (1, 5, 1, 540, 540)
   - (1, 5, 1, 270, 270)
   - (1, 5, 1, 135, 135)

I don't know if the warnings matter, but the output otherwise matches yours

will-moore · 2023-09-01T09:55:01Z

Hi, apologies for the long pause and thanks for bringing this back to my attention...

I guess there's a few threads to catch up on...

Downsampling / thumbnails: the creation of thumbnails on import into IDR was very slow because of the lack of downsampled resolutions. The option to create a single extra thumbnail resolution by using a large downsample factor in bioformats2 raw isn't going to be supported (see Downscale factor option glencoesoftware/bioformats2raw#193). That leaves a couple of options: Either we post-process the data to add a thumbnail resolution layer (I need to find or rewrite the script I had above for that) or we use bioformats2raw and generate multiple resolutions (test first to see how much bigger the data is).
Preparation of IDR server with ZarrReader. The existing IDR server doesn't include the new Bio-Formats ZarrReader needed to read OME-Zarr data. However, it does contain various custom Bio-Formats readers that are IDR-specific (added to support various studies in the past). Maintenance of these custom readers, especially during the upgrade necessary to add ZarrReader, has become too costly so we have decided to remove them and convert all the custom data into OME-Zarr. We are making good progress on this but it has been a fair bit of work. We are getting close to start import into the "next" IDR release server and at that point we'll be ready to start importing the cellpainting data (but will need some solution to thumbnailing first)
Metadata - I need to find and start validating the metadata you've provided to check that it corresponds to every Plate/Well/Image in the data. I'll let you know of any issues...

shntnu · 2023-09-01T14:40:49Z

Thanks a lot for recapping the status, Will!

The preparation of IDR server and metadata is in your hands, so we can only help with the thumbnails. I tried reading the past few comments to determine whether we preferred 1 vs. 2, but I couldn't conclude.

We post-process the data to add a thumbnail resolution layer (I need to find or rewrite the script I had above for that)
We use bioformats2raw and generate multiple resolutions (test first to see how much bigger the data is)

@ErinWeisbart - do you have an opinion? Your past comments #54 (comment) might help remind

will-moore · 2023-09-01T15:48:59Z

Created an issue at IDR/idr0125-way-cellpainting#2 wrt the annotation.csv file.

ErinWeisbart · 2023-09-01T16:34:04Z

We post-process the data to add a thumbnail resolution layer (I need to find or rewrite the script I had above for that)

We use bioformats2raw and generate multiple resolutions (test first to see how much bigger the data is)

I'm happy enough with either, though I have a preference for (1). It sounds like either way I would need to reprocess this dataset?

For (1) I would add Will's script to our Distributed-OMEZarrCreator which would add the thumbnail creation functionality so the thumbnail creation could be triggered either independently (for this dataset) or as part of the OMEZarr conversion (for any/all future datasets).

For (2) it's @shntnu 's call how much larger we can expand the data if we were to add the whole pyramid down to the necessary thumbnail size. Alternatively, we can re-create the whole pyramid and then delete the layers we don't want, but that isn't a very elegant approach. I'd much rather add functionality than just make stuff and delete part of it ;)

shntnu · 2023-09-01T16:45:37Z

It sounds like 2. would be a lot simpler to accomplish. Based on previous notes, below it looks like we can get away with a 6.7% increase if we go with a scale factor of 4. Does that sound right to you @ErinWeisbart? If so, I am all for this approach

#54 (comment)
#54 (comment)

Specifically this:

Maybe rather than using scale factor of 2 which will create a 1/4+1/16+1/64+1/256 i.e. a 33% increase, we could use a scale factor of 4 which would bring us to 1/16+1/256 i.e. a 6.7 % increase for the conversion?

shntnu · 2023-09-01T16:50:08Z

Oh wait, now that I read #54 (comment) I'm not sure if 4x factor is allowed. If it is then the 1/16 resolution would be perfect: it will produce 135x135 thumbnails

will-moore · 2023-09-04T10:15:19Z

@shntnu No, the 4x factor won't be supported. So it could be up to a 33% increase in data, but it would be good to test since it's possible that it might compress further.

I'll try scripting the thumbnails too so we have that option available...

will-moore · 2023-09-04T12:21:02Z

@shntnu - I found and uploaded the script I was using for testing downsampling: https://github.com/IDR/idr0125-way-cellpainting/blob/main/scripts/add_downsampling.py
It only works for a single image at the moment, e.g. to add a downsample resolution to an image.zarr, scaling by a factor of 8... (which would give suitable thumbnail performance and still a good resolution):

python /path/to/image.zarr 8

EDIT: (This seems to be fine now - see next comment below) I was reminded in my testing that vizarr doesn't support images with downsamplings of factors other than 2.
I can look at scaling this up to work for a whole plate and testing in OMERO/IDR...

will-moore · 2023-09-13T14:05:17Z

@shntnu I have processed a sample plate to create a single extra resolution level at a factor-8 downsampling.

Described at
IDR/idr0125-way-cellpainting#3 (comment)

View in ome-ngff-validator at https://ome.github.io/ome-ngff-validator/?source=https://uk1s3.embassy.ebi.ac.uk/idr0125/SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr

This looks pretty good in vizarr - haven't imported into IDR yet but don't see any issues there.

This option looks like it could be a viable solution, but it still is more work than using bioformats2raw to generate a full pyramid.

So I guess it comes down to your workflow (whether you can include a python add_downsample.py step for each Plate) or whether you can afford the extra space for a full pyramid?

ErinWeisbart · 2023-09-13T16:59:28Z

That's great @will-moore !
It should be pretty simple for me to add add_downsample.py to our Distributed-OMEZarrCreator Docker such that when we create .ome.zarr's we can optionally pass an extra flag to perform the downsample at the same time. Do you want to make a PR to add the script to the repo (in the worker folder) so you have credit for the contribution and I can do the extra work of integrating it?

will-moore · 2023-09-14T16:53:27Z

Thanks @ErinWeisbart - I opened a PR at DistributedScience/Distributed-OMEZarrCreator#6

will-moore · 2023-10-09T19:27:23Z

I see that you've added the downsampling to the data - e.g. https://ome.github.io/ome-ngff-validator/?source=https://uk1s3.embassy.ebi.ac.uk/idr0125/SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/A/1/0/
Sorry, I'd not checked earlier.

We are making progressing with the update of the IDR to support NGFF data - a few more things to cover but the end is in sight. We're currently thinking of releasing the upgrade to IDR, followed by a separate release with the cellpainting data but we'll let you know when the schedule is clearer.

shntnu · 2024-02-10T03:47:10Z

We are making progressing with the update of the IDR to support NGFF data - a few more things to cover but the end is in sight. We're currently thinking of releasing the upgrade to IDR, followed by a separate release with the cellpainting data but we'll let you know when the schedule is clearer.

@will-moore did you get any closer? :D

will-moore · 2024-02-12T12:13:41Z

Hi @shntnu - apologies for not updating you on progress. Unfortunately the IDR upgrade to support NGFF data is taking longer than expected. We finally have all the NGFF data and software updates in place but are finding that reduced performance of reading NGFF data from s3 (mounted as a file-system) is causing issues with the server stability. So we are looking at installing microservices to spread the load...

Once the upgrade is released we will focus on getting your study in.
Frances was wondering if you'd got her e-mail "Annotations for idr0125" on the 16th November about compounds, concentration units etc? Thx

shntnu · 2024-02-21T13:03:54Z

Once the upgrade is released we will focus on getting your study in.

Thanks for the update, @will-moore

Frances was wondering if you'd got her e-mail "Annotations for idr0125" on the 16th November about compounds, concentration units etc? Thx

We have not done this yet but I will paste her email in here so we can keep track of it

As Will continues to import your plates into IDR, your library file has also been curated (attached) to provide annotations for your images. Are the identifiers in column I reagent identifiers? If not, please amend column header. Please can you confirm that the unit concentration for your compounds is in microMolar (column L), if not please amend the unit. Please could you also provide the InChIKey for each compound (column M) if available. If an InChIKey is not available, please leave blank. Please email the updated library file to us when ready.

idr0125-screenA-library.csv.zip

will-moore · 2024-05-01T11:01:50Z

Starting to look at this again since we are getting closer to releasing OME-NGFF support in IDR (apologies for the delay).

I noticed that I'd got a bit confused at #54 (comment) and mixed up the URLs to our sample data on embassy.ebi and the original data on your cellpainting-gallery.s3.

As far as I can see, the original data doesn't yet have down-sampled resolution levels: E.g. this shows a single multiscales resolution of shape 1,5,1,2160,2160:

https://ome.github.io/ome-ngff-validator/?source=https://cellpainting-gallery.s3.amazonaws.com/cpg0004-lincs/broad/images/2016_04_01_a549_48hr_batch1/images_zarr/SQ00015118__2016-04-13T19_52_28-Measurement1.ome.zarr/A/1/0/

Am I looking at the right data there? Are you still considering whether to add downsampling to that data?
It's possible that our recent OME-NGFF performance improvements can mitigate some of the issues we were seeing previously with lack of downsampling, but I think that it would still make the data more user-friendly to have the extra resolutions.

Cheers,
Will

ErinWeisbart · 2024-05-01T16:17:53Z

@will-moore Honestly, I lost track of this. I had thought I put a test in at s3://cellpainting-gallery/cpg0004-lincs/broad/images/2016_04_01_a549_48hr_batch1/test_downsample/ but it doesn't look any different from the images_zarr or images_zarr_50 folders (if I'm reading them right).

If you/your team are focusing on this again, I can add it back to my priority list (though I am out the next two weeks so there will be a delay on my end).

will-moore · 2024-05-01T16:38:21Z

Thanks @ErinWeisbart. It's not urgent for us. We are still several weeks away from moving this towards release and I was just starting to test again ahead of time. But if you have a chance to look at it sometime after you're back that would be great.

ErinWeisbart · 2024-05-16T21:35:08Z

Hi @will-moore.
I've implemented your downsample script into Distributed-OMEZarrCreator and tested it on a single plate.
https://ome.github.io/ome-ngff-validator/?source= https://cellpainting-gallery.s3.amazonaws.com/cpg0004-lincs/broad/images/2016_04_01_a549_48hr_batch1/images_zarr_withdownscale8/SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/

Can you confirm that this looks and performs as expected and then I can convert the rest of the cpg004-lincs dataset?

will-moore · 2024-05-17T19:21:48Z

Thanks - I'll let you know...

will-moore · 2024-05-20T20:02:18Z

@ErinWeisbart - That plate worked well and reduced the time for generating thumbnails in IDR from about 12 hours to approx 1.5 hours - Big improvement!
So please go ahead with the other plates.
Will you update the previously published plates to add the lower resolutions or use new locations?
Thanks!

ErinWeisbart · 2024-05-22T17:58:00Z

@will-moore The plates are now all updated in the original s3://cellpainting-gallery/cpg0004-lincs/broad/images/2016_04_01_a549_48hr_batch1/images_zarr location.

will-moore · 2024-05-23T16:39:40Z

Great, thanks @ErinWeisbart. I'll get working on them - although I might be delayed a bit due to our OME meeting next week...

will-moore · 2024-07-03T21:24:16Z

Just a quick update.... I've been importing all the plates into a test server and they're looking good but they take a while to import - About 5 hours a plate - approx a month for all the plates, so I'll see if I can do this in parallel...

On the good-news side, we finally released the OME-Zarr support in IDR - see https://forum.image.sc/t/ome-ngff-data-in-the-idr/98630 so we are one step closer.
Still some work to do but we are making progress...

shntnu · 2024-08-04T18:25:27Z

@will-moore – could you post an update when you get the chance? An upcoming (smallish) dataset would benefit a lot from IDR's tools, so if this overall approach is looking promising, we will start converting that dataset in the format needed.

(For Erin – this is cpg0038-tegtmeyer-neuropainting)

will-moore · 2024-08-13T16:30:20Z

Hi @shntnu, I was just comparing the names of the NGFF plates e.g. SQ00014812 as defined in the plate metadata with the plate names in the library.csv file, e.g. SQ00014812__2016-05-23T20_44_31-Measurement1.

These need to match in order that our annotation scripts can assign rows to the imported Plates.
Currently, NGFF data imported into the server is given the shorter name SQ00014812. If you are happy to use this name in the IDR, then the library file would need to be updated to use those names.

https://ome.github.io/ome-ngff-validator/?source=https://cellpainting-gallery.s3.amazonaws.com/cpg0004-lincs/broad/images/2016_04_01_a549_48hr_batch1/images_zarr/SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr

I've also been checking the thumbnail generation for Plates in our IDR test server and identified a bunch of images where this failed due to the images being all black. I went through and manually triggered generation of black thumbnails and I listed these images so I know where do this again. I thought I'd share this list to check it corresponds with your expectations?

SQ00015043 P9 Field 1
SQ00015120 P1 Field 6
SQ00015148 P24 Field 6
SQ00015173 O19 Field 2
SQ00015195 B23 Field 5
SQ00015197 P1 Field 1
SQ00015198 P1 Fields 2-8, P21 Field 2
SQ00015207 O1 Fields 1 & 3-9 (P1 missing Well - as discussed above)
SQ00015208 B22 Field 5

shntnu · 2024-09-16T14:32:26Z

A note to ourselves that the progress here is partially blocked by us (Broadies), because this below from Frances needs to be addressed

#54 (comment)

This is currently on my plate

shntnu mentioned this issue Mar 9, 2021

Processing data using DeepProfiler #2

Open

ErinWeisbart mentioned this issue Sep 1, 2023

Missing rows in annotation.csv IDR/idr0125-way-cellpainting#2

Closed

will-moore mentioned this issue Sep 14, 2023

Add worker/add_downsampling.py script DistributedScience/Distributed-OMEZarrCreator#6

Merged

will-moore mentioned this issue May 1, 2024

idr0125 NGFF import IDR/idr0125-way-cellpainting#4

Open

Upload Image Files to IDR #54

Upload Image Files to IDR #54

Comments

shntnu commented Dec 11, 2020

shntnu commented Feb 26, 2021

gwaybio commented Mar 5, 2021 • edited Loading

A couple immediate answers to track:

gwaybio commented Mar 16, 2021

gwaybio commented May 21, 2021

shntnu commented Oct 13, 2021 • edited Loading

Transfer images

Fill templates

Wrap up

shntnu commented Oct 26, 2021

gwaybio commented Oct 26, 2021

shntnu commented Nov 5, 2021

gwaybio commented Nov 5, 2021

francesw commented Nov 9, 2021

gwaybio commented Feb 14, 2022

shntnu commented Mar 10, 2022 • edited Loading

shntnu commented Mar 10, 2022

bethac07 commented Mar 10, 2022

joshmoore commented Mar 10, 2022

shntnu commented Mar 10, 2022

joshmoore commented Mar 10, 2022 • edited Loading

shntnu commented Mar 11, 2022

sbesson commented Mar 11, 2022

ErinWeisbart commented Mar 15, 2022

joshmoore commented Mar 15, 2022

ErinWeisbart commented Mar 16, 2022

shntnu commented Mar 23, 2022

joshmoore commented Mar 24, 2022

shntnu commented Apr 12, 2022

joshmoore commented Apr 12, 2022

shntnu commented Apr 12, 2022

shntnu commented Apr 13, 2022

ErinWeisbart commented Apr 25, 2022

will-moore commented Sep 1, 2023

shntnu commented Sep 1, 2023

will-moore commented Sep 1, 2023

ErinWeisbart commented Sep 1, 2023

shntnu commented Sep 1, 2023 • edited Loading

shntnu commented Sep 1, 2023

will-moore commented Sep 4, 2023

will-moore commented Sep 4, 2023 • edited Loading

will-moore commented Sep 13, 2023

ErinWeisbart commented Sep 13, 2023 • edited by shntnu Loading

will-moore commented Sep 14, 2023

will-moore commented Oct 9, 2023

shntnu commented Feb 10, 2024

will-moore commented Feb 12, 2024

shntnu commented Feb 21, 2024

will-moore commented May 1, 2024

ErinWeisbart commented May 1, 2024

will-moore commented May 1, 2024

ErinWeisbart commented May 16, 2024

will-moore commented May 17, 2024

will-moore commented May 20, 2024

ErinWeisbart commented May 22, 2024

will-moore commented May 23, 2024

will-moore commented Jul 3, 2024

shntnu commented Aug 4, 2024

will-moore commented Aug 13, 2024

shntnu commented Sep 16, 2024

gwaybio commented Mar 5, 2021 •

edited

Loading

shntnu commented Oct 13, 2021 •

edited

Loading

shntnu commented Mar 10, 2022 •

edited

Loading

joshmoore commented Mar 10, 2022 •

edited

Loading

shntnu commented Sep 1, 2023 •

edited

Loading

will-moore commented Sep 4, 2023 •

edited

Loading

ErinWeisbart commented Sep 13, 2023 •

edited by shntnu

Loading