-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upload Image Files to IDR #54
Comments
A conclusion from our internal discussion: Let's also include the LKCP dataset when submitting to IDR. |
The first step is to reach out to IDR to see if they would be interested in hosting these data. I plan on doing this today. Becki will be taking notes on the submission process, on the wiki. I will use this issue to jot down specific metadata information that we'll likely need to track for IDR. A couple immediate answers to track:
We hope to submit a preprint in 2-3 months.
.tiff
Yes. Feature level data are available at https://github.com/broadinstitute/lincs-cell-painting/
Definitely. These data are morphologies after thousands of drug perturbations. Data can be linked by drug information. |
Initial inquiry sent on March 16, 2021 with ipLINCS project tag and subject: "[IDR] LINCS Cell Painting - a 45TB benchmark dataset of drug perturbations" |
On March 23, 2021, we received word from the IDR staff that they will not accept our data without first a manuscript draft. I believe the current plan is to introduce this dataset with the LINCS profiling complementarity paper. |
I've created a checklist based on an email Frances Wong: Transfer imagesWe’ve recently setup the Globus platform for file transfer (https://www.globus.org/).
When preparing your image files for transfer, you may wish to refer to your previous submission (idr00080) as scripts like https://github.com/IDR/idr0080-way-perturbation/blob/master/scripts/illumcorrect_plate_symlinks.sh may be useful. Note: We will not create illumination corrected files; we don't have the capacity to do that. See broadinstitute/cell-health#106 to understand why this is a very labor-intensive task. Steps
du -h --max-depth 0 /cmap/imaging/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/2016_04_01_a549_48hr_batch1/images/
26T /cmap/imaging/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/2016_04_01_a549_48hr_batch1/images/ Upload script TOP_LEVEL_FOLDER=/cmap/imaging/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/2016_04_01_a549_48hr_batch1/images/
aws s3 sync \
--profile jump-cp-role \
--acl bucket-owner-full-control \
${TOP_LEVEL_FOLDER} \
s3://cellpainting-gallery/lincs/broad/images/2016_04_01_a549_48hr_batch1/images/ I should exclude 4 plates because these were bad plates (they got left behind in the freezer, and the images were terrible once we did image them; they were excluded from all analyses) parallel aws s3 rm --recursive --profile jump-cp-role s3://cellpainting-gallery/lincs/broad/images/2016_04_01_a549_48hr_batch1/images/{1} ::: SQ00015225__2016-10-29T16_09_17-Measurement1 SQ00015226__2016-10-29T17_50_20-Measurement1 SQ00015227__2016-10-29T19_31_37-Measurement1 SQ00015228__2016-10-29T21_13_50-Measurement1 Fill templatesAs before with idr0080, we need some information about the study and the images for this new submission. We have some metadata templates for this information. Empty templates can be downloaded here https://github.com/IDR/idr0000-lastname-example/archive/master.zip.
There are 3 template files to fill in.
There are examples of completed templates for other studies here https://github.com/IDR/idr-metadata/. Please try to fill in as much information as you can. Our most recent submission is idr0080: Wrap up
Please keep using [email protected] email address for any future communication. |
@gwaygenomics Any thoughts on this? |
I didn't add phenotypes or any quantification to the cell health submission. The info on the right is all I provided: https://idr.openmicroscopy.org/webclient/?show=screen-2701 Thanks! |
This is less permissive than the license that we will use in the s3://cellpainting-gallery (CC0 https://github.com/awslabs/open-data-registry/blob/899c7a0e44e331dfc9c844a2a28261406ad73eb7/datasets/cellpainting-gallery.yml#L29) but that's ok I think. Do you see any issues @gwaygenomics ? |
Sounds good to me 👍 - as long as people are free to use, I'm good |
It's fine, we are happy to go with a more permissive license than CC BY 4.0, so CC0 is good for us. Thanks |
Hi all! Sorry to not have pinged sooner, but how are we doing with this upload? We received favorable reviews, but the journal will not publish without public data. Thanks! (hope all is well!) |
@joshmoore said:
aws --no-sign-request --region us-east-1 s3 ls --summarize --human-readable --recursive s3://cellpainting-gallery/lincs/broad/images/2016_04_01_a549_48hr_batch1/images/SQ00014812__2016-05-23T20_44_31-Measurement1/Images/ 2>&1 | tail -n 2
exec time \
conda run -n aws \
aws --no-sign-request --region us-east-1 s3 sync \
s3://cellpainting-gallery/lincs/broad/images/2016_04_01_a549_48hr_batch1/images/SQ00014812__2016-05-23T20_44_31-Measurement1/Images/ \
SQ00014812__2016-05-23T20_44_31-Measurement1/Images/ | tee "$(date "+%F_%T").log"
time sudo docker run \
-u $(id -u) -v $PWD:/src --rm josh-bf2raw \
--debug=INFO \
/src/SQ00014812__2016-05-23T20_44_31-Measurement1/Images/Index.idx.xml \
/src/SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr
du -sh SQ00014812__2016-05-23T20_44_31-Measurement1*
ome_zarr info SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/
ome_zarr info SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/0/0/0
|
Over to @ErinWeisbart and @bethac07 |
I think it makes sense to leave the download and upload out . |
@bethac07 : Perfect. Thanks. |
@joshmoore – do we expect this 33% increase in storage? From David Logan, I had learned this:
I wonder if the zlib compression switch was off in your conversion? |
On average the TIFFs are 9 MB and 5 make up the equivalent of one OME Image. Looking at the pyramid of an OME-Zarr:
the full resolution matches 5*9MB. So the extra space should come primarily from the extra four levels of the pyramid (
A different compression might help, but configuring the pyramid levels will definitely make a difference:
|
Thanks for the explainer @joshmoore!
Is this a decision that IDR will make (to keep it standard across all datasets) or do we need to / get to decide? If we need to decide, we might need some help on understanding what we'd trade if we went with fewer levels (better storage-wise but worse interactivity-wise?) |
At least from the OMERO perspective, the individual fields of view would not classify under the category of large images (aka larger than 3K x 3K) where the server would mandate pyramidal levels. This means all data access operations would only happen using tiled access to the top-level resolution . Said otherwise, the intermediate resolutions generated are not critical for OMERO/IDR and we can likely make compromises in order to keep the data volumes largely equivalent between both representation. I pre-computing some intermediate resolution levels in the NGFF representation is valuable. In particular the lowest resolution typically correspond to the thumbnail representation of a field of view. OMERO currently recomputes these levels internally but with growing usage of NGFF, I could certainly imagine it could make natively use of these resolutions if they exist. Maybe rather than using scale factor of 2 which will create a 1/4+1/16+1/64+1/256 i.e. a 33% increase, we could use a scale factor of 4 which would bring us to 1/16+1/256 i.e. a 6.7 % increase for the conversion? |
I've talked to Beth and I think I'm up to speed on my part of this project, at least as up-to-speed as I can get without starting to get my hands dirty. (Sorry for missing out on joining the meeting, but I'm on the West Coast which makes it quite a challenge to reasonably schedule meetings with folks across the pond). It sounds like I should go ahead with a scale factor of 4? |
Hi @ErinWeisbart. If a US-timezone call is necessary in the next few weeks, let me know. The docker has the |
Thanks @joshmoore. This project is right at the edge of my knowledge base, so I apologize for asking naive questions. Our "Distributed-Something" usually points to a Docker on Dockerhub. Were you planning on creating an official openmicroscopy/bioformats2raw docker? |
@joshmoore would you recommend that @ErinWeisbart creates a docker herself using this?:
|
@shntnu : I've failed to find an automated mechanism that will keep the conda-based docker above up-to-date with the latest tag of |
@joshmoore I believe @ErinWeisbart will be following up on this once she is back from vacation. Meanwhile, is it possible to get an IDR identifier while we are working through this pilot? Very soon, we will be submitting our revision for the paper associated with this dataset, and they require an identifier for us to be able to submit. |
@shntnu: re: @ErinWeisbart 👍. I'll be here. 😉 For the IDR identifier, I'd gently push you back to the standard IDR channels. |
Of course, will do |
Frances said
🎉 cc @gwaygenomics |
@joshmoore I'm running some tests to optimize instance specs for our distributed deployment and it looks like I'm getting slightly different outputs than you. Is it obvious to you what I'm missing? Thanks in advance for your help. I have an EBS volume mounted as /ebs_tmp with the images downloaded to it (for PLATE I used the same SQ00014812__2016-05-23T20_44_31-Measurement1 as you).
I don't know if the warnings matter, but the output otherwise matches yours |
Hi, apologies for the long pause and thanks for bringing this back to my attention... I guess there's a few threads to catch up on...
|
Thanks a lot for recapping the status, Will! The preparation of IDR server and metadata is in your hands, so we can only help with the thumbnails. I tried reading the past few comments to determine whether we preferred 1 vs. 2, but I couldn't conclude.
@ErinWeisbart - do you have an opinion? Your past comments #54 (comment) might help remind |
Created an issue at IDR/idr0125-way-cellpainting#2 wrt the annotation.csv file. |
I'm happy enough with either, though I have a preference for (1). It sounds like either way I would need to reprocess this dataset? For (1) I would add Will's script to our Distributed-OMEZarrCreator which would add the thumbnail creation functionality so the thumbnail creation could be triggered either independently (for this dataset) or as part of the OMEZarr conversion (for any/all future datasets). For (2) it's @shntnu 's call how much larger we can expand the data if we were to add the whole pyramid down to the necessary thumbnail size. Alternatively, we can re-create the whole pyramid and then delete the layers we don't want, but that isn't a very elegant approach. I'd much rather add functionality than just make stuff and delete part of it ;) |
It sounds like 2. would be a lot simpler to accomplish. Based on previous notes, below it looks like we can get away with a 6.7% increase if we go with a scale factor of 4. Does that sound right to you @ErinWeisbart? If so, I am all for this approach Specifically this:
|
Oh wait, now that I read #54 (comment) I'm not sure if 4x factor is allowed. If it is then the 1/16 resolution would be perfect: it will produce 135x135 thumbnails |
@shntnu No, the 4x factor won't be supported. So it could be up to a 33% increase in data, but it would be good to test since it's possible that it might compress further. I'll try scripting the thumbnails too so we have that option available... |
@shntnu - I found and uploaded the script I was using for testing downsampling: https://github.com/IDR/idr0125-way-cellpainting/blob/main/scripts/add_downsampling.py
EDIT: (This seems to be fine now - see next comment below) I was reminded in my testing that |
@shntnu I have processed a sample plate to create a single extra resolution level at a factor-8 downsampling. Described at View in ome-ngff-validator at https://ome.github.io/ome-ngff-validator/?source=https://uk1s3.embassy.ebi.ac.uk/idr0125/SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr This looks pretty good in vizarr - haven't imported into IDR yet but don't see any issues there. This option looks like it could be a viable solution, but it still is more work than using bioformats2raw to generate a full pyramid. So I guess it comes down to your workflow (whether you can include a |
That's great @will-moore ! |
Thanks @ErinWeisbart - I opened a PR at DistributedScience/Distributed-OMEZarrCreator#6 |
I see that you've added the downsampling to the data - e.g. https://ome.github.io/ome-ngff-validator/?source=https://uk1s3.embassy.ebi.ac.uk/idr0125/SQ00014812__2016-05-23T20_44_31-Measurement1.ome.zarr/A/1/0/ We are making progressing with the update of the IDR to support NGFF data - a few more things to cover but the end is in sight. We're currently thinking of releasing the upgrade to IDR, followed by a separate release with the cellpainting data but we'll let you know when the schedule is clearer. |
@will-moore did you get any closer? :D |
Hi @shntnu - apologies for not updating you on progress. Unfortunately the IDR upgrade to support NGFF data is taking longer than expected. We finally have all the NGFF data and software updates in place but are finding that reduced performance of reading NGFF data from s3 (mounted as a file-system) is causing issues with the server stability. So we are looking at installing microservices to spread the load... Once the upgrade is released we will focus on getting your study in. |
Thanks for the update, @will-moore
We have not done this yet but I will paste her email in here so we can keep track of it As Will continues to import your plates into IDR, your library file has also been curated (attached) to provide annotations for your images. Are the identifiers in column I reagent identifiers? If not, please amend column header. Please can you confirm that the unit concentration for your compounds is in microMolar (column L), if not please amend the unit. Please could you also provide the InChIKey for each compound (column M) if available. If an InChIKey is not available, please leave blank. Please email the updated library file to us when ready. |
Starting to look at this again since we are getting closer to releasing OME-NGFF support in IDR (apologies for the delay). I noticed that I'd got a bit confused at #54 (comment) and mixed up the URLs to our sample data on embassy.ebi and the original data on your cellpainting-gallery.s3. As far as I can see, the original data doesn't yet have down-sampled resolution levels: E.g. this shows a single Am I looking at the right data there? Are you still considering whether to add downsampling to that data? Cheers, |
@will-moore Honestly, I lost track of this. I had thought I put a test in at If you/your team are focusing on this again, I can add it back to my priority list (though I am out the next two weeks so there will be a delay on my end). |
Thanks @ErinWeisbart. It's not urgent for us. We are still several weeks away from moving this towards release and I was just starting to test again ahead of time. But if you have a chance to look at it sometime after you're back that would be great. |
Hi @will-moore. Can you confirm that this looks and performs as expected and then I can convert the rest of the cpg004-lincs dataset? |
Thanks - I'll let you know... |
@ErinWeisbart - That plate worked well and reduced the time for generating thumbnails in IDR from about 12 hours to approx 1.5 hours - Big improvement! |
@will-moore The plates are now all updated in the original |
Great, thanks @ErinWeisbart. I'll get working on them - although I might be delayed a bit due to our OME meeting next week... |
Just a quick update.... I've been importing all the plates into a test server and they're looking good but they take a while to import - About 5 hours a plate - approx a month for all the plates, so I'll see if I can do this in parallel... On the good-news side, we finally released the OME-Zarr support in IDR - see https://forum.image.sc/t/ome-ngff-data-in-the-idr/98630 so we are one step closer. |
@will-moore – could you post an update when you get the chance? An upcoming (smallish) dataset would benefit a lot from IDR's tools, so if this overall approach is looking promising, we will start converting that dataset in the format needed. (For Erin – this is |
Hi @shntnu, I was just comparing the names of the NGFF plates e.g. These need to match in order that our annotation scripts can assign rows to the imported Plates. I've also been checking the thumbnail generation for Plates in our IDR test server and identified a bunch of images where this failed due to the images being all black. I went through and manually triggered generation of black thumbnails and I listed these images so I know where do this again. I thought I'd share this list to check it corresponds with your expectations?
|
A note to ourselves that the progress here is partially blocked by us (Broadies), because this below from Frances needs to be addressed This is currently on my plate |
We will upload image files to the Image Data Resource and add URL and metadata information to the Broad Bioimage Benchmark Collection.
We will use this issue to outline the required steps.
From IDR:
All files should be in tab-delimited text format.
Templates are provided but can be modified to suit your experiment.
Add or remove columns from the templates as necessary.
@gwaygenomics Did you have a processed data file for cell health?
The text was updated successfully, but these errors were encountered: