Merge pull request #107 from RConsortium/106-post-volcalc-blog

volcalc blog post
RConsortium · Sep 23, 2024 · c85586e · c85586e
2 parents c202c14 + 333be9b
commit c85586e
Show file tree

Hide file tree

Showing 2 changed files with 220 additions and 0 deletions.
diff --git a/...emical-volatility-how-the-volcalc-r-package-is-streamlining-scientific-research/index.qmd b/...emical-volatility-how-the-volcalc-r-package-is-streamlining-scientific-research/index.qmd
@@ -0,0 +1,220 @@
+---
+title: "Unlocking Chemical Volatility: How the volcalc R Package is Streamlining Scientific Research"
+description: "Kristina Riemer, director of the CCT Data Science Team at the University 
+of Arizona, and Eric Scott, Scientific Programmer and Educator in the CCT Data Science Team, 
+the developers behind the volcalc package, discuss the motivation and development of this 
+innovative tool designed to automate the calculation of chemical compound volatilities. "
+author: "R Consortium"
+image: "volcalc.webp"
+date: "09/23/2024"
+---
+
+![](volcalc.webp)
+
+The R Consortium recently interviewed [Kristina Riemer](https://datascience.cct.arizona.edu/person/kristina-riemer), director of [the CCT Data Science Team](https://datascience.cct.arizona.edu/) at the University of Arizona, and [Eric Scott](https://datascience.cct.arizona.edu/person/eric-scott), Scientific Programmer and Educator in the CCT Data Science Team, the developers behind the [volcalc package](https://github.com/Meredith-Lab/volcalc), to discuss the motivation and development of this innovative tool designed to automate the calculation of chemical compound volatilities. volcalc streamlines the process by allowing users to input a compound and quickly receive its volatility information, eliminating the need for time-consuming manual calculations. Initially created to assist [Dr. Laura Meredith](https://has.arizona.edu/person/laura-meredith) in managing a large database of volatile compounds, volcalc has since grown into a more versatile tool under Eric’s leadership, now supporting a wider range of researchers. 
+
+Kristina and Eric share insights into the challenges they faced, including managing dependencies, 
+integrating with CRAN and Bioconductor, and refining complex molecular identification methods. 
+They also discuss future enhancements, such as incorporating temperature-specific volatility 
+calculations and expanding the package’s functionality to estimate other compound characteristics.
+ This project was funded by the R Consortium. 
+
+**Could you share what motivated the development of the volcalc package and how it 
+aligns with the broader goals of the R ecosystem, particularly in scientific computing?**
+
+**Kristina**: I was heavily involved in the initial development of volcalc, and later 
+on, Eric took over the project. We developed volcalc because we began collaborating 
+with Dr. Laura Meredith, who was compiling a database of volatile chemical compounds. 
+At the time, she had around 300 compounds, and her students manually gathered details 
+for each one by examining their representations and calculating various associated values. 
+This process was tedious and prone to errors, so we thought there must be a more 
+efficient and automated way to handle it.
+
+That’s when we came up with the idea of creating a pipeline where someone could input a 
+compound and quickly receive its volatility information, eliminating the need for all 
+the manual labor. The purpose of volcalc was to transform the process from taking months 
+to gather details for 300 compounds to obtaining information for thousands in a much 
+shorter time.
+
+**Eric**: volcalc was initially developed specifically for a project where the 
+researchers were mainly interested in chemical compounds from the 
+[KEGG database (Kyoto Encyclopedia of Genes and Genomes)](https://www.genome.jp/kegg/). 
+When I joined the team and learned about the project, I was thrilled because, as a 
+chemical ecologist, I saw its potential. However, I also recognized a limitation: 
+the tool only worked with the KEGG database. This was a drawback because many 
+researchers, including food scientists and others who work with similar compounds, 
+might not find their compounds in that specific database.
+
+This realization inspired me to apply for the R Consortium grant. We saw a significant 
+opportunity to expand volcalc, making it more flexible and applicable to a wider range 
+of researchers. We also wanted to improve its integration within the R ecosystem by 
+adding features like returning the file path of a molecule representation after 
+downloading it, so it could be easily piped into subsequent steps. These enhancements 
+aimed to make the tool more versatile and user-friendly for a broader audience.
+
+**What were the most significant challenges you faced during the development of the 
+initial version of volcalc, and how did you overcome them?**
+
+**Kristina**: One of the most challenging aspects of developing volcalc, which 
+continues to be an issue, is managing dependencies. Specifically, we rely heavily 
+on a command-line program to handle much of the processing. Early on, we struggled 
+with how to enable users to run volcalc without needing to install this program on 
+their own computers, as many of our users aren’t familiar with that kind of setup. 
+I spent a lot of time trying to create a reproducible environment using 
+[Binder](https://mybinder.org/), but I was never able to get it fully working. 
+Even today, there are still issues related to managing these dependencies, which 
+Eric can elaborate on further.
+
+It was incredibly important to have Eric on this project because I don’t have a 
+strong background in chemistry. His ability to come in and figure out some of 
+the intricate details that would have taken me much longer to grasp was a huge 
+advantage. The more we can collaborate with domain experts, the better our results will be.
+
+**Eric**: One thing that has helped with the dependency challenges is that we’ve 
+started building volcalc on [R-Universe](https://r-universe.dev/search), which means 
+binaries are available there. While it’s not on CRAN yet, having these binaries on 
+R-Universe makes installation a bit easier. However, we’ve faced some challenges 
+with dependencies, particularly because two of them are from Bioconductor. We 
+didn’t originally aim to develop this package for Bioconductor, which uses S4 
+objects and has different standards than CRAN. Our goal was to get it on CRAN, 
+but our first submission was rejected because the license field for the Bioconductor 
+package wasn’t formatted to CRAN’s liking. These differences between Bioconductor and 
+CRAN have created barriers, even though the authors of the Bioconductor package have 
+been very responsive. Their package works fine on Bioconductor, but it doesn’t meet 
+CRAN’s criteria, which has been a frustrating challenge.
+
+Another major challenge in developing volcalc relates to the method we use for 
+estimating volatility. This method involves counting the numbers of different 
+functional groups on molecules—such as hydroxyl groups or sulfur atoms—and assigning 
+coefficients to them. To do this programmatically, we use something called SMARTS, 
+which is essentially like regular expressions but for molecular structures. Regular 
+expressions for text are already challenging, but SMARTS is even more complex 
+because it deals with three-dimensional molecules.
+
+Before I joined the group, the first version of volcalc had most of these functional 
+groups figured out, but not all. I spent a significant amount of time trying to 
+develop SMARTS strings to match additional molecules. Moving forward, I hope that 
+if we implement new versions, we can get help from the community to refine these 
+SMARTS strings, as there are likely people out there who are more skilled at it than I am.
+
+**The original project proposal mentions expanding volcalc to work with any chemical 
+compound with a known structure. What are the key technical challenges you anticipate 
+in achieving this goal?**
+
+**Eric**: This task turned out to be less difficult than I initially expected, but 
+let me explain. In the original version of volcalc, before we received the R Consortium 
+funding, the main function started with a KEGG ID—an identifier specific to the KEGG 
+database. The function would download a MOL file, which is a text representation of a 
+molecule corresponding to that ID. It would then identify and count the functional groups 
+in the molecule, and finally, calculate the volatility based on those counts.
+
+The major change we needed to implement to make volcalc more versatile was to decouple 
+these steps. In the current version of volcalc, the functionality to download a MOL file 
+from KEGG is still available, but it’s now separate from the main function that calculates 
+volatility. This means that the inputs for calculating volatility can now be any MOL file, 
+not just ones from KEGG. The file can come from any database, be exported from other software, 
+or even be downloaded manually. Additionally, the tool now supports SMILES, which is 
+another, simpler text-based representation of molecules.
+
+There are various ways to represent chemicals in text, including another format 
+called [InChI.](https://www.inchi-trust.org/) The Bioconductor packages we use, ChemmineR 
+and ChemmineOB, have the ability to translate from InChI and other types of chemical 
+representations. However, that feature isn’t available on Windows. So, I decided to 
+keep volcalc focused on SMILES and MOL files. I believe that chemists and other 
+researchers should be able to obtain data in one of these two formats, or use another 
+tool to translate their data into these formats. I didn’t want to overload volcalc 
+with the responsibility of being a chemical representation translator, as that didn’t 
+seem like its primary purpose.
+
+**Can you walk us through the process of implementing the SIMPOL algorithm within the 
+volcalcc package?**
+
+**Kristina**: The algorithm itself is fairly simple; it’s just basic math. You need 
+to input some constants, the mass of the compound, and the counts of the functional 
+groups we discussed earlier. Writing the code for this was straightforward and not 
+particularly challenging.
+
+**Eric**: Each functional group has a coefficient associated with it, which is 
+multiplied by the number of times that group appears in the molecule. These values 
+are then summed up, and the mass of the molecule is factored in as well. The challenging 
+part wasn’t the algorithm itself, which is straightforward—just multiplying by 
+coefficients and adding them up. The real difficulty was interpreting what the authors 
+of the algorithm meant by each of the functional groups. Some were oddly specific, 
+like how the hydroxyl group that is part of a nitrophenol group isn’t supposed to count 
+toward the total number of hydroxyl groups. I spent a lot of time poring over the paper, 
+particularly one table, to fully understand how they defined each group. That 
+interpretation was the hardest part.
+
+**What future functionalities or expansions do you see as crucial for volcalc, especially 
+in the context of evolving research needs in chemoinformatics?**
+
+**Eric**: Right now, we’re working on allowing users to specify different temperatures. 
+The paper that describes the SIMPOL.1 method includes equations for how the coefficients 
+of each functional group change with temperature. These changes aren’t always linear, 
+and the contributions of functional groups can shift in importance as the temperature 
+varies. This is an important feature to include because the version of volcalc we 
+currently have uses coefficients calculated at 20°C, based on a table from the original 
+paper. To accommodate other temperatures, we need to integrate another table that 
+provides equations for calculating these coefficients based on temperature, and that’s
+ what we’re working on.
+
+Another key feature we want to leave room for in the future is the ability to add other 
+methods for estimating volatility. SIMPOL.1 is just one type of group contribution 
+method, but there are other approaches described in various papers that use different 
+functional groups, equations, and coefficients. The basic idea remains the same: count 
+the functional groups in a molecule, apply an equation, and estimate volatility. We’re 
+trying to structure the code in a way that makes it easy to incorporate additional 
+methods later, even if we don’t add them right away. I think these are the most important 
+features we’re focusing on right now.
+
+**Kristina**: We’re focused on the features I mentioned in the near future, but looking 
+further ahead, I could see volcalc expanding to estimate other characteristics of compounds 
+beyond just volatility. While I’m not a chemistry expert or a chemical ecologist, I 
+imagine that those interested in volatility might also be interested in other compound 
+characteristics that currently lack automated tools for estimation. So, it’s possible 
+the package could evolve to include those features.
+
+That said, one of the things I appreciate about the R package ecosystem is that it 
+allows for specialized tools. Since anyone can build what they need, we don’t end up 
+with massive, overly complex packages that try to do everything and become difficult 
+to maintain. It might be better to keep volcalc focused and leave room for separate 
+packages to handle additional functionality. This way, the tools remain manageable 
+and easier to maintain in the long run.
+
+**How has it been working with the R Consortium? Would you recommend applying for 
+an ISC grant to other R developers?**
+
+**Kristina**: The application process was straightforward, and I found the grant 
+format to be very practical. It was focused on milestones and product development, 
+which is refreshing compared to many academic research grants that tend to avoid 
+specific deliverables. I highly recommend considering this grant. I believe people 
+often overlook smaller funding sources, but even small amounts can make a big impact 
+on the work you’re doing.
+
+**Eric**: The first time I applied for an R Consortium grant was as a grad student, 
+and I strongly encourage trainees to apply as well. It was a great experience for me 
+because I could do it independently—my advisor wasn’t involved as one of the authors, 
+and it wasn’t a complex process like applying for an NSF grant. It was straightforward 
+and really rewarding. The only tricky part was figuring out the payment process, but 
+that’s something people can work out.
+
+I’ve noticed there seem to be fewer projects in recent years, and I don’t think it’s 
+due to a lack of funding. It seems like fewer people are applying, which is why I 
+especially encourage others to give it a shot. From what I’ve seen, there’s a very 
+good chance of getting funded if you apply right now.
+
+People should be creative and think broadly about how their project can benefit the 
+broader R community. This doesn’t mean you need to develop the next big thing like 
+R-Universe or CRAN. It can be something smaller, like a package that other R users 
+will find helpful. For example, with our project, volcalc, our main goal was to 
+encourage chemists—who usually use point-and-click software—to start using R. 
+That was enough of a contribution to the R community to get funded. So, I really 
+encourage people to think creatively about what “benefiting the R community” can mean.
+
+## About ISC Funded Projects
+
+A major goal of the R Consortium is to strengthen and improve the infrastructure 
+supporting the R Ecosystem. We seek to accomplish this by funding projects that will 
+improve both technical infrastructure and social infrastructure.
+
+[Learn More!](/all-projects/)
diff --git a/...cal-volatility-how-the-volcalc-r-package-is-streamlining-scientific-research/volcalc.webp b/...cal-volatility-how-the-volcalc-r-package-is-streamlining-scientific-research/volcalc.webp