Response to Package Reviews

Dear @sckott,

I am very grateful for receiving two valuable reviews from @grimbough and @naupaka. In the following response to the reviewers, I tried to address all comments and issues and incorporated all changes into the new version of the biomartr package (see NEWS).

I hope that I could address all points sufficiently so that you can consider biomartr to become a part of the rOpenSci package collection.

I would like to thank you and both reviewers for your detailed and constructive comments. It was a pleasure to work with you and I am looking forward to collaborating with the rOpenSci community in the future.

Kind regards, Hajk

Response to reviewer 1 (@grimbough)

General comments

I actually find the package name a little misleading, since this does way more than just provide an R interface to the BioMart API. Perhaps that was the original intention for the package, but the present functionality exceeds this. It’s probably a bit late to change this though given there are existing publications referring to it.

Response: I agree. The package naming is constrained due to the history of how biomartr was developed in the first place. Initially, I focused on the BioMart database, but then it closed it’s service and the NCBI and ENSEMBL databases took over parts of it’s role and diversified their services as well. Another point was the review process of the biomartr package in the journal Bioinformatics. There, it was requested during the review process that the functionality of the biomartr package should be extended to retrieve genomes, proteomes, etc. from NCBI and ENSEMBL databases. I apologize that all this caused a misleading naming of the package, but I hope that the detailed documentation will make users aware of this shortcoming.

One other small semantic point, which you see a lot, is that this package provides access to ‘Ensembl Biomart’, rather than accessing Biomart services in general. There are in fact many other databases that use Biomart as a way to query their data e.g. SalmonDB or Pancreas Expression Database, which this package currently doesn’t provide access to. This is particularly true since the centralised access via Biomart.org no longer exists, and it might be worth clarifying this in the README and Introductory vignette.

Response: Many thanks for pointing this out. I now made this point clear in the introductory vignette.

Build/Install

The author might be interested to know that you can use biocLite() to install from CRAN and github as well as Bioconductor, and it will sort of the dependencies from the various repos. So the installation of the CRAN version can be simplified.

Response: This is a great suggestion, and I now use the biocLite() way in the README.

One other thing to note is that the importing of functions from biomaRt and Biostrings is never defined explicitly. They are mentioned in the installation instructions and their functions are accessed using the double colon operator e.g. Biostrings::readBStringSet(). However they are not declared in the NAMESPACE. Including the relevant functions as imports would be desireable. Otherwise, checking for their existence with useNamespace() and printing a helpful message if they aren’t present may also work.

Response: I fixed this now and explicitly imported biomaRt and Biostrings in the NAMESPACE.

Most functions have examples, and those that I tested manually worked. However, they are all set to not run with DONTRUN flags, so it is difficult to run them in an automated fashion. This also means they aren’t run when the package is built and check, so there is a chance that some no longer execute successfully. I would certainly encourage the package maintainers to make it that more of the examples are run, although I appreciate that they probably take a long time to execute given almost all are retrieving online data sets. Running the set of tests that accompany the package to over 30 minutes for me.

Response: Thank you so much for pointing this out to me, but to be honest, I am not sure how to run these download commands while following the CRAN policy that states that examples should terminate within 5 sec. In my initial CRAN submission, I didn’t have DONTRUNS wrapped around examples and was asked by the CRAN maintainers to do so to comply with the 5 sec example rule. Due to this fact, I started using Travis to automatically check the functionality of all functions whenever I update the package. However, I am very happy for any suggestion on how to run download examples and still comply with the 5 sec CRAN policy.

When running devtools::test() I get three failures, however in each case when I manually ran the code I got the expected output rather than an error. I haven’t had a chance to explore this further, but I have include the failure messages below for completeness.

Response: I also explored this peculiarity and fixed the unit tests.

Documentation and vignettes

The vignettes are generally well written and cover a variety of use cases. The vignette naming is also based around these use cases, which I guess makes sense if you approach the package with a particular research domain in mind, but is not so helpful if you a looking for how to access a specific service. For example, if I know I want to query Ensembl BioMart (which doesn’t seem unreasonable given the package name) it doesn’t feel intuitive to me that I have to look in ‘Evolutionary Transcriptomics’ to find an example of accessing that particular resource. This would be somewhat mitigated by the suggestion to include services in the function names, so it is at least easier to find the appropriate manual page.

Response: I agree and now renamed the vignette to BioMart Examples and adapted the content of this vignette.

I would highly recommend making more of the vignette examples evaluated when the Rmarkdown documents are rendered. Over time the contents of the databases can change, and having static output in the vignettes can lead to confusion among users, when what they see when running a code block no longer matches what they see in the documentation. For example running the command listDatabases(db = “human”) produces a list of files that is different from what is contained in the Database_Retrieval vignette. You can also end up in the situation where examples in a vignette no longer work at all, although having a decent set of tests in the package should reduce the chances of this happening. This is something where a repository like Bioconductor, which builds all of its packages regularly on multiple platforms, can be very helpful, since you are notified if the vignette no longer executes.

Response: I agree, and now vignette examples are evaluated when the Rmarkdown documents are rendered.

Function/variable naming & general syntax

In general the code is pretty easy to read, with sensible variable names and neatly structured code. However, the function naming within the package seems to be inconsistent. There are examples of camelCaps, snake_case and period.separated. Personally I would avoid using period.separated since there is the potential to conflict with the S3 method dispatch system, but you can find at least one style guide to support each. However, it would be nice to be consistent within the package.

Response: Thanks for pointing this out to me. In the next versions of biomartr, I will consistently use the camelCaps style for new functions.

I also think from a user view point that some function names could perhaps be more explicit about the resource they are accessing. e.g. listDatabases() only accesses the NCBI, but all of the services this package can connect to are ultimately databases. Similarly getDatasets() is BioMart specific. This is explained in the manual pages, but it might make it more intuitive to users to have the service in the function name so you can more easily pick out the functions to use together. I feel this is particularly true when you consider functions like getGenome() actually do have the functionality to access multiple resources. This is addressed a bit by grouping related functions together in the README.

Response: I renamed the listDatabases() function to listNCBIDatabases() and will depreciate listDatabases() in the next version of biomartr. Due to the documentation of the Ensembl Biomart query notation biomartr::getDatasets() is now more clearly defined.

Console messages

There are quite a few instances of cat() and print() scattered throughout the code to write messages to the user e.g. in getGenome.R, getGFF.R etc. I would recommend changing these to message() so they can be suppressed if desired. The cat() usage is often in conjunction with sink() and setwd(). I am more in favour of writeLines() over sink()&cat() but I that’s personal preference rather than anything concrete. However, I would avoid using setwd() if possible, since this changes things in the global environment and if your function fails for any reason the working directory has now changed for the user. I would either pass the path to the file writing function, or if you really want to change directory, use on.exit(setwd(“my/original/dir”)) to ensure it is changed back however the current function exits.

Response: I agree and now consistently use message(). I also use the file.path() instead of changing the working directory.

Is there code duplication in the package that should be reduced?

There are a few instances where variables such as file names are created multiple times throughout the same function. For example file.path(tempdir(), “_ncbi_downloads“,”listDatabases.txt“) appears six times in listDatabases.R. It’s only a very minor detail, but having it defined only once may save hassle in the future if you ever decide to change that particular file name.

Response: I now store file.path(tempdir(), "_ncbi_downloads", "listDatabases.txt") in the variable db.file.name.

getGenome() has a degree of code replication between the paths followed when db = “ensemblgenomes” and db = “ensembl”. I haven’t thought about it too much, but I wonder if this function could be modularised with two sub-functions, one of which handles refseq and genbank queries, and one which handles the two ensembl queries? Similarly there is a lot of overlap between getENSEMBL.Seq.R and getENSEMBLGENOMES.Seq.R. I wonder if these could be combined, and the requested host used to select the few lines that are different?

Response: Due to the differences in the internal folder structure of the ENSEMBL and ENSEMBLGENOMES servers I had to implement special cases and thus this code redundancy was on purpose to have a better overview for code maintainence.

Response to reviewer 2 (@naupaka)

Review Comments

I think it would be nice to clarify in the README and elsewhere that the queries work, for the most part (as far as I understand it), against NCBI or EMSEMBL, and not from any of the many other sources that have a BioMart interface (as @grimbough pointed out). For example, I am interested in getting data from the BioMart interface to JGI’s Phytozome, since they have a 3.0 version of the Populus trichocarpa genome (the NCBI version is still at 2.0), but that is not possible, I believe, with this package (feature request ).

Response: I agree and I changed it according to @grimbough’s point. The JGI’s Phytozome request is taken (https://github.com/HajkD/biomartr/issues/9) and since I am working with plant genomes myself I will try my best to provide some interface functions :D .

Another thing that I didn’t see, but I think would be useful, would be to modify listGenomes() to return a data frame where the first column is is the species’ Latin binomial and the second column is which database each species’ genome is associated with. Otherwise, it’s tricky to figure out which to set for the db = parameter when using getGenome().

Response: This is a functionality extension request. I am happy to implement it for the next biomartr version and opened up an extension request.

README

No Travis badge in README.

Response: Travis badge is now included in README.

While all exported functions have roxygen-generated documentation, some internal functions have no comments or documentation, e.g. connected.to.internet.R, exists.ftp.file.R, or get.emsembl.info.R. You could add some short roxygen comments and add #’ @noRd as recommended in the ROpenSci package guidelines, or just have some non-roxygen style comments in there in case others would like to better understand what’s going on.

Response: Many thanks for pointing this out to me. I now commented all internal functions and used #' @noRd to mark that it is an internal function.

I also agree with @grimbough that it would be ideal if at least some of the example code in the documentation could actually be run. The output from devtools::run_examples() suggests that there isn’t any code in any of the examples for any of the functions that is actually run. Obviously, it wouldn’t make sense to run download.database.all(name = “nr”, path = “nr”) because that could take days, but surely there are some queries that would be fast enough to run in a reasonable timeframe?

Response: I agree, but please see my response to @grimbough concerning the 5s CRAN policy.

Code

I was a bit confused by which functions did what, in terms of which database(s) they query, and where exactly datasets were coming from. For each download, this information is all very helpfully documented in the doc_*.txt files in the _ncbi_downloads directory, but I felt like it was a bit of a hunt to figure out what had happened. Maybe this is just because I am still not 100% clear on which databases this package interfaces with. This is related to my earlier comment about being confused about how to figure out which db to get a genome from if you don’t know ahead of time.

Response: As stated above, I now clearly write the databases biomartr is interfacing in the vignettes and also changed the function name from listDatabases() ti listNCBIDatabases(). I hope that it is less confusing now.

ROpenSci’s packaging guide suggests the use of message() and/or warning() instead of print() (e.g. lines 144 and 291 of getProteome.R)

Response: I now consistently use message() and/or warning() instead of print() or cat().

Tests

A number of the tests fail for me when run with either R CMD CHECK or devtools::check(). Not sure if it is appropriate to put all that output here, but I figured maybe it would be helpful for diagnosing the problem. I think these errors are related to the occasional ‘weird server error’ messages I was getting.

Response: I fixed the corrupt unit tests and now devtools::check(document = FALSE) does no throw errors anymore.

Looks like there is a non-ASCII character in the copied tibble output in one of the verbatim blocks in the Functional Annotation vignette (non-executing).

Response: Thank you so much for detecting this non-ASCII character. It is fixed it now.

Need line break before line 33 in Introduction.Rmd, or markdown header doesn’t render properly.

Response: I included the line break.

Repeated “No encoding supplied” warnings

Response: The encoding is fixed now.

Sometimes when I would try to run a command, I’d get errors like this. I was connected to the Internet at the time, and similar commands run before and afterwards worked fine. Sometimes the same command, run a second time, worked fine even though it failed at first. Maybe something about the timeout before waiting for a server response could be tweaked to reduce the likelihood of premature failure?

Response: Thank you so much for pointing this server timeout issue out to me. I now implemented a customized download function which uses the best download tool for the underlying operating system. I hope that this issue is now resolved.

===============

Dear @sckott,

I addressed all comments and issues and incorporated all changes into the new version of the biomartr package.

I hope that I could address all points sufficiently so that you can consider biomartr to become a part of the rOpenSci package collection.

Please find my detailed response below.

Thank you so much for all your support.

Kind regards,

Hajk

P.S. I just want to make you aware that the Ensembl Services are currently not available. Thus, until the Ensembl servers is up again some of the biomartr functions will not run properly.

For running egs, I suggest MAYBE running egs wrapped in dontrun on Travis by adding –run-dontrun to your check args (since taking a longish time on travis I assume is not a big deal) - I’d have one or two examples for each function not wrapped in dontrun when you know they will run quickly. Note: I said MAYBE since check already takes a very long time, but still would be nice to make sure egs always work

Response: I tried to solve the running time issue by choosing example test cases that use very small files. It should be better now. Please also see my response to the test coverage and R CMD check.

output of goodpractice::gp() was pointed out and there’s some things in there which I think you should address: sapply usage: replace with vapply/lapply it’s a bit of work, but I highly suggest tidying code and documentation to 80 line width so you don’t get text running off the page on the CRAN manual (e.g., https://cran.rstudio.com/web/packages/biomartr/biomartr.pdf), and it’s easier to see in a narrow width editor setup replace usage of 1:length(…), 1:nrow(…), 1:ncol(…), 1:NROW(…) and 1:NCOL(…) with eg., seq_along/seq_len

Response: I agree and now replaced all sapply() commands by lapply(), reformatted all code and documentation to 80 line width, and replace usage of 1:length(...), 1:nrow(...), 1:ncol(...), 1:NROW(...) and 1:NCOL(...) with seq_len().

Add a package level manual file - so that users can do ?biomartr to get to the man pages easily

Response: A package level manual file is now included and as suggested can be accessed via ?biomartr.

looks like test coverage is about 47% - would be good to get that % up - not necessary now, but in the future.

Response: I now spent a significant amount of time on increasing the test coverage and will continue to improve it in the future.

R cmd check takes a long time, just for ease of quick turn around time, it’d be wise to make that time much less - I assume most of the time is in tests, but haven’t looked into how to speed that up

Response: Most tests cover the functionality of data retrieval functions, so to reduce R cmd check time I now choose small file downloads for unit testing.

also related to tests, but not about speed: looks like many of your test files aren’t doing much testing using testthat - that is, you run examples, but then don’t test their output (e.g., https://github.com/HajkD/biomartr/blob/master/tests/testthat/test-is.genome.available.R#L12-L36) I imagine the majority of time in tests right now is doing data requests - adding test expectations shouldn’t add a lot more time

Response: I absolutely agree and see where your confusion comes from. I now commented and specified the unit tests more clearly, so that no confusion exists on what part of the function is actually tested.