Best practices for ecological analysis are required to act on concrete challenges

Timothée Poisot

doi:10.24072/pci.ecology.100694

Back

Recommendation

Share Tweet

Printable page

Best practices for ecological analysis are required to act on concrete challenges

Timothée Poisot based on reviews by Nick Isaac and 1 anonymous reviewer

A recommendation of:

Guidance framework to apply best practices in ecological data analysis: Lessons learned from building Galaxy-Ecology

Coline Royaux, Jean-Baptiste Mihoub, Marie Jossé, Dominique Pelletier, Olivier Norvez, Yves Reecht, Anne Fouilloux, Helena Rasche, Saskia Hiltemann, Bérénice Batut, Marc Eléaume, Pauline Seguineau, Guillaume Massé, Alan Amossé, Claire Bissery, Romain Lorrilliere, Alexis Martin, Yves Bas, Thimothée Virgoulay, Valentin Chambon, Elie Arnaud, Elisa Michon, Clara Urfer, Eloïse Trigodet, Marie Delannoy, Gregoire Loïs, Romain Julliard, Björn Grüning, Yvan Le Bras (2024), EcoEvoRxiv, ver.3, peer-reviewed and recommended by PCI Ecology https://doi.org/10.32942/X2G033

Read preprint in preprint server Now published in a journal

Abstract

ZH-CN

Guidance framework to apply best practices in ecological data analysis: Lessons learned from building Galaxy-Ecology

Numerous conceptual frameworks exist for best practices in research data and analysis (e.g. Open Science and FAIR principles). In practice, there is a need for further progress to improve transparency, reproducibility, and confidence in ecology. Here, we propose a practical and operational framework for researchers and experts in ecology to achieve best practices for building analytical procedures from individual research projects to production-level analytical pipelines. We introduce the concept of atomisation to identify analytical steps which support generalisation by allowing us to go beyond single analyses. The term atomisation is employed to convey the idea of single analytical steps as “atoms” composing an analytical procedure. When generalised, “atoms” can be used in more than a single case analysis. These guidelines were established during the development of the Galaxy-Ecology initiative, a web platform dedicated to data analysis in ecology. Galaxy-Ecology allows us to demonstrate a way to reach higher levels of reproducibility in ecological sciences by increasing the accessibility and reusability of analytical workflows once atomised and generalised.

Biodiversity; Reproducible analyses; Galaxy; Good practices; Atomisation; Generalisation; Workflows; Ecoinformatics; Conda; Container; Common Workflow Language; RO-CRATE

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

إطار توجيهي لتطبيق الممارسات الجيدة في تحليل البيانات البيئية: الدروس المستفادة من بناء Galaxy-Ecology

توجد العديد من الأطر المفاهيمية للممارسات الجيدة في بيانات البحث والتحليل (مثل مبادئ العلوم المفتوحة وFAIR). ومن الناحية العملية، هناك حاجة إلى مزيد من التقدم لتحسين الشفافية، والتكاثر، والثقة في البيئة. وهنا نقترح إطارًا عمليًا وتنفيذيًا لتحقيق الممارسات الجيدة لبناء الإجراءات التحليلية القائمة على التفتيت والتعميم. نقدم مفهوم التفتيت لتحديد الخطوات التحليلية التي تدعم التعميم من خلال السماح لنا بتجاوز التحليلات الفردية. تم وضع هذه المبادئ التوجيهية أثناء تطوير مبادرة Galaxy-Ecology، وهي منصة ويب مخصصة لتحليل البيانات في علم البيئة. يتيح لنا Galaxy-Ecology إظهار طريقة للوصول إلى مستويات أعلى من إمكانية التكرار في العلوم البيئية من خلال زيادة إمكانية الوصول وإعادة الاستخدام لسير العمل التحليلي بمجرد تفتيته وتعميمه.

التنوع البيولوجي; تحليلات استنساخه؛ المجرة. الممارسات الجيدة؛ الانحلال. تعميم؛ سير العمل؛ المعلوماتية البيئية؛ كوندا؛ حاوية؛ لغة سير العمل المشتركة؛ RO-CRATE

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Marco de orientación para aplicar buenas prácticas en el análisis de datos ecológicos: lecciones aprendidas de la construcción de Galaxy-Ecology

Existen numerosos marcos conceptuales para buenas prácticas en análisis y datos de investigación (por ejemplo, ciencia abierta y principios FAIR). En la práctica, es necesario seguir avanzando para mejorar la transparencia, la reproducibilidad y la confianza en la ecología. Aquí proponemos un marco práctico y operativo para lograr buenas prácticas para construir procedimientos analíticos basados en la atomización y la generalización. Introducimos el concepto de atomización para identificar pasos analíticos que apoyan la generalización al permitirnos ir más allá de los análisis únicos. Estas directrices se establecieron durante el desarrollo de la iniciativa Galaxy-Ecology, una plataforma web dedicada al análisis de datos en ecología. Galaxy-Ecology nos permite demostrar una forma de alcanzar niveles más altos de reproducibilidad en las ciencias ecológicas aumentando la accesibilidad y reutilización de los flujos de trabajo analíticos una vez atomizados y generalizados.

Biodiversidad; Análisis reproducibles; Galaxia; Buenas practicas; Atomización; Generalización; Flujos de trabajo; Ecoinformática; Conda; Envase; Lenguaje de flujo de trabajo común; RO-CAJA

This is an author-verified version. The authors endorse the responsibility for its content.

Cadre d'orientation pour appliquer les bonnes pratiques en matière d'analyse des données écologiques : leçons tirées de la construction de Galaxy-Ecology

De nombreux cadres conceptuels existent pour les bonnes pratiques en matière de données et d'analyses de recherche (par exemple, les principes Open Science et FAIR). Dans la pratique, des progrès supplémentaires sont nécessaires pour améliorer la transparence, la reproductibilité et la confiance dans l’écologie. Nous proposons ici un cadre concret et opérationnel pour parvenir à de bonnes pratiques pour construire des procédures analytiques basées sur l'atomisation et la généralisation. Nous introduisons le concept d'atomisation pour identifier les étapes analytiques qui aident la généralisation en nous permettant d'aller au-delà des analyses uniques. Ces lignes directrices ont été établies lors du développement de l’initiative Galaxy-Ecology, une plateforme web dédiée à l’analyse de données en écologie. Galaxy-Ecology nous permet de démontrer un moyen d'atteindre des niveaux plus élevés de reproductibilité en écologie en augmentant l'accessibilité et la réutilisabilité des flux de travail analytiques une fois atomisés et généralisés.

Biodiversité ; Analyses reproductibles ; Galaxie; Bonnes pratiques; Atomisation ; Généralisation; Flux de travail ; Écoinformatique ; Conda; Récipient; Langage de flux de travail commun ; RO-CRATE

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

पारिस्थितिक डेटा विश्लेषण में अच्छी प्रथाओं को लागू करने के लिए मार्गदर्शन ढांचा: गैलेक्सी-पारिस्थितिकी के निर्माण से सीखे गए सबक

अनुसंधान डेटा और विश्लेषण (उदाहरण के लिए ओपन साइंस और एफएआईआर सिद्धांत) में अच्छी प्रथाओं के लिए कई वैचारिक ढांचे मौजूद हैं। व्यवहार में, पारिस्थितिकी में पारदर्शिता, प्रतिलिपि प्रस्तुत करने योग्यता और विश्वास में सुधार के लिए और प्रगति की आवश्यकता है। यहां, हम परमाणुकरण और सामान्यीकरण के आधार पर विश्लेषणात्मक प्रक्रियाओं के निर्माण के लिए अच्छी प्रथाओं को प्राप्त करने के लिए एक व्यावहारिक और परिचालन ढांचे का प्रस्ताव करते हैं। हम विश्लेषणात्मक चरणों की पहचान करने के लिए परमाणुकरण की अवधारणा पेश करते हैं जो हमें एकल विश्लेषणों से आगे जाने की अनुमति देकर सामान्यीकरण का समर्थन करते हैं। ये दिशानिर्देश गैलेक्सी-इकोलॉजी पहल के विकास के दौरान स्थापित किए गए थे, जो पारिस्थितिकी में डेटा विश्लेषण के लिए समर्पित एक वेब प्लेटफॉर्म है। गैलेक्सी-इकोलॉजी हमें एक बार परमाणुकृत और सामान्यीकृत विश्लेषणात्मक वर्कफ़्लो की पहुंच और पुन: प्रयोज्य को बढ़ाकर पारिस्थितिक विज्ञान में पुनरुत्पादन के उच्च स्तर तक पहुंचने का एक तरीका प्रदर्शित करने की अनुमति देती है।

जैव विविधता; प्रतिलिपि प्रस्तुत करने योग्य विश्लेषण; आकाशगंगा; अच्छे आचरण; परमाणुकरण; सामान्यीकरण; कार्यप्रवाह; इकोइंफॉर्मेटिक्स; कोंडा; कंटेनर; सामान्य वर्कफ़्लो भाषा; आरओ-क्रेट

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

生態学的データ分析に優れた実践を適用するためのガイダンスフレームワーク: Galaxy-Ecology の構築から学んだ教訓

研究データと分析における優れた実践のために、多数の概念フレームワークが存在します (オープンサイエンスや FAIR 原則など)。実際には、生態学における透明性、再現性、信頼性を向上させるためには、さらなる進歩が必要です。ここでは、細分化と一般化に基づいて分析手順を構築するための優れた実践を達成するための実践的で運用的なフレームワークを提案します。単一分析を超えて一般化をサポートする分析ステップを特定するために、アトマイゼーションの概念を導入します。これらのガイドラインは、生態学におけるデータ分析専用の Web プラットフォームである Galaxy-Ecology イニシアチブの開発中に確立されました。 Galaxy-Ecology を使用すると、細分化され一般化された分析ワークフローのアクセシビリティと再利用性を高めることで、生態科学の再現性をより高いレベルに達成する方法を実証できます。

生物多様性;再現可能な分析。銀河;良い習慣;噴霧化;一般化;ワークフロー;エコインフォマティクス;コンダ;容器;共通ワークフロー言語。 RO-クレート

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Quadro de orientação para aplicar boas práticas na análise de dados ecológicos: Lições aprendidas com a construção da Galaxy-Ecologia

Existem inúmeras estruturas conceituais para boas práticas em dados e análises de pesquisa (por exemplo, Ciência Aberta e princípios FAIR). Na prática, há necessidade de mais progressos para melhorar a transparência, a reprodutibilidade e a confiança na ecologia. Aqui, propomos um quadro prático e operacional para alcançar boas práticas na construção de procedimentos analíticos baseados na atomização e na generalização. Introduzimos o conceito de atomização para identificar etapas analíticas que apoiam a generalização, permitindo-nos ir além de análises únicas. Estas diretrizes foram estabelecidas durante o desenvolvimento da iniciativa Galaxy-Ecology, uma plataforma web dedicada à análise de dados em ecologia. A Galaxy-Ecology permite-nos demonstrar uma forma de alcançar níveis mais elevados de reprodutibilidade nas ciências ecológicas, aumentando a acessibilidade e a reutilização dos fluxos de trabalho analíticos, uma vez atomizados e generalizados.

Biodiversidade; Análises reprodutíveis; Galáxia; Boas práticas; Atomização; Generalização; Fluxos de trabalho; Ecoinformática; Conda; Recipiente; Linguagem Comum de Fluxo de Trabalho; RO-CRATE

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Рамочная основа для применения передовой практики в анализе экологических данных: уроки, извлеченные из создания Galaxy-Ecology

Существует множество концептуальных рамок передового опыта в области исследовательских данных и анализа (например, принципы открытой науки и FAIR). На практике существует необходимость дальнейшего прогресса в улучшении прозрачности, воспроизводимости и уверенности в экологии. Здесь мы предлагаем практическую и оперативную основу для достижения передовой практики построения аналитических процедур, основанных на атомизации и обобщении. Мы вводим концепцию атомизации для определения аналитических шагов, которые поддерживают обобщение, позволяя нам выйти за рамки единичного анализа. Эти рекомендации были созданы во время разработки инициативы Galaxy-Ecology — веб-платформы, посвященной анализу данных в области экологии. Galaxy-Ecology позволяет нам продемонстрировать способ достижения более высокого уровня воспроизводимости в экологических науках за счет повышения доступности и возможности повторного использования аналитических рабочих процессов после их атомизации и обобщения.

Биоразнообразие; Воспроизводимые анализы; Галактика; Передовая практика; Распыление; Обобщение; Рабочие процессы; Экоинформатика; Конда; Контейнер; Общий язык рабочего процесса; РО-КЛЕЙТ

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

在生态数据分析中应用良好实践的指导框架：构建银河生态的经验教训

研究数据和分析方面存在许多良好实践的概念框架（例如开放科学和公平原则）。在实践中，需要进一步进步以提高透明度、可重复性和对生态学的信心。在这里，我们提出了一个实用且可操作的框架，以实现构建基于原子化和泛化的分析程序的良好实践。我们引入原子化的概念来识别支持泛化的分析步骤，使我们能够超越单一分析。这些指南是在 Galaxy-Ecology 计划（一个致力于生态学数据分析的网络平台）开发过程中制定的。 Galaxy-Ecology 使我们能够展示一种通过原子化和通用化后提高分析工作流程的可访问性和可重用性来达到生态科学更高水平可重复性的方法。

生物多样性；可重复的分析；星系;良好做法；雾化；概括;工作流程;经济信息学；康达；容器;通用工作流程语言；滚装箱

Submission: posted 12 April 2024, validated 12 April 2024
Recommendation: posted 01 October 2024, validated 07 October 2024

Cite this recommendation as:
Poisot, T. (2024) Best practices for ecological analysis are required to act on concrete challenges. Peer Community in Ecology, 100694. https://doi.org/10.24072/pci.ecology.100694

Recommendation

A core challenge facing ecologists is to work through an ever-increasing amount of data. The accelerating decline in biodiversity worldwide, mounting pressure of anthropogenic impacts, and increasing demand for actionable indicators to guide effective policy means that monitoring will only intensify, and rely on tools that can generate even more information (Gonzalez et al., 2023). How, then, do we handle this new volume and diversity of data?

This is the question Royaux et al. (2024) are tackling with their contribution. By introducing both a conceptual ("How should we think about our work?") and an operational ("Here is a tool to do our work with") framework, they establish a series of best practices for the analysis of ecological data.

It is easy to think about best practices in ecological data analysis in its most proximal form: is it good statistical practice? Is the experimental design correct? These have formed the basis of many recommendations over the years (see e.g. Popovic et al., 2024, for a recent example). But the contribution of Royaux et al. focuses on a different part of the analysis pipeline: the computer science (and software engineering) aspect of it.

As data grows in volume and complexity, the code needed to handle it follows the same trend. It is not a surprise, therefore, to see that the demand for programming skills in ecologists has doubled recently (Feng et al., 2020), prompting calls to make computational literacy a core component of undergraduate education (Farrell & Carrey, 2018). But beyond training, an obvious way to make computational analysis ecological data more reliable and effective is to build better tools. This is precisely what Royaux et al. have achieved.

They illustrate their approach through their experience building Galaxy-Ecology, a computing environment for ecological analysis: by introducing a clear taxonomy of computing concepts (data exploration, pre-processing, analysis, representation), with a hierarchy between them (formatting, data correction, anonymization), they show that we can think about the pipeline going from data to results in a way that is more systematized, and therefore more prone to generalization.

We may buckle at the idea of yet another ontology, or yet another framework, for our work, but I am convinced that the work of Royaux et al. is precisely what our field needs. Because their levels of atomization (their term for the splitting of complex pipelines into small, single-purpose tasks) are easy to understand, and map naturally onto tasks that we already perform, it is likely to see wide adoption. Solving the big, existential challenges of monitoring and managing biodiversity at the global scale requires the adoption of good practices, and a tool like Galaxy-Ecology goes a long way towards this goal.

References

Farrell, K.J., and Carey, C.C. (2018). Power, pitfalls, and potential for integrating computational literacy into undergraduate ecology courses. Ecol. Evol. 8, 7744-7751.
https://doi.org/10.1002/ece3.4363

Feng, X., Qiao, H., and Enquist, B. (2020). Doubling demands in programming skills call for ecoinformatics education. Frontiers in Ecology and the Environment 18, 123-124.
https://doi.org/10.1002/fee.2179

Gonzalez, A., Vihervaara, P., Balvanera, P., Bates, A.E., Bayraktarov, E., Bellingham, P.J., Bruder, A., Campbell, J., Catchen, M.D., Cavender-Bares, J., et al. (2023). A global biodiversity observing system to unite monitoring and guide action. Nat. Ecol. Evol., 1-5.
https://doi.org/10.1038/s41559-023-02171-0

Popovic, G., Mason, T.J., Drobniak, S.M., Marques, T.A., Potts, J., Joo, R., Altwegg, R., Burns, C.C.I., McCarthy, M.A., Johnston, A., et al. (2024). Four principles for improved statistical ecology. Methods Ecol. Evol. 15, 266-281.
https://doi.org/10.1111/2041-210X.14270

Coline Royaux, Jean-Baptiste Mihoub, Marie Jossé, Dominique Pelletier, Olivier Norvez, Yves Reecht, Anne Fouilloux, Helena Rasche, Saskia Hiltemann, Bérénice Batut, Marc Eléaume, Pauline Seguineau, Guillaume Massé, Alan Amossé, Claire Bissery, Romain Lorrilliere, Alexis Martin, Yves Bas, Thimothée Virgoulay, Valentin Chambon, Elie Arnaud, Elisa Michon, Clara Urfer, Eloïse Trigodet, Marie Delannoy, Gregoire Loïs, Romain Julliard, Björn Grüning, Yvan Le Bras (2024) Guidance framework to apply best practices in ecological data analysis: Lessons learned from building Galaxy-Ecology. EcoEvoRxiv, ver.3 peer-reviewed and recommended by PCI Ecology.
https://doi.org/10.32942/X2G033

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

Funding:
Funding were provided by the European Union through the Erasmus+ Gallantries project; the Agence Nationale de la Recherche through the 65 Million d’Observateurs and the IA-Biodiv projects; the French National Fund for Open Science through the OpenMetaPaper project; the European commission through the H2020 EOSC-Pillar, GAPARS projects, and Horizon Europe FAIRE EASE project; the GO FAIR initiative through the BiodiFAIRse Implementation Network; the Blue Nature Alliance; and the Antarctic and Southern Ocean Coalition. Finally, funding by the French Ministry of Higher Education and Research were provided for the “Pôle national de données de biodiversité” e-infrastructure.

Reviews

Toggle reviews

Evaluation round #2

DOI or URL of the preprint: https://doi.org/10.32942/X2G033

Version of the preprint: 2

Author's Reply, 05 Sep 2024

Download author's reply Download tracked changes file https://doi.org/10.24072/pci.ecology.100694.ar2

Decision by Timothée Poisot, posted 12 Aug 2024, validated 12 Aug 2024

Both reviewers are satisfied with the improvements, and are only recommending (very) minor wording / clarification changes. I will not send this preprint for review again when they are done.

https://doi.org/10.24072/pci.ecology.100694.d2

Reviewed by Nick Isaac, 01 Jul 2024

This revised ms is much clearer and I am pleased to see that many suggestions from the first round of review have been adopted. I would be happy to see this work published: in addition to providing an advertisement for the Galaxy platform, it makes a number of really good points about computational practices in ecological research. Most of my recommended changes (below) are aimed at clarifying the message and simplifying the language:

Some of the new text appears to have been rather hastily inserted and could be improved. For example, on line 162: “propositions have been delimited within several thematic communities in ecology to evaluate and enhance best practices application, notably the SDM communities” should be simplified to “Individual subdisciplines have taken initiative to assert community best practices, e.g. in SDMs” (or similar)

Line 138: “although data accessibility”: this paragraph is not just a single sentence. Please combine with previous or next para.

Line 181: replace “the expertise and research” with “the scientific process”

Line 182: replace “FAIR research data” with “the adoption of FAIR practices”

Line 244: the heading “frameworks towards best practice” is a bit confusing. Perhaps replace with “Principles for best practice”

Line 247: would benefit from a clearer definition of atomisation, perhaps with an example. Something like: “In a maximally-atomised workflow, each small step would be conducted by its own bespoke function”

Line 195-7: replace “mechanically reduces the number of potential users … fragilizing …. ” with “creates barriers to uptake and challenges for peer-review.”

Line 220: replace “single” with “distinct”. Also insert “each” between “steps” and “constituting”.

Line 224-6: Around here it would be appropriate to have a sentence explaining how/why atomisation is part of the solution. (I realise much of this is covered in the later section from line 245, so overlap between sections should be minimised). My own recent experience of working on a colleague’s code has been that atomisation makes it much easier to alter the code, to make it do something different from what the original coder intended. Altering how atomised code is used is, of course, the first step towards generalisation. However, as written the two steps appear to be quite distinct.

Ine 227: replace “this framework …. The Galaxy-Ecology” with “Atomisation and Generalisability are central organising principles in the design of the Galaxy-Ecology”

Line 229: I don’t think that “sharing and processing data” does justice to what Galaxy is aiming for. How about “analysing data and sharing outputs”.

Line 284: again, an example would help to illustrate this point more clearly: “This means trying to avoid hard-coding anything that is specific to the structure of the original dataset (e.g. number of years).”

Line 290 the word “step” appears twice in this sentence. Replace the second with “element” or similar word.

Line 301: the section title is quite clunky: how about “Practical steps towards atomised and generalised coding”.

Line 319: the figure legend needs more detail here. Explain that the different colours refer to different scripts/categories (1-3) and that the boxes refer to functions/scripts.

Line 321: re “code-writing habits”: I think it would be useful to make some kind of comment about how ecologists learn to code. Most of them learn by analysing their own data, and/or from a statistical ecology module that focusses on specific applications, in which the statistics and the ecology are given primacy but the computation. Few are taught formal programming skills in the way that computer science students would be. This means that most of us have generated a large number of “bad habits”!

Line 345: can you explain why someone should feel less embarrassed or fearful out sharing code if it has been atomised and generalised? Is it because they will feel confident about having followed best practice (I doubt it) or because they will feel confident that someone else will be able to actually read and implement it (more likely)?

Table 2 legend: would be more informative to write as “illustration of how Galaxy implements and confirms to best practice”. The second sentence about “limitations …” makes no sense to me. Perhaps it is a reference to the fact that Galaxy is a work-in-progress that the table is a snapshot of current status.

Line 444: can you elaborate on the figure 3 legend to relate it back to the concepts of atomisation and generalisability? In particular, it is appropriate to describe the named items with checkboxes as atoms in the workflow?

Line 459: this is very unclear. Does “eventually” mean “when the user becomes expert” or “there is an aspiration for Galaxy to have this new functionality”. If the former then perhaps replace with something like “In addition to using existing tools, users may develop and upload entirely new tools to the Galaxy server”.

Line 462: “utterly” is superfluous

Line 464: “notably” is superfluous

Line 515: there is perhaps another level in this hierarchy: I have authored papers that were fully executable at the time of publication. However, we did not user Docker or other tools to account for changes in the underlying software, so the code no longer works and the work is therefore not reproducible. Distinguishing between “reproducible now vs reproducible forever” might be helpful.

Line 531: “additionally”

Line 550: replace “correctly” with “appropriately”

Line 552: “heavier” is not clear. I get that the “heaviness” refers to the amount of time investment required to realise the advantages of using Galaxy (but this could be clearer – please use a different word). What I don’t understand is the comparator: is it heavier for experienced than non-experienced, heavier to learn Galaxy than the principles of atimosation and generalisation, and are you referring to absolute or relative terms (i.e. cost vs cost:benefit ratio).

https://doi.org/10.24072/pci.ecology.100694.rev21

Reviewed by anonymous reviewer 1, 09 Aug 2024

This draft is greatly improved in structure from the previous version and I found it significantly easier to follow. The presentation of reproducibility as the goal and the Galaxy workflow as a solution is strong and clear, and generally the message is more concise. The section describing Galaxy-E and the discussion however have some issues with redundancy of some topics, and the absence of other topics, that make them harder to follow.

The “Entering a new dimension” section lists many of the ways Galaxy meets the criteria for reproducibility, or the benefits it might bring, but it doesn’t describe what it *is*. That piece is critical for this type of introductory paper – where does it live? How does the user engage? What are the key pieces? This part might include existing parts of the section such as the description of the community, or how a user uploads data, but should be more comprehensive and systematic. I think in the previous version many of those ideas lived in the “methods” section, which has rightly been moved to a different venue, but a paragraph or two of description is still necessary. Starting with a clear description also gives grounding for the platform’s benefits, as the reader already has clear evidence of how the tool might achieve those things. One possible structure for the “Entering a new dimension” section would be:

What Galaxy-E is (how users engage with it)
How Galaxy-E follows the atomisation/generalisation and reproducibility/fairness philosophy
Other benefits
Examples of its success

I think improving the structure will also reduce some of the redundancy of ideas and language throughout the section. For example, the paragraph starting at line 403 is a repetition of the general philosophy of Galaxy-E that echoes a similar sentiment given many times throughout the paper, I don’t think It’s necessary here.

Similarly, the discussion repeats many of the basic concepts of reproducibility or atomisation/generalisation without linking them back to the platform. The first two paragraphs in particular could either be cut, or should be edited to be directly relevant to the platform and its strengths. The discussion could also benefit from a more detailed comparison to other existing platforms.

Detailed feedback:

Line 227-243: This description feels out of place here, as it’s followed by a more detailed description of the atomisation/generalisation framework rather than the Galaxy-E approach, I’d move it to the “Entering a new dimension” section

Line 342: How or why might that be true?

Line 346–360: This feels redundant, it could be integrated more concisely into the earlier paragraphs.

Line 376: This feels redundant after the previous paragraph

Line 501: I don’t think this bullet list is appropriate for the discussion, maybe a summarized version in the reproducibility section, but as it stands it’s relevance to the Galaxy-E tool is not clear.

Line 527-535: These things are all true, but what are their relevance for Galaxy-E? How does it help a user achieve this?

Line 530: I don’t think arborescence is the right word here, are you trying to describe the relationships of analysis pieces to one another?

https://doi.org/10.24072/pci.ecology.100694.rev22

Evaluation round #1

DOI or URL of the preprint: https://doi.org/10.32942/X2G033

Version of the preprint: 1

Author's Reply, 20 Jun 2024

Download author's reply Download tracked changes file https://doi.org/10.24072/pci.ecology.100694.ar1

Decision by Timothée Poisot, posted 07 Jun 2024, validated 07 Jun 2024

Both reviewers see the value of the work (and I agree), but make important comments that would improve the readability of the manuscript. I think the separation between philosophy, implementation/architecture, and anticipated use-cases needs to be clarified. This will help readers navigate a very dense manuscript, especially readers with different types of background/expertise.

https://doi.org/10.24072/pci.ecology.100694.d1

Reviewed by Nick Isaac, 23 May 2024

This is a paper in three parts. The first part discusses issues around analytical practices in ecology, and the principles by which these can become more reproducible. The second half is a description of the Galaxy Ecology platform, and it’s potential to realise better practice among the community. The third part described the technical details of implementation.

The visual abstract is really great, presenting a clear, cohesive message about how atomisation and generalisation can improve reproducibility and FAIR principles. However, the main text does not reflect this clarity of message.

Overall there is a lot of good material in here, but the article does not feel mature in how it is structured and presented. I do not get a clear idea who the article is aimed at, nor who is the target user of Galaxy Ecology. It’s hard to see how the present version would lead to material change in how ecologists go about their work. Another challenge is that much of the material about Galaxy Ecology is likely to become quickly outdated. For these reasons, my overall suggestion would be to greatly reduce the amount of text in the section on Galaxy and in the Methods. Rewrite these sections in a way that uses Galaxy to illustrate the general principles in the first part of the manuscript.

The paper needs to make more comparisons to current workflow platforms. The examples provided are not current: Taverna retired in 2020 and the latest version of Kepler (version 2.5) was 2015. For example, ecologically specific tools such as Bon-in-a-box https://boninabox.geobon.org/ or more general tools such as the R package Targets: https://books.ropensci.org/targets/ .

Additionally, the methods section could be better as online supplementary materials (or zenodo repository) of simply part of the galaxy user guide. It is not relevant to the core message of the manuscript. As the galaxy-E platform develops, the user guide may change. Hosting this content online rather than in the manuscript allows that updating to ensure parity with the Galaxy-E platform.

The introduction needs to be more concise, a lot of text is used to set up the wider concerns about reproducibility in ecology, which is important, but the main contribution is not to review ecological reproducibility to provide the solution and at current it takes too long to get to the solution. Large parts of the text are not necessary for the delivery of the core message eg. lines 106-131, 151-185, 211-229.In the detailed comments below, I’ve made a few suggestions for how the Introductory sections could help to contextualise the issues being discussed. For the section on Galaxy, I think the authors need to give a much clearer exposition of what it is, who it is for and how it can help to deliver the principles outlined in the first section. At present, this text assumes too much knowledge of the system for a naïve reader to properly engage. The Methods section is even more difficult to follow: it’s half-way between a user-manual and a conceptual overview but doesn’t quite deliver to either of those goals. Perhaps it would help if these issues were illustrated via the use of one or more case studies.

Detailed comments:

· Abstract Line 75: perhaps add a few words to explain that the principles described here are applicable across all levels of ecological analysis, e.g. “from individual research projects to production-level analytical pipelines”

· Abstract line 77: Perhaps explain why “atomisation” is the right word. I thought this was a typo and that the authors meant to write “automation” instead. The authors need to provide rationale behind using the term ‘atomisation’ to describe the process of making a script more modular, there is no previous usage of this term in the software development literature. Whereas modular programming is a widely used and understood term (https://en.wikipedia.org/wiki/Modular_programming). If the authors want to use the term ‘atomisation’ it must be made clear it is a new, non-standard term introduced in this manuscript that may not be widely understood across other sectors.

· Introduction paragraph 2: a useful citation here would be Cassey & Blackburn, who distinguished between “reproducibility” and “repeatability”, and discussed the relative merits of each. Also, with reference to later discussion of reproducibility around lines 205, it would be good to acknowledge that reproducibility is not an absolute but rather a relative concept (who needs the results to be identical to the 9th significant figure?).

· Lines 117-119: complicated sentence. I suggest to simplify to “Given the increasing complexity of ecological analyses, there is a clear need for tools that facilitate greater reproducibility.”

· The same paragraph (lines 119-130, but also the next section, starting at line 140) would really benefit from a big more context for what is the problem and why atomised workflows are needed. My perspective is that ecology has, until now, been a discipline in which most analysis happened on a single computer, but increasingly we are seeing papers derived from big collaborations involving code developed in different labs. This means we are moving into an era where analytical pipelines are becoming so complex that no individual researcher can understand all the details at a granular level. Other disciplines (e.g. meteorology, particle physics) have already passed through this phase. There is plenty of literature that could help illustrate these issues and clarify why this paper is novel. First, it would be good to include a citation to support the assertion that analyses are becoming more complex. One option would be to find some data on the average length of supplementary information on ecological papers (I know of no such data!). An alternative would be to cite papers that describe highly complex workflows, e.g. Boyd et al (2023) or Jetz et al (2019). Second, it might be worth acknowledging that individual branches of ecology have developed principles to enhance reproducibility within those sub-domains. The SDM community is perhaps the best example: citing papers such as Araujo et al (2019) and Golding et al (2017) might provide a way for the authors to explain what is different about the proposals in this paper.

· Line 156: try to avoid directly quoting from another article

· Iine 170: I would question whether long-term public archiving of code is as valuable as the authors assert. The most popular coding language among ecologists, R, is in a continuous state of evolution. Most R code written 10 years ago would not execute today. I’m not saying that we should not archive code: I just think it is important to be clear about what we are trying to achieve as a community and make decision about where to invest resources accordingly.

· Line 195: in the previous paragraph you made the case that code should be considered as data. So, for clarity, insert the word “observational” before “data” in this sentence.

· Line 298: “Atomisation refers to dividing …”

· Line 298-366: this text on atomisation and generalisation is absolutely fine, but I can’t help thinking that these must be fundamental principle of computer science. If so, it may be worth mentioning here, perhaps with a citation.

· Line 369: “et” -> “at”

· Line 386: missing word “A” at beginning of sentence

· Line 412: at the beginning of this section, it would be useful to explain who is the target user. Are you recommending that everyone in ecology use it for all of their analysis? Or is it better suited to large collaborative projects?

· Line 435: “tools” -> “tool”

REFERENCES

Araujo et al (2019) Standards for distribution models in biodiversity assessments. Science Advances https://www.science.org/doi/full/10.1126/sciadv.aat4858

Boyd et al (2023) Biological Reviews https://onlinelibrary.wiley.com/doi/full/10.1111/brv.12961

Cassey & Blackburn (2006) Reproducibility and Repeatability in Ecology. BioScicence https://academic.oup.com/bioscience/article/56/12/958/221622

Golding et al (2017) the Zoon package … Methods in ecology & Evolution. http://doi.wiley.com/10.1111/2041-210X.12858

Jetz et al (2019) Nature Ecology & Evolution https://doi.org/10.1038/s41559-019-0826-1

https://doi.org/10.24072/pci.ecology.100694.rev11

Reviewed by anonymous reviewer 1, 09 May 2024

This manuscript introduces and describes the Galaxy-Ecology tool, laying out different modes of engagement and the ways the approach addresses reproducibility issues in ecology. The Galaxy-Ecology project and associated community seem like a powerful framework to build and share ecological analyses and this paper includes all the essential introductory pieces for a new user. I particularly appreciate the way the paper discusses different types of users and the value it adds for each. I do however think there is a fair bit of related but ultimately superfluous content that, if paired down, would greatly improve the readability and clarity of the paper.

While the initial discussion of reproducibility is clearly motivating for development of the tool, the current level of depth is unnecessary and even a little misleading for the reader. For example, Galaxy-Ecology isn’t mentioned until the fourth section of the introduction and is introduced in a way that makes the reader unsure if it’s just a nice example to illustrate the reproducibility point or the main message of the paper in and of itself. Given that a history of reproducibility in ecology is not the goal of the paper, I would recommend editing everything prior to “Framework towards good practices” down to a few introductory paragraphs that also immediately introduce Galaxy-Ecology as a solution. If I have misunderstood the purpose of the paper and the goal of the paper is to first give a detailed context for the reproducibility crisis and second discuss Galaxy-Ecology, that structure can be better set up in the abstract and first couple paragraphs of the paper.

In general I found the structure of the paper hard to follow, as technical details about engaging with the tool or development are interspersed with motivation and philosophy. One possible approach for addressing that confusion is to lay out the technical details in an initial description of the tool, then describe which of those pieces different kinds of users might engage with, rather than introducing new information in the user sections. In a related formatting issue, currently the three guidelines sections read as multi-paragraph lists which I found fairly jarring. Rather than referencing different steps of the workflow as a starting bullet point, I would reference steps in the body of the paragraph (for example as “(step A)”) to aid flow of the writing.

As a small linguistic note, I would suggest the phrase “good practices” be replaced with “best practices”, which is a more common way to reference standards and will be immediately recognizable to a reader.

A few line comments:

Line 461: Unnecessary reference, either remove sentence or integrate into a paragraph.

Line 560: This heading is unnecessary.

Line 632: The colon here is confusing to me, I can see from the workflow what you’re implying but the list within a list is quite hard to follow in the writing. This would be much easier to express in a paragraph rather than list format.

Line 644: This is a good example of the kind of concept that should be introduced outside the user descriptions. Is the "Galaxy history" an internal versioning system? At what level is it tracking, just the kinds of modules that are used in the pipeline?

Title and abstract

Does the title clearly reflect the content of the article? [x] Yes, [ ] No (please explain), [ ] I don't know
Does the abstract present the main findings of the study? [x] Yes, [ ] No (please explain), [ ] I don’t know

Introduction

Are the research questions/hypotheses/predictions clearly presented? [ ] Yes, [x] No (please explain), [ ] I don’t know

- As discussed above, it is difficult for the reader to initially figure out that Galaxy-Ecology is the focus of the paper.
Does the introduction build on relevant research in the field? [x] Yes, [ ] No (please explain), [ ] I don’t know

Materials and methods

Are the methods and analyses sufficiently detailed to allow replication by other researchers? [x] Yes, [ ] No (please explain), [ ] I don’t know
Are the methods and statistical analyses appropriate and well described? [ ] Yes, [x] No (please explain), [ ] I don’t know

The paper is missing a detailed overview of the moving pieces of the tool.

Results

In the case of negative results, is there a statistical power analysis (or an adequate Bayesian analysis or equivalence testing)? [ ] Yes, [ ] No (please explain), [ ] I don’t know N/A
Are the results described and interpreted correctly? [x] Yes, [ ] No (please explain), [ ] I don’t know

Discussion

Have the authors appropriately emphasized the strengths and limitations of their study/theory/methods/argument? [ ] Yes, [x] No (please explain), [ ] I don’t know

There could be a little more comparison to other similar efforts to improve reproducibility and the limitations of what Galaxy-Ecology does.

Are the conclusions adequately supported by the results (without overstating the implications of the findings)? [x] Yes, [ ] No (please explain), [ ] I don’t know

https://doi.org/10.24072/pci.ecology.100694.rev12

User comments

No user comments yet

or Register
Submit a preprint

Best practices for ecological analysis are required to act on concrete challenges