Close printable page

Recommendation

A flexible pipeline combining clustering and correction tools for prokaryotic and eukaryotic metabarcoding

Stefaniya Kamenova based on reviews by Tiago Pereira and 1 anonymous reviewer

A recommendation of:

A flexible pipeline combining clustering and correction tools for prokaryotic and eukaryotic metabarcoding

Miriam I Brandt, Blandine Trouche, Laure Quintric, Patrick Wincker, Julie Poulain, Sophie Arnaud-Haond (2020), bioRxiv, 7717355, ver. 3 recommended and peer-reviewed by Peer Community In Ecology https://doi.org/10.1101/717355

Read preprint in preprint server Now published in a journal

Data used for results

Scripts used to obtain or analyze results

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

A flexible pipeline combining clustering and correction tools for prokaryotic and eukaryotic metabarcoding

Environmental metabarcoding is an increasingly popular tool for studying biodiversity in marine and terrestrial biomes. With sequencing costs decreasing, multiple-marker metabarcoding, spanning several branches of the tree of life, is becoming more accessible. However, bioinformatic approaches need to adjust to the diversity of taxonomic compartments targeted as well as to each barcode gene specificities. We built and tested a pipeline based on Illumina read correction with DADA2 allowing analysing metabarcoding data from prokaryotic (16S) and eukaryotic (18S, COI) life compartments. We implemented the option to cluster Amplicon Sequence Variants (ASVs) into Operational Taxonomic Units (OTUs) with swarm v2, a network-based clustering algorithm, and to further curate the ASVs/OTUs based on sequence similarity and co-occurrence rates using a recently developed algorithm, LULU. Finally, flexible taxonomic assignment was implemented *via* Ribosomal Database Project (RDP) Bayesian classifier and BLAST. We validate this pipeline with ribosomal and mitochondrial markers using eukaryotic mock communities and 42 deep-sea sediment samples. The results show that ASVs, reflecting genetic diversity, may not be appropriate for alpha diversity estimation of organisms fitting the biological species concept. The results underline the advantages of clustering and LULU-curation for producing more reliable metazoan biodiversity inventories, and show that LULU is an effective tool for filtering metazoan molecular clusters, although the minimum identity threshold applied to co-occurring OTUs has to be increased for 18S. The comparison of BLAST and the RDP Classifier underlined the potential of the latter to deliver very good assignments, but highlighted the need for a concerted effort to build comprehensive, ecosystem-specific, databases adapted to the studied communities.

Biodiversity, bioinformatics, environmental DNA, metabarcoding, mock communities

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

خط أنابيب مرن يجمع بين أدوات التجميع والتصحيح للترميز الفوقي بدائيات النواة وحقيقيات النواة

يعد الترميز الفوقي البيئي أداة شائعة بشكل متزايد لدراسة التنوع البيولوجي في المناطق الأحيائية البحرية والبرية. ومع انخفاض تكاليف التسلسل، أصبح الوصول إلى الترميز الفوقي متعدد العلامات، الذي يغطي عدة فروع من شجرة الحياة، أكثر سهولة. ومع ذلك، تحتاج نُهج المعلومات الحيوية إلى التكيف مع تنوع الأجزاء التصنيفية المستهدفة وكذلك مع خصوصيات كل جين من رموز الباركود. لقد قمنا ببناء واختبار خط أنابيب يعتمد على تصحيح قراءة Illumina باستخدام DADA2 مما يسمح بتحليل بيانات metabarcoding من حجرات الحياة بدائية النواة (16S) وحقيقية النواة (18S، COI). لقد قمنا بتنفيذ خيار تجميع متغيرات تسلسل Amplicon (ASVs) في وحدات التصنيف التشغيلية (OTUs) باستخدام swarm v2، وهي خوارزمية تجميع قائمة على الشبكة، ولمواصلة تنظيم ASVs/OTUs استنادًا إلى تشابه التسلسل ومعدلات التكرار المشترك باستخدام أداة حديثة الخوارزمية المطورة LULU أخيرًا، تم تنفيذ التخصيص التصنيفي المرن *عبر* مشروع قاعدة بيانات الريبوسوم (RDP) والمصنف البايزي وBLAST. نحن نتحقق من صحة خط الأنابيب هذا باستخدام علامات الريبوسوم والميتوكوندريا باستخدام مجتمعات وهمية حقيقية النواة و42 عينة من الرواسب في أعماق البحار. تظهر النتائج أن ASVs، التي تعكس التنوع الجيني، قد لا تكون مناسبة لتقدير تنوع ألفا للكائنات الحية التي تتناسب مع مفهوم الأنواع البيولوجية. تؤكد النتائج على مزايا التجميع ومعالجة LULU لإنتاج قوائم جرد أكثر موثوقية للتنوع البيولوجي في الميتازوان، وتظهر أن LULU هي أداة فعالة لتصفية المجموعات الجزيئية في الميتازوان، على الرغم من أن الحد الأدنى من عتبة الهوية المطبقة على وحدات OTU المتزامنة يجب زيادتها لمدة 18S. . أكدت المقارنة بين BLAST وRDP Classifier على قدرة الأخير على تقديم مهام جيدة جدًا، ولكنها سلطت الضوء على الحاجة إلى جهد متضافر لبناء قواعد بيانات شاملة خاصة بالنظام البيئي ومتكيفة مع المجتمعات التي تمت دراستها.

التنوع البيولوجي، المعلوماتية الحيوية، الحمض النووي البيئي، الترميز الفوقي، المجتمعات الوهمية

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Un canal flexible que combina herramientas de agrupación y corrección para metabarcodes procarióticos y eucariotas

El metacódigo de barras ambiental es una herramienta cada vez más popular para estudiar la biodiversidad en biomas marinos y terrestres. Con la disminución de los costos de secuenciación, los metacódigos de barras de marcadores múltiples, que abarcan varias ramas del árbol de la vida, se están volviendo más accesibles. Sin embargo, los enfoques bioinformáticos deben adaptarse a la diversidad de compartimentos taxonómicos a los que se dirigen, así como a las especificidades de cada gen del código de barras. Construimos y probamos una canalización basada en la corrección de lectura de Illumina con DADA2 que permite analizar datos de metacódigos de barras de compartimentos de vida procarióticos (16S) y eucariotas (18S, COI). Implementamos la opción de agrupar variantes de secuencia de amplicones (ASV) en unidades taxonómicas operativas (OTU) con swarm v2, un algoritmo de agrupamiento basado en red, y para seleccionar aún más los ASV/OTU en función de la similitud de secuencia y las tasas de co-ocurrencia utilizando un software recientemente algoritmo desarrollado, LULU. Finalmente, se implementó una asignación taxonómica flexible *mediante* el clasificador bayesiano del Ribosomal Database Project (RDP) y BLAST. Validamos este conducto con marcadores ribosómicos y mitocondriales utilizando comunidades simuladas de eucariotas y 42 muestras de sedimentos de aguas profundas. Los resultados muestran que los ASV, que reflejan diversidad genética, pueden no ser apropiados para la estimación de la diversidad alfa de organismos que se ajustan al concepto de especie biológica. Los resultados subrayan las ventajas de la agrupación y la curación LULU para producir inventarios de biodiversidad de metazoos más confiables, y muestran que LULU es una herramienta eficaz para filtrar grupos moleculares de metazoos, aunque el umbral mínimo de identidad aplicado a las OTU concurrentes debe aumentarse para el 18S. . La comparación de BLAST y el clasificador RDP subrayó el potencial de este último para realizar muy buenas tareas, pero destacó la necesidad de un esfuerzo concertado para construir bases de datos integrales, específicas del ecosistema y adaptadas a las comunidades estudiadas.

Biodiversidad, bioinformática, ADN ambiental, metacódigos de barras, comunidades simuladas

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Un pipeline flexible combinant des outils de clustering et de correction pour le métabarcoding procaryote et eucaryote

Le métabarcoding environnemental est un outil de plus en plus populaire pour étudier la biodiversité dans les biomes marins et terrestres. Avec la diminution des coûts de séquençage, le métabarcoding à marqueurs multiples, couvrant plusieurs branches de l’arbre de vie, devient plus accessible. Cependant, les approches bioinformatiques doivent s’adapter à la diversité des compartiments taxonomiques ciblés ainsi qu’aux spécificités de chaque gène code-barres. Nous avons construit et testé un pipeline basé sur la correction de lecture Illumina avec DADA2 permettant d'analyser les données de métabarcoding des compartiments de vie procaryotes (16S) et eucaryotes (18S, COI). Nous avons implémenté l'option de regrouper les variantes de séquence d'amplicons (ASV) en unités taxonomiques opérationnelles (OTU) avec swarm v2, un algorithme de clustering basé sur le réseau, et de mieux organiser les ASV/OTU en fonction de la similarité de séquence et des taux de cooccurrence en utilisant un récemment algorithme développé, LULU. Enfin, une affectation taxonomique flexible a été mise en œuvre *via* le classificateur bayésien du Ribosomal Database Project (RDP) et BLAST. Nous validons ce pipeline avec des marqueurs ribosomiques et mitochondriaux utilisant des communautés eucaryotes factices et 42 échantillons de sédiments des grands fonds. Les résultats montrent que les ASV, reflétant la diversité génétique, peuvent ne pas être appropriés pour l'estimation de la diversité alpha des organismes correspondant au concept d'espèce biologique. Les résultats soulignent les avantages du regroupement et de la curation LULU pour produire des inventaires de biodiversité métazoaires plus fiables et montrent que LULU est un outil efficace pour filtrer les clusters moléculaires métazoaires, bien que le seuil d'identité minimum appliqué aux OTU concomitantes doive être augmenté pour 18S. . La comparaison de BLAST et du classificateur RDP a souligné le potentiel de ce dernier à fournir de très bonnes missions, mais a souligné la nécessité d'un effort concerté pour construire des bases de données complètes, spécifiques à l'écosystème, adaptées aux communautés étudiées.

Biodiversité, bioinformatique, ADN environnemental, métabarcoding, communautés fictives

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

प्रोकैरियोटिक और यूकेरियोटिक मेटाबार्कोडिंग के लिए क्लस्टरिंग और सुधार उपकरणों का संयोजन करने वाली एक लचीली पाइपलाइन

समुद्री और स्थलीय बायोम में जैव विविधता का अध्ययन करने के लिए पर्यावरणीय मेटाबारकोडिंग एक तेजी से लोकप्रिय उपकरण है। अनुक्रमण लागत कम होने के साथ, जीवन के वृक्ष की कई शाखाओं तक फैली मल्टीपल-मार्कर मेटाबार्कोडिंग अधिक सुलभ होती जा रही है। हालाँकि, जैव सूचनात्मक दृष्टिकोण को लक्षित टैक्सोनोमिक डिब्बों की विविधता के साथ-साथ प्रत्येक बारकोड जीन विशिष्टताओं को समायोजित करने की आवश्यकता है। हमने DADA2 के साथ इलुमिना रीड करेक्शन पर आधारित एक पाइपलाइन का निर्माण और परीक्षण किया, जो प्रोकैरियोटिक (16S) और यूकेरियोटिक (18S, COI) जीवन डिब्बों से मेटाबार्कोडिंग डेटा का विश्लेषण करने की अनुमति देता है। हमने एम्प्लिकॉन सीक्वेंस वेरिएंट्स (एएसवी) को ऑपरेशनल टैक्सोनोमिक यूनिट्स (ओटीयू) में स्वार्म वी2, एक नेटवर्क-आधारित क्लस्टरिंग एल्गोरिदम के साथ क्लस्टर करने और हाल ही में अनुक्रम समानता और सह-घटना दरों के आधार पर एएसवी/ओटीयू को और अधिक क्यूरेट करने का विकल्प लागू किया है। विकसित एल्गोरिदम, LULU। अंत में, लचीले टैक्सोनोमिक असाइनमेंट को राइबोसोमल डेटाबेस प्रोजेक्ट (आरडीपी) बायेसियन क्लासिफायर और ब्लास्ट के माध्यम से लागू किया गया। हम यूकेरियोटिक मॉक समुदायों और 42 गहरे समुद्र तलछट नमूनों का उपयोग करके राइबोसोमल और माइटोकॉन्ड्रियल मार्करों के साथ इस पाइपलाइन को मान्य करते हैं। नतीजे बताते हैं कि आनुवंशिक विविधता को प्रतिबिंबित करने वाले एएसवी, जैविक प्रजातियों की अवधारणा के अनुरूप जीवों की अल्फा विविधता के आकलन के लिए उपयुक्त नहीं हो सकते हैं। परिणाम अधिक विश्वसनीय मेटाज़ोअन जैव विविधता सूची बनाने के लिए क्लस्टरिंग और LULU-क्यूरेशन के लाभों को रेखांकित करते हैं, और दिखाते हैं कि LULU मेटाज़ोअन आणविक समूहों को फ़िल्टर करने के लिए एक प्रभावी उपकरण है, हालांकि सह-होने वाले OTU पर लागू न्यूनतम पहचान सीमा को 18S तक बढ़ाना होगा . BLAST और RDP क्लासिफायर की तुलना ने बाद वाले की बहुत अच्छे असाइनमेंट देने की क्षमता को रेखांकित किया, लेकिन अध्ययन किए गए समुदायों के लिए अनुकूलित व्यापक, पारिस्थितिकी तंत्र-विशिष्ट डेटाबेस बनाने के लिए एक ठोस प्रयास की आवश्यकता पर प्रकाश डाला।

जैव विविधता, जैव सूचना विज्ञान, पर्यावरण डीएनए, मेटाबार्कोडिंग, नकली समुदाय

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

原核生物および真核生物のメタバーコーディング用のクラスタリングおよび修正ツールを組み合わせた柔軟なパイプライン

環境メタバーコーディングは、海洋および陸生生物群系の生物多様性を研究するためのツールとしてますます人気が高まっています。配列決定コストが減少するにつれて、生命の木の複数の枝にまたがる複数マーカーのメタバーコーディングがより利用しやすくなっています。ただし、バイオインフォマティクスのアプローチは、各バーコード遺伝子の特異性だけでなく、標的となる分類学的コンパートメントの多様性にも適応する必要があります。私たちは、原核生物 (16S) および真核生物 (18S、COI) の生命コンパートメントからのメタバーコーディングデータの分析を可能にする、DADA2 を使用したイルミナ読み取り補正に基づくパイプラインを構築およびテストしました。ネットワークベースのクラスタリングアルゴリズムである swarm v2 を使用してアンプリコン配列バリアント (ASV) を運用分類単位 (OTU) にクラスター化するオプションを実装し、最近のアルゴリズムを使用して配列類似性と共起率に基づいて ASV/OTU をさらにキュレーションします。開発されたアルゴリズム、LULU。最後に、柔軟な分類学的割り当てが、リボソームデータベースプロジェクト (RDP) のベイジアン分類器と BLAST を介して実装されました。私たちは、真核生物の模擬コミュニティと 42 個の深海堆積物サンプルを使用して、リボソームおよびミトコンドリアのマーカーでこのパイプラインを検証します。この結果は、遺伝的多様性を反映する ASV が、生物学的種の概念に適合する生物のアルファ多様性推定には適切ではない可能性があることを示しています。この結果は、より信頼性の高い後生動物の生物多様性目録を作成するためのクラスタリングと LULU キュレーションの利点を強調し、共起する OTU に適用される最小同一性しきい値は 18S に対して増加する必要があるものの、LULU が後生動物の分子クラスターをフィルタリングするための効果的なツールであることを示しています。。 BLAST と RDP 分類器の比較は、後者が非常に優れた割り当てを提供できる可能性を強調しましたが、調査対象のコミュニティに適応した包括的でエコシステム固有のデータベースを構築するための協調的な取り組みの必要性を強調しました。

生物多様性、バイオインフォマティクス、環境DNA、メタバーコーディング、モックコミュニティ

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Um pipeline flexível que combina ferramentas de agrupamento e correção para metabarcoding procarióticos e eucarióticos

O metabarcoding ambiental é uma ferramenta cada vez mais popular para estudar a biodiversidade em biomas marinhos e terrestres. Com a diminuição dos custos de sequenciamento, o metabarcoding de múltiplos marcadores, abrangendo vários ramos da árvore da vida, está se tornando mais acessível. No entanto, as abordagens bioinformáticas precisam se ajustar à diversidade dos compartimentos taxonômicos alvo, bem como às especificidades de cada gene do código de barras. Construímos e testamos um pipeline baseado na correção de leitura Illumina com DADA2, permitindo a análise de dados de metabarcoding de compartimentos de vida procarióticos (16S) e eucarióticos (18S, COI). Implementamos a opção de agrupar Variantes de Sequência de Amplicon (ASVs) em Unidades Taxonômicas Operacionais (OTUs) com swarm v2, um algoritmo de clustering baseado em rede, e para selecionar ainda mais os ASVs/OTUs com base na similaridade de sequência e taxas de co-ocorrência usando um recentemente algoritmo desenvolvido, LULU. Finalmente, a atribuição taxonômica flexível foi implementada *via* classificador bayesiano do Ribosomal Database Project (RDP) e BLAST. Validamos este pipeline com marcadores ribossômicos e mitocondriais usando comunidades simuladas eucarióticas e 42 amostras de sedimentos de águas profundas. Os resultados mostram que os ASVs, refletindo a diversidade genética, podem não ser apropriados para a estimativa da diversidade alfa de organismos que se enquadram no conceito biológico de espécie. Os resultados sublinham as vantagens do agrupamento e da curadoria de LULU para a produção de inventários de biodiversidade de metazoários mais confiáveis, e mostram que o LULU é uma ferramenta eficaz para filtrar aglomerados moleculares de metazoários, embora o limite mínimo de identidade aplicado a OTUs co-ocorrentes deva ser aumentado para 18S. . A comparação entre o BLAST e o Classificador RDP sublinhou o potencial deste último para entregar trabalhos muito bons, mas destacou a necessidade de um esforço concertado para construir bases de dados abrangentes, específicas do ecossistema, adaptadas às comunidades estudadas.

Biodiversidade, bioinformática, DNA ambiental, metabarcoding, comunidades simuladas

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Гибкий конвейер, объединяющий инструменты кластеризации и коррекции для прокариотического и эукариотического метабаркодирования.

Экологическое метабаркодирование становится все более популярным инструментом для изучения биоразнообразия в морских и наземных биомах. По мере снижения затрат на секвенирование, многомаркерное метабаркодирование, охватывающее несколько ветвей древа жизни, становится все более доступным. Однако биоинформационные подходы должны адаптироваться к разнообразию целевых таксономических компартментов, а также к особенностям каждого гена штрих-кода. Мы построили и протестировали конвейер на основе коррекции чтения Illumina с помощью DADA2, позволяющий анализировать данные метабаркодирования из жизненных компартментов прокариот (16S) и эукариот (18S, COI). Мы реализовали возможность кластеризации вариантов последовательностей ампликонов (ASV) в операционные таксономические единицы (OTU) с помощью swarm v2, сетевого алгоритма кластеризации, а также для дальнейшего курирования ASV/OTU на основе сходства последовательностей и частоты совпадений с использованием недавно разработанный алгоритм LULU. Наконец, было реализовано гибкое таксономическое назначение *с помощью* байесовского классификатора Ribosomeal Database Project (RDP) и BLAST. Мы проверяем этот трубопровод с помощью рибосомальных и митохондриальных маркеров, используя эукариотические сообщества и 42 образца глубоководных отложений. Результаты показывают, что ASV, отражающие генетическое разнообразие, могут не подходить для оценки альфа-разнообразия организмов, соответствующих концепции биологических видов. Результаты подчеркивают преимущества кластеризации и LULU-курирования для создания более надежных реестров биоразнообразия многоклеточных животных и показывают, что LULU является эффективным инструментом для фильтрации молекулярных кластеров многоклеточных животных, хотя минимальный порог идентичности, применяемый к одновременно встречающимся OTU, должен быть увеличен для 18S. . Сравнение BLAST и классификатора RDP подчеркнуло потенциал последнего для выполнения очень хороших задач, но подчеркнуло необходимость согласованных усилий по созданию комплексных, ориентированных на экосистему баз данных, адаптированных к изучаемым сообществам.

Биоразнообразие, биоинформатика, ДНК окружающей среды, метабаркодирование, псевдосообщества

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

结合聚类和校正工具的灵活管道，用于原核和真核元条形码

环境元条形码是一种越来越流行的研究海洋和陆地生物群落生物多样性的工具。随着测序成本的降低，跨越生命树多个分支的多标记元条形码变得越来越容易实现。然而，生物信息学方法需要适应目标分类区室的多样性以及每个条形码基因的特异性。我们使用 DADA2 构建并测试了基于 Illumina 读取校正的管道，允许分析来自原核 (16S) 和真核 (18S、COI) 生命区室的元条形码数据。我们使用基于网络的聚类算法 swarm v2 实现了将扩增子序列变体 (ASV) 聚类为操作分类单元 (OTU) 的选项，并使用最近的基于序列相似性和共现率的方法进一步管理 ASV/OTU。开发算法，LULU。最后，*通过*核糖体数据库项目 (RDP) 贝叶斯分类器和 BLAST 实施了灵活的分类分配。我们使用真核模拟群落和 42 个深海沉积物样本，通过核糖体和线粒体标记验证了这条管道。结果表明，反映遗传多样性的ASV可能不适用于符合生物物种概念的生物体的α多样性估计。结果强调了聚类和 LULU 管理在生成更可靠的后生动物生物多样性清单方面的优势，并表明 LULU 是过滤后生动物分子簇的有效工具，尽管应用于共存 OTU 的最小身份阈值必须增加 18S 。 BLAST 和 RDP 分类器的比较强调了后者提供非常好的任务的潜力，但强调需要共同努力建立适合所研究社区的全面的、特定于生态系统的数据库。

生物多样性、生物信息学、环境 DNA、元条形码、模拟社区

Submission: posted 02 August 2019
Recommendation: posted 30 January 2020, validated 05 February 2020

Cite this recommendation as:
Kamenova, S. (2020) A flexible pipeline combining clustering and correction tools for prokaryotic and eukaryotic metabarcoding. Peer Community in Ecology, 100043. https://doi.org/10.24072/pci.ecology.100043

Recommendation

High-throughput sequencing-based techniques such as DNA metabarcoding are increasingly advocated as providing numerous benefits over morphology‐based identifications for biodiversity inventories and ecosystem biomonitoring [1]. These benefits are particularly apparent for highly-diversified and/or hardly accessible aquatic and marine environments, where simple water or sediment samples could already produce acceptably accurate biodiversity estimates based on the environmental DNA present in the samples [2,3]. However, sequence-based characterization of biodiversity comes with its own challenges. A major one resides in the capacity to disentangle true biological diversity (be it taxonomic or genetic) from artefactual diversity generated by sequence-errors accumulation during PCR and sequencing processes, or from the amplification of non-target genes (i.e. pseudo-genes). On one hand, the stringent elimination of sequence variants might lead to biodiversity underestimation through the removal of true species, or the clustering of closely-related ones. On the other hand, a more permissive sequence filtering bears the risks of biodiversity inflation. Recent studies have outlined an excellent methodological framework for addressing this issue by proposing bioinformatic tools that allow the amplicon-specific error-correction as alternative or as complement to the more arbitrary approach of clustering into Molecular Taxonomic Units (MOTUs) based on sequence dissimilarity [4,5]. But to date, the relevance of amplicon-specific error-correction tools has been demonstrated only for a limited set of taxonomic groups and gene markers.
The study of Brandt et al. [6] successfully builds upon existing methodological frameworks for filling this gap in current literature. By proposing a bioinformatic pipeline combining Amplicon Sequence Variants (ASV) curation with MOTU clustering and additional post-clustering curation, the authors show that contrary to previous recommendations, ASV-based curation alone does not represent an adequate approach for DNA metabarcoding-based inventories of metazoans. Metazoans indeed, do exhibit inherently higher intra-specific and intra-individual genetic variability, necessarily leading to biased biodiversity estimates unbalanced in favor of species with higher intraspecific diversity in the absence of MOTU clustering. Interestingly, the positive effect of additional clustering showed to be dependent on the target gene region. Additional clustering had proportionally higher effect on the more polymorphic mitochondrial COI region (as compared to the 18S ribosomal gene). Thus, the major advantage of the study lies in the provision of optimal curation parameters that reflect the best possible balance between minimizing the impact of PCR/sequencing errors and the loss of true biodiversity across markers with contrasting levels of intragenomic variation. This is important as combining multiple markers is increasingly considered for improving the taxonomic coverage and resolution of data in DNA metabarcoding studies.
Another critical aspect of the study is the taxonomic assignation of curated OTUs (which is also the case for the majority of DNA metabarcoding-based biodiversity assessments). Facing the double challenge of focusing on taxonomic groups that are both highly diverse and poorly represented in public sequence reference databases, the authors failed to obtain high-resolution taxonomic assignments for several of the most closely-related species. As a result, taxa with low divergence levels were clustered as single taxonomic units, subsequently leading to underestimation of true biodiversity present. This finding adds to the argument that in order to be successful, sequence-based techniques still require the availability of comprehensive, high-quality reference databases.
Perhaps the only regret we might have with the study is the absence of mock community validation for the prokaryotes compartment. Even though the analyses of natural samples seem to suggest a positive effect of the curation pipeline, the concept of intra- versus inter-species variation in naturally occurring prokaryote communities remains at best ambiguous. Of course, constituting a representative sample of taxonomically-resolved prokaryote taxa from deep-sea habitats does not come without difficulties but has the benefit of opening opportunities for further studies on the matter.

References

[1] Porter, T. M., and Hajibabaei, M. (2018). Scaling up: A guide to high-throughput genomic approaches for biodiversity analysis. Molecular Ecology, 27(2), 313–338. doi: 10.1111/mec.14478
[2] Valentini, A., Taberlet, P., Miaud, C., Civade, R., Herder, J., Thomsen, P. F., … Dejean, T. (2016). Next-generation monitoring of aquatic biodiversity using environmental DNA metabarcoding. Molecular Ecology, 25(4), 929–942. doi: 10.1111/mec.13428
[3] Leray, M., and Knowlton, N. (2015). DNA barcoding and metabarcoding of standardized samples reveal patterns of marine benthic diversity. Proceedings of the National Academy of Sciences, 112(7), 2076–2081. doi: 10.1073/pnas.1424997112
[4] Callahan, B. J., McMurdie, P. J., and Holmes, S. P. (2017). Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. The ISME Journal, 11(12), 2639–2643. doi: 10.1038/ismej.2017.119
[5] Edgar, R. C. (2016). UNOISE2: improved error-correction for Illumina 16S and ITS amplicon sequencing. BioRxiv, 081257. doi: 10.1101/081257
[6] Brandt, M. I., Trouche, B., Quintric, L., Wincker, P., Poulain, J., and Arnaud-Haond, S. (2020). A flexible pipeline combining clustering and correction tools for prokaryotic and eukaryotic metabarcoding. BioRxiv, 717355, ver. 3 peer-reviewed and recommended by PCI Ecology. doi: 10.1101/717355

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

Funding:
no declaration

Reviews

Evaluation round #1

DOI or URL of the preprint: https://doi.org/10.1101/717355

Version of the preprint: 1

Author's Reply, 30 Dec 2019

Download author's reply Download tracked changes file https://doi.org/10.24072/pci.ecology.100064.ar1

Decision by Stefaniya Kamenova, posted 14 Nov 2019

Dear authors,

All my apology for the delay - finding suitable reviewers accepting to evaluate your work proved difficult, especially during the summer period. Nevertheless, we managed to collect two high-quality reviews, whose comments and suggestions you can find here below as well as directly within the manuscript. Both reviewers and I find that your study adresses an interesting and very relevant topic pertaining to the analysis and interpretation of DNA metabarcoding datasets from biodiversity inventories. Overall, papers's methodology is sound with very high quality of writing. However, further clarification of methods as well as a better justification of the bioinformatic pipeline parameters choice are required in light of existing literature. Additional suggestions made by the reviewers (with few comments from my side) should help improving the overall quality of the manuscript.

I would recommend to incorporate the revisions suggested and re-submit your article. My feeling is that there will be no need for a second round of reviews but I will have to assess this upon the reception of your revision.

Looking forward to see the revised paper! Best Stefaniya Kamenova

Download recommender's annotations

https://doi.org/10.24072/pci.ecology.100064.d1

Reviewed by Tiago Pereira, 17 Sep 2019

The study by Brandt et al. (A flexible pipeline combining bioinformatic correction tools for prokaryotic and eukaryotic metabarcoding) brings new insights into data analysis of metabarcoding datasets covering both prokaryotes and eukaryotes as well as mitochondrial (e.g. COI) and nuclear genes (e.g. 18S and 16S), particularly with the inclusion and testing of new methods/bioinformatic tools (e.g. DADA2 and LULU). It is an interesting and well-written paper, likely to be very useful to many biologists/ecologists dealing with these types of datasets. The reviewer has done minor comments/changes in the pdf file (see attachment). Additionally, the authors should consider the following major points:

Expected relative abundance: multi copy nature of rRNA genes, PCR bias, etc., might confound our expectations. How close is it good enough?
Intragenomic/intraspecific polymorphism: is this a real problem? Can we alleviate by using phylogenetic methods?
General trend/patterns: although the different methods produced different results (e.g. alpha/beta diversity), how strongly did they impact the overall pattern?
In the pipeline, what seems to be the crucial step (e.g. clustering methods/thresholds or taxonomic assignment) in order to produce realible/accurate findings with respect to biodiversity and ecological patterns?

Finally, the reviewer recommends the preprint to be published after minor revisions.

Download the review https://doi.org/10.24072/pci.ecology.100064.rev11

Reviewed by anonymous reviewer 1, 02 Nov 2019

Brandt et al., present a study on bioinformatic processing of metabarcoding data that implements two currently wide applied tools (DADA2 and Swarm) in combination with a post-clustering tool (LULU). By proposing to combine DADA2 and Swarm, their study allows another perspective on the debate whether ASVs or OTUs should be used for metabarcoding datasets. However, this combination (and the further post-clustering process with LULU) opens up some major issues, which have to be addressed before I see this manuscript ready for publication. Especially the choice of some parameters is not justified properly. I will focus in my review on these major issues.

That being said, large parts of the manuscript read very well and there are few corrections needed on the language. My review of the language will therefore be very short and does not include typos (but there are some!). I would suggest, though, to re-structure the order of some paragraphs, which might improve the reading experience of the manuscript even further.

Major concerns:
i) The authors present their study -and especially the implementation of LULU- as a novel approach for studying metazoan diversity. However, a quick literature search returned another study from 2018 by Stefanni et al., that also targeted the COI and 18S gene for analyzing metazoan metabarcoding data with LULU (Stefanni et al., 2018; Multi-marker metabarcoding approach to study mesozooplankton at basin scale. Scientific Reports 8:12085). Stefanni et al., made some different choices regarding their bioinformatic pipeline, but their work and results should at least be discussed in the context of the current manuscript here. In general, I have some doubts regarding the extent of novelty presented by Brandt and colleagues. Using LULU in combination with DADA2 was originally tested by Frøslev et al., 2017 on plant data. I am not convinced that simply applying the same combination on metazoan, eukaryotic and prokaryotic data is enough for a study that proposes a ‘flexible pipeline combining bioinformatic correction tools’, because neither tool was developed by the authors, nor is said combination a novel idea of the authors. Maybe the authors refer to the combination of DADA2 and Swarm for being the proposed novel flexible pipeline. If that is what they are aiming at, they may want to consider putting the combination of DADA2 and Swarm (and LULU) in the focus. Momentarily it reads as the focus is on DADA2 and LULU.

ii) Several parameters were chosen in the bioinformatic pipeline that are currently not justified in the text. The most prominent example is Swarm’s d value, which is set to 4 for 18S data, 6 for COI data and 1 for 16S data (lines 261-262). I am aware of only few studies that do not use Swarm’s default of d=1, most likely because the results become harder to interpret. Allowing a difference of one nucleotide between two sequences in one OTU can easily be justified by naturally occurring sequence variation or artificially introduced sequencing errors. Every value beyond d=1 is harder to justify and may be just as arbitrary as the clustering thresholds the authors try to avoid. In fact, I was surprised that the authors use the avoidance of arbitrary sequence similarity clustering thresholds as an argument for Swarm (lines 54-55 and 113-115), but then try to set d to a value that mimics a 1% sequence divergence threshold, which is just the invers of a 99% sequence similarity threshold (lines 349-351). The situation gets even worse, because Swarm OTUs clustered with a different d value are pooled and analyzed in the same context. In my opinion, OTUs that are analyzed together should always be treated as similar as possible. I suppose the size of the 18S V1/V2 region is nearly as long as the 16S V4/V5 region; why were then so different thresholds chosen for the clustering of the respective OTUs? The authors need to justify these decisions and if they cannot come up with scientifically sound justifications, they should consider sticking to those values that are justifiable.
Other more or less arbitrary values for which I found no explanation or justification were the maximum error rate for primer removal in CUTADAPT (lines 231-232), the truncation length, maximum expected error rates (line 243) as well as the minimum overlap for paired-end assembly (line 247) in DADA2, the very low identity (70%) cutoff for BLAST (line 254) and the minimum match values for LULU (line 280). All of these parameters have a severe effect on downstream data processing and ultimately on the results. Maybe the authors chose the values for a good reason or they followed default values from the literature. But without further explanations, the readers cannot understand their decisions and I would not recommend using a bioinformatic pipeline that does not inform about such important steps.

iii) In abstract and introduction, the authors make a point about the importance of multiple marker metabarcoding approaches. However, they conclude that DADA2 is not fit for analyzing metabarcoding datasets of metazoan organisms (lines 504-507). In contrast to this finding, there are at least two publications that analyzed metazoan metabarcoding datasets with DADA2 and did not report the problems presented by the authors here. One of the publications used the 18S V4 marker region and was cited by the authors (Xiong & Zhan 2018), the other publication used the 18S V9 marker region and was not cited by the authors (Leff et al., 2018; Predicting the structure of soil communities from plant community taxonomy, phylogeny, and traits. ISME Journal 12:1794-1805). These studies show that i) the conclusions about metazoan metabarcoding data drawn by the authors on base of the COI region cannot be generalized to all gene regions and ii) the authors may have targeted a less suited gene region for their approach. In any case the results of the current study should be discussed in the context of these previous studies.
Although I admit that it is a tedious topic, I was also surprised about the author’s choice of the 18S V1/V2 region instead of the more commonly used V4 or V9 region. Can the authors please comment on why V1/V2 was chosen? Much more reference data seems to be available for V4 and V9. Since correct taxonomic assignments were an important topic in the current study, using a marker gene for which more reference data is available would have been beneficial for the authors’ study design.

Minor comments:
- Two sentences I struggled the most with:
‘As metabarcoding with multiple markers, spanning several branches of the tree of life is becoming more accessible, bioinformatic pipelines need to accommodate both micro- and macro biologists.’ (lines 2-4).
‘The results also confirm an important variation in the amplification success across taxa (Bhadury et al., 2006; Carugati, Corinaldesi, Dell’Anno, & Danovaro, 2015), supporting the present approach combining nuclear and mitochondrial markers to achieve more comprehensive biodiversity inventories (Cowart et al., 2015; Drummond et al., 2015; Zhan, Bailey, Heath, & Macisaac, 2014).’ (lines 542-546).
Could you please rephrase to make it clearer to the reader what you want to express?

The numbering of the manuscript sections is askew. Introduction should be ‘1’, but Methods ended up being ‘1’ and so on.
Reference style is not uniform. For instance: ‘Bista et al., 2015‘ next to ‘Deiner, Fronhofer, Mächler, Walser, & Altermatt, 2016‘ (line 36).
Singletons consist of only one read. If the OTU consists of two reads, it is a doubleton (line 68). By the way, DADA2 is very effective in removing singletons (see Callahan et al., 2016). Thus, if you think that singleton removal ‘…is arbitrary and potentially hinders the detection of rare species.’ you should not use DADA2.
Though different important topics are mentioned in the introduction it is not getting absolutely clear what the authors aim to achieve and how they want to do it. Especially the late mentioning of Swarm and how this algorithm will be connected to what had been said before is confusing.
What do the authors mean by amplicons obtained from negative controls (lines 317-318)? They cannot possibly refer to negative controls of the PCR that yielded amplicons? I am sure there must be another explanation, but could not find it in the manuscript’s methods section. There is just the cryptic sentence ‘Negative extraction controls were included in each extraction run.’ (line 152). Could you please explain what exactly these controls are, what you used them for and why they had been pooled with the rest of the amplicons?
Do more abundant species in the mock communities lead to more ASVs/OTUs?
Table 1: Maybe the comparison of the pipelines’ results could also be presented as a figure. All these numbers separated by a slash are hard to read and may look more impressive e.g. in barplots.
Table 2: Could also be a ‘real’ colored heatmap.
I struggled with the order of the paragraphs and would ask the authors to disentangle the results of the mock community approach from the results of the ‘true’ samples. One possibility is to restrict oneself first to the mock community results, because they allow for setting the further results in a context. Then present the alpha- and beta-diversity results of the ‘true’ samples.

https://doi.org/10.24072/pci.ecology.100064.rev12