Glossary

What is a data pipeline in analytics?

In analytics, a data pipeline is a collections of systems that covers the entire data journey from extraction from the data sources, to ingest into a file system, database, or storage service, to the ETL systems that transform and prepare data for analysis, to the analytics data processing engines, and the output of data to dashboards, BI tools, and data applications.

Many organizations have several, even hundreds of data pipelines that service different lines of business or use cases. Effective design and implementation of data pipelines helps organizations gain better and more insights by effectively capturing, organizing, routing, processing and visualizing data.

As more data becomes available from more sources, creating effective data pipelines is essential to connecting and coordinating different data sources, storage layers, data processing systems, analytics tools and applications. Since data scientists, developers and business leaders may all want to work with data in different ways, a flexible data pipeline architecture is essential so that relevant details for each team can be gathered, stored and made available for whatever analysis is needed.

Design goals of an effective data pipeline architecture is that it’s scalable, flexible, cost-effective and optimized for a wide variety of analytical tasks.