Paper Title
Automated Data Quality Management System for Big Data Pipelines
Abstract
The exponential growth of big data has created a need for robust data pipelines that can manage vast amounts of data efficiently. Ensuring data quality in these complex pipelines is challenging due to the fast-moving, diverse, and often inconsistent nature of big data. Poor data quality impacts analytics, decision-making, and business performance. This paper explores an automated data quality management system designed for big data pipelines to tackle issues such as inconsistency, incompleteness, and duplication. The system automates real-time monitoring of data quality metrics, uses anomaly detection to flag outliers or shifts, and performs data cleansing tasks like deduplication, filling missing values, and standardizing formats. By automating these processes, it reduces manual effort, enhances data reliability, and ensures high-quality data for analytics, enabling organizations to make more informed decisions.
Keywords - Big Data Pipelines, Data Quality Management, Automation, Anomaly Detection, Data Cleansing