Mining Data Lineage Patterns Using Machine Learning to Predict Downstream Impact

Main Article Content

Pramod Raja Konda

Abstract

This study investigates the use of machine learning to mine data lineage patterns and predict downstream impact across complex enterprise data ecosystems. As organizations increasingly rely on interconnected data pipelines for analytics, reporting, and regulatory compliance, understanding how changes in upstream datasets affect downstream processes has become critical. Traditional lineage tracking methods often rely on manual documentation or static metadata, which fail to capture evolving pipeline behavior and hidden dependencies. This research proposes a machine learning–driven framework that analyzes historical lineage graphs, transformation logs, schema evolution patterns, and workload metadata to identify recurring dependency structures. By training predictive models on these lineage-derived features, the system forecasts potential downstream impacts resulting from schema changes, data quality anomalies, or pipeline modifications. Experimental evaluation on large-scale enterprise datasets demonstrates that the proposed approach achieves high accuracy in predicting affected downstream tables, workflows, and analytical outputs. The findings highlight the value of AI-enabled lineage intelligence for proactive risk mitigation, automated impact analysis, and improved data governance. This work contributes to the broader field of metadata analytics and presents a scalable, model-driven strategy for enhancing the reliability of modern data platforms

Article Details

How to Cite
Konda, P. R. (2023). Mining Data Lineage Patterns Using Machine Learning to Predict Downstream Impact. International Meridian Journal, 5(5). https://meridianjournal.in/index.php/IMJ/article/view/117
Section
Articles

How to Cite

Konda, P. R. (2023). Mining Data Lineage Patterns Using Machine Learning to Predict Downstream Impact. International Meridian Journal, 5(5). https://meridianjournal.in/index.php/IMJ/article/view/117

References

Inmon, W. H. (2005). Building the data warehouse (4th ed.). Wiley.

Kimball, R., & Ross, M. (2013). The data warehouse toolkit: The definitive guide to dimensional modeling (3rd ed.). Wiley.

Hai, R., Geisler, S., & Quix, C. (2016). Constance: An intelligent data lake system. Proceedings of the 2016 International Conference on Management of Data, 2097–2100.

Armstrong, D., & Delaney, P. (2017). Data governance challenges in large-scale analytics platforms. International Journal of Information Management, 37(6), 673–682.

Giebler, C., Gröger, C., Hoos, E., Schwarz, H., & Mitschang, B. (2019). Model-driven data lake management. Proceedings of the 2019 IEEE International Conference on Big Data, 3012–3021.

Sawadogo, P. N., & Darmont, J. (2019). On data lake architectures and metadata management. International Conference on Big Data Analytics and Knowledge Discovery, 227–241.

Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile Networks and Applications, 19(2), 171–209.

Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The rise of big data on cloud computing: Review and open research issues. Information Systems, 47, 98–115.

Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., … Zaharia, M. (2016). MLlib: Machine learning in Apache Spark. Journal of Machine Learning Research, 17(34), 1–7.

Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., & Stoica, I. (2013). Discretized streams: Fault-tolerant streaming computation at scale. Proceedings of the 2013 ACM Symposium on Operating Systems Principles, 423–438.

Marz, N., & Warren, J. (2015). Big data: Principles and best practices of scalable real-time data systems. Manning Publications.

Dixon, J. (2010). Pentaho, Hadoop, and data lakes. Pentaho Blog. Retrieved from https://www.pentaho.com

Fang, H., & Zhang, J. (2016). Big data in finance: Data lakes, analytics, and governance. Journal of Financial Data Science, 1(1), 45–56.

Stein, B., & Morrison, A. (2014). The enterprise data lake: Better integration and deeper analytics. PricewaterhouseCoopers Technology Report, 1–12.

Khatri, V., & Brown, C. V. (2010). Designing data governance. Communications of the ACM, 53(1), 148–152.