Integrating Heterogeneous ETL Pipelines: Towards Unified Data Processing Across Cloud and Legacy Systems

Main Article Content

Krishna Chaitanya Batchu

Abstract

As organizations maintain a mix of cloud-native and legacy systems, seamless integration of ETL processes across platforms has emerged as a critical technical bottleneck. This article addresses the challenge of heterogeneous ETL integration by proposing a metadata-driven abstraction layer that decouples data transformation logic from execution environments. We introduce a three-tier architecture comprising interface, orchestration, and execution layers that enable platform-agnostic ETL orchestration while preserving system-specific optimizations. The interface layer provides a unified metadata schema capturing data lineage, transformation rules, and quality constraints. The orchestration layer employs a plugin-based architecture with adapters for Apache Airflow, Talend, Apache Spark, and cloud-native services, translating abstract ETL definitions into platform-specific execution plans. The execution layer coordinates actual processing engines through standardized telemetry interfaces. Central to this architecture is a graph-based metadata repository serving as the single source of truth for pipeline definitions and data lineage. Experimental validation across three operational scenarios demonstrates significant improvements in data consistency, development velocity, code reusability, monitoring capabilities, resource utilization, and cost reduction. The architecture successfully addresses technical challenges, including schema evolution, network latency, type system inconsistencies, and coordination overhead through specialized solutions. The proposed model enables incremental modernization strategies that preserve existing technology investments while progressively adopting cloud-native capabilities, providing a practical solution for enterprises undergoing digital transformation.

Article Details

Section

Articles

How to Cite

Integrating Heterogeneous ETL Pipelines: Towards Unified Data Processing Across Cloud and Legacy Systems. (2024). International Journal of Research Publications in Engineering, Technology and Management (IJRPETM), 7(3), 10491-10498. https://doi.org/10.15662/IJRPETM.2024.0703005

References

[1] Asma Qaiser et al., "Comparative Analysis of ETL Tools in Big Data Analytics," International Journal of Advanced Research in Computer Science and Software Engineering, ResearchGate, March 2023. [Online]. Available: https://www.researchgate.net/publication/369094822_Comparative_Analysis_of_ETL_Tools_in_Big_Data_Analytics

[2] Santhosh Bussa, "Evolution of Data Engineering in Modern Software Development," Journal of Software Engineering and Applications, ResearchGate, December 2024. [Online]. Available: https://www.researchgate.net/publication/386339393_Evolution_of_Data_Engineering_in_Modern_Software_Development

[3] Paulami Bandopadhyay, "Scaling Data Engineering with Advanced Data Management Architecture: A Comparative Analysis of Traditional ETL Tools Against the Latest Unified Platform," ResearchGate, October 2024. [Online]. Available: https://www.researchgate.net/publication/388962844_Scaling_Data_Engineering_with_Advanced_Data_Management_Architecture_A_Comparative_Analysis_of_Traditional_ETL_Tools_Against_the_Latest_Unified_Platform

[4] Alekhya Achanta & Roja Bo, "Evolving Paradigms of Data Engineering in the Modern Era: Challenges, Innovations, and Strategies," ResearchGate, November 2023. [Online]. Available: https://www.researchgate.net/publication/375861478_Evolving_Paradigms_of_Data_Engineering_in_the_Modern_Era_Challenges_Innovations_and_Strategies

[5] Matei Zaharia et al., "Apache Spark: A Unified Engine for Big Data Processing," Communications of the ACM, vol. 59, no. 11, pp. 56-65, November 2016. [Online]. Available: https://www.researchgate.net/publication/310613994_Apache_spark_A_unified_engine_for_big_data_processing

[6] Michael Whittaker & Michael M. Hellerstein, "Interactive Checks for Coordination Avoidance," Proceedings of the VLDB Endowment, vol. 13, no. 1, pp. 14-27, January 2021. [Online]. Available: https://www.researchgate.net/publication/344153137_Interactive_checks_for_coordination_avoidance

[7] Karwan Jameel Merseedi & Nareen Abdulla Sabri, "A Comprehensive Survey for Hadoop Distributed File System," International Journal of Computer Science and Information Security, ResearchGate, August 2021. [Online]. Available: https://www.researchgate.net/publication/354076409_A_Comprehensive_Survey_for_Hadoop_Distributed_File_System

[8] Ashish Thusoo et al., "Hive - A Warehousing Solution Over a Map-Reduce Framework," Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1626-1629, August 2009. [Online]. Available: https://www.researchgate.net/publication/220538285_Hive_-_A_Warehousing_Solution_Over_a_Map-Reduce_Framework

[9] Pooyan Jamshidi et al., "Microservices: The Journey So Far and Challenges Ahead," IEEE Software, vol. 35, no. 3, pp. 24-35, May 2018. [Online]. Available: https://www.researchgate.net/publication/324959590_Microservices_The_Journey_So_Far_and_Challenges_Ahead

[10] Beauden John, "Data Consistency in Distributed Systems," International Journal of Advanced Research in Computer Science, ResearchGate, February 2025. [Online]. Available: https://www.researchgate.net/publication/389356443_Data_Consistency_in_Distributed_Systems