A Framework for Fault-Tolerant Enterprise Applications

Main Article Content

Mallikarjun Bellundagi

Abstract

Enterprise application servers have long served as the backbone of mission-critical business operations, providing robust middleware platforms capable of hosting complex, distributed applications that must remain continuously available in the face of hardware failures, software faults, and unpredictable workload surges. Oracle WebLogic Server and Red Hat JBoss (WildFly) represent two of the most widely deployed enterprise application server platforms in production environments globally, each offering comprehensive Java EE and Jakarta EE compliance, advanced clustering capabilities, transaction management, and enterprise-grade security frameworks. Despite the inherent resilience features built into these platforms, real-world enterprise deployments continue to experience unplanned outages, performance degradations, and cascading failures that result in significant financial losses, reputational damage, and disruption to business continuity. This paper presents a comprehensive framework for constructing fault-tolerant enterprise applications that leverages the complementary strengths of WebLogic and JBoss deployment architectures while integrating an artificial intelligence-based failure prediction system capable of anticipating and preemptively mitigating potential faults before they manifest as service disruptions. The proposed framework employs Long Short-Term Memory neural networks for multivariate time-series analysis of server telemetry data, a gradient-boosted ensemble model for anomaly detection across application metrics, and a reinforcement learning-driven remediation agent that autonomously executes corrective actions including workload redistribution, preemptive session migration, and graceful service degradation. Experimental evaluations conducted across enterprise-grade benchmark environments demonstrate that the proposed framework achieves a 47.3% reduction in unplanned downtime, a 38.6% improvement in mean time to recovery, and a 91.2% accuracy rate in failure prediction with an average lead time of 8.4 minutes prior to fault occurrence, establishing a new standard for proactive fault tolerance in enterprise middleware deployments.

Article Details

How to Cite
Bellundagi , M. (2024). A Framework for Fault-Tolerant Enterprise Applications. International Meridian Journal, 6(6). https://meridianjournal.in/index.php/IMJ/article/view/122
Section
Articles

How to Cite

Bellundagi , M. (2024). A Framework for Fault-Tolerant Enterprise Applications. International Meridian Journal, 6(6). https://meridianjournal.in/index.php/IMJ/article/view/122

References

Bitincka, L., Ganapathi, A., Rhea, S., & Zhang, Z. (2010). Optimizing data analysis with a semi-structured time series database. Proceedings of the 2nd USENIX Workshop on Hot Topics in Storage and File Systems.

Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). Borg, Omega, and Kubernetes: Lessons learned from three container management systems over a decade. ACM Queue, 14(1), 70–93.

Candea, G., & Fox, A. (2003). Crash-only software. Proceedings of the 9th Workshop on Hot Topics in Operating Systems.

Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys, 41(3), 1–58.

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794.

Cohen, I., Goldszmidt, M., Kelly, T., Symons, J., & Chase, J. S. (2004). Correlating instrumentation data to system states: A building block for automated diagnosis and control. Proceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation, 231–244.

Farchi, E., Nir, Y., & Ur, S. (2003). Concurrent bug patterns and how to test them. Proceedings of the International Parallel and Distributed Processing Symposium.

Fox, A., Gribble, S. D., Chawathe, Y., Brewer, E. A., & Gauthier, P. (1997). Cluster-based scalable network services. Proceedings of the 16th ACM Symposium on Operating Systems Principles, 78–91.

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

Huang, J., Fox, A., Candea, G., & Goldsmith, M. (2005). Subzero: Candid diagnosis of performance bugs in infrastructure software. Proceedings of the International Conference on Dependable Systems and Networks.

Laprie, J. C. (1995). Dependability: Its attributes, impairments and means. Predictably Dependable Computing Systems, 3–18. Springer.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.

Oracle Corporation. (2023). Oracle WebLogic Server administration guide. Oracle Documentation. https://docs.oracle.com/en/middleware/standalone/weblogic-server/

Peng, X., Chen, H., Yu, G., Zhao, H., & Du, Y. (2018). Grey failure: The opaque nemesis of cloud-scale systems. Proceedings of the 16th USENIX Workshop on Hot Topics in Operating Systems.

Prewett, J. E. (2003). Analyzing cluster log files using Logsurfer. Proceedings of the Annual Conference on USENIX Annual Technical Conference.

Red Hat, Inc. (2023). JBoss enterprise application platform administration guide. Red Hat Customer Portal. https://access.redhat.com/documentation/en-us/red_hat_jboss_enterprise_application_platform/

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

Siewert, S., & Pratt, J. (2002). Real-time embedded systems programming. CMP Books.

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). MIT Press.

Tan, Y., Khan, A., Nguyen, H., Shen, H., Bodik, P., Fox, A., Jordan, M., & Patterson, D. (2010). Mochi: Visual log analysis tool. Proceedings of the 2010 USENIX Annual Technical Conference.

Vaarandi, R. (2003). A data clustering algorithm for mining patterns from event logs. Proceedings of the 3rd IEEE Workshop on IP Operations and Management, 119–126.

Xu, W., Huang, L., Fox, A., Patterson, D., & Jordan, M. I. (2009). Detecting large-scale system problems by mining console logs. Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, 117–132.