100 open source Big Data architecture papers for data professionals. | Anil Madan | LinkedIn
Big Data technology has been extremely disruptive with open source playing a dominant role in shaping its evolution. While on one hand it has been disruptive, on the other it has led to a complex ecosystem where new frameworks, libraries and tools are being released pretty much every day, creating confusion as technologists struggle and grapple with the deluge.
If you are a Big Data enthusiast or a technologist ramping up (or scratching your head), it is important to spend some serious time deeply understanding the architecture of key systems to appreciate its evolution. Understanding the architectural components and subtleties would also help you choose and apply the appropriate technology for your use case. In my journey over the last few years, some literature has helped me become a better educated data professional. My goal here is to not only share the literature but consequently also use the opportunity to put some sanity into the labyrinth of open source systems.
One caution, most of the reference literature included is hugely skewed towards deep architecture overview (in most cases original research papers) than simply provide you with basic overview. I firmly believe that deep dive will fundamentally help you understand the nuances, though would not provide you with any shortcuts, if you want to get a quick basic overview.
Jumping right in…
Key architecture layers
- File Systems- Distributed file systems which provide storage, fault tolerance, scalability, reliability, and availability.
- Data Stores– Evolution of application databases into Polyglot storage with application specific databases instead of one size fits all. Common ones are Key-Value, Document, Column and Graph.
- Resource Managers– provide resource management capabilities and support schedulers for high utilization and throughput.
- Coordination– systems that manage state, distributed coordination, consensus and lock management.
- Computational Frameworks– a lot of work is happening at this layer with highly specialized compute frameworks for Streaming, Interactive, Real Time, Batch and Iterative Graph (BSP) processing. Powering these are complete computation runtimes like BDAS (Spark) & Flink.
- DataAnalytics –Analytical (consumption) tools and libraries, which support exploratory, descriptive, predictive, statistical analysis and machine learning.
- Data Integration– these include not only the orchestration tools for managing pipelines but also metadata management.
- Operational Frameworks – these provide scalable frameworks for monitoring & benchmarking.
Read full article from 100 open source Big Data architecture papers for data professionals. | Anil Madan | LinkedIn
No comments:
Post a Comment