<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=420156728350215&amp;ev=PageView&amp;noscript=1">

Using Apache Arrow, Calcite and Parquet to build a Relational Cache

Jacques Nadeau | Dremio


Everybody wants to get to data faster. As we move from more general solution to specific optimization techniques, the level of performance impact grows. This talk will discuss how layering in-memory caching, columnar storage and relational caching can combine to provide a substantial improvement in overall data science and analytical workloads. It will include a detailed overview of how you can use Apache Arrow, Calcite and Parquet to achieve multiple magnitudes improvement in performance over what is currently possible.

We'll start by talking about in-memory caches and the difference between block-based and data-aware caching strategies. We'll discuss the deployment design of this type of solution as well as cover the strengths of each. There will also be a discussion of the relationship of security and predicate application in these scenarios. Then we'll go into detail about how columnar storage formats can further enhance performance by minimizing read time, optimizing for vectorized in-memory processing and powerful compression techniques.

Lastly, we'll introduce a much more advanced way to speed access to data called relational caching. Relational caching builds a cache on columnar in-memory caching techniques but also includes a full comprehension of how data is being used and how different forms of data relate to each other. This will include leveraging multiple sorting and partitioning strategies as well as maintaining multiple related derivations of data for different types of access patterns. As part of this and we also cover approaches to data ttl, relational cache consistency and several different approaches to data mutation and real-time updates.

Download Slides

jacques nadeau

Co-founder & CTO @ Dremio | Creator of Apache Arrow

Jacques Nadeau is co-founder and CTO of Dremio. He is also the PMC Chair of the open source Apache Arrow project, spearheading the project’s technology and community. Previously he was MapR’s lead architect for distributed systems technologies. He is an industry veteran with more than 15 years of big data and analytics experience. In addition, he was cofounder and CTO of search engine startup YapMap. Before that, he was director of new product engineering with Quigo (contextual advertising, acquired by AOL in 2007). He also built the Avenue A | Razorfish analytics data warehousing system and associated services practice (acquired by Microsoft).