Publications | durner.dev

Exploiting Cloud Object Storage for High-Performance Analytics

Exploiting Cloud Object Storage for High-Performance Analytics [PDF]

Dominik Durner, Viktor Leis, Thomas Neumann
VLDB 2023, 49th International Conference on Very Large Data Bases

Elasticity of compute and storage is crucial for analytical cloud database systems. All cloud vendors provide disaggregated object stores, which can be used as storage backend for analytical query engines. Until recently, local storage was unavoidable to process large tables efficiently due to the bandwidth limitations of the network infrastructure in public clouds. However, the gap between remote network and local NVMe bandwidth is closing, making cloud storage more attractive. This paper presents a blueprint for performing efficient analytics directly on cloud object stores. We derive cost- and performance-optimal retrieval configurations for cloud object stores with the first in-depth study of this foundational service in the context of analytical query processing. For achieving high retrieval performance, we present AnyBlob, a novel download manager for query engines that optimizes throughput while minimizing CPU usage. We discuss the integration of high-performance data retrieval in query engines and demonstrate it by incorporating AnyBlob in our database system Umbra. Our experiments show that even without caching, Umbra with integrated AnyBlob achieves similar performance to state-of-the-art cloud data warehouses that cache data on local SSDs while improving resource elasticity.

Crystal: A Unified Cache Storage System for Analytical Databases

Crystal: A Unified Cache Storage System for Analytical Databases [PDF]

Dominik Durner, Badrish Chandramouli, Yinan Li
VLDB 2021, 47th International Conference on Very Large Data Bases

Cloud analytical databases employ a disaggregated storage model, where the elastic compute layer accesses data persisted on remote cloud storage in block-oriented columnar formats. Given the high latency and low bandwidth to remote storage and the limited size of fast local storage, caching data at the compute node is important and has resulted in a renewed interest in caching for analytics. Today, each DBMS builds its own caching solution, usually based on file- or block-level LRU. In this paper, we advocate a new architecture of a smart cache storage system called Crystal, that is co-located with compute. Crystal's clients are DBMS-specific "data sources" with push-down predicates. Similar in spirit to a DBMS, Crystal incorporates query processing and optimization components focusing on efficient caching and serving of single-table hyper-rectangles called regions. Results show that Crystal, with a small DBMS-specific data source connector, can significantly improve query latencies on unmodified Spark and Greenplum while also saving on bandwidth from remote storage.

JSON Tiles: Fast Analytics on Semi-Structured Data

JSON Tiles: Fast Analytics on Semi-Structured Data [PDF]

Dominik Durner, Viktor Leis, Thomas Neumann
SIGMOD Honorable Mention Award
ACM SIGMOD 2021 International Conference on Management of Data (SIGMOD 2021)

Developers often prefer flexibility over upfront schema design, making semi-structured data formats such as JSON increasingly popular. Large amounts of JSON data are therefore stored and analyzed by relational database systems. In existing systems, however, JSON's lack of a fixed schema results in slow analytics. In this paper, we present JSON tiles, which, without losing the flexibility of JSON, enables relational systems to perform analytics on JSON data at native speed. JSON tiles automatically detects the most important keys and extracts them transparently -- often achieving scan performance similar to columnar storage. At the same time, JSON tiles is capable of handling heterogeneous and changing data. Furthermore, we automatically collect statistics that enable the query optimizer to find good execution plans. Our experimental evaluation compares against state-of-the-art systems and research proposals and shows that our approach is both robust and efficient.

No False Negatives: Accepting All Useful Schedules in a Fast Serializable Many-Core System

Dominik Durner, Thomas Neumann
35th IEEE International Conference on Data Engineering (ICDE 2019), Source Code, Slides

Concurrency control is one of the most performance critical steps in modern many-core database systems. Achieving higher throughput on multi-socket servers is difficult and many concurrency control algorithms reduce the amount of accepted schedules in favor of transaction throughput or relax the isolation level which introduces unwanted anomalies. Both approaches lead to unexpected transaction behavior that is difficult to understand by the database users. We introduce a novel multi-version concurrency protocol that achieves high performance while reducing the number of aborted schedules to a minimum and providing the best isolation level. Our approach leverages the idea of a graph-based scheduler that uses the concept of conflict graphs. As conflict serializable histories can be represented by acyclic conflict graphs, our scheduler maintains the conflict graph and allows all transactions that keep the graph acyclic. All conflict serializable schedules can be accepted by such a graph-based algorithm due to the conflict graph theorem. Hence, only transaction schedules that truly violate the serializability constraints need to abort. Our developed approach is able to accept the useful intersection of commit order preserving conflict serializable (COCSR) and recoverable (RC) schedules which are the two most desirable classes in terms of correctness and user experience. We show experimentally that our graph-based scheduler has very competitive throughput in pure transactional workloads while providing fewer aborts and improved user experience. Our multi-version extension helps to efficiently perform long-running read transactions on the same up-to-date database. Moreover, our graph-based scheduler can outperform the competitors on mixed workloads.

TracEx: Understanding and Analyzing Database Traces

TracEx: Understanding and Analyzing Database Traces [PDF]

Dominik Durner, Lennart Espe, Jana Giceva, Anja Gruenheid
CIDR 2024, Conference on Innovative Data Systems Research

With the shift to databases-as-a-service, vendors are able to collect high-level database traces of executed workloads while retaining the privacy of their customers. In contrast to pure end-to-end latency statistics, traces contain enriched information that is useful for tasks such as workload monitoring and regression testing. Despite its importance, efficient analysis and exploration of traces and their rich feature space remains a challenge. In this paper, we introduce TracEx, an open-source Trace Exploration tool that facilitates workload trace analysis and comparison for database systems. TracEx allows users to understand their workload by providing an intuitive, visual interface that explores the workload along different dimensions, e.g., resource utilization or database operator usage. Additionally, users are able to contrast and compare workloads that have been collected from different hardware configurations or even compare traces between database systems.

On the Impact of Memory Allocation on High-Performance Query Processing

Dominik Durner, Viktor Leis, Thomas Neumann
15th International Workshop on Data Management on New Hardware (DaMoN 2019)

Somewhat surprisingly, the behavior of analytical query engines is crucially affected by the dynamic memory allocator used. Memory allocators highly influence performance, scalability, memory efficiency and memory fairness to other processes. In this work, we provide the first comprehensive experimental analysis on the impact of memory allocation for high-performance query engines. We test five state-of-the-art dynamic memory allocators and discuss their strengths and weaknesses within our DBMS. The right allocator can increase the performance of TPC-DS (SF 100) by 2.7 x on a 4-socket Intel Xeon server.

Experimental Study of Memory Allocation for High-Performance Query Processing

Dominik Durner, Viktor Leis, Thomas Neumann
10th International Workshop on Accelerating Analytics and Data Management Systems (ADMS 2019)

About me

Dominik Durner
Scientific Employee
Technische Universität München (TUM)
Department of Informatics

[email protected]