Relational Data Access on Big Data

Büchi, Christof and Mathys, Susanne (2013) Relational Data Access on Big Data. Bachelor thesis, HSR Hochschule für Technik Rapperswil.

[thumbnail of ba_thesis_smathys_cbuechi.pdf]
Preview
Text
ba_thesis_smathys_cbuechi.pdf

Download (1MB) | Preview

Abstract

Big Data is an expanding topic in information technology based on the huge amounts of data which are generated by IT systems. The ability to store structured and unstructured data at petabyte scale will improve the efforts of business intelligence and widely expand the type of questions which can be answered with these systems. To address changes in data management for large data warehouses (DWH) several new products are released on top of the Hadoop ecosystem. We focus on storing data for business analytics concerned questions and running queries against it. Where the Hadoop distributed file system (HDFS) has its strengths at parallel processing and reading files in a sequential access pattern, we are interested in getting answers to business related questions based on aggregations. Answering these questions calls for range queries accessing subsets of data. The problem of dealing with files in Hadoop needs knowledge of the Map-Reduce principle and different languages are used to access the data. Enterprises have to invest time and money in new systems and know-how for Big Data systems. A new way is to provide the well known ANSI SQL standard for querying data on Hadoop platforms. Already done investment on the analytics side could be reused on the new platform. We have analyzed the capabilities of IBM BigSQL. To compare performance and semantic differences between IBM BigSQL and a traditional relational DWH solution, we built up a distributed IBM DB2 cluster based on IBM Database Partitioning Feature (DPF) technology and used an IBM BigInsights cluster to execute analytical queries. We have selected the TPC-H benchmark simulating a typical OLAP workload as use case. We have measured performance of query execution as well as scaling out characteristics. The results show us that in some cases processing an analytical query on a Big Data platform could be more efficient than on relational DWH systems. The Hadoop ecosystem owns lot of potential but brings along lot of drawbacks as well. But with these in mind, a cost-efficient DWH can be established. Beside the established traditional way many hadoop-based solutions are in development to avoid the negative aspects of the Map-Reduce principle. After our research and measurements during this thesis we came to the conclusion that all of the investigated products could be used in a production environment. A recommendation which product could be used for analytical queries depends of many different factors like costs, query execution time and diversity.

Item Type: Thesis (Bachelor)
Subjects: Topics > Software > Performance
Area of Application > Industry
Area of Application > Data Mining
Technologies > Frameworks and Libraries > Apache Hadoop
Technologies > Databases
Technologies > Databases > Data Warehouse
Technologies > Databases > SQL
Divisions: Bachelor of Science FHO in Informatik > Bachelor Thesis
Depositing User: OST Deposit User
Contributors:
Contribution
Name
Email
Thesis advisor
Joller, Josef
UNSPECIFIED
Date Deposited: 10 Apr 2014 07:11
Last Modified: 10 Apr 2014 07:11
URI: https://eprints.ost.ch/id/eprint/337

Actions (login required)

View Item
View Item