Loading real-time streaming data into Druid (part 4 of Druid tutorial)
This is part four in series of tutorials about Druid: high performance, scalable, distributed time-series datastore. In this part we will learn how to load real-time streaming data into Druid with Tranquility.
This tutorial, as well as previous parts, expects reader to have basic knowledge of system administration and some experience working in command line. If you don't yet have local Druid instance running, or don't have sample dataset loaded into database, please refer to previous parts of this tutorial:
Your question and suggestions are welcomed in comment section, as always. Examples are tested on MacOS, but should work without changes in Linux as well. Windows users are advised to use virtual machine to emulate linux environment, since Druid does not have Windows support.
About Tranquility
Tranquility is a streaming data injection tool for Druid database. It is an official product, developed and maintained by the same team as Druid database. This utility fits right in with modern data engineering technology stack: it works out of the box with Kafka, Samza, Spark, Storm and Trident and others. It also provides simple HTTP API for loading real-time data from other applications.
Tranqulity helps in writing streaming applications by handling administrative tasks for you:
- creation of indexing tasks
- partitioning and replication
- service discovery
- schema rollover
For the purpose of this tutorial we will use Tranquility's HTTP API for stream upload.
Tranquility server config
Druid database distribution comes bundled with sample data generator and sample config file for Tranqulity server. Sample Tranqulity config is located in druid-0.11.0/conf-quickstart/tranquility/server.json
. Whole file is just 74 lines of json, and is somewhat similar in format to batch task configuration from Loading data into Druid database tutorial. It's important to understand few important directives:
dataSources
can be a single json object where keys are either names of datasource and values are a datasource configuration object (as is the case in example config), or an array, with datasource format configuration objects as items.properties
contains global Tranqulity propertiesdataSources.<datasource-name>.properties
allows you to override global properties for each data source
There are many properties available for configuration in Tranqulity, but most of them are specific to one injestion method. We use seven properties in server.json
config file:
zookeeper.connect
- address of zookeeper instance (required)druid.discovery.curator.path
- service discovery path for Tranqulity's internal Apache Curatordruid.selectors.indexing.serviceName
- service name of Druid's overlord nodehttp.port
- server port (global only)http.threads
- how many threads to use for http handling (global only)task.partitions
- how many Druid partitions to create for tasktask.replicants
- how many replicants to create for data in Druid.
DataSource dimentions are configured to be schemaless (any dimension key will be accepted by Tranqulity). Supported metrics are:
- count
- value_sum (
sum
aggregation onvalue
input metric) - value_min (
min
aggregation onvalue
input metric) - value_max (
max
aggregation onvalue
input metric)
Running Tranqulity server
First of all we need to download and unarchive tranquility:
curl -O http://static.druid.io/tranquility/releases/tranquility-distribution-0.8.0.tgz
tar -xzf tranquility-distribution-0.8.0.tgz
cd tranquility-distribution-0.8.0
To start tranqulility server with this config file we need to execute following command (assuming folders druid-0.11.0
and tranquility-distribution-0.8.0
are sharing common parent folder):
bin/tranquility server -configFile ../druid-0.11.0/conf-quickstart/tranquility/server.json
Keep in mind that tranquility requires running Zookeeper instance. If you receive Zookeper connection errors on the start, follow steps described in the first part of the tutorial.
Generating mock metrics
Sample data generator utility is located in bin/generate-example-metrics
. Let's run it and examine the output:
$ ./bin/generate-example-metrics
{"unit": "milliseconds", "http_method": "GET", "value": 70, "timestamp": "2018-02-28T21:12:00Z", "http_code": "200", "page": "/list", "metricType": "request/latency", "server": "www1.example.com"}
{"unit": "milliseconds", "http_method": "GET", "value": 86, "timestamp": "2018-02-28T21:12:00Z", "http_code": "200", "page": "/", "metricType": "request/latency", "server": "www2.example.com"}
{"unit": "milliseconds", "http_method": "GET", "value": 79, "timestamp": "2018-02-28T21:12:00Z", "http_code": "200", "page": "/list", "metricType": "request/latency", "server": "www1.example.com"}
...
As you can see, generate-example-metrics
simulate simple web-server logs in JSON. Metrics in this format can be forwarded directly to Tranqulity without any additional transformations. We can do it with just by redirecting generate-example-metrics
to curl
:
bin/generate-example-metrics | curl -XPOST -H'Content-Type: application/json' --data-binary @- http://localhost:8200/v1/post/metrics
This will start the real-time indexing task for our data in Druid Indexing Service. Uploaded data be
comes immediately available for querying.
Summary
In this part of the Druid tutorial we learned how to load real-time streaming data into Druid using Tranqulity. In upcoming tutorials we will cover number of interesting topics, such as:
- advanced Druid queries
- real-time injection of avro-encoded events from kafka into our Druid cluster using Tranquility utility
- we will visualize data in Druid with Swiv(formerly Pivot) and Superset
Posted on Utopian.io - Rewarding Open Source Contributors
Thank you for the contribution. It has been approved.
I think that for future tutorials it would be better if you said what the users will learn per section of your tutorial, instead of just saying what they will learn in general. Also make sure to really explain why they should do it like you show them to, but of course, this is all my opinion!
You can contact us on Discord.
[utopian-moderator]
All fair remarks.
That was my original plan. Unfortunately it's often the case that I change the learning curriculum mid-way for many reasons: topic might end up being simpler than I expected or the other way around - too long and complicated. So I tried to keep it intentionally vague. It sure would be better for users to know the curriculum ahead of time, so I'll do my best to improve in this regard.
I've explained advantages of Tranquility in About Tranquility section. I will keep in mind the need to justify my choices on every step of the way in upcoming tutorials ;)
Thanks for detailed review, @amosbastian.
Hey @laxam I am @utopian-io. I have just upvoted you!
Achievements
Suggestions
Get Noticed!
Community-Driven Witness!
I am the first and only Steem Community-Driven Witness. Participate on Discord. Lets GROW TOGETHER!
Up-vote this comment to grow my power and help Open Source contributions like this one. Want to chat? Join me on Discord https://discord.gg/Pc8HG9x