In this blog post, I'll try to explain why we moved from ElasticStack to Quickwit and Grafana and why we choosed it over other solutions.
First, we've been in the observability world for quite some time and have been using ElasticStack for years. I personally used Elasticsearch for more than 10 years and Apache SolR before for logging and observability usecases even before Elasticsearch's birth!
We also succeed to use ElasticStack for IoT (Internet of Things) projects and rebuilt our own images of Kibana and Elasticsearch for ARM32 and ARM64 before Elastic (the company) starts to release official images. We had a lot of fun with it.
However everyone who works with it on premises know that Elastic is a big distributed system which brings everyone lot of struggles such as:
- The log retentions because it's on filesystem and storage on disk is expensive1
- Like most of highly distributed databases developed in Java, it has a very high footprint, consumes a lot of RAM...
- You have also some issue such as "split brains" when you're dealing with HA (High Availability)
On the other hand, there's SaaS (Software as a Service) observability solutions such as Datadog or Elastic cloud which are saving you the trouble of managing clusters but which are very expansive. And even putting the price aside, most of our customers are required to keep all the data on an infrastructure they own.
That been said, Grafana proposed an alternative which is called Grafana Loki which is storing the data on object storage. The idea of using object storage is great because it's often implementing HA by design on most of the big cloud players and it lower the price a lot. Moreover, even when you're on premises, you often want to only ensure the HA of fewer components, the object storage amongs them.
However we weren't convinced because Loki ain't implemented a real search engine such as Apache Lucene used by both Elasticsearch and SolR. It also appears to be very slow as well with bad feedbacks from the community such as this one.
So we were looking for a solution who combines the advantages of both worlds: an efficient search engine which compensates the slowness brought by the use of the object storage's API.
And yet we discovered Quickwit \o/.
Quickwit is built on top of Tantivy which is similar to Lucene but written in Rust2, and also store the indexed data on object storage. That's the main reason making Quickwit better than Loki3 and Elasticsearch in my opinion.
Quickwit is also bringing lot's of integration with the CNCF ecosystem4:
- A datasource for Grafana
- OpenTelemetry interoperability for traces and logs ingestion
- Jaeger's GRPC API interoperability which allows us to use Quickwit as a storage backend for traces and keep the Jaeger UI or Jaeger datasource on grafana. This is the only known solution to store Jaeger traces on object storage
- Elasticsearch or Opensearch5's API interoperability
- Falcosidekick which can use Quickwit as an output
- Glasskube which makes easier the Quickwit's installation on Kubernetes6
That's why we decided to propose Quickwit as our main observability solution in cwcloud DaaS (Deployment as a Service) platform. You can checkout this tutorial to get more informations.
Moreover, we also started to migrate most of our customers infrastructures to Quickwit instances and recommand to design their new applications with the OpenTelemetry's SDK available in their stack when it's possible or use Vector from datadog which is bringing lot of advantages as well:
- It's very fast and has a very low footprint comparing to some other well-known solutions such as Fluentbit, Logstash and even Filebeat from ElasticStack (probably because it's written in Rust :p ).
- It provides a very powerful VRL (Vector Remap Language) language in order to remap your logs and make-it compliants with some already existing indexes mapping7.
- It's working with Kubernetes but also with docker and even logs written on filesystem by legacy applications. And this is very convenient for us because as explained in my previous blog post Docker in production, is it really bad?, we have lot of customer who are using docker in production (through cwcloud's DaaS) instead of Kubernetes.
For most of them as for our own internal use, we have divided the compute consumption at least by 3 while increasing the retention. Larger companies successfuly created astronomical logging service with Quickwit such as Binance with 100PB of stored data.
So now Quickwit is covering our observability needs in terms of logs and traces but we still miss the metrics. For the metrics usecase we're using VictoriaMetrics which is working pretty well but lacks the support of object storage. We know that Quickwit plans to handle this usecase one day with a real TSDB (Time Series Database) which sounds really promising. I'm quite convinced that separating the compute from the storage and propose object storage is now a success key factor for building modern observability solutions.
To conclude, I still think ElasticStack is a great product with a bigger company behind which is providing more advanced features including AI (Artificial Intelligence) capabilities. I might still offer it to some customers who might be interested by some of those features or even using Elasticsearch as a full-text search engine as a dependancy of some applications or microservices (Quickwit isn't the best choice in this case, it's more suitable for observability usecases only).
- We know that Elasticsearch is providing object storage compatibility with the searchable snapshot feature but it's not available in the opensource version on one hand, and only recommanded on cold data which are not supposed to be fetch too much on the other hand.↩
- Tantivy is 2x faster than Lucene according to this benchmark, this compensate the slowness brought by the use of the object storage.↩
- Quickwit also provides this benchmark with Loki, trying to make a faire comparison.↩
- I'm involved myself to contribute to lot of them, missioned by Quickwit Inc. (the company).↩
- OpenSearch is a fork of ElasticStack initiated by Amazon AWS.↩
- I wrote a blog post directly on the Quickwit's blog if you want to get more informations.↩
- You see an example of remap function in order to make the docker logs compliant with the default
otel-logs-v0_7
index in this tutorial.↩