Sunday, February 4, 2024

Datafusion Comet

Hi!

Recently I moved to Rust and working on several projects - more insights to come ... one of them was Datafusion - an extremely fast SQL query engine.  

I will have some posts/code to share with a few interesting findings, and one of them is Comet - it is a side project that could be used inside Spark - as a separate executor (written in Rust).


Apache Datafusion Comet intro

Comet is an Apache Spark plugin that uses Apache Arrow DataFusion to accelerate Spark workloads. It is designed as a drop-in replacement for Spark’s JVM-based SQL execution engine and offers significant performance improvements for some workloads as shown below.




 

Apache Spark is a stable, mature project developed for many years. It is one of the best frameworks to scale out for processing large-scale datasets. However, the Spark community has had to address performance challenges that require various optimizations over time. Pain points include (not full list):

  • JVM memory/CPU overhead 
  • Performance issues 
  • Lack of support of native SIMD instructions 


There are a few libraries like Arrow, and Datafusion. Using features like native implementation, columnar data format, and vectorized data processing, these libraries can outperform Spark's JVM-based SQL engine.





High-level functionality

  • Offload performance-critical data processing to the native execution engine 
  • Automated conversion of Spark’s physical plan  -> Datafusion plan 
  • Native Operators for Spark execution- (Filter/Project/Aggregation/Join/Exchange) 
  • Spark built-in expressions 
  • Easy migration of legacy Spark UDF And UDAF


Why it is interesting


The last feature may not sound impressive but from a business perspective is massive - it could allow companies that are dependent on Java to move to Rust ;-)

Another main point is:
Since Datafusion soon will be top ASF project, Comet as part of that will gain more potential and will closely align with Spark development. 




Others 

Datafusion Comet

Hi! Recently I moved to Rust and working on several projects - more insights to come ... one of them was Datafusion - an extremely fast S...