Sunday, February 4, 2024

Datafusion Comet

Hi!

Recently I moved to Rust and working on several projects - more insights to come ... one of them was Datafusion - an extremely fast SQL query engine.  

I will have some posts/code to share with a few interesting findings, and one of them is Comet - it is a side project that could be used inside Spark - as a separate executor (written in Rust).


Apache Datafusion Comet intro

Comet is an Apache Spark plugin that uses Apache Arrow DataFusion to accelerate Spark workloads. It is designed as a drop-in replacement for Spark’s JVM-based SQL execution engine and offers significant performance improvements for some workloads as shown below.




 

Apache Spark is a stable, mature project developed for many years. It is one of the best frameworks to scale out for processing large-scale datasets. However, the Spark community has had to address performance challenges that require various optimizations over time. Pain points include (not full list):

  • JVM memory/CPU overhead 
  • Performance issues 
  • Lack of support of native SIMD instructions 


There are a few libraries like Arrow, and Datafusion. Using features like native implementation, columnar data format, and vectorized data processing, these libraries can outperform Spark's JVM-based SQL engine.





High-level functionality

  • Offload performance-critical data processing to the native execution engine 
  • Automated conversion of Spark’s physical plan  -> Datafusion plan 
  • Native Operators for Spark execution- (Filter/Project/Aggregation/Join/Exchange) 
  • Spark built-in expressions 
  • Easy migration of legacy Spark UDF And UDAF


Why it is interesting


The last feature may not sound impressive but from a business perspective is massive - it could allow companies that are dependent on Java to move to Rust ;-)

Another main point is:
Since Datafusion soon will be top ASF project, Comet as part of that will gain more potential and will closely align with Spark development. 




Others 

Tuesday, October 3, 2023

Application Tracer

Hi,

This is one of my first Rust apps.  

I use it to benchmark long-running applications - like server /streaming solutions.

Tracer

Live terminal monitoring of applications.


Why

Created it for 2 reasons:
- to check/learn how to create and manage full rust applications using the whole ecosystem - crates/builds/publishing
- personal needs to have a simple monitor to see application CPU/Memory usage with some "graphical interface" that could be used in a terminal window. A generally simplified version of data collectors and Grafana.


Code




Monitor live application either as child process or separate PID, collecting /displaying stats ( cpu usage, memory usage).



UI (TUI)




Build

cargo build -r

Run


Create an example app:
cargo build --examples test_app

Run in txt mode and output persisted to out.csv file:
cargo run -r -- -n -o out.csv /opt/workspace/app_tracer/target/debug/examples/test_app


Usage

  
app-tracer 0.4.0
Tracing / benchmarking long running applications (ie: streaming).

USAGE:
    tracer [OPTIONS] [APPLICATION]

ARGS:
    <application>    Application to be run as child process (alternatively provide PID of
                     running app)

OPTIONS:
    -h, --help                 Print help information
    -l, --log <log>            Set custom log level: info, debug, trace [default: info]
    -n, --noui                 No UI - only text output
    -o, --output <output>      Name of output CSV file with all readings - for further investigations
    -p, --pid <pid>            PID of external process
    -r, --refresh <refresh>    Refresh rate in milliseconds [default: 1000]
    -V, --version              Print version information

      

Example output

 cargo run -r -- -n -o out.csv /opt/workspace/app_tracer/target/debug/examples/test_app 

 Compiling app-tracer v0.4.0 (/opt/workspace/app_tracer)
 Finished release [optimized] target(s) in 2.98s
 Running `target/release/tracer -n -o out.csv /opt/workspace/app_tracer/target/debug/examples/test_app`

12:26:12.260 (t: main) INFO - tracer - Application to be monitored is: test_app, in dir /opt/workspace/app_tracer/target/debug/examples/test_app
12:26:12.261 (t: main) INFO - tracer - Refresh rate: 1000 ms.
12:26:12.261 (t: main) INFO - tracer - Output readings persisted into "out.csv".
12:26:12.261 (t: main) INFO - tracer - Starting with PID::15008
12:26:12.296 (t: main) INFO - tracer - Running in TXT mode.
12:26:13.298 (t: main) INFO - tracer - CPU: 0 [%], memory: 2208 [kB]
12:26:14.303 (t: main) INFO - tracer - CPU: 0.0030129354 [%], memory: 2208 [kB]
12:26:15.308 (t: main) INFO - tracer - CPU: 0.0054045436 [%], memory: 2208 [kB]
12:26:16.309 (t: main) INFO - tracer - CPU: 0.0023218023 [%], memory: 2208 [kB]
12:26:17.311 (t: main) INFO - tracer - CPU: 0.006252239 [%], memory: 2208 [kB]
12:26:18.312 (t: main) INFO - tracer - CPU: 0.0036088445 [%], memory: 2208 [kB]
12:26:19.317 (t: main) INFO - tracer - CPU: 0.0057060686 [%], memory: 2208 [kB]
12:26:20.318 (t: main) INFO - tracer - CPU: 0.005099413 [%], memory: 2208 [kB]
12:26:21.318 (t: main) INFO - tracer - CPU: 0.007175615 [%], memory: 2208 [kB]
12:26:22.319 (t: main) INFO - tracer - CPU: 0.005251118 [%], memory: 2208 [kB]
12:26:23.319 (t: main) INFO - tracer - CPU: 0.0021786916 [%], memory: 2208 [kB]
12:26:24.321 (t: main) INFO - tracer - CPU: 0.006866733 [%], memory: 2208 [kB]




CSV persistence


Example output.csv file:
  
Time,Cpu,Mem
11:27:16.394591,0,2136
11:27:17.396917,0.004986567,2136
11:27:18.397440,0.006548807,2136




Note: For monitoring one-shot applications - see https://github.com/yarenty/app_benchmark.

Sunday, September 10, 2023

Benchmarker

Hi,

This is one of my first Rust apps.  

I use it to benchmark an application - run it multiple times and get readings + graphs.

Benchmark

Benchmarking data collector - runs an application as a child process, collecting stats (time, CPU usage, memory usage) and generating benchmarking reports.



Why

Created it for 2 reasons:
- to check/learn how to create and manage full rust applications using the whole ecosystem - crates/builds/publishing
- personal needs to get benchmarks for different other projects

Code




High-level idea

  • run the application multiple times

  • collect all interested readings:

    • time
    • CPU
    • memory
  • process outputs and provide results as:

    • CSV/excel
    • graphs

Save outputs to local DB/file to check downgrade/speedup in the next release of an application.


Methodology

For each benchmark run:

  • run multiple times (default 10)
  • remove outliers
  • average output results

methodology

Build

cargo build -r --bin benchmark 

Usage

benchmark 0.1.0
Benchmarking data collector.

USAGE:
    benchmark [OPTIONS] <APPLICATION>

ARGS:
    <APPLICATION>    Application path (just name if it is in the same directory)

OPTIONS:
    -h, --help           Print help information
    -l, --log <LOG>      Set custom log level: info, debug, trace [default: info]
    -r, --runs <RUNS>    Number of runs to be executed [default: 10]
    -V, --version        Print version information

Example output

09:33:24.899 (t: main) INFO - benchmark - Application to be benchmark is: /opt/workspace/ballista/target/release/examples/example_processing
09:33:24.899 (t: main) INFO - benchmark - Number of runs: 10
09:33:24.902 (t: main) INFO - benchmark - Collecting data::example_processing
09:33:24.902 (t: main) INFO - benchmark::bench::analysis - Run 0 of 10
09:33:24.947 (t: main) INFO - benchmark::bench::analysis - Run 1 of 10
09:33:24.983 (t: main) INFO - benchmark::bench::analysis - Run 2 of 10
09:33:25.016 (t: main) INFO - benchmark::bench::analysis - Run 3 of 10
09:33:25.049 (t: main) INFO - benchmark::bench::analysis - Run 4 of 10
09:33:25.087 (t: main) INFO - benchmark::bench::analysis - Run 5 of 10
09:33:25.132 (t: main) INFO - benchmark::bench::analysis - Run 6 of 10
09:33:25.188 (t: main) INFO - benchmark::bench::analysis - Run 7 of 10
09:33:25.238 (t: main) INFO - benchmark::bench::analysis - Run 8 of 10
09:33:25.288 (t: main) INFO - benchmark::bench::analysis - Run 9 of 10
09:33:25.338 (t: main) INFO - benchmark - Processing outputs
0.04,130,18752,
0.03,140,18664,
0.03,156,18856,
0.03,153,18868,
0.04,152,18884,
0.04,140,18904,
0.05,136,19404,
0.05,145,19220,
0.05,137,18780,
0.05,138,18788,
09:33:25.339 (t: main) INFO - benchmark::bench::collector - SUMMARY:
09:33:25.339 (t: main) INFO - benchmark::bench::collector - Time [ms]:: min: 30, max: 50, avg: 41 ms
09:33:25.339 (t: main) INFO - benchmark::bench::collector - CPU [%]:: min: 130, max: 156, avg: 142.7 %
09:33:25.339 (t: main) INFO - benchmark::bench::collector - Memory [kB]:: min: 18664, max: 19404, avg: 18912 kB

Process finished with exit code 0


Also in the current directory of the benchmark app, there is an output directory created named "bench_<your_app_name>", ie: bench_example_processing, which contains:

Output CSV file:

Time,Cpu,Mem
0.04,130,18752
0.03,140,18664
0.03,156,18856
0.03,153,18868
0.04,152,18884
0.04,140,18904
0.05,136,19404
0.05,145,19220
0.05,137,18780
0.05,138,18788

and output graphs:

summary report: summary_report.txt

TEST

cargo build --example test_app -r   

cargo run --bin benchmark -- /opt/workspace/app_banchmark/target/release/examples/test_app   

cargo run --bin benchmark -- "/opt/workspace/app_banchmark/target/release/examples/test_app -additionl -app -params"  


TODO:

  • incremental runs - use date/time in output dir
  • local db / or file struct to see changes with time/application trends
  • move out from GNU time dependency to sysinfo



Note: For monitoring long-running processes like servers / streaming apps - see https://github.com/yarenty/app_tracer.

Saturday, January 29, 2022

Web 3 - blockchain layers

Layers from a blockchain perspective.


My plan is to write 5 articles: 

1 Intro: Web 1.. 2.. 3..

2 Layers in crypto.  [this one]

3 Applications - not only DeFi!

4 Decentralisation

5 Summary - where we are, where to look, why should we join





Layer 1

Layer 1 refers to the underlying blockchain architecture, i.e., the actual blockchain itself. In the case of Bitcoin, it is the BTC network launched in 2009.


Layer 2

Layer 2 refers to various protocols that are built on top of layer 1 to improve the original blockchain’s functionality. Layer 2 protocols often use off-chain processing elements to solve the speed and cost inefficiencies of the layer 1 network. Examples of layer 2 platforms for Bitcoin include Lightning Network and Liquid Network.


Layer 3

Layer 3 is represented by blockchain-based applications, such as decentralized finance (DeFi) apps, games, or distributed storage apps. Many of these applications also have cross-chain functionality, helping users access various blockchain platforms via a single app.





A layer-1 blockchain is a set of solutions that improve the base protocol itself to make the overall system a lot more scalable. There are two most common layer-1 solutions, and these are the consensus protocol changes as well as sharding.


When it comes to consensus protocol changes, projects like Ethereum are moving from older, clunky consensus protocols such as proof-of-work (PoW) to much faster and less energy-wasteful protocols such as proof-of-stake (PoS). 


Sharding is one of the most popular layer-1 scalability methods out there as well. Instead of making a network sequentially work on each transaction, sharding breaks these transaction sets into small data sets which are known as "shards," and these can then be processed by the network in parallel. 


One of the pros when it comes to layer-1 solutions is that there is no need to add anything on top of the existing infrastructure.





Layer 2 is a term used for solutions created to help scale an application by processing transactions off of the Ethereum Mainnet (layer 1) while still maintaining the same security measures and decentralization as the mainnet. Layer 2 solutions increase throughput (transaction speed) and reduce gas fees. Popular examples of Ethereum layer 2 solutions include Lightning Network, Liquid Network, Polygon, and Polkadot.


Layer 2 solutions are important because they allow for scalability and increased throughput while still holding the integrity of the Ethereum blockchain, allowing for complete decentralization, transparency, and security while also reducing the carbon footprint (less gas, means less energy used, which equates to less carbon.)


Although the Ethereum blockchain is the most widely used blockchain and arguably the most secure, that doesn’t mean it doesn’t come with some shortcomings. The Ethereum Mainnet is known to have slow transaction times (13 transactions per second) and expensive gas fees. Layer 2s are built on top of the Ethereum blockchain, keeping transactions secure, speedy, and scalable.


Each individual solution has its own pros and cons to consider such as throughput, gas fees, security, scalability, and of course functionality. No single layer 2 solution currently fulfills all these needs. However, there are layer 2 scaling solutions which aim to improve all these aspects; these solutions are called rollups.



There are three properties of a layer 2 rollup: 


1. Transactions are executed outside of layer 1 (reduces gas fees)

1. Data and proof of transactions reside on layer 1 (maintains security)

1. A rollup smart contract which is found on layer 1, can enforce proper transaction execution on layer 2, by using the transaction data that is stored on layer 1








Layer 3 is often referred to as the application layer. It is a layer that hosts DApps and the protocols that enable the apps. While some blockchains such as Ethereum or Solana (SOL) have a thriving variety of layer 3 apps, Bitcoin is not optimized to host such applications.


As such, layer 2 solutions are the furthest deviations from the core network that Bitcoin currently has. Some projects are trying to bring DApp functionality to the BTC ecosystem via forks of the original BTC network.


For instance, CakeDeFi is a DeFi app offering services such as staking, lending, and liquidity mining to BTC coin holders. CakeDeFi is based on a fork of Bitcoin called DeFiChain. DeFiChain maintains “an anchor” to the core BTC chain for some of its operations, but technically speaking, it is still a separate blockchain of its own.


Some industry observers believe that the lack of DApp functionality is one of the biggest limitations of BTC. Ever since Ethereum’s arrival in 2015, layer 3 platforms have been growing strongly in popularity and value. Ethereum currently has close to 3,000 layer 3 apps. The DeFi apps based on the blockchain hold a total value of $185 billion by now.


Another leading blockchain, Solana, hosts over 500 layer 3 DApps, and the total value locked in the DeFi apps of the network is approaching $15 billion.


In comparison, BTC has no functioning app that could be clearly defined as a layer 3 application. There is an ongoing debate about whether projects designed to “force in” DApp functionality onto BTC are worth the effort. Some in the industry argue that BTC will always remain a network designed for crypto fund transfers, not DApps.


These people point out that the layer 1 BTC chain enjoys an industry-leading market cap (of $1.3 trillion by now) that dwarfs all the TVL and market cap figures of all layer 3 projects in existence combined. As such, Bitcoin may not be in any urgent need of layer 3 functionality, at least judging from the financial figures.







Summary



Blockchain platforms may have three distinct layers. Layer 1 refers to the actual underlying blockchain, with its core architecture and functionality. Examples of layer 1 networks are the Bitcoin, Ethereum, and Solana blockchains.


Layer 2 are protocols built on top of layer 1 networks and extend some functionality of the underlying blockchain. For example, they may offer faster speeds and lower transaction costs than layer 1.


Layer 2 protocols often use a combination of on-chain and off-chain operations to offer their extended functional capabilities. Examples of layer 2 projects on Bitcoin include the Lightning Network and Liquid Network platforms.


Layer 3 refers to the protocols that enable DApps on the blockchain. While some other blockchains have a large collection of layer 3 apps, the BTC blockchain has none of them. Some projects attempt to bring layer 3 functionality into the BTC ecosystem by using apps designed on forks of BTC.


However, these apps are still based on their own blockchains, not on the core BTC blockchain. There is a debate about whether BTC even needs to move towards enabling the layer 3 functionality. Some industry analysts argue that BTC is worth multiple times more than all these layer 3 apps combined, and therefore, it does not have a pressing need for layer 3 at all.


Web3 - next big bang!

 Short article about Web3 - what it is and why I personally think that for the next 3 years it will be must know / must use / must to be in !!

 

 

My plan is to write 5 articles:

 

1. Intro: Web 1.. 2.. 3..   [this one]

2. Layers in crypto.

3. Applications - not only DeFi!

4. Decentralisation

5. Summary - where we are, where to look, why should we join

 

 

 




 

 

 

Web3 is an idea for a new iteration of the World Wide Web-based on the blockchain, which incorporates concepts including decentralization and token-based economics. Some technologists and journalists have contrasted it with Web 2.0, wherein they say data and content are centralized in a small group of companies sometimes referred to as "Big Tech".

 






 

Web 1.0 and Web 2.0 refer to eras in the history of the World Wide Web as it evolved through various technologies and formats. Web 1.0 refers roughly to the period from 1991 to 2004, where most websites were static web pages, and the vast majority of users were consumers, not producers, of content.

Web 2.0 is based around the idea of "the web as platform" and centers on user-created content uploaded to social media and networking services, blogs, and wikis, among other services. Web 2.0 is generally considered to have begun around 2004 and continues to the current day.

 

 

 

Visions for Web3 differ, but they revolve around the idea of decentralization and often incorporate blockchain technologies, such as various cryptocurrencies and non-fungible tokens (NFTs). Bloomberg has described Web3 as an idea that "would build financial assets, in the form of tokens, into the inner workings of almost anything you do online". Some visions are based on the concept of decentralized autonomous organizations (DAOs). Decentralized finance (DeFi) is another key concept; in it, users exchange currency without bank or government involvement. Self-sovereign identity allows users to identify themselves without relying on an authentication system such as OAuth, in which a trusted party has to be reached in order to assess identity. Technology scholars have argued that Web3 would likely run in tandem with Web 2.0 sites likely adopting Web3 technologies in order to keep their services relevant.

 

 





 

To believers, Web3 represents the next phase of the internet and, perhaps, of organizing society. Web 1.0, the story goes, was the era of decentralized, open protocols, in which most online activities involved navigating to individual static webpages. Web 2.0, which we’re living through now, is the era of centralization, in which a huge share of communication and commerce takes place on closed platforms owned by a handful of super-powerful corporations—think Google, Facebook, Amazon—subject to the nominal control of centralized government regulators. Web3 is supposed to break the world free of that monopolistic control.

 

At the most basic level, Web3 refers to a decentralized online ecosystem based on the blockchain. Platforms and apps built on Web3 won’t be owned by a central gatekeeper, but rather by users, who will earn their ownership stake by helping to develop and maintain those services.

 

Gavin Wood coined the term Web3 (originally Web 3.0) in 2014. At the time, he was fresh off of helping develop Ethereum, the cryptocurrency that is second only to Bitcoin in prominence and market size. Today he runs the Web3 Foundation, which supports decentralized technology projects, as well as Parity Technologies, a company focused on building blockchain infrastructure for Web3. Wood, who is based in Switzerland, spoke with me last week over a video about where Web 2.0 went wrong, his vision of the future, and why we all need to be less trusting. The following interview is a transcript of our conversation, lightly edited for clarity and length.

 

 




 

 


 

 

As technology continues to take center stage as a key differentiator for companies, new tech trends for 2022 are beginning to emerge. These trends are largely reflective of the changing realities around us. Successive global lockdowns have opened up a world of possibilities around virtual experiences and digital interactions. The increased urgency around resource scarcity, both human and natural, has also led to the introduction of technology geared towards efficiency and sustainability. These growing trends are part of a larger directional shift of the world wide web to Web 3.0.

 

 

To understand the concept of Web 3.0, we need to take a step back and understand the larger evolution of the internet that brought us to this point.

 

Web 1.0, the original conception of the internet, was a largely static, one-to-many format where users could view web pages but do little beyond that.

 

Web 2.0 introduced the concept of a worldwide community of internet users, encouraging them to form social media groups, interact with each other and create virtual experiences. Web 2.0, the current form of the internet, is still by large the most influential, but Web 3.0 is now slowly coming into its own.

 

Web 3.0 takes the notion of the ‘community and expands it to include community ownership and regulation of the internet as a whole. Three key philosophies are involved in web 3.0 definition:

 

The internet is ‘open’ and is built with open-source software that anyone can create, utilize and modify.

Interactions between users are not governed by a trusted third-party regulatory body.

Anyone can participate, without requiring permission from governments or regulatory bodies.

We can already see many manifestations of web 3.0 all around us. Cloud technology and artificial intelligence are some of the most prevalent forms of web 3.0 today.

 

In 2022, we will become increasingly familiar with the concept of a “metaverse” – persistent digital worlds that exist in parallel with the physical world we live in – Forbes

 

 



 

 


 

Here are **a few of the top technological trends of 2022** that web 3.0 has given rise to.

 

 

 

 

 

1.  Advanced applications of Artificial Intelligence

Artificial Intelligence has arguably been one of the biggest technological innovations of our time. As AI begins to get more sophisticated and closely mimic human intelligence, a new form will gain popularity in 2022: generative AI. Unlike traditional AI models that simply understand repetitive patterns and recreate them, generative AI is capable of producing completely new material. It uses the underlying principle of AI, learning patterns, to identify how input is linked together and uses it to create new content from code, images, text or video inputs. 

One of the biggest applications of generative AI is in customer service. Chatbots of 2022 could be so human-like that it would be practically impossible to differentiate between an AI chatbot and a human rep. AI could also change the way we consume content. Social media algorithms could get a lot smarter with AI and offer more accurate content recommendations. This could deliver even more personalized experiences for customers.

 

2. Low-code application building software

Creating a strong digital presence has become increasingly important for businesses, especially in the wake of the lockdowns. Mobile applications are one of the most important virtual assets a business can have because it’s the closest way to mimic an in-store experience. Historically, however, applications have been expensive to develop. Businesses needed to either have an in-house software development team or enough financial budgets to outsource the project to a vendor. These have acted as strong barriers to entry and have meant that small businesses could not afford to launch a mobile app. But with web 3.0, a number of low-code or no-code app development platforms have begun to sprout. These platforms make it extremely easy to develop an app with little to no coding knowledge required. Usually, they come with preset templates and features and users have to simply drag and drop them to build their app. These services will undoubtedly democratize the development of mobile applications and help businesses engage with their customers online.

 

3. Dominance of cloud technology trends

The cloud was one of the biggest game-changing technologies of 2021 as it enabled businesses to work remotely, from any part of the world. But even as offices begin to re-open, it’s unlikely that we’re ever going to see a shift to old on-premise models again. Global scenarios are unpredictable and businesses need to remain agile. As companies plan for this volatile future, they are more likely to move away from hastily implemented cloud setups to cloud-native platforms. 

The ‘lift and shift’ approach involved using the same organizational processes and simply moving them to the cloud. This approach might have worked as a time-sensitive solution, but they don’t deliver the full benefits of the cloud in the long term. A major focus for CIOs in the coming year will be implementing more sustainable cloud structures that are scalable and flexible. 

One study by Gartner estimates that as much as 95% of new digital initiatives in 2022 will be built upon cloud-native platforms. This underscores the importance of cloud technology in the coming year.

 

4. Secure data fabric

In line with creating flexible and accessible platforms, data fabric is a type of data architecture that ties together different platforms and users. It’s the antidote to data silos of the past that would often result in a loss of critical insights and make access extremely restricted. A data fabric promises greater efficiency and security because data is stored on a secure cloud platform. This minimizes the cost of storing data and ensures the highest level of encryption. 

Think of a data fabric like a self-driving car. When the driver is active then the autonomous mode takes a back seat. If the driver gets lazy and is not too alert then the semi-autonomous mode kicks in and makes course corrections, This is similar to how a data fabric works. It will monitor all data pipelines as a passive observer and once it understands, it will start to suggest more productive alternatives. For instance, a supply chain organization that uses data fabric can keep adding newly encountered data information that can improve decisions based on the newly integrated data. 

To implement data fabric design, new technologies like semantic knowledge graphs, embedded machine learning etc. will be needed.

 

5. Growth of 5G technology

5G technology will ultimately be the backbone that drives all of the above technology trends. 5G networks have a much wider reach than 4G, offer faster internet speeds and even enable heavy code to download seamlessly. 2022 will most likely see 5G enter the mainstream market and become the most widespread form of mobile broadband connectivity. This presents a huge opportunity for businesses. As the new year quickly approaches, CIOs should brainstorm new ways to present their business digitally and new channels that they can open up, powered by 5G technology.

 

 

 

Technology is constantly evolving and these 5 tech trends are just a few of the many we expect to see develop over the next year. As the very form of the internet evolves with the introduction of web 3.0, there’s only one thing that’s for certain: businesses need to respond positively to change and leverage the latest technological advances if they want to maintain their competitive edge.

 




 

Web 2 was boom - lots of new companies emerged, Web 3 will go into the same direction with a possibility to be a much bigger change to our way of living!

Datafusion Comet

Hi! Recently I moved to Rust and working on several projects - more insights to come ... one of them was Datafusion - an extremely fast S...