Sunday, February 3, 2013

Performance Processing

Multi-threaded solutions: complete as much work as possible in the shortest time possible. Parallel processing using multiple threads is necessary to bring all available computing resources to bear on the problem. Today's multiple core, multiple CPU servers are especially well suited for multi-threaded processing.


 

The basic software architecture pattern to follow is a dispatcher-worker pattern. This pattern consists of a centralized controller (the dispatcher) that sends chunks of work to multiple worker threads. This basic pattern is often used in software applications that must do a high volume of work.

The central problem is to determine how to partition the chunks of work to maximize the efficiency of threads and database interactions. Clearly, each thread must operate on an independent logical unit of work. Otherwise, concurrent threads might end up waiting on one another or incorrectly altering results related to other threads. Each thread must be able to complete its job independently from all other threads.

Subsequently, the units of work should be designed to minimize database interactions because these are very expensive. They involve asking the database to retrieve some data (which could require spinning a physical hard drive) and then moving that data over a network to the JEE server.Minimizing the frequency and volume of these interactions is the single most important factor in JEE processing performance. There are many ways to minimize database access and optimize the necessary database interactions.

 

Only Get the Required Data

The easiest way to minimize database interactions is to carefully construct the algorithm to only get and operate on the data it really needs. This may seem obvious but layers such as Data Access Objects (DAOs) or perhaps an object model based on Hibernate tend to return fully populated objects rather than just the few fields required. It's often convenient to reuse an existing data layer that does these things, but only do so if the time required to retrieve the extra data is acceptable. Otherwise, create new DAOs or JDBC statements to get just the specific data required.

 

Only Do the Work Required for Each Run

Another way to avoid bringing back unnecessary data and doing pointless work is to create a configurable  process. Processes often do several different but related operations and not all of them are always necessary. A little extra development work is required to provide input parameters that allow certain operations to be switched off for certain runs, but avoiding unnecessary work can provide worthwhile performance improvements.

 

Only Work on Data that Has Changed

In this same vein of avoiding unnecessary work, it is often possible to implement a feature that tracks which data has changed (and requires new operations) and which data has not changed (and can safely be ignored). Depending on the rate of change of the data, ignoring unchanged values can lead to a large performance improvement.

 

Use Data Warehousing Techniques to Compress Data over Time

The size of the dataset can be further reduced by exploiting common data warehouse data modelling techniques such as the concept of slowly changing dimensions. Data warehouses are often modelled to contain dimension tables and fact tables. The dimension tables contain all the descriptive attributes on which data is sliced. Fact tables contain the actual aggregated data. For example, there may be a fact table containing order totals with a foreign key to a dimension table that captures the name of the salesperson for the order, allowing the creation of a report to slice order totals by salesperson.

Slowly changing dimensions and slowly changing facts are methods that can be used to compress the volume of this data if the data changes over time. The idea is to put date ranges on the dimensions and facts rather than repeating the same values for each date in the time period. For example, a salesperson's name could change over time if she gets married. Without date range effectiveness on this dimension, it is necessary to capture the name as it was at each run to preserve historical data even though the data most likely does not change often. This is repetitive and wasteful. If the dimension has a date range, then the batch process need only store a row for each different value.

The same can be done with facts if the model requires storing facts at different times. If the result of the computation happens to be the same value as it was the last time, the process could just store a date range with the answer rather than storing the same answer multiple times.

 

Optimize Database Interactions

Eliminating unnecessary work is the best way to limit database interactions, but, clearly, some interactions must happen. Further strategies can be used to make sure those interactions are as efficient as possible.

 

Caching

One approach is to take advantage of caching technologies. Processes often require access to some set of master data that is reused throughout the process. This master data should be loaded from the database just once and then cached in memory within the application server context and reused.

This can be done using singletons or static variables that hold the data, or caching tools like JBoss Cache, GigaSpaces, Tangosol Coherence, etc. These latter tools provide benefits such as replicating the cached values across multiple JVM instances but introduce added complexity to the application.

One caveat for caching is that it may solve a database interaction problem but create a memory constraint problem because the in-memory cache in the application server tier may grow too large. RAM has become much cheaper in recent years, but most 32-bit JVMs are still limited to 2-4GB of heap space. Be careful that the cache will not exceed the memory space available and cause disk swapping on the application server. Also, consider 64-bit computing architectures that allow larger heap space.

 

Data Streaming

Another approach for optimizing database interactions is to favour a smaller number of de-normalized queries that retrieve large volumes of data rather than a larger number of more granular queries that retrieve small volumes of data. Relational databases are very good at creating an execution plan for a few complex queries and then streaming back the results as quickly as possible. They perform less well when asked to execute lots of small queries that appear to be randomly organized.

For example, consider a process that needs to compute the shipping cost on a large set of orders. You could choose to define each chunk of work to be a single order. The dispatcher could ask the database for the master list of order IDs and send each ID to a worker thread for processing. That worker thread could then ask the database for the details of each order, do its work, and save the answer back to the database.

To the database, this approach will feel like its getting slammed by lots of concurrent users asking for different orders all at the same time. There will be high contention for resources such as database connections and access to the order table. It will look sort of like a denial of service attack.

On the other hand, you could choose to define the chunks as an arbitrary number of orders, perhaps 1,000. The dispatcher could query the database for all required order columns rather than just the order ID. As each row is returned, the dispatcher would send all the order data required to compute shipping costs to a worker. Each worker would not have to query the database to do its work because everything it needs is provided as input.

As each worker completes its work, it would place the results on a persistence queue rather than immediately sending an individual insert or update to the database. Every time this queue reaches 1,000 entries, the process would send a bulk insert/update statement to the database.

The result is that the database is allowed to do a few, high-volume things as fast as it is able, rather than swapping between numerous small tasks.

 

Optimize Physical Database Access

Databases often respond slowly when they receive multiple requests that contend for data located on the same physical media. Avoiding this contention will speed the batch process. It's often possible to specify how database tables are segregated on different physical disk drives and divide tables that are likely to receive large numbers of concurrent requests onto different physical drives.

 

Use Database Tricks

Relational databases offer many configuration options and interaction methods that can be used intelligently to optimize a process. Performance monitoring tools should be used to watch the behavior of the database as the  process runs. This will allow the optimal configuration of settings such as how much memory to commit to the database's shared cache.

 

Transactions

A database uses transactions to group multiple changes into a single logical unit of work. These changes are then all committed and stored or all rolled back and thrown away together. The database must maintain a log of these changes to keep track of which things belong together. Large transactions result in a large transaction log. Large logs can negatively impact performance. There are a couple ways to avoid this problem. You could choose not to use transactions at all. Most databases include an autocommit feature that allows all changes to be committed immediately. Alternatively, you could make sure that each thread independently commits its own relatively small transaction. In any case, it's not wise to have large, long-running transactions as part of a process.

 

Prepared Statements

Most databases support the idea of precompiled SQL statements called prepared statements. A prepared statement is a SQL statement with placeholders for parameters that will be supplied later on with actual data values. The statement can be compiled once and then reused even if the parameters change. This saves compilation time on the database platform and improves performance.

Processes usually involve multiple executions of the same SQL statements over and over with different parameter values. This is a perfect situation for prepared statements. Dynamic statements should always be avoided.

 

Application Server Clustering

One of the benefits of using the JEE platform for processing is that you can leverage its ability to cluster multiple application servers. If JMS is used as the transport mechanism to move messages from the dispatcher to the workers and the JMS implementation supports clustered, distributed queues (as many do), the workers can reside on different physical machines. This provides a method to scale the performance of the batch process by adding application servers. A powerful cluster can be assembled using multiple, inexpensive commodity application servers and it can grow with the requirements of the batch process.

 


This is "default" performance processing strategies, in next article I will try to share with you personal view on creating ideal architecture for hig-performance processing, building engines, and possibilities of expansion through introducing CUDA, or ZooKeeper ;-)

 

No comments:

Post a Comment

Datafusion Comet

Hi! Recently I moved to Rust and working on several projects - more insights to come ... one of them was Datafusion - an extremely fast S...