Monday, February 18, 2013

How to investigate memory leaks in Java AIX

AIX is high performance platform but quite difficult system to investigate memory leaks. I was working on one specific JDK release and there was memory leak inside JIT compiler. Here are bunch of tools to look into and some explanation how to read them: ps / svmon / nmon / topas.

Memory Leak: Caused by a program that repeatedly allocates memory without freeing it.

When a process exits, its working storage is freed up immediately and its associated memory frames are put back on the free list.
However any files the process may have opened can stay in memory.

AIX tries to use the maximum amount of free memory for file caching.

High levels of file system cache usually means that is the way the application runs and likes it (you have to decide if this is expected by understanding the workload) or AIX can't find anything else to do with the memory and so thinks it might as well save disk I/O CPU cycles by caching - this is normal and a good idea.
Link to similar article about Oracle AIX memory issue: 

There are couple commands to run to find issues:
  • ps gv | head -n 1; ps gv | egrep -v "RSS" | sort +6b -7 -n -r
  • svmon
  • svmon -P -O summary=basic,unit=MB
  • nmon m

Examples (this is after-fix readings, and fix was to upgrade to another version of JDK):

ps gv | head -n 1; ps gv | egrep -v "RSS" | sort +6b -7 -n -r | head -n 10

$ ps gv | head -n 1; ps gv | egrep -v "RSS" | sort +6b -7 -n -r | head -n 10
  4587574      - A    4943:32 8010 679324 679452 32768    80   128  0.9  1.0 /opt/W
  7077924      - A    5435:52 1930 666140 666268 32768    80   128  1.0  1.0 /opt/W
 15794204      - A    233:22  219 550876 551004 32768    80   128  0.0  1.0 /opt/W
  5439520      - A    236:28  164 530844 530972 32768    80   128  0.0  1.0 /opt/W
 12845156      - A    132:39  423 470300 470428 32768    80   128  0.0  1.0 /opt/W
 12321022      - A    131:39  271 406428 406556 32768    80   128  0.0  1.0 /opt/W
  9764870      - A    68:25  157 255896 256024 32768    80   128  0.0  0.0 /opt/W
  7340098      - A    14:18  387 63356 63428 32768    39    72  0.0  0.0 java -
  7995524      - A    11:54 3049 59212 59332 32768    69   120  0.0  0.0 /opt/t

This command gives you info about current memory usage by processes (top 10). 
RSS:    The amount of RAM used for the text and data segments per process. (RSS:resident set size)
%MEM:    The actual amount of the RSS / Total RAM. Watch for processes that consume 40-70 percent of %MEM.
TRS:    The amount of RAM used for the text segment of a process in kilobytes.
SIZE:    The actual amount of paging space (virtual mem. size) allocated for this process (text and data).

On size/rss column you can see current usage at level of 680MB (for to processes), when I investigated leak I saw usage at level of 8 GB even when my max heap was set to 4GB!


What is really going on

$ svmon
               size       inuse        free         pin     virtual   mmode
memory     16777216    16138874      638342     1431202     7262359     Ded
pg space    1572864      220721

               work        pers        clnt       other
pin          923826           0           0      507376
in use      7230007           0     8908867

PageSize   PoolSize       inuse        pgsp         pin     virtual
s    4 KB         -     9824554       16721      652946      915687
m   64 KB         -      394645       12750       48641      396667

Here we have all memory used by system (and all applications) - just keep in mind its in 4KB pages.
Top line:
size: total memory available: 16777216 * 4 KB = 64GB = 65536 MB
inuse: currently used memory () 16138874 * 4KB = 63042 MB
free: free memory: 638342 * 4KB = 2493 MB
So looks like over 95% of memory is used.
Real value of used pages are on bottom 2 lines, here we have two types of pages: s - small 4KB one and m - medium 64KB, together:
9824554 * 4KB + 394645 *64KB = 39298216 KB + 25257280 KB = 64555496 KB =63042 MB

svmon -P -O summary=basic,unit=MB

Lets look what processes are using:
$ svmon -P -O summary=basic,unit=MB
Unit: MB

     Pid Command          Inuse      Pin     Pgsp  Virtual
 4587574 java           5222.52     32.0     0.94  4920.19
 7077924 java           5212.62     31.9     0.94  4914.32
12845156 java           4914.87     31.4     0.94  4620.29
 5439520 java           2832.61     31.7     0.94  2522.76
15794204 java           2646.86     31.7     0.94  2331.34
12321022 java           2064.07     31.3     0.94  1762.70
 9764870 java            965.89     31.3     0.94   673.54
 8650818 webservd        178.28     30.8     2.31   150.90
11403468 svmon           115.71     30.6     0.94   116.89
15532206 ksh             113.70     30.5     2.31   120.28
 5243046 ksh             113.67     30.5     2.31   120.26
10616930 sleep           113.00     30.5     2.31   119.88

As you can see I have 7 java processes; I'm interested only on first 6 - as that my application. On all of them max heap is set to 4GB, question is:  why first of them are getting over 5GB?
In AIX IBM java version: process memory is divided into  java managed (heap), native (calls to native system: threads, JNI, JDBC...), file caching.

svmon -P <pid>

 Look into details:
     Pid Command          Inuse      Pin     Pgsp  Virtual 64-bit Mthrd  16MB
 4587574 java           1336964     8183      240  1259569      Y     Y     N

     PageSize                Inuse        Pin       Pgsp    Virtual
     s    4 KB               96484        343          0      18689
     m   64 KB               77530        490         15      77555

    Vsid      Esid Type Description              PSize  Inuse   Pin Pgsp Virtual
  5a32d8        72 work text data BSS heap           m   4096     0    0    4096
  5e90de        78 work text data BSS heap           m   4096     0    0    4096
  a74427        7f work text data BSS heap           m   4096     0    0    4096
  db1d5b        7a work text data BSS heap           m   4096     0    0    4096
  d74357        76 work text data BSS heap           m   4096     0    0    4096
  cd444d        7e work text data BSS heap           m   4096     0    0    4096
  b63234        70 work text data BSS heap           m   4096     0    0    4096
  2634a4        75 work text data BSS heap           m   4096     0    0    4096
  5034d2        73 work text data BSS heap           m   4096     0    0    4096
  c49444        77 work text data BSS heap           m   4096     0    0    4096
  f33471        71 work text data BSS heap           m   4096     0    0    4096
  7296f2        7c work text data BSS heap           m   4096     0    0    4096
  c62c46        74 work text data BSS heap           m   4096     0    0    4096
  9c451c        7d work text data BSS heap           m   4096     0    0    4096
  4190c1        79 work text data BSS heap           m   4096     0    0    4096
  a72e25      1002 work text data BSS heap           m   4076     0    0    4076
  6916e9        7b work text data BSS heap           m   3916     0    0    3916
  5996d9      1001 work text data BSS heap           m   3837     0    0    3837
  cc2b4c      1003 work text data BSS heap           m   2640     0    0    2640
  990019  90000000 work shared library text          m   1033     0    8    1052
  c34343  70000000 work default shmat/mmap          sm  12800     0    0   12800
  730ef3         - clnt /dev/fslv00:197305           s   9919     0    -       -
   20002         0 work kernel segment               m    532   487    7     538
  c79247         - clnt /dev/fslv00:193648           s   5021     0    -       -
  c40f44         - clnt /dev/fslv00:197307           s   3058     0    -       -
  e60f66         - clnt /dev/fslv00:197320           s   2580     0    -       -
  396cbb         - clnt /dev/fslv00:172631           s   2505     0    -       -
  c89648         - clnt /dev/fslv00:193602           s   2438     0    -       -
   50005  9ffffffd work shared library              sm   2251     0    0    2251
  cc0e4c         - clnt /dev/fslv00:197330           s   2158     0    -       -
  b99639         - clnt /dev/fslv00:193657           s   2144     0    -       -
  dc275c         - work                              s   1831   343    0    1831
  2b0aab         - clnt /dev/fslv00:197268           s   1762     0    -       -
  9b001b  90020014 work shared library               s   1633     0    0    1633
  a30d23         - clnt /dev/fslv00:173466           s   1523     0    -       -
Full detailed information about process and memory usage.
How to read it:
There are 3 areas that we are interesting in are:
1) Processes that using small esid numbers (70, 71, 72 ... 7e) - that's java heap
2) Processes using 1001, 1002, 1003 - that's native memory
3) -clnt - that files caching
So if you want to know how much memory is used by java heap - you need to summarize all 7x processes and multiply if by 64KB (PSize = m)
Calculations for native memory:
  a72e25      1002 work text data BSS heap           m   4076     0    0    4076
  5996d9      1001 work text data BSS heap           m   3837     0    0    3837
  cc2b4c      1003 work text data BSS heap           m   2640     0    0    2640

 (4076 + 3837 + 2640) * 64 KB  =  675456 KB = 659 MB

nmon m t

Every day tool to check memory usage

-->m [memory usage] --> t (top processes) --> 4 (order in process size)
│ Memory ────────────────────────────────────────────────────────────────────────│
│          Physical  PageSpace |        pages/sec  In     Out | FileSystemCache  │
│% Used       96.2%     14.0%  | to Paging Space   0.0    0.0 | (numperm) 52.8%│% Free        3.8%     86.0%  | to File System    0.0    0.0 | Process   36.3%  │
│MB Used   63028.8MB   861.4MB | Page Scans        0.0        | System     7.0%  │
│MB Free    2507.2MB  5282.6MB | Page Cycles       0.0        | Free       3.8%  │
│Total(MB) 65536.0MB  6144.0MB | Page Steals       0.0        |           ------ │
│                              | Page Faults      12.5        | Total    100.0%  │
│------------------------------------------------------------ | numclient 52.8%  │
│Min/Maxperm     1905MB(  3%)  57163MB( 90%) <--% of RAM      | maxclient 90.0%  │
│Min/Maxfree     960   1088       Total Virtual   70.0GB      | User      86.1%  │
│Min/Maxpgahead    2      8    Accessed Virtual   27.7GB 39.6%| Pinned     8.5%  │
│                                                             | lruable pages   1│
│ Top-Processes-(140) ─────Mode=4  [1=Basic 2=CPU 3=Perf 4=Size 5=I/O 6=Cmds]────│
│  PID     %CPU  Size   Res   Res   Res  Char RAM      Paging       Command      │
│          Used    KB   Set  Text  Data   I/O  Use io other repage               │
│ 4587574   0.8   663M  663M  128   663M    8   1%    0    0    0 java           │
│ 7077924   0.7   650M  650M  128   650M    7   1%    0    0    0 java           │
│15794204   1.2   538M  538M  128   537M  131   1%    0    0    0 java           │
│ 5439520   0.7   518M  518M  128   518M    7   1%    0    0    0 java           │
│12845156   0.7   459M  459M  128   459M   14   1%    0    0    0 java           │
│12321022   0.7   396M  397M  128   396M   26   1%    0    0    0 java           │
│ 9764870   0.4   249M  250M  128   249M   62   0%    0    0    0 java           │
│ 7340098   0.0 63392 63428    72 63356     0   0%    0    0    0 java           │
│ 7995524   0.0 59280 59332   120 59212     0   0%    0    0    0 java           │

Size KB: program on disk size
ResSize: Resident Set Size - how big it is in memory (excluding the pages still in the file system (like code) and some parts on paging disks)
ResText: code pages of the Resident Set
ResData: data and stack pages of the Resident Set         
What interested here is:

% Used       96.2%
% Free        3.8%
and  split:
(numperm) 52.8%
Process   36.4%
System     7.0%
Free       3.8%
Total    100.0%
Over 50% of memory is used by file system cache.


topas -P  This does not tell how much of the application is paged out but how much of the application memory is backed by paging space.(things in memory (working segment) should be backed by paging space by the actual size in memory of the process.)  

svmon -Pt15 | perl -e 'while(<>){print if($.==2||$&&&!$s++);$.=0 if(/^-+$/)}'    top 15 processes using the most memory

ps aux | head -1 ; ps aux | sort -rn +3 | head -20   top memory processes (the above is better)

ps -ef | grep -c LOCAL=NO   shows the number of client connections (each connection takes up memory, so if it is high then...)

svmon -Pg -t 1 |grep Pid ; svmon -Pg -t 10 |grep "N"  top 10 processes using the most paging space
svmon -P -O sortseg=pgsp  shows paging space usage of processes
ipcs -bm - it  will show our memory segments, who owns them and what their size (in bytes). It shows the maximum allocated size, that a memory segment can go to.

Tuesday, February 12, 2013

Thread safe SimpleDateFormat

Run this test class and you will see (as is not predictable run it few times):
 * Please feel free to experiment - not only wrong data but sometimes number format exceptions...
public class SimpleDateTest {

 static SimpleDateFormat df = new SimpleDateFormat("dd-MMM-yyyy");
 static String testdata[] = { "01-Jan-1999", "14-Feb-2001", "31-Dec-2007" };

  * Test method for SDF.
 public void testParse() {
  Runnable r[] = new Runnable[testdata.length];
  for (int i = 0; i < r.length; i++) {
   final int i2 = i;
   r[i] = new Runnable() {
    public void run() {
     try {
      for (int j = 0; j < 1000; j++) {
       String str = testdata[i2];
       String str2 = null;
//         synchronized(df) 
        Date d = df.parse(str);
        str2 = df.format(d);

       Assert.assertEquals("date conversion failed after "
         + j + " iterations.", str, str2);
     } catch (ParseException e) {
      throw new RuntimeException("parse failed");
   new Thread(r[i]).start();

Possible outputs are:
Exception in thread "Thread-0" junit.framework.ComparisonFailure: date conversion failed after 0 iterations. expected:<[01-Jan-1999]> but was:<[14-Feb-2001]>
 at junit.framework.Assert.assertEquals(

Exception in thread "Thread-0" Exception in thread "Thread-1" java.lang.NumberFormatException: For input string: "19992001.E199920014E"

Exception in thread "Thread-0" java.lang.NumberFormatException: multiple points
 at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source)

Exception in thread "Thread-0" java.lang.NumberFormatException: For input string: ""
 at java.lang.NumberFormatException.forInputString(Unknown Source)

And solution is synchronize usage of SimpleDateFormat, use ThreadLocal like here: ThreadSafeSimpleDateFormat:

or my last findings use improved version of ThreadLocal  that creates HashMaps of SDF inside: SafeSimpleDateFormat
public class SafeSimpleDateFormat
    private final String _format;
    private static final ThreadLocal<Map<String, SimpleDateFormat>> _dateFormats = new ThreadLocal<Map<String, SimpleDateFormat>>()
        public Map<String, SimpleDateFormat> initialValue()
            return new HashMap<String, SimpleDateFormat>();

    private SimpleDateFormat getDateFormat(String format)
        Map<String, SimpleDateFormat> formatters = _dateFormats.get();
        SimpleDateFormat formatter = formatters.get(format);
        if (formatter == null)
            formatter = new SimpleDateFormat(format);
            formatters.put(format, formatter);
        return formatter;

    public SafeSimpleDateFormat(String format)
        _format = format;

    public String format(Date date)
        return getDateFormat(_format).format(date);

    public String format(Object date)
        return getDateFormat(_format).format(date);

    public Date parse(String day) throws ParseException
        return getDateFormat(_format).parse(day);

    public void setTimeZone(TimeZone tz)

    public void setCalendar(Calendar cal)

    public void setNumberFormat(NumberFormat format)

    public void setLenient(boolean lenient)

    public void setDateFormatSymbols(DateFormatSymbols symbols)

    public void set2DigitYearStart(Date date)

Sunday, February 3, 2013

Performance Processing

Multi-threaded solutions: complete as much work as possible in the shortest time possible. Parallel processing using multiple threads is necessary to bring all available computing resources to bear on the problem. Today's multiple core, multiple CPU servers are especially well suited for multi-threaded processing.


The basic software architecture pattern to follow is a dispatcher-worker pattern. This pattern consists of a centralized controller (the dispatcher) that sends chunks of work to multiple worker threads. This basic pattern is often used in software applications that must do a high volume of work.

The central problem is to determine how to partition the chunks of work to maximize the efficiency of threads and database interactions. Clearly, each thread must operate on an independent logical unit of work. Otherwise, concurrent threads might end up waiting on one another or incorrectly altering results related to other threads. Each thread must be able to complete its job independently from all other threads.

Subsequently, the units of work should be designed to minimize database interactions because these are very expensive. They involve asking the database to retrieve some data (which could require spinning a physical hard drive) and then moving that data over a network to the JEE server.Minimizing the frequency and volume of these interactions is the single most important factor in JEE processing performance. There are many ways to minimize database access and optimize the necessary database interactions.


Only Get the Required Data

The easiest way to minimize database interactions is to carefully construct the algorithm to only get and operate on the data it really needs. This may seem obvious but layers such as Data Access Objects (DAOs) or perhaps an object model based on Hibernate tend to return fully populated objects rather than just the few fields required. It's often convenient to reuse an existing data layer that does these things, but only do so if the time required to retrieve the extra data is acceptable. Otherwise, create new DAOs or JDBC statements to get just the specific data required.


Only Do the Work Required for Each Run

Another way to avoid bringing back unnecessary data and doing pointless work is to create a configurable  process. Processes often do several different but related operations and not all of them are always necessary. A little extra development work is required to provide input parameters that allow certain operations to be switched off for certain runs, but avoiding unnecessary work can provide worthwhile performance improvements.


Only Work on Data that Has Changed

In this same vein of avoiding unnecessary work, it is often possible to implement a feature that tracks which data has changed (and requires new operations) and which data has not changed (and can safely be ignored). Depending on the rate of change of the data, ignoring unchanged values can lead to a large performance improvement.


Use Data Warehousing Techniques to Compress Data over Time

The size of the dataset can be further reduced by exploiting common data warehouse data modelling techniques such as the concept of slowly changing dimensions. Data warehouses are often modelled to contain dimension tables and fact tables. The dimension tables contain all the descriptive attributes on which data is sliced. Fact tables contain the actual aggregated data. For example, there may be a fact table containing order totals with a foreign key to a dimension table that captures the name of the salesperson for the order, allowing the creation of a report to slice order totals by salesperson.

Slowly changing dimensions and slowly changing facts are methods that can be used to compress the volume of this data if the data changes over time. The idea is to put date ranges on the dimensions and facts rather than repeating the same values for each date in the time period. For example, a salesperson's name could change over time if she gets married. Without date range effectiveness on this dimension, it is necessary to capture the name as it was at each run to preserve historical data even though the data most likely does not change often. This is repetitive and wasteful. If the dimension has a date range, then the batch process need only store a row for each different value.

The same can be done with facts if the model requires storing facts at different times. If the result of the computation happens to be the same value as it was the last time, the process could just store a date range with the answer rather than storing the same answer multiple times.


Optimize Database Interactions

Eliminating unnecessary work is the best way to limit database interactions, but, clearly, some interactions must happen. Further strategies can be used to make sure those interactions are as efficient as possible.



One approach is to take advantage of caching technologies. Processes often require access to some set of master data that is reused throughout the process. This master data should be loaded from the database just once and then cached in memory within the application server context and reused.

This can be done using singletons or static variables that hold the data, or caching tools like JBoss Cache, GigaSpaces, Tangosol Coherence, etc. These latter tools provide benefits such as replicating the cached values across multiple JVM instances but introduce added complexity to the application.

One caveat for caching is that it may solve a database interaction problem but create a memory constraint problem because the in-memory cache in the application server tier may grow too large. RAM has become much cheaper in recent years, but most 32-bit JVMs are still limited to 2-4GB of heap space. Be careful that the cache will not exceed the memory space available and cause disk swapping on the application server. Also, consider 64-bit computing architectures that allow larger heap space.


Data Streaming

Another approach for optimizing database interactions is to favour a smaller number of de-normalized queries that retrieve large volumes of data rather than a larger number of more granular queries that retrieve small volumes of data. Relational databases are very good at creating an execution plan for a few complex queries and then streaming back the results as quickly as possible. They perform less well when asked to execute lots of small queries that appear to be randomly organized.

For example, consider a process that needs to compute the shipping cost on a large set of orders. You could choose to define each chunk of work to be a single order. The dispatcher could ask the database for the master list of order IDs and send each ID to a worker thread for processing. That worker thread could then ask the database for the details of each order, do its work, and save the answer back to the database.

To the database, this approach will feel like its getting slammed by lots of concurrent users asking for different orders all at the same time. There will be high contention for resources such as database connections and access to the order table. It will look sort of like a denial of service attack.

On the other hand, you could choose to define the chunks as an arbitrary number of orders, perhaps 1,000. The dispatcher could query the database for all required order columns rather than just the order ID. As each row is returned, the dispatcher would send all the order data required to compute shipping costs to a worker. Each worker would not have to query the database to do its work because everything it needs is provided as input.

As each worker completes its work, it would place the results on a persistence queue rather than immediately sending an individual insert or update to the database. Every time this queue reaches 1,000 entries, the process would send a bulk insert/update statement to the database.

The result is that the database is allowed to do a few, high-volume things as fast as it is able, rather than swapping between numerous small tasks.


Optimize Physical Database Access

Databases often respond slowly when they receive multiple requests that contend for data located on the same physical media. Avoiding this contention will speed the batch process. It's often possible to specify how database tables are segregated on different physical disk drives and divide tables that are likely to receive large numbers of concurrent requests onto different physical drives.


Use Database Tricks

Relational databases offer many configuration options and interaction methods that can be used intelligently to optimize a process. Performance monitoring tools should be used to watch the behavior of the database as the  process runs. This will allow the optimal configuration of settings such as how much memory to commit to the database's shared cache.



A database uses transactions to group multiple changes into a single logical unit of work. These changes are then all committed and stored or all rolled back and thrown away together. The database must maintain a log of these changes to keep track of which things belong together. Large transactions result in a large transaction log. Large logs can negatively impact performance. There are a couple ways to avoid this problem. You could choose not to use transactions at all. Most databases include an autocommit feature that allows all changes to be committed immediately. Alternatively, you could make sure that each thread independently commits its own relatively small transaction. In any case, it's not wise to have large, long-running transactions as part of a process.


Prepared Statements

Most databases support the idea of precompiled SQL statements called prepared statements. A prepared statement is a SQL statement with placeholders for parameters that will be supplied later on with actual data values. The statement can be compiled once and then reused even if the parameters change. This saves compilation time on the database platform and improves performance.

Processes usually involve multiple executions of the same SQL statements over and over with different parameter values. This is a perfect situation for prepared statements. Dynamic statements should always be avoided.


Application Server Clustering

One of the benefits of using the JEE platform for processing is that you can leverage its ability to cluster multiple application servers. If JMS is used as the transport mechanism to move messages from the dispatcher to the workers and the JMS implementation supports clustered, distributed queues (as many do), the workers can reside on different physical machines. This provides a method to scale the performance of the batch process by adding application servers. A powerful cluster can be assembled using multiple, inexpensive commodity application servers and it can grow with the requirements of the batch process.


This is "default" performance processing strategies, in next article I will try to share with you personal view on creating ideal architecture for hig-performance processing, building engines, and possibilities of expansion through introducing CUDA, or ZooKeeper ;-)