Stream vs ParallelStream in Java 8
Java 8 came with streams and it is a buzzword since then. Some people who know how to work with streams always try to use this functional programming construct.
.stream and .parallelStream are two methods that one can use to process a series of data. My colleague always told me that people use them without knowing their performance. He also told that streams are more performant than parallelStream and I would always believe that, till one day I got a chance to find it out myself.
That day a production issue was reported where application was taking significant time to load the data and was ultimately failing. I had to fix that. I started my investigation to find out that over 100 thousand records are being fetched from database and for each record, database is queried to get some value, which means over 100 thousand queries being executed (apart from other processing that was happening).
Now I gave it a thought. Why would someone write a piece of code that would sequentially execute queries on RDBMS?? There was no dependency of result of one query on the other so they could be executed in parallel.
So I knew that I can optimise the performance of this code by using multiple threads so that there are more threads, than just ‘main’ thread to do the work. Then it came to me that would parallelStream help me out here??
“Umm… but Mr. X said that they are not performant..
What should I do?…. Should I write entire code to create and maintain threads??…. I am too lazy to do that 😴”
“Wait, let me try and find out if parallelStreams are actually bad”. This is what I thought and tried to simulate the scenario to solve the mystery.
Below is piece of code that was used.
Thread.sleep(10); //Used to simulate the I/O operation
So the code is pretty simple. It creates a list of 100 thousand numbers and uses streams to process each number which means 100 thousand records are being processed (like in my scenario).
To simulate the RDBMS query execution I used Thread.sleep(10).
Guess what the result was ……….
Start Sync.....
Sync: PT18M55.200639S
Start Async.....
Async: PT2M54.665221S
parallelStream was around 6 times faster. stream took ~19 minutes whereas parallelStream took ~3 minutes.
Hence the fact that parallelStream is less performant was not at all true in my case where an I/O operation was involved.
I used parallelStream in the code to improve the performance and released it to production. There were no issues seen henceforth.
Reason for this is that behind the scenes in parallelStream 6 threads are working on the data in parallel and hence reducing the time of operation by around 6 times.
This is the lesson I learnt that day…..