Requested array size exceeds VM limit

I get this error

java.lang.OutOfMemoryError: Requested array size exceeds VM limit

when upserting an MLSerialPipeline of a relatively large RandomForest model (NUMBER OF TREES: 250 and MINIMAL NUMBER OF SAMPLES IN LEAF: 5 and NUMBER OF FEATURES: 40). However, checking the old_gen on Splunk I don’t see the memory level spike up anywhere close to 100%. It only goes up to ~30%. What do you recommend? Is the reason I don’t see the spike in memory on Splunk that it happens and fails so fast that is not captured by Splunk?

That error does not mean that old gen is hitting it’s limit, it means that the JVM is trying to allocate an array that has more elements than it can support. Usually somewhere between 1-2 billion elements. Is there any way you can batch your requests?

Thanks for the response @tony.li. The object here is only one MLSerialPipeline, which fails to get upserted. So, I don’t think the option to batch the upsert would work here.

However, this issue happens only for some objects. Some other pipelines with fewer number of features do get upserted successfully but they are still extremely large objects even when I have very few number of features, and fetching those objects takes minutes.

I have two questions here:

  1. How can I get around the upsert error above for large MLSerialPipelines?
  2. Does it make sense for these Random Forest model objects to be that large in size such that fetching a model with only 2 features would take (often more than 3) minutes on the console?

Random Forest in Sklearn (which I assume is the version you are using) have a non scalable implementation: their size grows almost linearly with the number of samples you train on. It also grows with the number of trees, which is pretty high here, but not with the number of features (It is not necessary based on how the Random Forest algorithm works)

Have you tried using the XGBoost Pipe?

We cannot control the number of samples beyond a certain level, as the end users train these models, and their requirement is up to 1 year at MINUTE interval. We can however reduce the number of trees which I agree is too high, as first step. If that didn’t work will try XGBoost Pipe.

Turns out the model size scalability of Sklearn implementation is even worse than linear:

The size of the model with the default parameters is O( M * N * log (N) ),
where M is the number of trees and N is the number of samples.
In order to reduce the size of the model, you can change these parameters:
min_samples_split, max_leaf_nodes, max_depth and min_samples_leaf.

Ref: https://github.com/scikit-learn/scikit-learn/blob/master/doc/modules/ensemble.rst

1 Like

@mehdi.maasoumy If we are limited by the size of the JVM and your algorithm is not splittable then your only option is to use a hardwareProfile and have a bigger machine on which this runs