Are "Exists in" filters on large data sources a problem?

According to the Appfarm documentation, under the optimize performance page, one should “Avoid filters with ‘Exists in’ on large Data Source”. Is this actually true?

It is possible to find the elements of an array X that “Exists in” a second array Y quite efficiently O(X+Y).

There are also slower implementations of this operation that take significantly longer.

Because the Appfarm documentation warns about the performance of this operation, I assumed Appfarm used a slower implementation. After some testing I am not sure this is the case.

I tried reading form a data source with 600 000 entries, using an “Exists in” filter, where the data source I am filtering against has 2000 entries. This took on average about 750 milliseconds. Then I tried the same thing again, but without the “Exists in” filter. This took on average about 650 milliseconds. In both cases, the same number of entries were read.

There is a difference in performance, but not a very significant difference.

“Exists in” filters don’t seem to have a huge performance impact. Is there still a reason why one should avoid “Exists in” filters on large data sources? Are there other cases where it does have a significant performance impact?

Hi!

Testing performance without the full extent of others users with simultaneous operations, is not a full source of truth. There are multiple factors impacting the performace: both the total app load (if measured from an app), database load (from app, and in total if shared cluster), and potential local variable loads on regions by google gloud.

The ‘exists in’ is in general slower than without, when returning the same number if records. The extent of this difference depends.

An important note with regards to this recommendation: it is not only the database operation itself, but also the fact that any changes to the ‘right side’ (exists in) of this filter (e.g. a new object is added to the right-side data source) results in re-refiltering (e.g. a heavy database query) the data source. Meaning - avoid large ‘exists in’ filters unless you have total control on the frequence of right-side operand changes.

1 Like