Risk of memory leak in `JpaEventPublicationRepository` #1022

iirekm · 2025-01-18T13:02:44Z

findIncompletePublicationsPublishedBefore and similar methods do something like: entityManager....getResultStream().map(...).toList(). The toList() part here can cause a memory leak if there are zillions of events to process (e.g. the event handler is constantly failing). Paging is missing (e.g. OFFSET...LIMIT)

I had a similar expensive, production bug in custom-made transactional outbox, because of exactly the same problem!

The same problem likely exists in other publication repository implementations (e.g. JDBC).

Another issue I see is order by p.publicationDate asc - after adding paging it will make Denial of Service attacks pretty easy (when all events in a page end with processing error, eventually no further events can be processed). Instead some form of exponential backoff and/or randomness should be used.

Version: 1.3.1

The text was updated successfully, but these errors were encountered:

odrotbohm · 2025-01-23T08:58:54Z

If you are in concerns about performance and memory consumption, you might rather want to look at the JDBC-based repository implementation. That avoids the overhead of the EntityManager, although it's not designed fundamentally different, as it still tries to read all outstanding event publications at once.

While we can certainly look into improving this, I think “zillions of events” in that table are a bit of a smell anyway. The EPR is not an event store, but primarily a guard against events currently in flight getting lost. If you have to deal with such great volume of events, I think the better way out is to deal with that fact on the architectural level and resort to things like Kafka etc.

The ordering we apply is to make sure we resubmit the outstanding events in order. The resubmission process currently hands off each publication one by one, would certainly block the submission of follow-up publications, but a failure would still cause those to still be submitted. The failure of a single resubmission would simply cause the publication to stay outstanding.

While I see that a full page of failing resubmissions would cause that page to be tried again and again, I don't quite see how this would end up with a DoS attack vector. It's the application developer's responsibility to trigger the resubmissions. Unless you build something that gives unguarded access to that to outside parties (through an HTTP request or the like), how would someone be able to invoke such an attack? That said, one, of course, also needs a strategy to deal with publications that are “stuck” in a some form of a dead letter queue. In other words, that is and has to be an application level concern.

iirekm · 2025-01-23T09:46:45Z

We have 2 scenarios here:

Application produces lots of events

Architectural-level solution won't help when application produces lots of events, and Kafka or RabbitMQ stops for even few minutes, which can lead to millions of unprocessed events, which (because of lack of e.g. LIMIT 1000) may require gigabytes of memory, which are often unavailable (e.g. when memory is restricted with Docker/k8s container settings).
In normal circumstances functionalities that don't depend on the broker would still work, but lack of (say) LIMIT 1000 will cause OOM, and block even the functionalities that don't depend on the broker.

Application produces few events

For applications that produce very few events, and we don't care about their latency much, adding a message broker is an overkill, and "queue over database" approach is great. And Modulith allows this approach thanks to @ApplicationModuleListener (instead of @Externalized and then e.g. @KafkaListener). But even few events that can't be processed, over say weeks or months can slowly accumulate and lead to OOM. Because the accumulation happens slowly and linearly, even if proper monitoring is configured, staff responsible for app monitoring can easily miss it.

Also in this 'brokerless' scenario, DoS attack is easy, even after adding (e.g.) LIMIT 1000:

developer uses @TransactionalEventListener (which seems to be valid alternative to @ApplicationModuleListener that removes asynchronicity and adds convenient auto-retry mechanism)
attacker finds out that some some HTTP request generates message that gets stuck
attacker generated 1000 such requests
because of ORDER BY ... LIMIT 1000 we end up with the faulty requests in the beginning of the queue; the solution is easy - add lastAttemptTime column, and in the condition add something like AND now() - lastAttemptTime >= duration 10 seconds.

If solution is easy (LIMIT 1000 + new column + new WHERE condition), why not implement it?

odrotbohm added in: event publication registry Event publication registry meta: waiting for feedback Waiting for feedback of the original reporter labels Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Risk of memory leak in `JpaEventPublicationRepository` #1022

Risk of memory leak in `JpaEventPublicationRepository` #1022

iirekm commented Jan 18, 2025 •

edited

Loading

odrotbohm commented Jan 23, 2025

iirekm commented Jan 23, 2025 •

edited

Loading

Risk of memory leak in JpaEventPublicationRepository #1022

Risk of memory leak in JpaEventPublicationRepository #1022

Comments

iirekm commented Jan 18, 2025 • edited Loading

odrotbohm commented Jan 23, 2025

iirekm commented Jan 23, 2025 • edited Loading

Application produces lots of events

Application produces few events

Risk of memory leak in `JpaEventPublicationRepository` #1022

Risk of memory leak in `JpaEventPublicationRepository` #1022

iirekm commented Jan 18, 2025 •

edited

Loading

iirekm commented Jan 23, 2025 •

edited

Loading