Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Risk of memory leak in JpaEventPublicationRepository #1022

Open
iirekm opened this issue Jan 18, 2025 · 2 comments
Open

Risk of memory leak in JpaEventPublicationRepository #1022

iirekm opened this issue Jan 18, 2025 · 2 comments
Labels
in: event publication registry Event publication registry meta: waiting for feedback Waiting for feedback of the original reporter

Comments

@iirekm
Copy link

iirekm commented Jan 18, 2025

findIncompletePublicationsPublishedBefore and similar methods do something like: entityManager....getResultStream().map(...).toList(). The toList() part here can cause a memory leak if there are zillions of events to process (e.g. the event handler is constantly failing). Paging is missing (e.g. OFFSET...LIMIT)

I had a similar expensive, production bug in custom-made transactional outbox, because of exactly the same problem!

The same problem likely exists in other publication repository implementations (e.g. JDBC).

Another issue I see is order by p.publicationDate asc - after adding paging it will make Denial of Service attacks pretty easy (when all events in a page end with processing error, eventually no further events can be processed). Instead some form of exponential backoff and/or randomness should be used.

Version: 1.3.1

@odrotbohm
Copy link
Member

If you are in concerns about performance and memory consumption, you might rather want to look at the JDBC-based repository implementation. That avoids the overhead of the EntityManager, although it's not designed fundamentally different, as it still tries to read all outstanding event publications at once.

While we can certainly look into improving this, I think “zillions of events” in that table are a bit of a smell anyway. The EPR is not an event store, but primarily a guard against events currently in flight getting lost. If you have to deal with such great volume of events, I think the better way out is to deal with that fact on the architectural level and resort to things like Kafka etc.

The ordering we apply is to make sure we resubmit the outstanding events in order. The resubmission process currently hands off each publication one by one, would certainly block the submission of follow-up publications, but a failure would still cause those to still be submitted. The failure of a single resubmission would simply cause the publication to stay outstanding.

While I see that a full page of failing resubmissions would cause that page to be tried again and again, I don't quite see how this would end up with a DoS attack vector. It's the application developer's responsibility to trigger the resubmissions. Unless you build something that gives unguarded access to that to outside parties (through an HTTP request or the like), how would someone be able to invoke such an attack? That said, one, of course, also needs a strategy to deal with publications that are “stuck” in a some form of a dead letter queue. In other words, that is and has to be an application level concern.

@odrotbohm odrotbohm added in: event publication registry Event publication registry meta: waiting for feedback Waiting for feedback of the original reporter labels Jan 23, 2025
@iirekm
Copy link
Author

iirekm commented Jan 23, 2025

We have 2 scenarios here:

Application produces lots of events

Architectural-level solution won't help when application produces lots of events, and Kafka or RabbitMQ stops for even few minutes, which can lead to millions of unprocessed events, which (because of lack of e.g. LIMIT 1000) may require gigabytes of memory, which are often unavailable (e.g. when memory is restricted with Docker/k8s container settings).
In normal circumstances functionalities that don't depend on the broker would still work, but lack of (say) LIMIT 1000 will cause OOM, and block even the functionalities that don't depend on the broker.

Application produces few events

For applications that produce very few events, and we don't care about their latency much, adding a message broker is an overkill, and "queue over database" approach is great. And Modulith allows this approach thanks to @ApplicationModuleListener (instead of @Externalized and then e.g. @KafkaListener). But even few events that can't be processed, over say weeks or months can slowly accumulate and lead to OOM. Because the accumulation happens slowly and linearly, even if proper monitoring is configured, staff responsible for app monitoring can easily miss it.

Also in this 'brokerless' scenario, DoS attack is easy, even after adding (e.g.) LIMIT 1000:

  • developer uses @TransactionalEventListener (which seems to be valid alternative to @ApplicationModuleListener that removes asynchronicity and adds convenient auto-retry mechanism)
  • attacker finds out that some some HTTP request generates message that gets stuck
  • attacker generated 1000 such requests
  • because of ORDER BY ... LIMIT 1000 we end up with the faulty requests in the beginning of the queue; the solution is easy - add lastAttemptTime column, and in the condition add something like AND now() - lastAttemptTime >= duration 10 seconds.

If solution is easy (LIMIT 1000 + new column + new WHERE condition), why not implement it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in: event publication registry Event publication registry meta: waiting for feedback Waiting for feedback of the original reporter
Projects
None yet
Development

No branches or pull requests

2 participants