Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent cases of getaddrinfo failing between frontend and backend #482

Open
LennertVA opened this issue May 24, 2024 · 9 comments
Open
Assignees
Labels
question Further information is requested

Comments

@LennertVA
Copy link

Describe the bug
While playing around in Community Edition, using the docker images provided, error 500 happens very regularly - roughly one in 3 to 4 actions triggers one. According to debug output it is due to getaddrinfo sometimes failing. In every case, simply refreshing the interface once or twice makes it go away.

To Reproduce
There are no steps needed to reproduce. It happens all over the interface, for any action that involves calling the backend, in roughly 25% of the cases.

Expected behavior
No errors 500.

Screenshots
Screenshots don't say much except "Error 500 - Internal Error", but whenever it happens this is the cause in the container logs:

frontend    | TypeError: fetch failed
frontend    |     at node:internal/deps/undici/undici:12500:13
frontend    |     at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
frontend    |   [cause]: Error: getaddrinfo ENOTFOUND backend
frontend    |       at GetAddrInfoReqWrap.onlookupall [as oncomplete] (node:dns:118:26) {
frontend    |     errno: -3008,
frontend    |     code: 'ENOTFOUND',
frontend    |     syscall: 'getaddrinfo',
frontend    |     hostname: 'backend'
frontend    |   }
frontend    | }

Environment (please complete the following information):

  • Server OS: official Docker container images imported into podman on RHEL8.9 x86_64
  • Client Browser: Firefox 125.0.2
  • CISO Assistant version: v1.3.5 build 5baf1fc

Additional context
The server OS runs SELinux in full enforcing mode. It took quite some relabeling of files and loading of custom policies to get it to run, but now that it runs it appears not to be involved in this (no audit logs of anything being blocked). Still worth mentioning perhaps.

It is particularly odd that it fails sometimes. And if it fails, a refresh usually does the trick. Which means it is not simply a case of something being broken or blocked, since it does work "usually". Is there a very short timeout configured somewhere for the call? The host server does run a noticeable load.

@ab-smith
Copy link
Contributor

Hello @LennertVA

Thank you for the feedback. This reminds me of a discussion on Discord about a user with mutualised virtualisation and low resources. The trick was to re-bake the docker image with a higher timeout on gunicorn but will try to have an equivalent setup to yours in the meanwhile.

I've read that you've setup the labels but have you cleared out SELinux by switching temporarily to permissive mode?

Warm regards,

@ab-smith ab-smith added the question Further information is requested label May 24, 2024
@LennertVA
Copy link
Author

Hello @LennertVA

Thank you for the feedback. This reminds me of a discussion on Discord about a user with mutualised virtualisation and low resources.

Well, I wouldn't call it low resources, but the physical system is well-used at least. The VM running the docker images less so - it has a load under 0.5 and 50+% free memory - but I can't rule out that it sometimes has to wait for CPU or I/O a bit longer than what is usual due to higher-prio heavily loaded VMs being greedy.

The trick was to re-bake the docker image with a higher timeout on gunicorn but will try to have an equivalent setup to yours in the meanwhile.

Interesting. Definitely curious what you'll find. It mostly threw me off that the error is in getaddrinfo, I wasn't expecting something silly as network resolution to be the thing that would start shitting itself.

I've read that you've setup the labels but have you cleared out SELinux by switching temporarily to permissive mode?

I have not. The host runs some other containers too and I don't like the idea of disabling it globally to test a single case. SELinux seems to be happy enough - had to give some perms and set some labels but if it would be causing any blocking, it would be clearly logged which is no longer happening. But it could be worth testing though, you're right.

@LennertVA
Copy link
Author

I've read that you've setup the labels but have you cleared out SELinux by switching temporarily to permissive mode?

Just did a setenforce 0, refreshed the CISO Assistant UI, clicked on a few random items (risk assessment et al) and faced half a dozen error 500s within the first 30 seconds. So it's safe to say that SELinux is not causing this.

The VM running the containers also has a load of ~0.25 (for two CPUs) and memory ~40% in use. So that also doesn't sound like a very likely cause.

@ab-smith
Copy link
Contributor

Thanks @LennertVA for the feedback,
Given that we are unable to reproduce this, we'll try to build an equivalent setup and get back to you.
Regards

@LennertVA
Copy link
Author

LennertVA commented May 27, 2024

Thanks! I've been looking around for this particular error 3008, and a surprising lot of hits deal with multicast related services and mdns. Is CISO Assistant also using mdns in any way internally?

For good measure I turned off the local host firewall for a minute too - I honestly can't imagine how that would be the cause, but okay - and still no dice.

@ab-smith ab-smith self-assigned this Jun 1, 2024
@ab-smith
Copy link
Contributor

ab-smith commented Jun 1, 2024

So after some digging, it seems that there are other softwares reporting this strange behaviour between node, undicti and docker during DNS resolution.
Other people are suggesting tricks like this one but I'm not a big fan right now to specific a chain of DNS resolution:
https://forum.weaviate.io/t/fetch-error-between-ts-client-and-weaviate-when-deployed-with-docker-compose-on-windows/2146/9

Regardless of these 3008 and DNS warnings, how does it translate to CISO Assistant? do you get any errors? still having random 500 errors after the update?

I've managed a way with a NUC to build a home lab with Proxmox and RHEL9 to emulate your case.

Thank you

@LennertVA
Copy link
Author

LennertVA commented Jun 3, 2024

So after some digging, it seems that there are other softwares reporting this strange behaviour between node, undicti and docker during DNS resolution. Other people are suggesting tricks like this one but I'm not a big fan right now to specific a chain of DNS resolution: https://forum.weaviate.io/t/fetch-error-between-ts-client-and-weaviate-when-deployed-with-docker-compose-on-windows/2146/9

Indeed, same thoughts here. Feels like a patch on a wooden leg. This should not happen regardless.

Regardless of these 3008 and DNS warnings, how does it translate to CISO Assistant? do you get any errors? still having random 500 errors after the update?

Yes, every third or fourth click results in an error 500 still.

@ab-smith
Copy link
Contributor

Ok I've managed to create similar issues by artificially creating latency between the front and backend with toxiproxy; can you confirm that you haven't split the front from the back on different vm/hosts ? were using compose with the prebuilt images or locally built ones?

@LennertVA
Copy link
Author

Ok I've managed to create similar issues by artificially creating latency between the front and backend with toxiproxy; can you confirm that you haven't split the front from the back on different vm/hosts ? were using compose with the prebuilt images or locally built ones?

Yes, front and back are on the same host and in the same "pod". There is nothing in between. Using the prebuilt images.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants