Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getaddrinfo ENOTFOUND occasionally #4798

Closed
1 task done
sudoexec opened this issue May 29, 2024 · 12 comments
Closed
1 task done

getaddrinfo ENOTFOUND occasionally #4798

sudoexec opened this issue May 29, 2024 · 12 comments
Labels
area:documentation Improvements or additions to documentation feature-request Request for new features to be added good first issue Good for newcomers hacktoberfest help wanted May need your help to test or answer

Comments

@sudoexec
Copy link

sudoexec commented May 29, 2024

📑 I have found these related issues/pull requests

🛡️ Security Policy

Description

There are some getaddrinfo ENOTFOUND errors occasionally(0-3 errors per day).

Uptime Kuma running in k8s. Upstream dns is k8s's coredns and coredns don't have any error logs.
I use while true; do nslookup example.com && sleep 1; done to test dns resolution and no errors.

The error occurs randomly and I can't reproduce it.
Is there any methods to find details about this error?

👟 Reproduction steps

Can't reproduce.

👀 Expected behavior

No getaddrinfo ENOTFOUND errors.

😓 Actual Behavior

getaddrinfo ENOTFOUND

🐻 Uptime-Kuma Version

1.23.11

💻 Operating System and Arch

k8s

🌐 Browser

125.0.6422.112 (Official Build) Arch Linux (64-bit)

🖥️ Deployment Environment

  • Runtime: k8s v1.18.1
  • Database: sqlite
  • Filesystem used to store the database on: local storage via hostpath
  • number of monitors: 52

📝 Relevant log output

Failing: getaddrinfo ENOTFOUND
@sudoexec sudoexec added the bug Something isn't working label May 29, 2024
@CommanderStorm CommanderStorm added help and removed bug Something isn't working labels May 29, 2024
@CommanderStorm
Copy link
Collaborator

Same steps as in #4765

getaddrinfo ENOTFOUND test.xyz

  • What is the TTL of the domains you are using?
  • Do you have DNS caching enabled in the settings?

Most commonly, this issue is caused by you using a DNS resolver which does not like the level of DNS requests it is getting.
=> your DNS Server is dropping SOME requests
=> have you enabled NSCD in the settings to lowered the amount of DNS requests to your TTL (instead of on every request)

@CommanderStorm CommanderStorm added the area:monitor Everything related to monitors label May 29, 2024
@sudoexec
Copy link
Author

Same steps as in #4765

getaddrinfo ENOTFOUND test.xyz

  • What is the TTL of the domains you are using?
  • Do you have DNS caching enabled in the settings?

Most commonly, this issue is caused by you using a DNS resolver which does not like the level of DNS requests it is getting. => your DNS Server is dropping SOME requests => have you enabled NSCD in the settings to lowered the amount of DNS requests to your TTL (instead of on every request)

  • TTL is 600
  • DNS chaing is enabled
    image

@CommanderStorm
Copy link
Collaborator

I have no clue what could be causing this.

Lets rule out the stupid cauases first:

@sudoexec
Copy link
Author

sudoexec commented May 29, 2024

could you look in the log if NSCD has been successfully started? (possible cause: using a custom UUID/GUID)

ps aux show NSCD is running

have you verified that the TTL is actually 600?

I'm sure TTL is 600

coredns don't have any error logs

Just to make sure: you have activated https://coredns.io/plugins/errors/ and/or https://coredns.io/plugins/log/?
What are the logs?

I enable errors plugin but not log plugin. I'll try to enable log plugin to find more details.

@thielj
Copy link

thielj commented May 30, 2024

@sudoexec Alpine or other musl based Linux? Can you post a copy of your host's and the running container's /etc/resolv.conf?

I have seen similar issues in the past, including with Kubernetes, usually involving multiple DNS servers or related to search domains. The musl resolver would send out multiple parallel queries and ignore all replies but the first one. If that response was an error, this is what you would get. If the "good" lookup would usually win the race, you wouldn't see this error often.

Also, a regular nslookup or dig (or the DNS monitors in Kuma) do name service lookups differently than for example curl or http requests in Node which use the resolver (getaddrinfo) provided by the C library. Just had a quick google and these might give some background:

https://jvns.ca/blog/2022/02/23/getaddrinfo-is-kind-of-weird/
https://medium.com/@hsahu24/understanding-dns-resolution-and-resolv-conf-d17d1d64471c

(this is just a personal opinion, but I wouldn't touch nscd with a barge pole)

@sudoexec
Copy link
Author

@thielj Host machine is ubuntu 18.04.
Here are resolv.conf:

# Host
nameserver 119.29.29.29

# Container
nameserver 10.96.0.10                 # k8s coredns
search namespace.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

Thanks for the info you provided, I've learned more abount DNS internal from it.

Additionally, I've added another nameserver to uptime kuma pod, and there're no errors in the past 2 days.

@thielj
Copy link

thielj commented May 31, 2024

If you get more getadrinfo related errors: those resolv.conf settings and the internal DNS they lead to is the rabbit hole you need to dig into, all the way from the container/pod down your stack.

https://coredns.io/2017/06/08/how-queries-are-processed-in-coredns/

@CommanderStorm CommanderStorm added area:documentation Improvements or additions to documentation feature-request Request for new features to be added help wanted May need your help to test or answer good first issue Good for newcomers hacktoberfest and removed help area:monitor Everything related to monitors labels May 31, 2024
@CommanderStorm
Copy link
Collaborator

We should likely document this here
https://github.com/louislam/uptime-kuma/wiki/Troubleshooting

What is your second nameserver? (how did you find it's IP? Do you have multiple coredns instances running?)

(Not a kubernetes/dns wizard 😅)

@sudoexec
Copy link
Author

@thielj Thanks again for your help. I'll try it

@CommanderStorm

Additionally, I've added another nameserver to uptime kuma pod, and there're no errors in the past 2 days.

In fact,"another nameserver" is 1.1.1.1. In case it's caused by coredns.

@thielj
Copy link

thielj commented May 31, 2024

@sudoexec This probably doesn't do what you expect, and if it does, you're relying on specific implementation behaviour of POSIX getaddrinfo. There are at least four different major implementations, and most of them can be further configured, see nsswitch.conf for an example.

The two most common, and their default behaviour with regards to the DNS resolver are:

  • glibc, which will query the first server, and if it replies saying that it can't resolve your name, that's the final result. Only if the first server doesn't reply at all within the timeout, glibc would move on. For the purpose of monitoring, this can effectively mask problems in your Kubernetes DNS setup. Unless you monitor to show off "all green" to your boss or a client, it's probably not what you want.

  • musl, which will query both servers in parallel, and the first to reply wins. If 1.1.1.1 is faster than coredns and says it's unresolveable, then that's the final result. This usually ends in your internal DNS winning the race 99.99% of the time. Instead of logging that your coredns is sometimes slow, you will log lookup failures (without knowing that they actually came from 1.1.1.1).

So: If you specify more than one server in resolv.conf, BOTH should be able to resolve ALL your hosts. If you want to implement fallbacks, query routing and such, configure a coredns or dnsmasq instance appropriately and point your resolv.conf to that. If you still want two DNS entries in your resolv.conf, configure two identically redundant instances.

Also, if you run frequent probes, you will eventually see failures. That's pretty normal. With a 99.99% reliability, a < 0.01% failure rate would be acceptable. Configure your probes to allow for one retry maybe?

Alpine/Musl

@skrue
Copy link

skrue commented Jun 11, 2024

I started seeing this behavior after setting up AdGuard Home. In my previous setup I only had Unbound DNS running on my OPNsense router/firewall. Now, AdGuard will relay all requests that it doesn't decide to block to Unbound, so AdGuard is the primary DNS. My entire home network is whitelisted in AdGuard as is the Uptime Kuma IP, so no blocking should be happening there. I am running Uptime Kuma as an LXC container on my Proxmox host. getaddrinfo ENOTFOUND errors pop up roughly once a day for each monitor that I have configured. I have now increased the retry value from 0 to 2, let's see if that helps.

@sudoexec
Copy link
Author

Weeks age, I change my upstream DNS (which is provided by cloud service and managed by systemd-resolved) to another 2 public DNS server. There's no getaddrinfo ENOTFOUND error again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:documentation Improvements or additions to documentation feature-request Request for new features to be added good first issue Good for newcomers hacktoberfest help wanted May need your help to test or answer
Projects
None yet
Development

No branches or pull requests

4 participants