Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor error when terraform init fails to acquire a lock on the working directory's backend settings file on Windows (e.g. on a WSL share) #35344

Open
robpomeroy opened this issue Jun 14, 2024 · 12 comments
Labels
bug cli explained a Terraform Core team member has described the root cause of this issue in code v1.8 Issues (primarily bugs) reported against v1.8 releases

Comments

@robpomeroy
Copy link

Terraform Version

Terraform v1.8.5
on windows_386

Terraform Configuration Files

Not relevant.

Debug Output

https://gist.github.com/robpomeroy/96161fd561e6a109947b636fcd522567

Expected Behavior

Terraform initialises

Actual Behavior

When running terraform init from Windows on a WSL file path, this fails with an error:

│ Error: Error locking state: [{%!s(tfdiags.Severity=69) Error acquiring the state lock Error message: 2 errors occurred:
│       * Incorrect function.
│       * open .terraform\.terraform.tfstate.lock.info: The system cannot find the file specified.
│
│
│
│ Terraform acquires a state lock to protect the state from being written
│ by multiple users at the same time. Please resolve the issue above and try
│ again. For most commands, you can disable locking with the "-lock=false"
│ flag, but this is not recommended. }]

Running from PowerShell provides a little more insight, adding to the above:

'\\wsl.localhost\AlmaLinux9\mnt\wsl\repos\[REDACTED]\Terraform'
CMD.EXE was started with the above path as the current directory.
UNC paths are not supported.  Defaulting to Windows directory.

It looks like terraform init invokes CMD.EXE when run from PowerShell. And CMD.EXE can't handle changing to a UNC path so ends up in the wrong directory. In CMD.EXE, presumably Terraform cannot correctly interpret the UNC path.

Steps to Reproduce

  1. Store a Terraform plan in WSL (e.g. at /tmp/plan or from Windows, e.g. \\wsl.localhost\AlmaLinux9\tmp\plan).
  2. Open PowerShell prompt and navigate to the WSL directory (e.g. Set-Location \\wsl.localhost\AlmaLinux9\tmp\plan)
  3. Run terraform init

Additional Context

In my development environment, I use a junction point to map C:\Repos (link) to \\wsl.localhost\AlmaLinux9\mnt\wsl\repos (target). Hence CMD.EXE allows me to enter the WSL directory using cd C:\Repos, i.e., without needing to use a UNC path. My system is set up for a lot of cross-platform work and I keep my repositories inside WSL for performance reasons (known issue with WSL2 file performance).

Appears to be somewhat related to #29483, albeit in a different context.

References

No response

@robpomeroy robpomeroy added bug new new issue not yet triaged labels Jun 14, 2024
@robpomeroy
Copy link
Author

robpomeroy commented Jun 14, 2024

PS As a workaround, I have changed my workflow to call Terraform from WSL rather than from Linux. That works, but it's less comfortable because I cannot use my USB YubiKey as I normally do, for MFA.

@apparentlymart
Copy link
Member

Hi @robpomeroy! Thanks for reporting this.

Based on the trace log it appears that this behavior of running an external program using the Windows command interpreter belongs to the AWS SDK, which is being used indirectly by Terraform's S3 backend.

I'm not deeply familiar with how the AWS SDK deals with credentials, but I do remember that it can be configured (e.g. in the AWS configuration in your home directory) to run an external program to gather the credentials. Do you think it's plausible that the AWS SDK might think it needs to run an external program to gather AWS credentials on your system, and that it's the launching of that program that's happening through the Windows command interpreter?

@apparentlymart
Copy link
Member

I found the code in the AWS SDK that handles the task of executing an external process to gather credentials, and indeed it does seem to be hard-coded to use cmd.exe /C when running on Windows:

https://github.com/aws/aws-sdk-go-v2/blob/af47d4b51bc085bcf85e2b872b2cf3e86b76b472/credentials/processcreds/provider.go#L90-L94

	if runtime.GOOS == "windows" {
		cmdArgs = []string{"cmd.exe", "/C"}
	} else {
		cmdArgs = []string{"sh", "-c"}
	}

I assume this is a compatibility constraint in that the command to run is an arbitrary string specified in the AWS configuration file and changing it to use PowerShell instead of the traditional Windows command interpreter would likely cause existing configured command lines to be interpreted differently than they are today.

Regardless of the reason why it's written this way, this code is not directly part of Terraform and so I don't see any clear path to avoiding using cmd.exe in this case. I would hope that this would come into play only if you've specified a credentials command in your AWS configuration files and that you could avoid it by changing the configuration, but of course if the external credentials program is the only way you can access AWS credentials for your environment then that does leave you at a bit of an impasse. Does the AWS CLI also exhibit this behavior?

@robpomeroy
Copy link
Author

robpomeroy commented Jun 14, 2024

That sounds plausible Martin. I have AWS calling aws-vault (and ykman for authentication and MFA. But this could be a red herring. I've been using Terraform/AWS/YubiKey successfully for years, but this is the first time I've run it from Windows on a plan that's inside WSL's filesystem. The CMD.EXE output only surfaces when running Terraform from PowerShell - which is not my usual practice. Most of the time I'm using Cmder (which is essentially CMD.EXE) in Tabby.

If I just run terraform init from CMD.EXE, the problem still arises. Terraform manages to create the .terraform directory and an empty terraform.tfstate. Then it gives up. By this point, I've already been past the YubiKey/ykman prompt, so I assume auth is complete.

@apparentlymart
Copy link
Member

apparentlymart commented Jun 14, 2024

Indeed, I think you're probably right that this error is a red herring. I expect that the AWS SDK is running cmd.exe /C aws-vault (or similar) and that's causing that warning to be printed but yet the command still succeeds anyway because the command being run doesn't actually care which working directory it's run from.

Then downstream something else fails, unrelated to the launching of the credentials process, and so Terraform returns the "Error locking state" message.

My next guess would be that Terraform is trying to do an I/O operation on the lock information file that the filesystem of the current working directory cannot support for some reason, and thus the Windows kernel is returning this "Incorrect function" error. I recognize "Incorrect function" as the English localization of a Windows error code that we've seen before in some other context, though I don't remember which one.

@robpomeroy
Copy link
Author

Yes, that explanation fits well. 👍🏻

@apparentlymart
Copy link
Member

apparentlymart commented Jun 14, 2024

I think this is where that error message is originating:

err := slowmessage.Do(LockThreshold, func() error {
id, err := statemgr.LockWithContext(ctx, s, lockInfo)
l.lockID = id
return err
}, l.view.Locking)
if err != nil {
diags = diags.Append(tfdiags.Sourceless(
tfdiags.Error,
"Error acquiring the state lock",
fmt.Sprintf(LockErrorMessage, err),
))
}

Behind the scenes, Terraform implements the locking for the file that contains the working directory's initialized backend settings (which is named .terraform/terraform.tfstate for historical reasons) by calling LockFileEx:

func (s *LocalState) lock() error {
// even though we're failing immediately, an overlapped event structure is
// required
ol, err := newOverlapped()
if err != nil {
return err
}
defer syscall.CloseHandle(ol.HEvent)
return lockFileEx(
syscall.Handle(s.stateFileOut.Fd()),
_LOCKFILE_EXCLUSIVE_LOCK|_LOCKFILE_FAIL_IMMEDIATELY,
0, // reserved
0, // bytes low
math.MaxUint32, // bytes high
ol,
)
}

I expect that whatever filesystem \\wsl.localhost refers to does not support this operation and so it's failing with the "Incorrect function" error. Terraform then separately tries to read the lock info file and finds it missing, because the locking step already failed.

I assume from what appears in the path after it that \\wsl.localhost\AlmaLinux9 is referring to the root of a Linux virtual file system inside a WSL context, and so the directory where Terraform is trying to work is presumably a Linux filesystem that cannot support the Windows file locking concepts.

Although I've not seen this particular symptom before, this sort of thing unfortunately commonly arises when crossing between Windows and Linux contexts, because the Windows version of Terraform is built to work with the Windows API and the Linux version of Terraform is built for the Linux API, and WSL often ends up supporting only the common subset of both out of necessity.

Since you seem to be intending to work in a directory in your WSL environment anyway, could you instead using the Linux version of Terraform inside WSL, so that Terraform will expect the filesystem to behave in a Linux-like way rather than a Windows-like way?

@apparentlymart
Copy link
Member

Although some of the fine details are different, this issue seems to confirm my hypothesis: microsoft/WSL#5762

@apparentlymart
Copy link
Member

The Go toolchain seems to have a similar problem for the same reasons, and I found a comment that mentions a workaround of diverting the directories needing locking to a different location.

I don't have a Windows/WSL environment to test this on, but I think the Terraform equivalent of that workaround would be to set TF_DATA_DIR to a location that is on a native Windows filesystem rather than in the exported WSL filesystem.

That approach does require some considerable care, though: Terraform expects that each working directory has a distinct "data dir", so if you set the TF_DATA_DIR environment variable you must make sure to set it to a different directory for each distinct working directory where you'll run terraform init, or else the working directory states for different directories will interfere with one another.

@robpomeroy
Copy link
Author

Yes, running Terraform's Linux build within WSL is my workaround. I suspected this might be not worth fixing.

Took me a while to figure out why Terraform was bombing out. Possibly the end user error message could be adjusted?

@apparentlymart
Copy link
Member

Unfortunately "Incorrect function", aka ERROR_INVALID_FUNCTION, is a very generic Windows API error code that could represent a number of different problems, and so I think we'd need to be careful about making the error message too specific because giving specific-seeming guidance that is actually irrelevant is often more harmful than giving incomplete information. 😖

However, I think there are some concrete opportunities to improve this code and the error message it returns:

  1. The formatting of this error message suggests that Terraform is confusing itself about whether it's dealing with Go-style errors or Terraform-style diagnostics. It should not be exposing the tfdiags implementation detail and should instead be presenting each of these errors as a separate diagnostic.

  2. Once we are correctly handling the separate diagnostics, hopefully we can also do better than just copying the raw OS error into the first diagnostic, and instead recognize the "invalid function" error as something special that indicates that Terraform is running in an unsuitable filesystem.

    Although I think it would be a mistake for the error message to presume this is a WSL problem, it could still say something general about how the filesystem does not seem to support locking and that placing the working directory on a local filesystem might help.

  3. While researching this I noticed that golang.org/x/sys/windows now has a LockFileEx wrapper, so we could switch to using that instead of our own hand-written stub. The official one does essentially the same thing as ours does, so I don't expect this would actually change anything significant but it would at least make this code easier to maintain in future.

@apparentlymart apparentlymart added cli explained a Terraform Core team member has described the root cause of this issue in code and removed backend/s3 new new issue not yet triaged labels Jun 14, 2024
@apparentlymart apparentlymart changed the title terraform init fails on Windows in a WSL UNC path Poor error when terraform init fails to acquire a lock on the working directory's backend settings file on Windows (e.g. on a WSL share) Jun 14, 2024
@apparentlymart
Copy link
Member

If microsoft/WSL#5762 is fixed at some point in a way that makes LockFileEx work on the WSL 9p filesystem then this specific reproduction method will no longer work and we'd need to find a different filesystem that returns ERROR_INVALID_FUNCTION when asked to take an exclusive lock.

@apparentlymart apparentlymart added the v1.8 Issues (primarily bugs) reported against v1.8 releases label Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug cli explained a Terraform Core team member has described the root cause of this issue in code v1.8 Issues (primarily bugs) reported against v1.8 releases
Projects
None yet
Development

No branches or pull requests

2 participants