Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std buffer overflows from improper usage of wtf8ToWtf16Le #20288

Open
sno2 opened this issue Jun 14, 2024 · 10 comments · May be fixed by #20297
Open

std buffer overflows from improper usage of wtf8ToWtf16Le #20288

sno2 opened this issue Jun 14, 2024 · 10 comments · May be fixed by #20297
Labels
bug Observed behavior contradicts documented or intended behavior standard library This issue involves writing Zig code for the standard library.
Milestone

Comments

@sno2
Copy link
Contributor

sno2 commented Jun 14, 2024

From wtf8ToWtf16Le's documentation:

/// Assumes there is enough space for the output.

In the following std functions we do not guarantee this when we call wtf8ToWtf16Le:

We could either make the function itself return an error in some cases or create a wtf8CountCodepoints and make sure that this value is less than fs.max_path_bytes for output buffers.

@squeek502 squeek502 added bug Observed behavior contradicts documented or intended behavior standard library This issue involves writing Zig code for the standard library. labels Jun 14, 2024
@squeek502
Copy link
Collaborator

squeek502 commented Jun 14, 2024

I suspect there are currently quite a few places where the standard library just assumes that windows.PathSpace.data will be large enough.

@sno2
Copy link
Contributor Author

sno2 commented Jun 14, 2024

I will try to go with the wtf8CountCodepoints route and fix the known problems. Then, we can apply similar fixes to other overflows when we find them.

Edit: Wrong unicode API. Will need to create a variant of calcUtf16LeLen.

@squeek502
Copy link
Collaborator

squeek502 commented Jun 14, 2024

Will need to create a variant of calcUtf16LeLen.

That would be one way to go, but I think it can be simpler. It might be sufficient to just check

if (wtf8_path.len > std.fs.max_path_bytes)

since max_path_bytes is

zig/lib/std/fs.zig

Lines 56 to 60 in 82a934b

// Each WTF-16LE code unit may be expanded to 3 WTF-8 bytes.
// If it would require 4 WTF-8 bytes, then there would be a surrogate
// pair in the WTF-16LE, and we (over)account 3 bytes for it that way.
// +1 for the null byte at the end, which can be encoded in 1 byte.
.windows => windows.PATH_MAX_WIDE * 3 + 1,

Definitely check my assumptions on that, though.

The intention is for max_path_bytes to be able to handle windows.PATH_MAX_WIDE WTF-16 code units + a NUL terminator, but it's possible that also means that [windows.PATH_MAX_WIDE:0]u16 can handle max_path_bytes of WTF-8. I might be thinking about it wrong, though.

@sno2
Copy link
Contributor Author

sno2 commented Jun 14, 2024

since max_path_bytes is

After reading this, I was convinced this would be safe as well. However, this test program ends up failing after i=65536 so I am not so sure now.

const std = @import("std");

pub fn main() !void {
    const @"U+10FFFF": [4]u8 = .{ 244, 143, 191, 191 }; // first codepoint that requires 4 bytes in UTF-16LE

    const input = @"U+10FFFF" ** std.fs.max_path_bytes; // just a long number of bytes

    var buffer: [std.os.windows.PATH_MAX_WIDE]u16 = undefined;

    var i: usize = 0;
    while (true) : (i += 1) {
        // std.log.info("i={}", .{i});
        if (i > std.fs.max_path_bytes) return error.NameTooLong;
        _ = std.unicode.utf8ToUtf16Le(&buffer, input[0..i]) catch {};
    }
}
info: i=65536
thread 16168 panic: index out of bounds: index 32768, len 32767
C:\lib\zig\lib\std\unicode.zig:1245:34: 0xd128fd in utf8ToUtf16LeImpl__anon_2556 (foo.exe.obj)
            utf16le[dest_index..][0..2].* = .{ mem.nativeToLittle(u16, high), mem.nativeToLittle(u16, low) };
                                 ^
C:\lib\zig\lib\std\unicode.zig:1207:29: 0xd113ca in utf8ToUtf16Le (foo.exe.obj)
    return utf8ToUtf16LeImpl(utf16le, utf8, .cannot_encode_surrogate_half);
                            ^
C:\dev\zig\foo.zig:14:38: 0xd11189 in main (foo.exe.obj)
        _ = std.unicode.utf8ToUtf16Le(&buffer, input[0..i]) catch {};
                                     ^
C:\lib\zig\lib\std\start.zig:363:53: 0xd114ec in WinStartup (foo.exe.obj)
    std.os.windows.ntdll.RtlExitUserProcess(callMain());
                                                    ^
???:?:?: 0x7ff9339c257c in ??? (KERNEL32.DLL)
???:?:?: 0x7ff93516aa47 in ??? (ntdll.dll)

@squeek502
Copy link
Collaborator

squeek502 commented Jun 14, 2024

Yeah, I was thinking about it wrong. EDIT: ...and I was still thinking about it wrong here

wrong math
  • max_path_bytes is 98,302
  • The shortest possible sequence that expands to 2 UTF-16 code units is 4 bytes long (U+10000)
  • There are ~24,575 such sequences possible in 98,302 bytes
  • If each of those sequences take 2 UTF-16 code units, that means that the WTF-16 buffer would need 49,150 code units of space to safely encode max_path_bytes of WTF-8

EDIT: The correct math now that I finally have a grasp on things:

  • max_path_bytes is 98,302
  • Each byte can correspond to a WTF-16 code unit
  • The WTF-16 buffer would need 98,302 code units of space to safely encode max_path_bytes of WTF-8

@sno2
Copy link
Contributor Author

sno2 commented Jun 14, 2024

That's fine, I still have an idea of how we can skip counting most inputs beforehand because there is a maximum number of WTF-8 bytes where we can know that it will not overflow for.

@squeek502
Copy link
Collaborator

squeek502 commented Jun 14, 2024

Had the same thought. Hopefully I got this right this time:

EDIT: I got it wrong again. See comment below

@squeek502
Copy link
Collaborator

squeek502 commented Jun 14, 2024

Here's what I'm thinking, added to std.unicode:

// TODO: better name
// TODO: unsure if taking a slice for the wtf16 len makes sense, the only reason I think it might
//       is that it makes the length unit (code units rather than bytes) obvious
pub fn wtf8ToWtf16LeBufCheck(wtf16le: []const u16, wtf8: []const u8) !bool {
    if (wtf16le.len >= minWtf16CodeUnitsForWtf8Bytes(wtf8.len)) return true;
    return wtf16le.len >= (try calcWtf16LeLen(wtf8));
}

pub fn minWtf16CodeUnitsForWtf8Bytes(wtf8_bytes: usize) usize {
    // The largest ratio of WTF-8 bytes to WTF-16 code units is 1:1 for
    // one byte WTF-8 sequences, since all other WTF-8 sequence lengths
    // have a ratio of 2:1, 3:1, or 4:2
    return wtf8_bytes;
}

pub fn calcWtf16LeLen(wtf8: []const u8) !usize {
    // calcUtf16LeLen but for WTF-16
}

so then usage would be something like:

var wtf16_dir_path: [windows.PATH_MAX_WIDE]u16 = undefined;
const buf_big_enough = try std.unicode.wtf8ToWtf16LeBufCheck(&wtf16_dir_path, dir_path);
if (!buf_big_enough) {
    return error.NameTooLong;
}

// We now know it is safe to do this conversion, i.e. `len` is guaranteed to be <= wtf16_dir_path.len
// Note that we do *not* know that `dir_path` is valid WTF-8, though, since we don't always need to
// call calcWtf16LeLen in wtf8ToWtf16LeBufCheck
const len = try std.unicode.wtf8ToWtf16Le(wtf16_dir_path[0..], dir_path);

EDIT: I had the implementation of minWtf16CodeUnitsForWtf8Bytes completely wrong at first.

@Vexu Vexu added this to the 0.14.0 milestone Jun 14, 2024
@Trevor-Strong
Copy link

I would prefer renaming the current wtf8ToWtf16Le to something like wtf8ToWtf16LeAssumeCapacity and have wtf8ToWtf16Le be the length checked variant.
This is would also be more inline with other parts of the standard library that are safe by default (i.e. append and appendAssumeCapacity). Although std.unicode in particular seems to favor the unsafe by default methodology.

sno2 added a commit to sno2/zig that referenced this issue Jun 14, 2024
@sno2 sno2 linked a pull request Jun 14, 2024 that will close this issue
@sno2
Copy link
Contributor Author

sno2 commented Jun 15, 2024

@Trevor-Strong I agree, the current design can easily lead to footguns as shown by this issue. However, I didn't include any changes related to that in my PR to avoid bikeshedding. I advise creating a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Observed behavior contradicts documented or intended behavior standard library This issue involves writing Zig code for the standard library.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants