std buffer overflows from improper usage of `wtf8ToWtf16Le` #20288

sno2 · 2024-06-14T02:07:37Z

From wtf8ToWtf16Le's documentation:

/// Assumes there is enough space for the output.

In the following std functions we do not guarantee this when we call wtf8ToWtf16Le:

We could either make the function itself return an error in some cases or create a wtf8CountCodepoints and make sure that this value is less than fs.max_path_bytes for output buffers.

The text was updated successfully, but these errors were encountered:

squeek502 · 2024-06-14T02:12:03Z

I suspect there are currently quite a few places where the standard library just assumes that windows.PathSpace.data will be large enough.

sno2 · 2024-06-14T02:19:55Z

I will try to go with the wtf8CountCodepoints route and fix the known problems. Then, we can apply similar fixes to other overflows when we find them.

Edit: Wrong unicode API. Will need to create a variant of calcUtf16LeLen.

squeek502 · 2024-06-14T02:28:12Z

Will need to create a variant of calcUtf16LeLen.

That would be one way to go, but I think it can be simpler. It might be sufficient to just check

if (wtf8_path.len > std.fs.max_path_bytes)

since max_path_bytes is

zig/lib/std/fs.zig

Lines 56 to 60 in 82a934b

    
           // Each WTF-16LE code unit may be expanded to 3 WTF-8 bytes. 
        
           // If it would require 4 WTF-8 bytes, then there would be a surrogate 
        
           // pair in the WTF-16LE, and we (over)account 3 bytes for it that way. 
        
           // +1 for the null byte at the end, which can be encoded in 1 byte. 
        
           .windows => windows.PATH_MAX_WIDE * 3 + 1,

Definitely check my assumptions on that, though.

The intention is for max_path_bytes to be able to handle windows.PATH_MAX_WIDE WTF-16 code units + a NUL terminator, but it's possible that also means that [windows.PATH_MAX_WIDE:0]u16 can handle max_path_bytes of WTF-8. I might be thinking about it wrong, though.

sno2 · 2024-06-14T02:56:27Z

since max_path_bytes is

After reading this, I was convinced this would be safe as well. However, this test program ends up failing after i=65536 so I am not so sure now.

const std = @import("std");

pub fn main() !void {
    const @"U+10FFFF": [4]u8 = .{ 244, 143, 191, 191 }; // first codepoint that requires 4 bytes in UTF-16LE

    const input = @"U+10FFFF" ** std.fs.max_path_bytes; // just a long number of bytes

    var buffer: [std.os.windows.PATH_MAX_WIDE]u16 = undefined;

    var i: usize = 0;
    while (true) : (i += 1) {
        // std.log.info("i={}", .{i});
        if (i > std.fs.max_path_bytes) return error.NameTooLong;
        _ = std.unicode.utf8ToUtf16Le(&buffer, input[0..i]) catch {};
    }
}

info: i=65536
thread 16168 panic: index out of bounds: index 32768, len 32767
C:\lib\zig\lib\std\unicode.zig:1245:34: 0xd128fd in utf8ToUtf16LeImpl__anon_2556 (foo.exe.obj)
            utf16le[dest_index..][0..2].* = .{ mem.nativeToLittle(u16, high), mem.nativeToLittle(u16, low) };
                                 ^
C:\lib\zig\lib\std\unicode.zig:1207:29: 0xd113ca in utf8ToUtf16Le (foo.exe.obj)
    return utf8ToUtf16LeImpl(utf16le, utf8, .cannot_encode_surrogate_half);
                            ^
C:\dev\zig\foo.zig:14:38: 0xd11189 in main (foo.exe.obj)
        _ = std.unicode.utf8ToUtf16Le(&buffer, input[0..i]) catch {};
                                     ^
C:\lib\zig\lib\std\start.zig:363:53: 0xd114ec in WinStartup (foo.exe.obj)
    std.os.windows.ntdll.RtlExitUserProcess(callMain());
                                                    ^
???:?:?: 0x7ff9339c257c in ??? (KERNEL32.DLL)
???:?:?: 0x7ff93516aa47 in ??? (ntdll.dll)

squeek502 · 2024-06-14T03:05:57Z

Yeah, I was thinking about it wrong. EDIT: ...and I was still thinking about it wrong here

wrong math

max_path_bytes is 98,302
The shortest possible sequence that expands to 2 UTF-16 code units is 4 bytes long (U+10000)
There are ~24,575 such sequences possible in 98,302 bytes
If each of those sequences take 2 UTF-16 code units, that means that the WTF-16 buffer would need 49,150 code units of space to safely encode max_path_bytes of WTF-8

EDIT: The correct math now that I finally have a grasp on things:

max_path_bytes is 98,302
Each byte can correspond to a WTF-16 code unit
The WTF-16 buffer would need 98,302 code units of space to safely encode max_path_bytes of WTF-8

sno2 · 2024-06-14T03:15:25Z

That's fine, I still have an idea of how we can skip counting most inputs beforehand because there is a maximum number of WTF-8 bytes where we can know that it will not overflow for.

squeek502 · 2024-06-14T03:25:20Z

Had the same thought. Hopefully I got this right this time:

EDIT: I got it wrong again. See comment below

squeek502 · 2024-06-14T04:02:15Z

Here's what I'm thinking, added to std.unicode:

// TODO: better name
// TODO: unsure if taking a slice for the wtf16 len makes sense, the only reason I think it might
//       is that it makes the length unit (code units rather than bytes) obvious
pub fn wtf8ToWtf16LeBufCheck(wtf16le: []const u16, wtf8: []const u8) !bool {
    if (wtf16le.len >= minWtf16CodeUnitsForWtf8Bytes(wtf8.len)) return true;
    return wtf16le.len >= (try calcWtf16LeLen(wtf8));
}

pub fn minWtf16CodeUnitsForWtf8Bytes(wtf8_bytes: usize) usize {
    // The largest ratio of WTF-8 bytes to WTF-16 code units is 1:1 for
    // one byte WTF-8 sequences, since all other WTF-8 sequence lengths
    // have a ratio of 2:1, 3:1, or 4:2
    return wtf8_bytes;
}

pub fn calcWtf16LeLen(wtf8: []const u8) !usize {
    // calcUtf16LeLen but for WTF-16
}

so then usage would be something like:

var wtf16_dir_path: [windows.PATH_MAX_WIDE]u16 = undefined;
const buf_big_enough = try std.unicode.wtf8ToWtf16LeBufCheck(&wtf16_dir_path, dir_path);
if (!buf_big_enough) {
    return error.NameTooLong;
}

// We now know it is safe to do this conversion, i.e. `len` is guaranteed to be <= wtf16_dir_path.len
// Note that we do *not* know that `dir_path` is valid WTF-8, though, since we don't always need to
// call calcWtf16LeLen in wtf8ToWtf16LeBufCheck
const len = try std.unicode.wtf8ToWtf16Le(wtf16_dir_path[0..], dir_path);

EDIT: I had the implementation of minWtf16CodeUnitsForWtf8Bytes completely wrong at first.

Trevor-Strong · 2024-06-14T13:17:51Z

I would prefer renaming the current wtf8ToWtf16Le to something like wtf8ToWtf16LeAssumeCapacity and have wtf8ToWtf16Le be the length checked variant.
This is would also be more inline with other parts of the standard library that are safe by default (i.e. append and appendAssumeCapacity). Although std.unicode in particular seems to favor the unsafe by default methodology.

Closes ziglang#20288

sno2 · 2024-06-15T03:28:41Z

@Trevor-Strong I agree, the current design can easily lead to footguns as shown by this issue. However, I didn't include any changes related to that in my PR to avoid bikeshedding. I advise creating a new issue.

sno2 mentioned this issue Jun 14, 2024

std.fs.Dir: Refactor atomicSymLink from std.fs #20279

Open

squeek502 added bug Observed behavior contradicts documented or intended behavior standard library This issue involves writing Zig code for the standard library. labels Jun 14, 2024

Vexu added this to the 0.14.0 milestone Jun 14, 2024

sno2 added a commit to sno2/zig that referenced this issue Jun 14, 2024

std: fix buffer overflows from improper WTF encoding

0b35080

Closes ziglang#20288

sno2 linked a pull request Jun 14, 2024 that will close this issue

std: fix buffer overflows from improper WTF encoding #20297

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

std buffer overflows from improper usage of `wtf8ToWtf16Le` #20288

std buffer overflows from improper usage of `wtf8ToWtf16Le` #20288

sno2 commented Jun 14, 2024

squeek502 commented Jun 14, 2024 •

edited

Loading

sno2 commented Jun 14, 2024 •

edited

Loading

squeek502 commented Jun 14, 2024 •

edited

Loading

sno2 commented Jun 14, 2024 •

edited

Loading

squeek502 commented Jun 14, 2024 •

edited

Loading

sno2 commented Jun 14, 2024 •

edited

Loading

squeek502 commented Jun 14, 2024 •

edited

Loading

squeek502 commented Jun 14, 2024 •

edited

Loading

Trevor-Strong commented Jun 14, 2024

sno2 commented Jun 15, 2024

std buffer overflows from improper usage of wtf8ToWtf16Le #20288

std buffer overflows from improper usage of wtf8ToWtf16Le #20288

Comments

sno2 commented Jun 14, 2024

squeek502 commented Jun 14, 2024 • edited Loading

sno2 commented Jun 14, 2024 • edited Loading

squeek502 commented Jun 14, 2024 • edited Loading

sno2 commented Jun 14, 2024 • edited Loading

squeek502 commented Jun 14, 2024 • edited Loading

sno2 commented Jun 14, 2024 • edited Loading

squeek502 commented Jun 14, 2024 • edited Loading

squeek502 commented Jun 14, 2024 • edited Loading

Trevor-Strong commented Jun 14, 2024

sno2 commented Jun 15, 2024

std buffer overflows from improper usage of `wtf8ToWtf16Le` #20288

std buffer overflows from improper usage of `wtf8ToWtf16Le` #20288

squeek502 commented Jun 14, 2024 •

edited

Loading

sno2 commented Jun 14, 2024 •

edited

Loading

squeek502 commented Jun 14, 2024 •

edited

Loading

sno2 commented Jun 14, 2024 •

edited

Loading

squeek502 commented Jun 14, 2024 •

edited

Loading

sno2 commented Jun 14, 2024 •

edited

Loading

squeek502 commented Jun 14, 2024 •

edited

Loading

squeek502 commented Jun 14, 2024 •

edited

Loading