Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] crash when attempting to use MAC mps when wrapping PyTorch #3092

Open
mytechnotalent opened this issue Jun 21, 2024 · 0 comments
Open
Labels
bug Something isn't working mojo-repo Tag all issues with this label

Comments

@mytechnotalent
Copy link

mytechnotalent commented Jun 21, 2024

Bug description

When running train.mojo, we get the following crash.

Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0.      Program arguments: mojo train.mojo
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
0  mojo                     0x0000000104448c24 llvm_strlcpy + 51480
1  mojo                     0x0000000104446f10 llvm_strlcpy + 44036
2  mojo                     0x00000001044492c4 llvm_strlcpy + 53176
3  libsystem_platform.dylib 0x000000018fc5f584 _sigtramp + 56
4  libtorch_python.dylib    0x00000001119f2d20 has_torch_function_attr(_object*) + 52
5  libtorch_python.dylib    0x00000001119a323c torch::is_tensor_and_append_overloaded(_object*, std::__1::vector<_object*, std::__1::allocator<_object*>>*) + 92
6  libtorch_python.dylib    0x00000001119a3c18 torch::FunctionParameter::check(_object*, std::__1::vector<_object*, std::__1::allocator<_object*>>&, int, long long*) + 560
7  libtorch_python.dylib    0x00000001119a5688 torch::FunctionSignature::parse(_object*, _object*, _object*, _object**, std::__1::vector<_object*, std::__1::allocator<_object*>>&, bool) + 536
8  libtorch_python.dylib    0x00000001119a67dc torch::PythonArgParser::raw_parse(_object*, _object*, _object*, _object**) + 108
9  libtorch_python.dylib    0x0000000111430bdc torch::autograd::THPVariable_linear(_object*, _object*, _object*) + 116
10 Python                   0x000000010f52ea0c cfunction_call + 72
11 Python                   0x000000010f4bfe18 _PyObject_MakeTpCall + 128
12 Python                   0x000000010f604ff8 _PyEval_EvalFrameDefault + 47004
13 Python                   0x000000010f4c3e70 method_vectorcall + 180
14 Python                   0x000000010f606d24 _PyEval_EvalFrameDefault + 54472
15 Python                   0x000000010f4c3e70 method_vectorcall + 180
16 Python                   0x000000010f606d24 _PyEval_EvalFrameDefault + 54472
17 Python                   0x000000010f4bfb9c _PyObject_FastCallDictTstate + 96
18 Python                   0x000000010f55aeac slot_tp_call + 208
19 Python                   0x000000010f4bfe18 _PyObject_MakeTpCall + 128
20 Python                   0x000000010f604ff8 _PyEval_EvalFrameDefault + 47004
21 Python                   0x000000010f4c3e70 method_vectorcall + 180
22 Python                   0x000000010f606d24 _PyEval_EvalFrameDefault + 54472
23 Python                   0x000000010f4c3e70 method_vectorcall + 180
24 Python                   0x000000010f606d24 _PyEval_EvalFrameDefault + 54472
25 Python                   0x000000010f4bfb9c _PyObject_FastCallDictTstate + 96
26 Python                   0x000000010f55aeac slot_tp_call + 208
27 Python                   0x000000010f4c0d64 _PyObject_Call + 164
28 Python                   0x000000030009858c _PyObject_Call + 8333916364
29 mojo                     0x00000001047dd530 __jit_debug_register_code + 1041480
30 mojo                     0x00000001043a956c
31 mojo                     0x00000001043a8f60
32 mojo                     0x0000000104391960
33 dyld                     0x000000018f8a60e0 start + 2360
mojo crashed!
Please file a bug report.
[73148:1793171:20240621,070907.221789:WARNING process_memory_mac.cc:93] mach_vm_read(0x107750000, 0x8000): (os/kern) invalid address (1)
[73148:1793171:20240621,070907.222027:WARNING process_memory_mac.cc:93] mach_vm_read(0x107750000, 0x8000): (os/kern) invalid address (1)
[73148:1793171:20240621,070907.222151:WARNING process_memory_mac.cc:93] mach_vm_read(0x107750000, 0x8000): (os/kern) invalid address (1)
[73148:1793171:20240621,070907.222275:WARNING process_memory_mac.cc:93] mach_vm_read(0x107750000, 0x8000): (os/kern) invalid address (1)
[73148:1793171:20240621,070907.222396:WARNING process_memory_mac.cc:93] mach_vm_read(0x107750000, 0x8000): (os/kern) invalid address (1)
[73148:1793171:20240621,070907.222517:WARNING process_memory_mac.cc:93] mach_vm_read(0x107750000, 0x8000): (os/kern) invalid address (1)
[73148:1793171:20240621,070907.222634:WARNING process_memory_mac.cc:93] mach_vm_read(0x107750000, 0x8000): (os/kern) invalid address (1)
[73148:1793171:20240621,070907.285115:WARNING in_range_cast.h:38] value -97304226 out of range
[73148:1793171:20240621,070907.288890:WARNING crash_report_exception_handler.cc:257] UniversalExceptionRaise: (os/kern) failure (5)
zsh: bus error  mojo train.mojo

net.mojo

from python import Python

struct Net:
    """
    Simple neural network for classification.

    Attributes:
        model: Sequential model containing layers of the network.
        device: Device to run the model on (e.g., 'mps' or 'cpu').
    """
    var model: PythonObject
    var device: PythonObject

    fn __init__(inout self):
        """
        Initializes the neural network layers.
        """
        try:
            var torch = Python.import_module("torch")
            var nn = torch.nn
            if torch.backends.mps.is_built():
                self.device = torch.device("mps")
            else:
                self.device = torch.device("cpu")
            self.model = nn.Sequential(
                nn.Linear(2, 5),
                nn.ReLU(),
                nn.Linear(5, 5),
                nn.ReLU(),
                nn.Linear(5, 2)
            ).to(self.device)
        except e:
            print("Error importing PyTorch: {e}")
            self.model = None
            self.device = None

    fn __copyinit__(inout self, other: Net):
        """
        Initializes a copy of Net from another instance.

        Args:
            other: Another instance of Net.
        """
        self.model = other.model
        self.device = other.device

    fn forward(self, x: PythonObject) raises -> PythonObject:
        """
        Defines the forward pass of the network.

        Args:
            x: Input tensor.

        Returns:
            Output tensor after passing through the network.
        """
        try:
            if x is None:
                raise ("Input tensor is None")

            var torch = Python.import_module("torch")
            if not torch.is_tensor(x):
                raise ("Input is not a valid tensor")

            var x_tensor = x.to(self.device) if x.device != self.device else x

            if x_tensor is None:
                raise ("Failed to move tensor to the correct device")

            return self.model(x_tensor)
        except e:
            raise ("Error during forward pass: {e}")

    fn backward(self, loss: PythonObject) raises:
        """
        Performs backward pass and updates gradients.

        Args:
            loss: Loss tensor calculated during forward pass.
        """
        try:
            loss.backward()
        except e:
            raise ("Error during backward pass: {e}")

    fn predict_probabilities(self, x: PythonObject) raises -> PythonObject:
        """
        Calculates class probabilities using softmax after forward pass.

        Args:
            x: Input tensor.

        Returns:
            Probability distribution over classes.
        """
        try:
            var torch = Python.import_module("torch")
            var F = torch.nn.functional
            if x is None:
                raise ("Input tensor is None")

            var x_tensor = torch.tensor(x, dtype=torch.float32).unsqueeze(0).to(self.device)

            if x_tensor is None:
                raise ("Failed to create tensor from input")

            var logits = self.model(x_tensor)
            var probabilities = F.softmax(logits, dim=1)
            return probabilities
        except e:
            raise ("Error calculating probabilities: {e}")

    fn predict_number(self, x: PythonObject) raises -> Int:
        """
        Predicts the class label using the trained model.

        Args:
            x: Input tensor.

        Returns:
            Predicted class label.
        """
        try:
            var torch = Python.import_module("torch")
            if x is None:
                raise ("Input tensor is None")

            var x_tensor = torch.tensor(x, dtype=torch.float32).unsqueeze(0).to(self.device)

            if x_tensor is None:
                raise ("Failed to create tensor from input")

            var logits = self.model(x_tensor)
            var prediction = logits.argmax(dim=1).item()  # Get the index of the max probability
            return prediction
        except e:
            raise ("Error during prediction: {e}")

train.mojo

from python import Python
from mojo.net import Net

fn main() raises:
    try:
        var torch = Python.import_module("torch")
        var sklearn = Python.import_module("sklearn.model_selection")
        var train_test_split = sklearn.train_test_split
        var optim = torch.optim
        var model = Net()
        var seed_value = 42
        torch.manual_seed(seed_value) 
        var input_data = torch.randn(64, 2)  # example input tensor with batch size 64 and input size 2
        var target_data = torch.randint(0, 2, (64,))  # example target tensor with batch size 64 and 2 classes
        var split_result = train_test_split(input_data, target_data, test_size=0.2, random_state=seed_value)
        var train_inputs = split_result[0]
        var test_inputs = split_result[1]
        var train_targets = split_result[2]
        var test_targets = split_result[3]
        var criterion = torch.nn.CrossEntropyLoss()
        var optimizer = optim.Adam(model.model.parameters(), lr=0.01)
        # training loop
        var num_epochs = 100
        for epoch in range(num_epochs):
            model.model.train()  # set the model to training mode
            optimizer.zero_grad()  # zero the gradients
            var output = model.forward(train_inputs)  # forward pass
            var loss = criterion(output, train_targets)  # calculate the loss
            model.backward(loss)  # backward pass
            optimizer.step()  # update weights
            print('epoch, loss:', epoch + 1, num_epochs, loss.item())
        torch.save(model.model.state_dict(), "model.pth")
        # evaluate the model on test data
        model.model.eval()  # set the model to evaluation mode
        var test_output = model.forward(test_inputs)  # forward pass on test data
        var test_loss = criterion(test_output, test_targets)  # calculate test loss
        print('test loss:', test_loss.item())

    except e:
        print("error during execution:", e)

Steps to reproduce

file structure

mojonet
  mojo
    __init__.mojo
    net.mojo
  train.mojo

mojo train.mojo

test_net.mojo

from python import Python
from mojo.net import Net 
from testing import assert_true

fn test_net_init() raises:
    """
    Test the initialization of the Net class.
    """
    var net: Net = Net()
    assert_true(net.model is not None, "Model should be initialized")

fn test_net_forward() raises:
    """
    Test the forward pass of the Net class.
    """
    var net: Net = Net()
    var torch: PythonObject = Python.import_module("torch")
    var train_inputs: PythonObject = torch.tensor([0.2656, -0.0026])
    var output: PythonObject = net.forward(train_inputs)
    assert_true(output is not None, "Forward pass should produce an output")

fn test_net_backward() raises:
    """
    Test the backward pass of the Net class.
    """
    var net: Net = Net()
    var torch: PythonObject = Python.import_module("torch")
    var nn: PythonObject = torch.nn
    var criterion: PythonObject = nn.MSELoss()
    var train_inputs: PythonObject = torch.tensor([0.0, 1.0])
    var train_targets: PythonObject = torch.tensor([0.5, -0.5])
    var output: PythonObject = net.forward(train_inputs)
    var loss: PythonObject = criterion(output, train_targets)
    var backward_output: PythonObject = net.backward(loss)
    assert_true(backward_output is None, "Backward pass should not produce an error")
    
fn test_net_predict_probabilities() raises:
    """
    Test the predict_probabilities method of the Net class.
    """
    var net: Net = Net()
    var probabilities: PythonObject = net.predict_probabilities([0.5, -0.5])
    assert_true(probabilities is not None, "Predict probabilities should produce an output")

fn test_net_predict_number() raises:
    """
    Test the predict_number method of the Net class.
    """
    var net: Net = Net()
    var prediction: Int = net.predict_number([0.5, -0.5])
    assert_true(0 <= prediction < 2, "Prediction should be between 0 and 1")

fn main() raises:
    try:
        test_net_init()
        test_net_forward()
        test_net_backward()
        test_net_predict_probabilities()
        test_net_predict_number()
    except e:
        print(e)

mojo test test_net.mojo
result

Testing Time: 3.840s

Total Discovered Tests: 5

Passed : 3 (60.00%)
Failed : 2 (40.00%)
Skipped: 0 (0.00%)

******************** Failure: '/Users/kevinthomas/Desktop/mojonet/test_net.mojo::test_net_backward()' ********************

execution failed

2024-06-21 07:17:30.675299-0400 mojo-repl-entry-point[73384:1797920] flock failed to lock list file (/var/folders/76/xcggvjjn2zq1z1z4l1hgnrn80000gn/C//com.apple.metal/16777235_275/functions.list): errno = 35
2024-06-21 07:17:30.675339-0400 mojo-repl-entry-point[73384:1797920] flock failed to lock list file (/var/folders/76/xcggvjjn2zq1z1z4l1hgnrn80000gn/C//com.apple.metal/16777235_275/functions1.list): errno = 35


error: Execution was interrupted, reason: EXC_BAD_ACCESS (code=2, address=0x10f287170).
The process has been left at the point where it was interrupted, use "thread return -x" to return to the state before expression evaluation.

********************

******************** Failure: '/Users/kevinthomas/Desktop/mojonet/test_net.mojo::test_net_forward()' ********************

execution failed

2024-06-21 07:17:30.666774-0400 mojo-repl-entry-point[73390:1797940] flock failed to lock list file (/var/folders/76/xcggvjjn2zq1z1z4l1hgnrn80000gn/C//com.apple.metal/16777235_275/functions.list): errno = 35


error: Execution was interrupted, reason: EXC_BAD_ACCESS (code=2, address=0x11cddf0b0).
The process has been left at the point where it was interrupted, use "thread return -x" to return to the state before expression evaluation.

********************

System information

- What OS did you do install Mojo on MAC M3
- Provide version information for Mojo by pasting the output of `mojo 24.4.0 (2cb57382)`
- Provide Modular CLI version by pasting the output of `modular 0.8.0 (39a426b5)`
@mytechnotalent mytechnotalent added bug Something isn't working mojo-repo Tag all issues with this label labels Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working mojo-repo Tag all issues with this label
Projects
None yet
Development

No branches or pull requests

1 participant