NNLang Documentation

Welcome to the NNLang documentation. NNLang is a declarative language for defining neural network architectures, paired with the nnc compiler that produces standalone, zero-dependency native binaries with embedded weights.

Key Features

No runtime dependencies — compiled models are self-contained static binaries
No heap allocation — all memory is statically allocated at compile time
Human-readable source — model architectures are defined in plain-text .nnl files
Systems-first — nnc targets bare-metal-capable output

Documentation Sections

Getting Started — installation and your first model
Language Reference — complete syntax reference
CLI Reference — all compiler commands
Examples — complete working examples
Code Generation — how the compiler works

Quick Links

Getting Started

This guide will help you install and run your first NNLang model.

Installation

From crates.io

cargo install nnlang

From source

git clone https://github.com/gdesouza/nnl
cd nnl
cargo install --path .

Pre-built binaries

Download the latest release from GitHub:

# Linux
curl -L https://github.com/gdesouza/nnl/releases/latest/download/nnc-*-x86_64-unknown-linux-gnu.tar.gz | tar xz
sudo mv nnc /usr/local/bin/

# macOS
curl -L https://github.com/gdesouza/nnl/releases/latest/download/nnc-*-x86_64-apple-darwin.tar.gz | tar xz
sudo mv nnc /usr/local/bin/

Quick Start

1. Create a model file

Save this as model.nnl:

version 0.2;

model my_model {
    config {
        weights: "./weights";
        io: "stdio";
    }

    layer input  = Input(shape: [4]);
    layer fc1   = Dense(units: 3, activation: "relu");
    layer fc2   = Dense(units: 2);
}

2. Create weight files

Create a weights/ directory with:

weights/fc1.weight.npy — [4, 3] matrix
weights/fc1.bias.npy — [3] vector
weights/fc2.weight.npy — [3, 2] matrix
weights/fc2.bias.npy — [2] vector

3. Compile

nnc compile model.nnl --emit exe -o model

4. Run inference

# Input: 4 floats
echo -n -e '\x00\x00\x80\x3f\x00\x00\x00@\x00\x00@@\x00\x00\x80@' > input.bin
./model < input.bin > output.bin

Or test with known input/output:

nnc test model.nnl --input test_input.npy --expected expected_output.npy

Next Steps

Language Reference — full language syntax
CLI Reference — all commands
Examples — complete model examples

NNLang Language Reference (v0.2)

A practical reference for writing .nnl model files consumed by the nnc compiler.

File Structure

An NNL file has a fixed top-level structure:

version 0.2;

model <name> {
    config { ... }

    layer <id> = <LayerType>(<params>);
    ...

    connections { ... }   // optional
}

Section	Required	Purpose
`version`	No (warns if absent)	Declares the NNL spec version.
`model`	Yes	Names the model. Determines the generated C symbols (e.g., `model_name_infer`).
`config`	Yes	Global compilation settings (precision, weights path, target, etc.).
Layers	Yes	One or more `layer` declarations defining the network.
`connections`	No	Explicit data-flow graph. Omit for simple sequential models.

Comments

// Line comment — extends to end of line

/* Block comment —
   can span multiple lines */

Config Block

The config block sets compilation and runtime parameters.

config {
    precision: "float32";
    weights: "./weights/mnist.npz";
    target: "generic";
    align: 64;
    batch: 1;
    preprocess: "normalize_0_1";
    io: "stdio";
}

Key	Type	Required	Default	Description
`precision`	String	No	`"float32"`	Tensor data type. `"float32"`, `"float64"`, `"int8"`.
`weights`	String	Yes	—	Path to weights: directory of `.npy` files, `.npz` archive, or `.onnx` file.
`target`	String	No	`"generic"`	SIMD optimization target. `"generic"`, `"avx2"`, `"avx512"`, `"arm_neon"`.
`align`	Number	No	`64`	Memory alignment in bytes for weight and workspace buffers.
`batch`	Number	No	`1`	Inference batch size. Determines static buffer dimensions.
`preprocess`	String	No	`"none"`	Input preprocessing. `"none"`, `"normalize_0_1"`, `"standardize"`.
`preprocess_mean`	Shape	No	—	Per-channel mean for `"standardize"` (e.g., `[0.485, 0.456, 0.406]`).
`preprocess_std`	Shape	No	—	Per-channel std for `"standardize"` (e.g., `[0.229, 0.224, 0.225]`).
`io`	String	No	`"stdio"`	I/O mode for `--emit exe` binaries. Currently only `"stdio"`.

Preprocessing modes:

"normalize_0_1" — divides each input element by 255.0.
"standardize" — applies (x - mean) / std per channel; requires preprocess_mean and preprocess_std.

Layer Types

Every layer is declared as:

layer <id> = <LayerType>(<param>: <value>, ...);

The layer <id> is used for connections and for matching weight tensors (weights are looked up as {id}.{param_name} in the weight source).

Input

Entry point of the network. Defines the input tensor shape (excluding batch dimension).

Parameter	Type	Required	Default
`shape`	Shape	Yes	—

Output shape: the declared shape.

layer input = Input(shape: [28, 28, 1]);

Dense

Fully connected layer: Y = activation(W·X + B).

Parameter	Type	Required	Default
`units`	Integer	Yes	—
`activation`	String	No	`"none"`

activation accepts "none", "relu", "sigmoid", "softmax".

Weight files: {id}.weight (shape: input_dim × units), {id}.bias (shape: units).

Output shape: [units].

layer fc1 = Dense(units: 128, activation: "relu");

Conv2D

2D spatial convolution.

Parameter	Type	Required	Default
`filters`	Integer	Yes	—
`kernel`	Integer or Shape	Yes	—
`stride`	Integer	No	`1`
`padding`	String	No	`"valid"`

padding accepts "valid" (no padding) or "same" (zero-pad to preserve spatial dims).

Weight files: {id}.weight (shape: filters × in_channels × kH × kW), {id}.bias (shape: filters).

Output shape (HWC):

"valid": [⌊(H - kH) / stride⌋ + 1, ⌊(W - kW) / stride⌋ + 1, filters]
"same": [⌈H / stride⌉, ⌈W / stride⌉, filters]

layer conv1 = Conv2D(filters: 32, kernel: 3, stride: 1, padding: "valid");

MaxPool2D

Spatial max pooling.

Parameter	Type	Required	Default
`kernel`	Integer or Shape	Yes	—
`stride`	Integer	No	kernel size

Weight files: none.

Output shape: [⌊(H - kH) / stride⌋ + 1, ⌊(W - kW) / stride⌋ + 1, C]

layer pool1 = MaxPool2D(kernel: 2);

AvgPool2D

Spatial average pooling.

Parameter	Type	Required	Default
`kernel`	Integer or Shape	Yes	—
`stride`	Integer	No	kernel size

Weight files: none.

Output shape: same formula as MaxPool2D.

layer pool1 = AvgPool2D(kernel: 2, stride: 2);

Flatten

Reshapes a multi-dimensional tensor into a 1D vector.

No parameters.

Weight files: none.

Output shape: [H × W × C] (product of all input dimensions).

layer flat = Flatten();

BatchNorm

Batch normalization (inference mode — uses stored running statistics).

Parameter	Type	Required	Default
`epsilon`	Number	No	`1e-5`

Weight files: {id}.gamma, {id}.beta, {id}.running_mean, {id}.running_var (each shape: channels).

Output shape: same as input.

layer bn1 = BatchNorm();
layer bn2 = BatchNorm(epsilon: 1e-6);

Dropout

Identity pass-through at inference time. Exists so that models exported from training frameworks can be represented without editing.

Parameter	Type	Required	Default
`rate`	Number	No	`0.5`

The rate parameter is ignored during compilation.

Weight files: none.

Output shape: same as input.

layer drop = Dropout(rate: 0.25);

Add

Element-wise addition of two or more inputs. Requires explicit connections.

No parameters.

Constraint: all inputs must have identical shapes.

Weight files: none.

Output shape: same as each input.

layer res = Add();

Concat

Channel-wise concatenation of two or more inputs. Requires explicit connections.

Parameter	Type	Required	Default
`axis`	Integer	No	`-1`

Constraint: all inputs must have identical shapes except along the concatenation axis.

Weight files: none.

Output shape: input shape with dimension along axis summed across inputs.

layer merged = Concat();
layer merged = Concat(axis: -1);

ReLU

Standalone activation: max(0, x).

No parameters. No weight files.

Output shape: same as input.

layer relu1 = ReLU();

Sigmoid

Standalone activation: 1 / (1 + exp(-x)).

No parameters. No weight files.

Output shape: same as input.

layer sig = Sigmoid();

Softmax

Normalized exponential activation.

Parameter	Type	Required	Default
`axis`	Integer	No	`-1`

No weight files.

Output shape: same as input.

layer sm = Softmax();

Connections

Implicit Sequential

When the connections block is omitted, layers are connected in declaration order — each layer receives the output of the previous layer. This is the simplest form and works for linear stacks:

model simple {
    config { weights: "./weights"; io: "stdio"; }

    layer input  = Input(shape: [4]);
    layer fc1    = Dense(units: 8, activation: "relu");
    layer output = Dense(units: 2);
}
// Equivalent to: input -> fc1 -> output

Explicit Graph

When a connections block is present, it fully defines the data flow. Use this for skip connections, branches, and multi-input layers.

connections {
    input -> conv1;
    conv1 -> bn1;
    bn1   -> relu1;
    relu1 -> output;
}

Multi-Input Syntax

Layers like Add and Concat accept multiple inputs using bracket syntax:

[input, bn2] -> res;   // feeds both 'input' and 'bn2' into 'res'

Complete Examples

Simple MLP

A minimal multi-layer perceptron:

version 0.2;

model mlp {
    config {
        weights: "./weights";
        io: "stdio";
    }

    layer input  = Input(shape: [4]);
    layer fc1    = Dense(units: 16, activation: "relu");
    layer fc2    = Dense(units: 8, activation: "relu");
    layer output = Dense(units: 3, activation: "softmax");
}

CNN with Pooling

An MNIST digit classifier with convolution and pooling:

version 0.2;

// MNIST handwritten digit classifier
model mnist_classifier {
    config {
        precision: "float32";
        weights: "./weights";
        target: "avx2";
        batch: 1;
        preprocess: "normalize_0_1";
        io: "stdio";
    }

    layer input   = Input(shape: [28, 28, 1]);
    layer conv1   = Conv2D(filters: 32, kernel: 3, stride: 1, padding: "valid");
    layer pool1   = MaxPool2D(kernel: 2);
    layer flatten  = Flatten();
    layer fc1     = Dense(units: 128, activation: "relu");
    layer output  = Dense(units: 10, activation: "softmax");
}

ResNet Block with Skip Connections

A residual block using explicit connections and Add:

version 0.2;

model resnet_block {
    config {
        precision: "float32";
        weights: "./weights";
        target: "generic";
        io: "stdio";
    }

    layer input  = Input(shape: [32, 32, 64]);
    layer conv1  = Conv2D(filters: 64, kernel: 3, stride: 1, padding: "same");
    layer bn1    = BatchNorm();
    layer relu1  = ReLU();
    layer conv2  = Conv2D(filters: 64, kernel: 3, stride: 1, padding: "same");
    layer bn2    = BatchNorm();
    layer res    = Add();
    layer relu2  = ReLU();

    connections {
        input -> conv1;
        conv1 -> bn1;
        bn1   -> relu1;
        relu1 -> conv2;
        conv2 -> bn2;
        [input, bn2] -> res;   // skip connection
        res   -> relu2;
    }
}

CLI Reference

nnc is the NNL compiler. It compiles .nnl neural network definitions into standalone, zero-dependency native artifacts.

nnc new

Generate a starter host-language project around a sample NNL model.

nnc new <directory> --project <rust|go|cpp|python>

Behavior

nnc new creates a new directory containing:

a sample model.nnl configured with io: "none"
host-language boilerplate wired to the generated C ABI
a small build script or build file for compiling the model artifact
a README with run instructions

Examples

# Create a Rust starter that compiles the model from build.rs
nnc new demo-rust --project rust

# Create a Go starter with a small build script
nnc new demo-go --project go

# Create a C++ starter with a Makefile
nnc new demo-cpp --project cpp

# Create a Python starter with a shared-library wrapper
nnc new demo-python --project python

nnc compile

Compile an NNL model to a native artifact.

nnc compile <source.nnl> [--emit exe|obj|lib|shared|header] [-o <output>] [--target-triple <triple>]

Flags

Flag	Description	Default
`--emit <format>`	Output format: `exe`, `obj`, `lib`, `shared`, `header`	`exe`
`-o, --output <path>`	Output file path	Source stem with appropriate extension
`--target-triple <triple>`	Target triple for cross-compilation	Host platform

Emit formats

Format	Output	Extension	Notes
`exe`	Standalone executable with `main()` (reads stdin, writes stdout)	`<stem>`	Default
`obj`	Relocatable object file	`<stem>.o`	Also generates `<stem>.h`
`lib`	Static archive	`lib<stem>.a`	Also generates `<stem>.h`
`shared`	Shared library	`lib<stem>.so`	Also generates `<stem>.h`
`header`	C header only	`<stem>.h`	No compilation step

For obj, lib, and shared, a .h header declaring the public C API is generated alongside the output.

Examples

# Compile to a standalone executable (default)
nnc compile mnist.nnl

# Compile to a static library + header
nnc compile mnist.nnl --emit lib -o build/libmnist.a

# Compile to a shared library
nnc compile mnist.nnl --emit shared

# Compile to an object file
nnc compile mnist.nnl --emit obj -o mnist.o

# Generate only the C header
nnc compile mnist.nnl --emit header

# Cross-compile for ARM Cortex-M
nnc compile mnist.nnl --emit obj --target-triple thumbv7em-none-eabi

# Cross-compile for bare-metal ARM
nnc compile model.nnl --emit lib --target-triple arm-none-eabi

nnc inspect

Print a model summary: layers, types, output shapes, parameter counts, and memory estimates.

nnc inspect <source.nnl>

Example

nnc inspect mnist.nnl

Example output:

Model: mnist_classifier (version 0.2)
Precision: float32 | Target: avx2 | Batch: 1

Layer           Type        Output Shape        Params
──────────────────────────────────────────────────────
input           Input       [28, 28, 1]              0
conv1           Conv2D      [26, 26, 32]           320
pool1           MaxPool2D   [13, 13, 32]             0
flatten         Flatten     [5408]                   0
fc1             Dense       [128]                691,328
output          Dense       [10]                   1,290
──────────────────────────────────────────────────────
Total params:    692,938
Weight memory:   2.64 MB
Workspace:       86.5 KB (static buffer)

nnc import

Convert an ONNX model into NNL format with extracted weight files.

nnc import <model.onnx> [-o <output.nnl>] [--weights-dir <dir>]

Flags

Flag	Description	Default
`-o, --output <path>`	Output `.nnl` file path	Source name with `.nnl` extension
`--weights-dir <dir>`	Directory to write extracted `.npy` weight files	`./weights`

Notes

Each ONNX initializer is extracted as a separate .npy file in the weights directory.
Unsupported ONNX operators are emitted as comments in the generated .nnl file.

Examples

# Import with defaults (resnet.nnl + ./weights/)
nnc import resnet.onnx

# Specify output path and weights directory
nnc import resnet.onnx -o models/resnet.nnl --weights-dir models/weights

nnc test

Compile a model, run inference on a given input, and compare the output element-wise against expected values.

nnc test <source.nnl> --input <input.npy> --expected <expected.npy> [--tolerance <tol>]

Flags

Flag	Description	Default
`--input <path>`	Path to input tensor (`.npy`, float32)	Required
`--expected <path>`	Path to expected output tensor (`.npy`, float32)	Required
`--tolerance <tol>`	Maximum allowed absolute difference per element	`1e-5`

Behavior

Compiles the model to a temporary executable.
Feeds the input tensor via stdin as raw float32 bytes.
Reads the output tensor from stdout.
Compares each element against the expected tensor.
Reports up to 10 individual mismatches, then a summary.

Examples

# Test with default tolerance (1e-5)
nnc test mnist.nnl --input test_input.npy --expected test_output.npy

# Test with relaxed tolerance
nnc test mnist.nnl --input test_input.npy --expected test_output.npy --tolerance 1e-3

Example pass output:

PASS: 10/10 elements within tolerance 1.0e-5 (max diff: 3.42e-7)

Example fail output:

  mismatch at [3]: got 0.72341299, expected 0.72345012, diff 3.71e-5
  mismatch at [7]: got 0.10002345, expected 0.10010000, diff 7.66e-5
FAIL: 2/10 elements exceed tolerance 1.0e-5 (max diff: 7.66e-5)

Exit Codes

Code	Meaning
`0`	Success (compilation succeeded, test passed, import/inspect completed)
`1`	Error (syntax error, validation failure, compilation error, test mismatch, I/O error)

Environment

Requirement	Purpose
Rust toolchain	Building `nnc` from source
C compiler (`cc`, `gcc`, or `clang`) on `PATH`	Used by `nnc compile` to produce native artifacts
Cross-compiler (e.g., `arm-none-eabi-gcc`)	Required when using `--target-triple` for cross-compilation

Examples

The examples/ directory contains complete, self-contained models with pre-generated weights and test data. Each example includes:

A .nnl model definition
A weights/ directory with .npy weight files
test_input.npy and expected_output.npy for verification

Simple MLP (`examples/model/`)

Architecture: [4] → Dense(3) → Dense(2)

A minimal multi-layer perceptron with no activation functions — useful as a smoke test for the compiler pipeline.

Model definition

version 0.2;
model test_mlp {
    config {
        weights: "./weights";
        io: "stdio";
    }
    layer input = Input(shape: [4]);
    layer fc1   = Dense(units: 3);
    layer fc2   = Dense(units: 2);
}

Input: 4 floats
fc1: Dense layer with 3 units (no activation), weights: fc1.weight.npy [4×3], fc1.bias.npy [3]
fc2: Dense layer with 2 units (no activation), weights: fc2.weight.npy [3×2], fc2.bias.npy [2]
Output: 2 floats

Compile and test

# Compile to a standalone executable
nnc compile examples/model/model.nnl --emit exe -o mlp

# Verify against known test data
nnc test examples/model/model.nnl \
    --input examples/model/test_input.npy \
    --expected examples/model/expected_output.npy

MNIST CNN (`examples/mnist/`)

Architecture: [28,28,1] → Conv2D(32) → MaxPool2D(2) → Flatten → Dense(128, relu) → Dense(10, softmax)

A convolutional neural network for MNIST handwritten digit classification.

Model definition

version 0.2;

// MNIST handwritten digit classifier
model mnist_classifier {
    config {
        precision: "float32";
        weights: "./weights";
        target: "avx2";
        batch: 1;
        preprocess: "normalize_0_1";
        io: "stdio";
    }

    layer input   = Input(shape: [28, 28, 1]);
    layer conv1   = Conv2D(filters: 32, kernel: 3, stride: 1, padding: "valid");
    layer pool1   = MaxPool2D(kernel: 2);
    layer flatten  = Flatten();
    layer fc1     = Dense(units: 128, activation: "relu");
    layer output  = Dense(units: 10, activation: "softmax");
}

Layer breakdown

Layer	Operation	Output shape	Notes
`input`	Input	[28, 28, 1]	Single-channel grayscale image (HWC)
`conv1`	Conv2D	[26, 26, 32]	32 filters, 3×3 kernel, valid padding
`pool1`	MaxPool2D	[13, 13, 32]	2×2 pooling window
`flatten`	Flatten	[5408]	13 × 13 × 32 = 5408
`fc1`	Dense + ReLU	[128]	Fully connected with ReLU activation
`output`	Dense + Softmax	[10]	10-class probability distribution

Preprocessing

preprocess: "normalize_0_1" divides each input pixel by 255.0, mapping raw [0, 255] byte values to [0.0, 1.0] floats. This is applied automatically in the generated inference code.

Compile and test

nnc compile examples/mnist/mnist.nnl --emit exe -o mnist

nnc test examples/mnist/mnist.nnl \
    --input examples/mnist/test_input.npy \
    --expected examples/mnist/expected_output.npy

ResNet Block (`examples/resnet_block/`)

Architecture: A residual block with skip connection using explicit connections and Add.

This example demonstrates non-sequential layer graphs — the connections block allows arbitrary wiring between layers, including multi-input layers like Add.

Model definition

version 0.2;

model resnet_block {
    config {
        precision: "float32";
        weights: "./weights";
        target: "generic";
        io: "stdio";
    }

    layer input  = Input(shape: [32, 32, 64]);
    layer conv1  = Conv2D(filters: 64, kernel: 3, stride: 1, padding: "same");
    layer bn1    = BatchNorm();
    layer relu1  = ReLU();
    layer conv2  = Conv2D(filters: 64, kernel: 3, stride: 1, padding: "same");
    layer bn2    = BatchNorm();
    layer res    = Add();
    layer relu2  = ReLU();

    connections {
        input -> conv1;
        conv1 -> bn1;
        bn1 -> relu1;
        relu1 -> conv2;
        conv2 -> bn2;
        [input, bn2] -> res;
        res -> relu2;
    }
}

Skip connection explained

The key line is [input, bn2] -> res; — this feeds both the original input and the output of bn2 into the Add layer, creating the residual shortcut:

input ──→ conv1 → bn1 → relu1 → conv2 → bn2 ──┐
  │                                              │
  └──────────────────────────────────────────→ Add → relu2

Without the connections block, layers are connected sequentially in declaration order. The connections block overrides this default with explicit wiring.

Weight files

BatchNorm layers require four weight files each:

bn1.gamma.npy, bn1.beta.npy — learned scale and shift
bn1.running_mean.npy, bn1.running_var.npy — running statistics from training

Compile and test

nnc compile examples/resnet_block/resnet_block.nnl --emit exe -o resnet_block

nnc test examples/resnet_block/resnet_block.nnl \
    --input examples/resnet_block/test_input.npy \
    --expected examples/resnet_block/expected_output.npy

VGG Block (`examples/vgg_block/`)

Architecture: [32,32,3] → Conv2D(64)×2 → AvgPool2D(2) → Flatten → Dense(256, relu) → Dropout(0.5) → Dense(10, softmax)

A VGG-style CNN block for CIFAR-10 classification. Demonstrates stacked convolutions before pooling, AvgPool2D, and Dropout.

Model definition

version 0.2;

// VGG-style CNN block for CIFAR-10 classification
model vgg_block {
    config {
        precision: "float32";
        weights: "./weights";
        target: "generic";
        io: "stdio";
    }

    layer input   = Input(shape: [32, 32, 3]);
    layer conv1   = Conv2D(filters: 64, kernel: 3, stride: 1, padding: "same");
    layer conv2   = Conv2D(filters: 64, kernel: 3, stride: 1, padding: "same");
    layer pool    = AvgPool2D(kernel: 2);
    layer flatten = Flatten();
    layer fc1     = Dense(units: 256, activation: "relu");
    layer drop    = Dropout(rate: 0.5);
    layer output  = Dense(units: 10, activation: "softmax");
}

Key features

AvgPool2D: Average pooling instead of max pooling — useful for smoother feature maps.
Dropout: A no-op during inference, but preserved from training frameworks so the model definition stays faithful to the original.
Stacked Conv2D: Two 3×3 convolutions before pooling gives a 5×5 effective receptive field with fewer parameters.

Compile and test

nnc compile examples/vgg_block/vgg_block.nnl --emit exe -o vgg_block

nnc test examples/vgg_block/vgg_block.nnl \
    --input examples/vgg_block/test_input.npy \
    --expected examples/vgg_block/expected_output.npy

Binary Classifier (`examples/binary_classifier/`)

Architecture: [16] → Dense(64) → ReLU → Dense(32) → ReLU → Dense(1) → Sigmoid

A binary classifier MLP using standalone activation layers instead of inline activations on Dense.

Model definition

version 0.2;

// Binary classifier MLP for tabular data
// Dense layers with standalone ReLU and Sigmoid activations
model binary_classifier {
    config {
        weights: "./weights";
        io: "stdio";
    }

    layer input   = Input(shape: [16]);
    layer fc1     = Dense(units: 64);
    layer relu1   = ReLU();
    layer fc2     = Dense(units: 32);
    layer relu2   = ReLU();
    layer fc3     = Dense(units: 1);
    layer sigmoid = Sigmoid();
}

Key features

Standalone activations: ReLU() and Sigmoid() as separate layers rather than Dense parameters. This matches the graph structure of many ONNX exports.
Sigmoid output: Produces a single probability value in [0, 1] for binary classification.

Compile and test

nnc compile examples/binary_classifier/binary_classifier.nnl --emit exe -o binary_classifier

nnc test examples/binary_classifier/binary_classifier.nnl \
    --input examples/binary_classifier/test_input.npy \
    --expected examples/binary_classifier/expected_output.npy

Inception Module (`examples/inception_module/`)

Architecture: Three parallel Conv2D branches (1×1, 3×3, 5×5) merged via Concat.

A simplified Inception-style module demonstrating parallel branches and channel-wise concatenation.

Model definition

version 0.2;

// Simplified Inception module: three parallel convolution branches
// (1x1, 3x3, 5x5) concatenated along the channel axis.

model inception_module {
    config {
        precision: "float32";
        weights: "./weights";
        target: "generic";
        io: "stdio";
    }

    layer input   = Input(shape: [32, 32, 64]);
    layer conv1x1 = Conv2D(filters: 32, kernel: 1, stride: 1, padding: "same");
    layer conv3x3 = Conv2D(filters: 32, kernel: 3, stride: 1, padding: "same");
    layer conv5x5 = Conv2D(filters: 32, kernel: 5, stride: 1, padding: "same");
    layer concat  = Concat();
    layer bn      = BatchNorm();
    layer relu    = ReLU();

    connections {
        input -> conv1x1;
        input -> conv3x3;
        input -> conv5x5;
        [conv1x1, conv3x3, conv5x5] -> concat;
        concat -> bn;
        bn -> relu;
    }
}

Connection graph

           ┌→ conv1x1 (32 filters) ──┐
input ────→├→ conv3x3 (32 filters) ──├→ Concat → BatchNorm → ReLU
           └→ conv5x5 (32 filters) ──┘

Key features

Concat: Channel-wise concatenation of three branches (32+32+32 = 96 output channels).
Multi-input bracket syntax: [conv1x1, conv3x3, conv5x5] -> concat; feeds all three branches into the Concat layer.
Parallel branches: The connections block wires input to all three convolutions independently.

Compile and test

nnc compile examples/inception_module/inception_module.nnl --emit exe -o inception_module

nnc test examples/inception_module/inception_module.nnl \
    --input examples/inception_module/test_input.npy \
    --expected examples/inception_module/expected_output.npy

Feature Extractor (`examples/feature_extractor/`)

Architecture: [224,224,3] → Conv2D(32,7) → BN → ReLU → MaxPool → Conv2D(64,3) → BN → ReLU → MaxPool → Flatten → Dense(256) → ReLU → Dense(10) → Softmax

A CNN feature extractor with ImageNet-style preprocessing and standalone Softmax.

Model definition

version 0.2;

// CNN feature extractor with ImageNet-style preprocessing and standalone Softmax
model feature_extractor {
    config {
        precision: "float32";
        weights: "./weights";
        target: "avx2";
        io: "stdio";
        preprocess: "standardize";
        preprocess_mean: [0.485, 0.456, 0.406];
        preprocess_std: [0.229, 0.224, 0.225];
    }

    layer input   = Input(shape: [224, 224, 3]);
    layer conv1   = Conv2D(filters: 32, kernel: 7, stride: 2, padding: "valid");
    layer bn1     = BatchNorm();
    layer relu1   = ReLU();
    layer pool1   = MaxPool2D(kernel: 3, stride: 2);
    layer conv2   = Conv2D(filters: 64, kernel: 3, padding: "valid");
    layer bn2     = BatchNorm();
    layer relu2   = ReLU();
    layer pool2   = MaxPool2D(kernel: 2);
    layer flatten = Flatten();
    layer fc1     = Dense(units: 256);
    layer relu3   = ReLU();
    layer fc2     = Dense(units: 10);
    layer output  = Softmax();
}

Key features

Standalone Softmax: Used as a separate layer rather than a Dense activation parameter.
ImageNet preprocessing: preprocess: "standardize" with per-channel mean and std — the generated binary applies (x - mean) / std per channel automatically.
Strided convolution: Conv2D(kernel: 7, stride: 2) for aggressive spatial downsampling.
MaxPool2D with stride: MaxPool2D(kernel: 3, stride: 2) allows kernel/stride to differ.

Compile and test

nnc compile examples/feature_extractor/feature_extractor.nnl --emit exe -o feature_extractor

nnc test examples/feature_extractor/feature_extractor.nnl \
    --input examples/feature_extractor/test_input.npy \
    --expected examples/feature_extractor/expected_output.npy

ONNX Import (`examples/import_test/`)

Demonstrates the round-trip workflow: generate an ONNX model in Python, import it into NNL, compile, and verify.

Architecture: [4] → Dense(3, relu) → Dense(2)

Step 1: Generate the ONNX model

cd examples/import_test
python3 gen_mlp.py

This creates:

model.onnx — the ONNX model with embedded weights
input.npy — test input [1.0, 2.0, 3.0, 4.0]
expected.npy — expected output computed from the same weights

Step 2: Import into NNL

nnc import examples/import_test/model.onnx \
    -o examples/import_test/model.nnl \
    --weights-dir examples/import_test/weights

This produces a .nnl file and extracts weight tensors into the weights/ directory as .npy files.

Step 3: Compile

nnc compile examples/import_test/model.nnl --emit exe -o import_mlp

Step 4: Test

nnc test examples/import_test/model.nnl \
    --input examples/import_test/input.npy \
    --expected examples/import_test/expected.npy

What gen_mlp.py does

The script builds a two-layer MLP with fixed weights using the ONNX helper API:

Layer 1: Gemm (matrix multiply + bias) → Relu
Layer 2: Gemm

It uses deterministic weights so the expected output can be computed exactly and verified after the NNL round-trip.

Creating Your Own Model

1. Write the `.nnl` file

Define your architecture with layer declarations and an optional connections block:

version 0.2;
model my_model {
    config {
        weights: "./weights";
        io: "stdio";
    }
    layer input = Input(shape: [784]);
    layer fc1   = Dense(units: 64, activation: "relu");
    layer fc2   = Dense(units: 10, activation: "softmax");
}

2. Create the weights directory

Each layer expects specific .npy files named <layer_id>.<param>.npy:

Layer type	Weight files
Dense	`<id>.weight.npy`, `<id>.bias.npy`
Conv2D	`<id>.weight.npy`, `<id>.bias.npy`
BatchNorm	`<id>.gamma.npy`, `<id>.beta.npy`, `<id>.running_mean.npy`, `<id>.running_var.npy`

3. Generate weights with NumPy

import numpy as np

np.save("weights/fc1.weight.npy", np.random.randn(784, 64).astype(np.float32))
np.save("weights/fc1.bias.npy",   np.zeros(64, dtype=np.float32))
np.save("weights/fc2.weight.npy", np.random.randn(64, 10).astype(np.float32))
np.save("weights/fc2.bias.npy",   np.zeros(10, dtype=np.float32))

4. Compile

nnc compile my_model.nnl --emit exe -o my_model

5. Test

Generate test inputs and expected outputs, then verify:

nnc test my_model.nnl --input test_input.npy --expected expected_output.npy

The default tolerance is 1e-5 (element-wise). Adjust with --tolerance if needed.

Code Generation

How It Works

nnc generates C source code from the NNL model, then invokes the system C compiler (cc/gcc/clang) to produce the final artifact. This approach is documented in DESIGN.md as ADR-001: C Codegen Backend.

The generated C contains:

static const float weight arrays (placed in .rodata via const)
Statically-allocated workspace buffers for activations
The inference function body with kernel calls in topological order
A .h header declaring the public API

Pipeline

.nnl source → nnc frontend → IR → C source → cc/gcc/clang → native binary

Frontend — parses the .nnl file into an AST (src/syntax/)
Semantic analysis — validates layer types, resolves connections, infers shapes (src/sema/)
IR — builds a typed model graph with topological ordering (src/ir/)
Weights — loads .npy / .npz / ONNX weight tensors (src/weights/)
C emitter — generates a .c source file and .h header (src/codegen/emit.rs)
Toolchain — invokes cc/gcc/clang and ar to produce the requested artifact (src/codegen/toolchain.rs)

Generated C API

For a model named my_model, nnc generates:

#ifndef MY_MODEL_H
#define MY_MODEL_H

#include <stdint.h>

int my_model_infer(const void *input, void *output);
int my_model_input_size(void);   // total float elements in input tensor
int my_model_output_size(void);  // total float elements in output tensor

#endif /* MY_MODEL_H */

input / output are raw float arrays in row-major (HWC) layout
Returns 0 on success
No heap allocation during inference — all buffers are static
All weights are embedded as static const float arrays in .rodata

Output Formats

`--emit` flag	File type	What’s generated	Use case
`exe`	Standalone binary	Binary with `main()` that reads `stdin` / writes `stdout`	Quick testing, CLI inference
`obj`	`.o` relocatable object	Object file + `.h` header	Linking into a larger C/C++ project
`lib`	`.a` static archive	Static library + `.h` header	Distribution as a self-contained library
`shared`	`.so` shared library	Shared object + `.h` header	Dynamic linking, plugins
`header`	`.h` file only	Header with API declarations	Inspection, IDE integration
`c`	`.c` + `.h` source files	Generated C source and header	Debugging, auditing generated code

Under the hood, these map to standard compiler/archiver invocations:

exe → cc -O2 -o output source.c -lm
obj → cc -O2 -c -o output.o source.c
lib → cc -O2 -c + ar rcs output.a output.o
shared → cc -O2 -shared -fPIC -o output.so source.c -lm
header → direct file copy

Integration Example

Compile a model as a static library:

nnc compile my_model.nnl --emit lib -o libmy_model.a

This produces libmy_model.a and my_model.h in the same directory. Link them into your C project:

#include "my_model.h"

float input[784], output[10];

int main(void) {
    // ... fill input[] with preprocessed data ...
    int rc = my_model_infer(input, output);
    if (rc != 0) return rc;
    // ... use output[] ...
    return 0;
}

Compile and link:

gcc -O2 -o app app.c -L. -lmy_model -lm

Alternatively, link a .o object directly:

nnc compile my_model.nnl --emit obj -o my_model.o
gcc -O2 -o app app.c my_model.o -lm

Cross-Compilation

When --target-triple is specified, nnc invokes the corresponding cross-compiler instead of cc:

nnc compile model.nnl --emit exe --target-triple arm-none-eabi -o model
# invokes: arm-none-eabi-gcc -O2 -o model model.c -lm

Combine with a SIMD target in the model config for architecture-specific optimizations:

config {
    target: "arm_neon";
}

This adds -mfpu=neon to the compiler flags. Available targets and their flags:

Config `target`	Compiler flag
`"generic"`	(none)
`"avx2"`	`-mavx2`
`"avx512"`	`-mavx512f`
`"arm_neon"`	`-mfpu=neon`

Note: Target flags enable the C compiler’s autovectorizer. Hand-tuned SIMD intrinsics (AVX2, NEON) are planned for a future release. The generated C code uses scalar loops that the C compiler may vectorize automatically.

Memory Model

Static workspace buffers — all activation memory is statically allocated (static float arrays). No malloc is ever called.
Liveness-based buffer reuse — the codegen performs liveness analysis on the layer graph and reuses buffer slots when a layer’s output is no longer needed, minimizing total activation memory.
Weights in read-only data — all weight arrays are static const float with alignment attributes, placed in the .rodata section by the C compiler.
Alignment — buffers and weight arrays use __attribute__((aligned(N))) for SIMD-friendly access patterns.

Weight Files

Supported Formats

Format	Description
Directory of `.npy` files	Each file named `{layer_id}.{param}.npy` (e.g., `fc1.weight.npy`, `fc1.bias.npy`)
`.npz` archive	Keys must match `{layer_id}.{param}` (e.g., `fc1.weight`, `fc1.bias`)

Naming Convention

The weights config key points to the weight source. nnc resolves it relative to the .nnl file’s directory.

[config]
weights = "weights/"       # directory of .npy files
# or
weights = "model.npz"      # single .npz archive

Expected Shapes Per Layer

Layer	Parameter	Shape
Dense	weight	`[input_dim, units]`
Dense	bias	`[units]`
Conv2D	weight	`[filters, in_channels, kH, kW]`
Conv2D	bias	`[filters]`
BatchNorm	gamma	`[channels]`
BatchNorm	beta	`[channels]`
BatchNorm	running_mean	`[channels]`
BatchNorm	running_var	`[channels]`

Data Types

Precision	Weight dtype
`"float32"`	`float32`
`"float64"`	`float64`

Generating Test Weights (Python)

import numpy as np

# Create weights matching a Dense layer with 784 inputs and 128 units
np.save("fc1.weight.npy", np.random.randn(784, 128).astype(np.float32))
np.save("fc1.bias.npy", np.zeros(128, dtype=np.float32))

# Or bundle into an .npz archive
np.savez("model.npz",
    **{"fc1.weight": np.random.randn(784, 128).astype(np.float32),
       "fc1.bias": np.zeros(128, dtype=np.float32)})

Error Messages

Error	Meaning	Fix
E003: missing weight	A layer expects a weight file or key that was not found in the weight source.	Ensure the weight source contains an entry named `{layer_id}.{param}` for every parameterised layer.
Shape mismatch	The shape of a loaded weight does not match what the layer definition expects (e.g., expected `[784, 128]` but found `[128, 784]`).	Regenerate or transpose the weight so its shape matches the table above.

ONNX Import

Overview

nnc import converts ONNX models to NNL format with extracted weights.

nnc import model.onnx -o model.nnl

Supported ONNX Operators

ONNX Op	NNL Layer
Gemm / MatMul	Dense
Conv	Conv2D
MaxPool	MaxPool2D
AveragePool	AvgPool2D
Flatten	Flatten
BatchNormalization	BatchNorm
Dropout	Dropout
Add	Add
Concat	Concat
Relu	ReLU
Sigmoid	Sigmoid
Softmax	Softmax

Weight Handling

Weights are extracted from ONNX initializers and saved as individual .npy files.
Gemm nodes with transB=1 have their weights automatically transposed to the NNL [in, out] layout.
The batch dimension is stripped from input shapes.

Unsupported Operators

Operators without a mapping are emitted as comments in the generated .nnl file:

// UNSUPPORTED: Reshape(reshape_0)

These require manual resolution — replace the comment with an equivalent NNL layer or restructure the model before export.

Round-Trip Workflow

Train your model in PyTorch, TensorFlow, or another framework.
Export to ONNX (e.g., torch.onnx.export(model, dummy, "model.onnx")).
Import into NNL: nnc import model.onnx -o model.nnl
Compile and run: nnc compile model.nnl -o model && ./model
Test outputs against the original framework to verify correctness.

Limitations

Only float32 weights are supported.
External data is not supported — weights must be embedded in the .onnx file.
Dynamic shapes are not supported; all dimensions must be fixed at export time.

Release Notes

[0.9.0] — 2026-05-01

Added

New layers: LRN and FakeQuant — Local Response Normalization (LRN(size, alpha, beta, bias)) for AlexNet-style models, and FakeQuant(scale, zero_point, qmin, qmax) for simulated quantization. Both are wired through lexer, parser, IR, shape inference, and codegen.
Explicit per-side pool padding — MaxPool2D and AvgPool2D now accept a padding: [top, left, bottom, right] parameter for asymmetric padding, propagated through shape inference and codegen.

Changed

ONNX import: quantized CNN support — nnc import now maps LRN ONNX nodes, fuses Gemm/MatMul → (Quantize/Dequantize) → Add bias chains, lowers QuantizeLinear nodes into FakeQuant layers, and recognizes asymmetric pads attributes for Conv, MaxPool, and AveragePool shape inference.
ONNX import: tensor data decoding — initializers stored in int32_data / int64_data (instead of raw_data) are now decoded correctly, fixing imports for many torch-exported quantized models.

Fixed

ONNX Reshape → Flatten lowering — Reshape to a [1, N] target shape is now lowered to Flatten even when the input rank is < 3, matching how PyTorch exporters serialize the post-conv flatten.
ONNX DequantizeLinear with missing zero-point — empty zero-point initializers are now treated as zeros instead of failing to dequantize.

[0.8.0] — 2026-04-30

Added

nnc new project scaffolding — generate a starter host-language project around a sample NNL model. Supports --project rust, go, cpp, and python. The scaffold includes a sample model.nnl (configured with io: "none"), host-language boilerplate wired to the generated C ABI, a build script or build file for compiling the model artifact, and a README with run instructions.

Changed

Improved missing-weight diagnostics (E003) — nnc compile now produces a structured, actionable error when required weights are missing. Errors list every missing tensor with its expected shape, identify whether the source is a directory of .npy files, an .npz archive, or another path, and include a hint: to run nnc inspect <model> to view expected tensors and shapes. All missing weights are reported in a single error instead of stopping at the first one.

[0.7.0] — 2026-04-26

Added

Compile-time memory check with optional memory_limit config — nnc now computes total static memory (weights + workspace) and emits a W003 warning when it exceeds 256 MB. Add memory_limit: "128MB" to the config block to turn this into a hard compile error (E009). nnc inspect now shows a “Total memory” line. Accepted units: KB, MB, GB.
io: "none" config option — skips main() generation, producing a pure library artifact. Use with --emit lib, --emit shared, or --emit obj for embedding models in host applications. io: "none" with --emit exe produces a clear compile error.
Integration examples — new examples/integration/ directory with documented examples showing how to call an NNL-compiled model from C++, Rust, Go, and Python, using static/shared library linking and FFI.

[0.6.0] — 2026-04-23

Added

New layers: Hardswish, Upsample, Conv1D, MaxPool1D, LayerNorm — five new layer types across all pipeline stages (lexer, parser, IR, shape inference, codegen, ONNX import), completing the Tier 4 roadmap from the ONNX spec.
Hardswish activation — Hardswish(x) = x * min(max(0, x+3), 6) / 6, unlocks MobileNetV3. ONNX HardSwish op imported automatically.
Upsample layer — Upsample(scale: N) with nearest-neighbor interpolation for spatial upsampling. ONNX Upsample and Resize ops imported automatically. Unlocks YOLO-Tiny, U-Net, and encoder-decoder models.
Conv1D layer — 1D convolution with filters, kernel, stride, padding parameters. ONNX Conv ops with 3D weight tensors auto-detected as Conv1D. Enables audio, time-series, and keyword spotting models.
MaxPool1D layer — 1D max pooling with kernel and optional stride. ONNX MaxPool ops with 1D kernel_shape auto-detected. Enables audio and time-series models.
LayerNorm layer — Layer normalization with learnable scale and bias over the last dimension, with configurable epsilon. ONNX LayerNormalization op imported with epsilon and weights. Enables transformer-adjacent models.

[0.5.0] — 2026-04-23

Added

New layers: GlobalAvgPool2D, ReLU6, LeakyReLU, SiLU, Mul — six new layer types across all pipeline stages (lexer, parser, IR, shape inference, codegen, ONNX import), unlocking ResNet-18, MobileNetV1/V2, and EfficientNet model families.
Grouped / depthwise Conv2D — Conv2D now accepts a groups parameter (default 1) for grouped convolution, including depthwise separable convolution (groups == in_channels). ONNX Conv group attribute is imported automatically.
ONNX external tensor data support — nnc import can now load weights stored as external data files (ONNX data_location = EXTERNAL) with offset/length support, fixing import failures for models exported with torch.onnx.export(..., use_external_data_format=True).

Fixed

CHW→HWC weight permutation at Flatten→Dense boundary — nnc import now automatically detects the Flatten→Gemm pattern in ONNX graphs and permutes Dense weight matrix rows from PyTorch’s CHW flatten order to nnc’s HWC order, fixing incorrect inference results for all imported CNNs with Flatten→Dense transitions.
ONNX import empty tensor error — nnc import now produces a clear error message ("tensor '...' has no data") instead of a cryptic npy shape mismatch when tensor data is missing.

[0.4.0] — 2026-04-23

Added

--version / -V flag — nnc --version now prints the version from Cargo.toml.
--emit c flag — nnc compile model.nnl --emit c writes the generated .c and .h files directly without invoking the C compiler, useful for debugging and auditing generated code.

Fixed

Concat codegen for multi-dimensional tensors — fixed incorrect flat memcpy in Concat codegen that produced wrong results when concatenating 3D (HWC) tensors along the channel axis. Now generates proper strided copies for arbitrary concat axes.
ONNX import protobuf decode failure — fixed incorrect field tag numbers in AttributeProto that caused all ONNX imports to fail with a protobuf wire type error. Added missing floats field (tag 7).
Unsupported precision silently accepted — precision: "int8" and precision: "float64" now produce a compile error instead of silently generating incorrect float32 code.
Website hero demo — the output example now shows the realistic workflow (raw bytes piped through Python) instead of implying the binary outputs formatted text.
README DESIGN.md link — corrected broken link to point to docs/src/DESIGN.md.

[0.3.0] — 2025-04-23

Fixed

Conv2D rectangular kernel correctness — fixed a bug where non-square kernels (e.g., kernel: [3, 5]) produced incorrect inference results due to a variable shadowing issue in the generated C code. Square kernels were unaffected. The same shadowing fix was applied to MaxPool2D and AvgPool2D codegen for consistency.

[0.2.0] — 2025-04-20

Initial public release.

Added

NNLang DSL with version 0.2 syntax for defining neural network models
Layers: Input, Dense, Conv2D, MaxPool2D, AvgPool2D, Flatten, BatchNorm, Dropout, Add, Concat, ReLU, Sigmoid, Softmax
C code generation backend with static memory allocation (no heap, no runtime dependencies)
Output formats: exe, obj, lib, shared, header
Cross-compilation via --target-triple flag
SIMD target hints: generic, avx2, avx512, arm_neon
Weight loading from .npy files and .npz archives
ONNX model import via nnc import
nnc inspect command for model summary and shape information
nnc test command for verifying inference correctness against expected outputs
Explicit graph connections with connections { } block and skip connections
Liveness-based buffer reuse for minimal activation memory footprint
mdbook documentation site

Keyboard shortcuts

NNLang Documentation