Decoding Motor Intent from Neural Signals

Comparing MLP, 2D CNN, LSTM, and Transformer architectures for brain-computer interface decoding

This project decodes hand position and velocity from 95-channel motor-cortex spike-count data recorded during a reaching task. The goal: predict where the hand is moving using only neural activity. Four deep learning architectures were trained and compared on the same dataset (contdata95.mat), sampled at 50ms bins. The live replay below shows real decoded trajectories on held-out test data — exported directly from the trained models.

The models & how they were trained

We compared four architectures that differ in how they use time. The MLP is a fully-connected network (five hidden layers) that decodes from a single 50 ms bin of 95 spike counts — no temporal context. The 2D CNN treats a 32-bin window as a (time × channel) image and stacks four convolutional blocks with global pooling. The LSTM runs recurrently over the same 32-bin (1.6 s) window and predicts from its final hidden state. The Transformer projects each bin, adds sinusoidal positional encoding so self-attention can use temporal order, applies a two-layer encoder, and mean-pools across the window. All four were trained the same way: a temporal 70/15/15 split (never shuffled, to avoid leakage), spike counts z-scored with training statistics only, and Adam with MSE loss and early stopping. To report trustworthy numbers, we trained every model across 5 random seeds on a GPU and report the mean ± standard deviation.

Live Decoding Replay

Watch each model decode hand position from neural activity in real-time, on one representative run from the held-out test set. Blue = actual trajectory, red = model prediction. (The headline metrics below are 5-seed means.)

Spike Traces (24 of 95 ch)

Spike-Count Heatmap (95 channels)

Decoded Position

Speed:

Correlation: —

Architecture Comparison

Average Pearson correlation on the held-out test set — mean ± std over 5 random seeds. Error bars show the std.

MLP

0.928

± 0.001 avg r (5 seeds)

X pos: 0.955 Y pos: 0.926 X vel: 0.923 Y vel: 0.907

LSTM

0.989

± 0.000 avg r (5 seeds)

X pos: 0.992 Y pos: 0.988 X vel: 0.989 Y vel: 0.986

2D CNN

0.960

± 0.004 avg r (5 seeds)

X pos: 0.963 Y pos: 0.954 X vel: 0.962 Y vel: 0.959

Transformer

0.985

± 0.001 avg r (5 seeds)

X pos: 0.990 Y pos: 0.985 X vel: 0.984 Y vel: 0.982

Model Architectures

Fully Connected NN

Input (95)

FC 64

FC 128

FC 256

Output (4)

BatchNorm + Dropout (0.3) after each layer. Single time-bin input — no temporal context needed.

2D CNN

Input (1 x 32 x 95)

Conv2D 32 (k=5)

Conv2D 64 (k=5)

Conv2D 128 (k=3)

AdaptiveAvgPool2D

FC 64

Output (4)

Treats (time x channels) as a grayscale image. Conv2D captures spatio-temporal patterns. BatchNorm + Dropout (0.3).

LSTM

Input (32 x 95)

BatchNorm

LSTM 128

Last Hidden State

FC → Output (4)

32-step sequence input captures 1.6s of temporal context. Only the final hidden state is used for prediction.

Transformer

Input (32 x 95)

BatchNorm

Linear (95 → 130)

+ Positional Encoding

TransformerEncoder (10 heads)

Mean Pool

FC → Output (4)

Encoder-only self-attention over 32 time steps. Sinusoidal positional encoding injects temporal order before attention; mean pooling across the sequence for prediction.

Key Insight

Balancing the loss mattered as much as the architecture. Position (±100) and velocity (±700) differ ~4× in scale, so an unweighted MSE was dominated by velocity — z-scoring the four targets lifted the 2D CNN 0.84 → 0.96 and the MLP 0.88 → 0.93 (their earlier "position weakness" was a loss artifact, not the architecture). With that fix all four models clear 0.93: the LSTM leads at 0.989, the Transformer at 0.985 (it needs sinusoidal positional encoding to use time at all — without it ~0.74), then the 2D CNN (0.96) and single-bin MLP (0.93). The narrow spread is itself a hint that this single session is autocorrelated and "easy" — see the Neural Latents Benchmark numbers for a harder test. Reported numbers are mean ± std over 5 seeds (Lambda GPU sweep).

A harder test — Neural Latents Benchmark (MC_RTT)

The same four models, retrained on the public MC_RTT benchmark: 98-channel motor cortex recorded during continuous random-target finger tracking, decoding velocity at 20 ms bins. Each panel below is a real held-out reach — we integrate the decoded vs. true velocity into a 2D cursor path (blue = true, red = decoded) and play the reaches back to back. Unlike the easy single session above, the sequence models (LSTM/Transformer) now open a clear gap over the single-bin MLP. Headline numbers are velocity R² (5-seed mean ± std).

Spike Traces (24 of 98 ch)

Spike-Count Heatmap (98 channels)

Decoded Cursor Path

Speed:

Vel R²: —

MLP

—

vel R² (5 seeds)

2D CNN

—

vel R² (5 seeds)

LSTM

—

vel R² (5 seeds)

Transformer

—

vel R² (5 seeds)

Decoded Trajectories

Held-out test-set decoding (representative seed). Blue = actual, Red = predicted.

MLP Decode

From a single 50 ms bin (no temporal context), the MLP reaches 0.93 average correlation — a strong, cheap baseline.

2D CNN Decode

2D CNN predicted vs actual hand position

With z-scored targets the 2D CNN now tracks position too (~0.96), not just velocity — up from 0.84 when the velocity-dominated loss starved position.

LSTM Decode

The LSTM nearly perfectly tracks the actual trajectory — best overall at 0.989 ± 0.000 (most stable across seeds).

Transformer Decode

Transformer predicted vs actual hand position

With positional encoding, the Transformer reaches 0.985 ± 0.001 — without it, attention ignores temporal order and only reaches ~0.74.