ICLR logo ICLR 2026 · TSALM Workshop

TF-JEPA: Predictive Alignment of Time–Frequency Representations Without Contrastive Pairs

Michael Chaykowsky
Rivian and Volkswagen Group Technologies  ·  Palo Alto, CA

Cross-View Prediction in Latent Space as an Alternative to Contrastive Learning

Learning generalizable representations from multivariate time series is challenging due to complex temporal dynamics, distribution shifts, and the difficulty of designing effective contrastive pairs. TF-JEPA is a non-contrastive self-supervised method that leverages predictive alignment to integrate representations from the time and frequency domains without relying on negative sampling.

TF-JEPA utilizes dual online time and frequency encoders, each paired with its own momentum-updated target encoder, embedding both views into a stable and unified latent space. Experiments on sleep EEG, gesture recognition, mechanical fault detection, and EMG classification demonstrate that TF-JEPA matches or surpasses contrastive and time–frequency consistency baselines.

No Negative Pairs

Replaces contrastive repulsion with cross-view prediction, eliminating the need for large batches or memory queues.

8.2× Less GPU Memory

Removes the quadratic B×B similarity matrix, enabling training with batches as small as 32.

Up to +8 pp F₁

Improves cross-dataset transfer macro-F₁ by up to eight percentage points over contrastive baselines.

Predictive Alignment Across Time and Frequency

TF-JEPA couples an online time encoder with a momentum-updated frequency encoder and trains them with a lightweight cosine loss. The architecture rests on three design choices:

Dual EMA Targets

Frozen time and frequency target encoders are updated after every step by an exponential moving average (EMA, momentum m = 0.995) of the online weights, providing stable target representations with no gradient overhead.

Lightweight Predictors

Two small MLPs (128→256→128) map each online embedding to predict the corresponding target view. A BYOL-style cosine loss aligns the two domains without negative pairs.

End-to-End Fine-Tuning

Because the objective avoids contrastive collapse, all encoder weights can be unfrozen during downstream training, allowing full adaptation to the target distribution.

TF-JEPA architecture diagram showing dual time and frequency encoders with EMA targets and cross-view predictors
Architecture diagram not found — see paper for details.
Figure 1. Architecture for TF-JEPA pre-training. Time and frequency views are processed by dual online encoders. Each predictor MLP maps the online embedding to match the EMA target of the opposite view, aligning both domains without negative samples.

Cross-Dataset Transfer Performance

Each model is pre-trained on the source dataset and fine-tuned on the corresponding target dataset with identical classifier heads. All experiments were conducted on a single NVIDIA A10 GPU.

Transfer Task TS-TCC TF-C TF-JEPA ΔF₁
Acc.F₁ Acc.F₁ Acc.F₁
SleepEEG → Epilepsy 85.8882.48 94.9591.49 95.3192.24 +0.75
FD-A → FD-B 73.8577.31 89.3491.62 99.2899.47 +7.85
HAR → Gesture 63.3359.91 68.3365.79 75.6674.34 +8.55
ECG → EMG 85.8882.48 85.3780.51 87.8080.03 −2.45
Transfer learning results comparison
Results figure not found.
Table 1. Full transfer performance (%) across four benchmark tasks. Positive ΔF₁ values favor TF-JEPA over the best competing baseline.

Foundation Model Comparisons

TF-JEPA is also compared against large-scale foundation models (NormWear, CBraMod) pre-trained on diverse physiological corpora. TF-JEPA uses only a 13.3% subset of TUH EEG v2.0.1, yet remains competitive across all benchmarks.

Foundation model comparison results
Foundation model comparison figure not found.
Table 2. Target-task performance (%) against foundation models pre-trained on orders of magnitude more data.

EMA Momentum Analysis

An ablation across six momentum values (0.9–0.9995) shows that higher EMA momentum consistently improves transfer metrics. The HAR→Gesture task shows +11.3 pp gain at m = 0.9995 compared to the lowest momentum, confirming that target network stability is critical for non-contrastive time-series learning.

EMA momentum ablation study on HAR to Gesture transfer
EMA ablation figure not found.
Figure 2. Validation F₁ on Gesture after pre-training on HAR with fixed EMA momenta. Higher momentum yields slower convergence but superior final scores.

Resource Usage

Because TF-JEPA replaces the contrastive NT-Xent objective with a non-contrastive BYOL-style cosine loss, it eliminates the quadratic 2B×2B similarity matrix that NT-Xent must materialize and back-propagate through at every pre-training step.

Metric TF-JEPA TF-C Ratio
Total parameters 2.49 M 1.18 M 2.1×
Trainable parameters 1.31 M 1.18 M 1.1×
Peak GPU memory 51 MB 421 MB 0.12×
Avg. step time 26.7 ms 36.0 ms 0.74×
 Forward 12.7 ms 11.0 ms 1.15×
 Backward 14.1 ms 25.0 ms 0.56×

TF-JEPA's forward pass is marginally slower (12.7 ms vs. 11.0 ms) because the momentum-updated target encoder doubles the stored parameters. However, the backward pass is 1.8× faster (14.1 ms vs. 25.0 ms) because the cosine loss produces a compact O(B) gradient graph in place of the O(B²) graph generated by the contrastive similarity matrix — more than compensating for the extra forward-pass cost.

GPU resource comparison: TF-JEPA uses 51 MB peak memory vs 421 MB for TF-C, and completes steps 26% faster
GPU comparison figure not found.
Figure 3. Pre-training GPU footprint on an NVIDIA L4 (batch 128, seq len 178). TF-JEPA achieves an 8.2× reduction in peak memory and a 1.35× wall-clock speed-up by eliminating the quadratic similarity matrix.

ICLR 2026 — TSALM Workshop Poster

Presented at the 1st ICLR Workshop on Time Series in the Age of Large Models (TSALM). Click to view full resolution.

ICLR 2026 poster for TF-JEPA
Poster image not found.

BibTeX

If you find this work useful, please cite our paper:

@inproceedings{chaykowsky2026tfjepa,
  title     = {{TF-JEPA}: Predictive Alignment of Time--Frequency
               Representations Without Contrastive Pairs},
  author    = {Chaykowsky, Michael},
  booktitle = {1st ICLR Workshop on Time Series in the Age of
               Large Models (TSALM)},
  year      = {2026},
  url       = {https://iclr.cc/virtual/2026/10013864}
}