TF-JEPA: Predictive Alignment of Time–Frequency Representations Without Contrastive Pairs

Overview

Cross-View Prediction in Latent Space as an Alternative to Contrastive Learning

Learning generalizable representations from multivariate time series is challenging due to complex temporal dynamics, distribution shifts, and the difficulty of designing effective contrastive pairs. TF-JEPA is a non-contrastive self-supervised method that leverages predictive alignment to integrate representations from the time and frequency domains without relying on negative sampling.

TF-JEPA utilizes dual online time and frequency encoders, each paired with its own momentum-updated target encoder, embedding both views into a stable and unified latent space. Experiments on sleep EEG, gesture recognition, mechanical fault detection, and EMG classification demonstrate that TF-JEPA matches or surpasses contrastive and time–frequency consistency baselines.

No Negative Pairs

Replaces contrastive repulsion with cross-view prediction, eliminating the need for large batches or memory queues.

8.2× Less GPU Memory

Removes the quadratic B×B similarity matrix, enabling training with batches as small as 32.

Up to +8 pp F₁

Improves cross-dataset transfer macro-F₁ by up to eight percentage points over contrastive baselines.

Method

Predictive Alignment Across Time and Frequency

TF-JEPA couples an online time encoder with a momentum-updated frequency encoder and trains them with a lightweight cosine loss. The architecture rests on three design choices:

Dual EMA Targets

Frozen time and frequency target encoders are updated after every step by an exponential moving average (EMA, momentum m = 0.995) of the online weights, providing stable target representations with no gradient overhead.

Lightweight Predictors

Two small MLPs (128→256→128) map each online embedding to predict the corresponding target view. A BYOL-style cosine loss aligns the two domains without negative pairs.

End-to-End Fine-Tuning

Because the objective avoids contrastive collapse, all encoder weights can be unfrozen during downstream training, allowing full adaptation to the target distribution.

TF-JEPA architecture diagram showing dual time and frequency encoders with EMA targets and cross-view predictors — **Figure 1.** Architecture for TF-JEPA pre-training. Time and frequency views are processed by dual online encoders. Each predictor MLP maps the online embedding to match the EMA target of the opposite view, aligning both domains without negative samples.

Results

Cross-Dataset Transfer Performance

Each model is pre-trained on the source dataset and fine-tuned on the corresponding target dataset with identical classifier heads. All experiments were conducted on a single NVIDIA A10 GPU.

Transfer Task	TS-TCC		TF-C		TF-JEPA		ΔF₁
	Acc.	F₁	Acc.	F₁	Acc.	F₁
SleepEEG → Epilepsy	85.88	82.48	94.95	91.49	95.31	92.24	+0.75
FD-A → FD-B	73.85	77.31	89.34	91.62	99.28	99.47	+7.85
HAR → Gesture	63.33	59.91	68.33	65.79	75.66	74.34	+8.55
ECG → EMG	85.88	82.48	85.37	80.51	87.80	80.03	−2.45

Transfer learning results comparison — **Table 1.** Full transfer performance (%) across four benchmark tasks. Positive ΔF₁ values favor TF-JEPA over the best competing baseline.

Foundation Model Comparisons

TF-JEPA is also compared against large-scale foundation models (NormWear, CBraMod) pre-trained on diverse physiological corpora. TF-JEPA uses only a 13.3% subset of TUH EEG v2.0.1, yet remains competitive across all benchmarks.

Foundation model comparison results — **Table 2.** Target-task performance (%) against foundation models pre-trained on orders of magnitude more data.

EMA Momentum Analysis

An ablation across six momentum values (0.9–0.9995) shows that higher EMA momentum consistently improves transfer metrics. The HAR→Gesture task shows +11.3 pp gain at m = 0.9995 compared to the lowest momentum, confirming that target network stability is critical for non-contrastive time-series learning.

EMA momentum ablation study on HAR to Gesture transfer — **Figure 2.** Validation F₁ on Gesture after pre-training on HAR with fixed EMA momenta. Higher momentum yields slower convergence but superior final scores.

Resource Usage

Because TF-JEPA replaces the contrastive NT-Xent objective with a non-contrastive BYOL-style cosine loss, it eliminates the quadratic 2B×2B similarity matrix that NT-Xent must materialize and back-propagate through at every pre-training step.

Metric	TF-JEPA	TF-C	Ratio
Total parameters	2.49 M	1.18 M	2.1×
Trainable parameters	1.31 M	1.18 M	1.1×
Peak GPU memory	51 MB	421 MB	0.12×
Avg. step time	26.7 ms	36.0 ms	0.74×
Forward	12.7 ms	11.0 ms	1.15×
Backward	14.1 ms	25.0 ms	0.56×

TF-JEPA's forward pass is marginally slower (12.7 ms vs. 11.0 ms) because the momentum-updated target encoder doubles the stored parameters. However, the backward pass is 1.8× faster (14.1 ms vs. 25.0 ms) because the cosine loss produces a compact O(B) gradient graph in place of the O(B²) graph generated by the contrastive similarity matrix — more than compensating for the extra forward-pass cost.

GPU resource comparison: TF-JEPA uses 51 MB peak memory vs 421 MB for TF-C, and completes steps 26% faster — **Figure 3.** Pre-training GPU footprint on an NVIDIA L4 (batch 128, seq len 178). TF-JEPA achieves an 8.2× reduction in peak memory and a 1.35× wall-clock speed-up by eliminating the quadratic similarity matrix.