Software engineers encounter EEG data increasingly often — in health-tech projects, in BCI research pipelines, in wearable device software, in academic collaborations with neuroscience labs. The signal has properties that differ enough from the time-series data most software engineers have encountered that standard approaches need adjustment. This piece covers the fundamental characteristics of EEG and the preprocessing choices that matter most for building reliable pipelines.
What You're Actually Recording
EEG measures voltage differences between pairs of electrodes placed on the scalp. The signal reflects the summed postsynaptic potentials of large populations of cortical neurons oriented perpendicular to the scalp surface — primarily the apical dendrites of pyramidal neurons in the cortical layers just below the electrode. The skull attenuates and blurs the signal; you're always measuring a spatial average from a patch of cortex at least a few centimetres across.
Typical amplitudes are in the range of 10–100 microvolts for neural signals of interest. For comparison, a single muscle contraction near an electrode can produce artifacts in the millivolt range — two orders of magnitude larger. Powerline interference (50Hz in Europe and most of Asia, 60Hz in North America) is omnipresent. Electrode motion against the scalp produces low-frequency drift. Eye movements and blinks generate large frontally-distributed electrical fields. All of this is in your data, overlapping spectrally and temporally with the signals you care about.
Frequency Bands and What They Represent
| Band | Frequency Range | Associated States / Processes |
|---|---|---|
| Delta | 0.5 – 4 Hz | Deep sleep, some pathological states in wakefulness. Large amplitude, globally distributed. |
| Theta | 4 – 8 Hz | Drowsiness, working memory, navigation. Strong at frontal and temporal electrodes. |
| Alpha | 8 – 13 Hz | Relaxed wakefulness, eyes closed. Strongest over occipital (visual) cortex. Attenuates with visual or cognitive engagement — "alpha desynchronization." |
| Mu | 8 – 13 Hz | Sensorimotor rhythm over motor cortex. Attenuates with movement or motor imagery — the key feature for motor BCI applications. |
| Beta | 13 – 30 Hz | Active thinking, concentration, motor output maintenance. Also attenuates with movement. |
| Low Gamma | 30 – 70 Hz | Cognitive integration, feature binding, focused attention. |
| High Gamma | 70 – 150+ Hz | Very local cortical processing; best seen in ECoG, heavily contaminated by EMG in scalp EEG. |
The bandwidths above are conventions, not hard boundaries. The brain's oscillatory dynamics are continuous, and the "bands" are analysis abstractions. Different research groups use slightly different boundary definitions, which is worth keeping in mind when comparing across studies.
The Artifact Problem in Detail
Artifacts are the dominant data quality challenge in EEG and the place where naive signal processing fails most spectacularly.
Ocular Artifacts — Eye Movements and Blinks
The eye is a dipole — retina is electronegative relative to cornea. Horizontal saccades produce large voltages at frontal electrodes (F7, F8). Vertical eye movements and blinks produce large, characteristic potentials at Fp1, Fp2 that spread across the whole scalp. Blink artifacts can have amplitudes of several hundred microvolts — easily 10x the neural signal of interest. In a resting-state recording, an average adult produces 15–20 blinks per minute.
Standard removal approaches: independent component analysis (ICA) identifies statistically independent components in the multichannel signal; components with characteristic frontal distribution and blink-like temporal waveforms are labelled as ocular and removed before reconstructing the signal. Automatic ICA labelling tools (ICLabel, MARA) work reasonably well but still require occasional manual review.
Muscular Artifacts — EMG Contamination
EMG from facial and scalp muscles is broadband (20–500 Hz) and can be large. Jaw clenching, teeth grinding, forehead tension, swallowing — all produce large muscle artifacts that contaminate EEG, particularly in the beta and gamma bands. This is a severe problem for ambulatory and wearable EEG where subjects are moving naturally. ICA can separate some muscle components, but high-frequency EMG overlaps with neural signals spectrally in ways that make clean separation difficult without dense electrode arrays.
Movement and Electrode Artifacts
Any movement of the electrode relative to the scalp changes the electrode-skin impedance transiently, producing large low-frequency transients. These can look like slow cortical potentials and will survive highpass filtering unless the cutoff is aggressive. Cable movement from subjects moving their heads also introduces inductive and electrostatic pickup. These artifacts are more manageable with good hardware (active electrodes with front-end amplification at the electrode site) and good electrode preparation (skin abrasion, conductive gel, low impedance verification before recording).
The choice of preprocessing pipeline — filtering cutoffs, artifact rejection strategy, reference electrode, epoch length — affects downstream analysis results more than most algorithm choices. Two researchers applying different preprocessing to the same raw data can reach different conclusions. For reproducible and deployable pipelines, preprocessing choices must be explicit, justified, and validated on held-out data.
A Minimal Viable Preprocessing Pipeline
For a motor imagery BCI application — the most common EEG classification task — a reasonable starting pipeline looks like this:
First, import and inspect the raw data. Check electrode impedances if recorded, visualize the raw signal for gross artifacts, note any bad channels (permanently saturated, flat, or obviously noisy electrodes). Bad channels should be interpolated from neighbours rather than included in analysis.
Second, filtering. Apply a highpass filter at 1 Hz to remove slow drift. Apply a notch filter at the powerline frequency (50 or 60 Hz) and its harmonics. For motor imagery, a bandpass of 8–30 Hz (mu and beta bands) is typical. Use zero-phase filtering (filtfilt in scipy/MATLAB) to avoid phase shifts. Be aware that highpass filtering at 1 Hz with a sharp filter can introduce ringing on edge transients; 0.5 Hz is often a safer choice.
Third, re-referencing. EEG voltages are always relative to a reference electrode. Common choices are average reference (each electrode referenced to the mean across all electrodes), linked mastoids, or Cz. Average reference is theoretically motivated and widely used for source analysis. The choice affects the spatial distribution of the resulting signals.
Fourth, artifact removal. Run ICA on the continuous data. Label components using ICLabel or manual inspection. Remove ocular and muscular components. Reconstruct the signal. Alternatively, use epoch rejection to discard trials with peak-to-peak amplitudes exceeding a threshold (typically 100–150 µV for scalp EEG).
Fifth, epoching. Segment the continuous data into epochs time-locked to events of interest (stimulus onset, movement cue, response). Include a pre-stimulus baseline period. Apply baseline correction (subtract mean of baseline window from each time point).
Feature Extraction for Classification
For motor imagery BCI, the standard feature is band power in the mu and beta bands at central electrodes, particularly C3 and C4 (left and right motor cortex). Compute power spectral density using Welch's method or multitaper estimation. The feature vector is typically band power at a set of electrode-frequency combinations, fed to a linear discriminant or SVM classifier.
Common spatial patterns (CSP) is a supervised spatial filtering method widely used for motor imagery — it finds linear combinations of electrodes that maximise the variance ratio between two classes. CSP is simple, well-understood, and often outperforms more complex approaches on well-curated data. For multi-class problems, one-vs-rest or one-vs-one CSP extensions are standard.
Deep learning approaches — EEGNet, ShallowConvNet, DeepConvNet — have become competitive and sometimes outperform CSP+LDA, particularly on large datasets. EEGNet (Lawhern et al., 2018) is compact and generalises well across tasks, which is important when within-subject training data is limited. On small datasets (the typical BCI scenario), simple features and linear classifiers often remain competitive or superior to deep networks, which overfit more easily.
The subject-specific variability issue is the most important practical constraint. EEG signals differ substantially across individuals in spectral properties, spatial topography, and signal-to-noise ratio. Models trained across subjects perform substantially worse than subject-specific models. For deployed applications, per-session calibration is almost always necessary.