\chapter{Introduction \label{Chapter-Intro}}
A recent study estimated that over 900 million adults globally are affected by the common group of respiratory sleep disorders called Sleep-disordered breathing (SDB) \cite{benjafield2019estimation}. The most common SDB disorder is Obstructive Sleep Apnea (OSA), which clinical manifestations include sleepiness, fatigue, cardiovascular disease, and hypertension. SDB in general is linked to higher cases of diabetes, stroke occurrences and increased morbidity \cite{dempsey2010dempsey,patil2007adult,young2002epidemiology}.
The gold standard for assessing SDB is Polysomnography (PSG), which captures physical and biological signals like cardiac (electrocardiogram, ECG) and neurological (electroencephalogram, EEG; electrooculography, EOG; electromyography, EMG) activity, airflow, peripheral oxygen saturation (SpO2), thoracic and abdominal respiratory effort, sleeping position, and blood volume changes (photoplethysmography, PPG).
The diagnosis of the disorder relies on detecting repeated respiratory events in which airflow is either reduced (hypopnea) or entirely paused (apnea) during sleep \cite{dempsey2010dempsey,gould2012sleep}.
The current guidelines for scoring of respiratory events \cite{troester2023aasm} recommend scoring an apnea when there is a decrease of at least 90\% in the airflow amplitude in respect to baseline, and a hypopnea when there is a decrease between 30\% and 90\% in airflow amplitude, associated with a cortical arousal (measured with EEG) or a decrease of $\geq 3\%$ in the level of SpO2 compared to the pre-event baseline.
These events can further be categorized into obstructive or central origin, depending on if the apnea happens due to a physical blockage of the upper airway or if caused by the brain failing to signal breathing resulting in missing breathing effort. In case the event shows features of both, it is classified as a mixed.
Besides respiratory events, the PSG is used to score sleep stages,distinguishing between periods of wakefulness (or wake), rapid eye movement (REM) sleep and non-REM sleep, which is further divided in N1, N2 and N3. By adding up the time spent in each non-wake stage, the total sleep time (TST, measured in hours) is calculated. Dividing the number of apneas and hypopneas by the total sleep time (TST) gives the Apnea-Hypopnea-Index (AHI), which indicates the severity of SDB, and which combined with the clinical presentation is used for diagnosis.
Although PSG is the gold standard measure for assessing sleep and diagnosing SDB, it comes with a few downsides: Firstly, due to the vast amount of sensors and specialized equipment, setup and analysis of the full PSG is costly, requires human experts and might impact sleep quality, limiting its use to one or two nights. Secondly, looking only at a single night might have low diagnostic meaningfulness \cite{toussaint1995first} and hide within-subject variability in the assessment of the condition, which can only be elucidated by monitoring multiple nights.
Polygraphic setups reduce the number of sensors to a subset of the full PSG required for adequately scoring respiratory events, recording only airflow, SpO2, and respiratory effort. These so called home sleep apnea tests (HSAT) are increasing in popularity due to their reduced complexity and cost, but they still remain relatively uncomfortable and are effectively limited to only a few nights of recording.
All these factors contribute to an estimated 93\% of women and 82\% of men with at least moderate OSA that remain undiagnosed \cite{young1997estimation}.
In 2000, a PhysioNet kick-started interest in the topic of surrogate assessment of sleep apnea with simpler sensors, by holding a competition on their Apnea-ECG Dataset that consisted only of labeled ECG recordings split into one-minute epochs. Although submitted models reached high performances, later studies showed that these exhibited poor generalizability, suggesting that the dataset doesn't fully cover the broad spectrum of apneic events and may not be representative of real-world, clinically meaningful cohorts \cite{papini2018generalizability}. Therefore, the last decades saw a wide range of studies, with a multitude of clinical datasets containing different sleep disorders and architectures for apnea detection that focused on reliability and generalizability.
For instance, Olsen et al. \cite{olsen2020robust} used the Sleep Heart Health Study (SHHS) \cite{quan1997sleep} and the Multi-Ethnic Study of Atherosclerosis (MESA) \cite{chen2015racial} datasets to develope a neural network with bidirectional GRUs that used ECG inputs to achieve a sensitivity (Se) of 68.7\%, a precision (Pr) of 69.1\%, and an F1-score of 66.6\% on their self-defined event-level metric and an AHI-correlation of $R^2$ = 0.829. Xie et al. \cite{xie2023use} later validated Olsens model on the Sleep and Obstructive Sleep Apnoea Monitoring with Non-Invasive Applications (SOMNIA) \cite{van2019protocol} dataset achieving an F1-score of 70.8\%. Both Olsen's and Xie's studies relied on the ground-truth sleep stages scored from PSG to calculate the TST, required to obtain the AHI. In a follow-up study, Xie et al. \cite{xie2024multi} developed a multi-task model that in addition to SDB events, also predicted sleep and wake phases based on ECG and respiratory effort (RE) only, achieving an F1-score of 0.631. Xie's two studies highlighted the performance decrease when using surrogate signals to calculate sleep stages compared to using the full PSG. Also using ECG and RE, Fonseca et al. \cite{fonseca2024estimating} achieved intraclass correlation coefficient of 0.91 across different datasets.
Notable about these studies is that they do not rely on the airflow signal acquired typically with PSG and HSAT, based on which apnea and hypopnea events are scored. Unsurprisingly, using airflow as input helps increase performance greatly. Li et al. \cite{li2023deep} achieved an F1-score of 85.7\% on classifying one-minute segments of Airflow and ECG. Later, Yook et al. \cite{yook2024deep} used Airflow and SpO2 together to achieve an F1-score of 93\% on classifying 10-second segments converted into scalograms.
The downsides to this approach are that the sensors used to obtain airflow, i.e. nasal cannulas, a thin tube placed under the nostrils, or thermistors, a thermocouple sensor placed on the upper lip, are uncomfortable during sleep and hard to set up properly.
One of the simpler signals to set up and record during sleep is PPG, which can be obtained with a pulse oximeter that illuminates the skin to measure changes in light absorption.
These devices come in a range of forms such as wrist-worn, like most modern smart watches already have, or finger-worn as used in PSG and HSATs, mounted typically on the index finger, and which can also calculate SpO2.
Lazazzera et al. \cite {lazazzera2020detection} used PPG and SpO2 signals, achieving a Sensitivity of 76.9\% and Specificity of 73.2\%, although their dataset only consisted of 96 patients without any kind of co-morbidity. With the same input signals, Wu et al. \cite{wu2024transformer} trained a transformer-based model on a dataset containing patients with co-morbidities and were able to validate their performance on PPG and SpO2 signals measured by a Smart Ring resulting in an F1-score of 64.9\%.
In this work, we present an event-level apnea detection model that relies solely on signals obtained with easy-to-use sensors, namely PPG and SpO2. We evaluate the performance using the combination of both sensors, and with PPG only, which can be used with devices that cannot accurately measure SpO2, such as wrist-worn wearables. Finally, we evaluate the impact of using sleep stages scored from the gold standard PSG versus using surrogate sleep stages predicted from PPG only.