The Variability of Outcomes Used in Efficacy and Effectiveness Trials of Alcohol Brief Interventions: A Systematic Review

Objective: To characterize recent alcohol brief intervention (ABI) efficacy and effectiveness trials; summarize outcomes; and show how variability in outcomes and reporting compromises the evidence base. Method: A systematic review and narrative synthesis of articles from 10 databases were undertaken (Jan 2000-Nov 2017); study selection represented recent, readily available publications. Alcohol brief intervention definitions were informed by National Institute of Clinical Excellence (NICE) Public Health Guideline 24: Alcohol use disorders: prevention. The review was conducted using Centre for Reviews and Dissemination (CRD) guidance and pre-registered on PROSPERO (CRD42016047185). Seven a priori specified domains were used to classify outcomes: biomarkers, alcohol related outcomes, economic factors/resource use, health measures, life impact, intervention factors, and psychological/behavioral factors. Results: The search identified 405 trials from 401 eligible papers. In 405 trials, 2641 separate outcomes were measured in approximately 1560 different ways. The most common outcomes used were number of drinks consumed in a week and frequency of heavy episodic drinking. Biomarkers were least frequently used. The most common primary outcome was weekly drinks. By trial type, the most frequent outcome in efficacy and effectiveness trials was frequency of heavy drinking. Conclusions: Consumption outcomes predominated; however, no single outcome was found in all trials. This comprehensive outcome map for ABI effectiveness and efficacy trials can aid decision making in future trials. There was diversity of instruments, time points, and outcome descriptions in methods and results sections. Compliance with reporting guidance would support data synthesis and improve trial quality. This review establishes need for a core outcome set/minimum data standard (COS) and supports the Outcome Reporting in Brief Interventions: Alcohol initiative (ORBITAL) to improve standards in the ABI field through a COS for effectiveness and efficacy randomized trials.


Introduction
Alcohol brief interventions (ABIs) are key strategies to address problematic alcohol use worldwide (Coffield et al., 2001;National Institute for Health and Clinical Excellence, 2010;US Preventive Services Task Force, 2004;World Health Organisation [WHO], 2016).
An avoidable problem is the diversity in definition and measurement of outcomes used. This reduces the ability to meaningfully synthesize available information. For example, in a recent and comprehensive review (Kaner et al., 2018), authors excluded 22 of 69 otherwise eligible studies due to outcome reporting issues. Differing outcomes across studies weakens meta-analyses of the efficacy and effectiveness of ABIs and contributes to research waste as not all articles can be used for the evidence base (Glasziou, 2014). Given the number of reviews mentioning outcome heterogeneity across all populations in which ABIs are now employed, it is no longer appropriate to dismiss this heterogeneity as a limitation, when it can and should be addressed.
To address outcome heterogeneity in ABI trials, future ABI studies should use a coherent, consistent set of outcomes, known as a core outcome set (COS). A COS is a feature of a mature research base, and many healthcare areas have developed, or are developing, COS to support advances in their field (COMET Initiative, 2017). A COS reduces selective and inconsistent reporting in research trials, improves the quality of treatment guidance for a condition, and increases the number of studies synthesized in systematic reviews. Both the Consolidated Standards of Reporting Trials (CONSORT) (Moher et al., 2010) and the Standard Protocol Items: Recommendations for Interventional Trials (SPIRIT) statements recommend COS use, and a formal process for COS development has been established by the Core Outcome Measures in Effectiveness Trials (COMET) Initiative (Williamson et al., 2017;Williamson et al., 2012). A COS is a minimum reporting standard, and does not restrict the measurement of additional outcomes. A comprehensive map of outcomes can support decision making on other outcomes to be measured alongside the COS; reducing a potential source of conflict in trial planning (Daykin et al., 2016;Daykin et al., 2017).
Recognizing the benefits an ABI COS could provide, the International Network on  . Although numerous systematic reviews on ABI have been conducted, most have aimed to establish efficacy, effectiveness, and/or costeffectiveness, and their included studies meet a restrictive set of eligibility criteria, including their pre-specified outcome of interest. No study to date has compiled all outcomes used across ABI studies. This paper fills this gap through a definitive catalogue of outcomes used in recent ABI trial literature. Such a catalogue is needed to a) map outcomes used to demonstrate efficacy and effectiveness in peer-reviewed, published ABI trials, b) demonstrate the variability in outcome type and measurement, c) highlight methodological issues in the ABI field around outcomes and reporting, d) inform COS development, including identifying outcomes for a Delphi prioritization exercise (see Shorter et al., In Press), and e) support ABI trial protocol decision making on outcomes by trial area.

Methods
A review protocol was registered in advance on PROSPERO (CRD42016047185) (Shorter, et al., 2016a) Eligible studies were individual or cluster randomized trials focused on efficacy or effectiveness of ABIs designed to reduce alcohol consumption published in peer-reviewed journals. Trials that did not analyze outcomes by randomized arm were excluded (e.g. subsample analysis only). Papers with the same trial registration number were included if they assessed different outcomes in each. Specific search parameters are described below.
Population: Current drinkers (at least one drink in the past year) who were aged 16 years or above. Trials of drinkers aged 15 years or below were excluded, as were trials including individuals seeking treatment for alcohol problems, following related UK NICE guidance (National Institute for Health and Clinical Excellence, 2010).
Intervention: ABIs were defined as those suitable for drinkers not seeking treatment for an alcohol problem but who are identified by screening as having, or being at risk of, problems from their alcohol use (National Institute for Health and Clinical Excellence, 2010).
This definition covers brief advice and extended brief interventions, delivered once or more frequently. An ABI should assess an individual's alcohol use and provide feedback on their alcohol assessment. Trials including a multicomponent intervention arm or where one or more intervention components addressed non-alcohol related health behaviors (e.g., smoking cessation) were included if alcohol intervention components and outcomes could be clearly distinguished.
Comparator: Comparators could be any active or control intervention.
Outcomes: All outcomes analyzed by randomized arm were extracted including detail of how the outcome was defined and measured if possible. This was used to estimate the variability in outcome measurement, i.e. to what extent an outcome in one paper was exactly the same as in another paper (what the outcome represents, how it was measured and scored, and time period referred to). Other extracted information included: number and nature of sample randomized (sex, age, and population), trial details, including region, number of trial arms, trial arm composition, trial type (efficacy/effectiveness/not reported), and details of follow up timing. These were summarized either as number (%) or mean (SD) of trials included with indication of missing data in the total number. Broad indicators of trial reporting quality were included: stating 'trial' in the title or including a participant flow chart in line with early CONSORT guidance (Begg et al., 1996). Where study information was not provided this was stated. Effectiveness and efficacy ABI reviews often contact authors for missing data, we did not do so as our aim was to highlight where reporting falls short to improve standards in the field like other, similar high quality methodological reviews (Harman et al., 2017;Riddle et al., 2008;Thornley & Adams, 1998).
A taxonomy was created to map outcomes under seven domains: alcohol related outcomes; biomarkers; health measures; economic factors/social impacts; psychological/behavioral factors; life impact; and intervention factors. This was influenced by a range of sources. The first draft was informed by a presentation at the COMET V meeting in Amsterdam (attended by GWS in September 2016), since published in Dodd et al. (2018). However, given the ABI topic area is not directly concerned with physical pathology, many clinical factors in this taxonomy were irrelevant (e.g. musculoskeletal outcomes), whereas other outcomes were not specific enough (e.g. emotional functioning/wellbeing).

Other sources included the Outcome Measurement Sets for Clinical Trials [OMERACT]
filter (Boers et al., 2014); this was helpful to derive core areas such as death, life impact, resource use, or pathophysiological manifestations, but was too broad to capture outcomes relevant to ABIs. The Patient-Reported Outcomes Measurement Information System [PROMIS] (Cella et al., 2007) provided elaboration to describe some outcomes in ABI trials (anxiety, depression, or sleep disturbance) but there were classification limitations, and some outcomes (e.g. PROMIS alcohol use questionnaire) were absent from ABI papers. We drew upon health economic reviews to inform the economic outcomes domain (Barbosa et al., 2015;Barbosa et al., 2010;Bray et al., 2011). Outcome data extracted from ABI trials were used to refine the taxonomy further. GWS created the taxonomy, this was then refined by others (NH, DNB, JWB, AHB, CB, and ELG).
Search results were downloaded to EndNote version X7 and de-duplicated. GWS screened all titles and abstracts of papers and excluded those that did not meet the inclusion criteria. DNB checked 28% of these for accuracy; discrepancies were resolved by discussion.
All full text versions of potentially eligible papers were reviewed by GWS, and all doublescreened by one of ELG, DNB, JWB, AJOD, AHB, and AH; discrepancies were resolved by discussion. Extraction forms were piloted by GWS and KJS. All data were extracted by GWS, and all extracted data cross-checked for accuracy by at least one of ELG, DNB, SJS, KJS, JWB, AJOD and AHB. Data were presented from all trials, and split by population

Results
Searches identified 33,134 papers after de-duplication to be screened by title and abstract for eligibility. Exclusion at title and abstract stage reflected unambiguous violation of the above PICO (population, intervention, comparator, outcomes) on the basis of topic area (i.e. not alcohol) or a known alcohol treatment sample (such as Project MATCH). Any unclear matches were referred to full text assessment for closer inspection; 1,612 papers were retrieved for full text evaluation against PICO criteria, and 401 were deemed eligible ( Figure  1). The 401 included papers covered 405 individual trials (some papers reported two trials), representing 182,272 randomized participants in total (see Supplementary Material C for included papers) <<<Figure 1>>> The mean trial size was 450 individuals (range=12-7,935). Typically, higher numbers were randomized in 'primary care', 'emergency department', and 'general population' samples compared to the remaining populations (Table 1). There were slightly more males than females on average (mean % male=56.2; SD=28.1); highest in the 'other' population. Most trials were conducted in North America (60.7%); this was particularly evident in the 'University/College' population with 81.1% of trials from this region. Two-arm trial designs predominated, with trials in the 'emergency department' and 'University/College' populations more likely to have more than two arms. Around 83% of trials had a non-ABI control group. 'Other' populations were most likely to have a non-ABI control (91.2%) and 'General population' were least likely (75.0%). More trials were declared by their authors as efficacy trials (52.8%) compared to effectiveness trials (42.0%). Twenty-one trials did not state their type. Only 'University/College' populations had more effectiveness trials (56.1%) than efficacy trials (40.2%).

<<<Table 1>>>
Just over half the trials indicated they were a trial in the paper's title (52.1%); 63.7% included a flow chart of participants through the trial. 'University/College' populations were least likely to report these elements, with 'general population' trials more likely to state they were a trial (67.2%), and 'other healthcare' populations more likely to include a flow chart (78.2%). Broadly similar percentages had two or three data collection waves (42% and 38% respectively). Longer-term follow-up of two or more years was more likely in 'primary care' (n=7; 14%). Short-term follow-up was more likely in 'University/College' samples. Overall, trialists most often selected three-month intervals for follow-up (e.g. three or six months).
Over time, there was a general increase in the number of ABI trials published per year. The largest number were published in 2014. The number of trials per year is given in Figure 2.

<<<Figure 2>>>
Outcomes Overall, 2,641 outcomes were extracted from 405 trials. Only 285 trials stated if their outcomes were primary or secondary. The mean number of outcomes per trial was 6.5 (ranging from 1-56); highest in 'primary care', and lowest in 'University/College' samples.
On average, there were two primary, and four secondary outcomes reported in the included trials. Most trials had at least one alcohol related outcome measure. The highest percentage of trials with at least one health outcome was in the 'primary care' or 'other healthcare' population, least likely in the 'University/College' population. Economic factors or social impacts were most likely in the 'primary care' or 'emergency department' population.
Psychological factors were found in around 28% of trials, most commonly in 'other healthcare' populations. Life impact outcomes were present in 56 trials. Less than 10% of trials looked at intervention factors and 'University/College' samples were more likely to have one outcome of this type. Biomarkers were infrequently used: only 13 trials had measured at least one biomarker; more likely in 'primary care' or 'other' populations.

Alcohol related outcomes
Alcohol related outcomes include those connected to the amount or pattern of alcohol consumption, those related to the comorbid use of other substances and those reflecting substance use disorder symptomology. As such, it is broader than just alcohol consumption measures but we have retained the term "alcohol related outcomes" for ease of exposition and to maintain consistency with our protocol  and Delphi study (Shorter et al., In Press). In the 405 trials, there were 1,456 alcohol related outcomes measured in 744 different ways ( Table 2). The most commonly reported alcohol related outcome variables were frequency of heavy drinking (n=213), weekly drinks (n=205), alcohol related problems or consequences (n-190), typical quantity (n=137), typical frequency (n=117), and hazardous or harmful drinking (n=111). Many of the infrequently-measured outcomes were also the most diversely measured. An exception included 'at risk drinking' (which measures risk derived from publicly-available recommendations such as weekly or single episode limits).
By population, 'primary care' trials were most likely to report weekly drinks, frequency of heavy drinking, and at-risk drinking. This was somewhat similar to the 'general population', 'emergency department' and 'University/College' populations, which often measured weekly drinks, frequency of heavy drinking, and alcohol-related problems or consequences. The majority of trials that measured blood alcohol concentration were in the 'University/College' population. 'Other healthcare' populations often measured typical and heavy drinking frequencies, and hazardous and harmful drinking. Frequency of heavy drinking was the most commonly reported outcome in both efficacy and effectiveness trials, with the number of drinks consumed in a week the most frequent primary outcome.

<<<Table 2>>>
Other outcomes In total, 32 biomarker outcomes were reported across the 405 trials (Table 3). Of these, the most commonly reported was gamma-glutamyltransferase (GGT). Biomarkers were only found in 'primary care', 'other healthcare', and 'other' populations. The most frequent biomarker in efficacy trials was GGT, in effectiveness trials it was Carbohydrate-deficient transferrin (CDT). GGT and CDT tied as the most common primary outcome.
In the economic factors/social impacts domain, the most commonly reported outcomes were driving related offences and hospitalizations. This domain includes some overlap with measures of alcohol related consequences in the alcohol related outcome domain, but measures in this domain are intended to assess social costs and impacts, not to assess the possibility of a diagnosable alcohol disorder. In 'primary care', the most commonly reported economic factors/social impacts outcomes were driving-related offences, hospitalizations, other criminal justice use, or other healthcare use. For ABI trials set in the 'emergency department', the most common were seeking alcohol treatment, driving-related offences, and emergency healthcare use. In 'other healthcare' populations, the most commonly-assessed economic variable was that of provider intervention costs. In 'other' populations, given the composition of this group, other criminal justice use was most common. Economic factors/social impacts measures were not commonly reported by 'general population' or 'University/College' ABI trials. The intervention cost to the provider was the most common economic factors/social impacts measure for efficacy trials. Driving related offences was the most common measure in effectiveness trials, and the most reported primary outcome.
Health outcomes most commonly reported were alcohol-exposed pregnancy factors, psychological health measures, sexual violence or coercion, and severity of depression symptoms. In 'primary care', cardiac factors, psychological health, and physical health were most commonly reported. In 'general population' samples, alcohol-exposed pregnancy factors or severity of depression were more commonly reported. Sleep disruption was only measured in 'University/College' ABI trials. 'Other healthcare' populations most commonly reported alcohol-exposed pregnancy factors. The most frequent efficacy outcome in this domain was psychological health; the most common outcome in effectiveness trials was alcohol-exposed pregnancy factors. The most commonly reported outcome from the intervention factors domain was intervention satisfaction; true for both effectiveness and efficacy trials. ABI trials in 'University/College' and 'general population' samples were more likely to ask participants about this outcome.
In the domain of psychological and behavioral factors, the most commonly reported outcomes across all trials were drinking refusal self-efficacy, alcohol outcome expectancies, risky behaviors, and readiness to change. ABI trials in the 'primary care' and 'emergency department' populations were least likely to measure these outcomes. By contrast, 'University/College' samples were particularly likely to measure the perception of others' drinking, for example, the typical quantity drunk by a student at their institution. For 'general population' samples, drinking refusal self-efficacy and readiness to change were most common. In 'other healthcare' populations, risky behaviors were the most commonly reported; these include aspects such as sex without effective contraception. Finally, in 'other' populations, anger and aggression, drinking refusal self-efficacy, other psychological factors, and readiness to change were the most commonly reported outcomes. The most frequent measure in efficacy trials was readiness to change; for effectiveness trials it was perception of others' drinking. The most common primary outcome for both was engagement in risky behaviors. Life impact measures were most commonly role functioning or relationship factors or quality of life. The former was most common in effectiveness trials (and as a primary outcome), the latter the most common for efficacy trials.

Discussion
This review is the first to go beyond stating outcome heterogeneity as a weakness in ABI systematic reviews; it quantifies the heterogeneity and inconsistency in outcomes reported in effectiveness and efficacy trials of ABIs. Overall, there were 2,641 outcomes measured in approximately 1,560 different ways, truly a "Tower of Babel". The estimated 1,560 different ways outcomes were measured may be a conservative guess of the true variability given the lack of precision on how outcomes were measured. The variation in the outcomes used and reported across ABI trials reflects similar reviews conducted in different research areas (Harman et al., 2017). For the ABI field, the substantial heterogeneity represents an important challenge. Meta-analyses will continue to be compromised as they cannot draw on all evidence to decide whether ABIs work as intended. Just over half (53%) measured the most common consumption measure frequency of heavy drinking; this creates a considerable conflict between the drive for inclusion of all studies meeting criteria in high quality systematic reviews, and the ability to include all studies in the meta-analysis.
Determining efficacy or effectiveness depends on outcomes measured, and therefore all ABI trial papers should contain sufficient detail on outcome measurement. One way this may affect meta-analyses is through the combination of an outcome (e.g. weekly drinks) which hides considerable variability. For example, "weekly drinks" may refer to an average week, a typical week, or the last week. It may refer to a typical week in the past month, 28 days, 90 days, six months, or since last measurement. The definition of drink may be specified or left to the respondent. Weekly drinks may be reported directly or calculated based on other information in a range of different questionnaires. We can calculate some differences to be equivalent, but some measure genuine differences and their combination compromises the validity of estimates. At a minimum trials should report a) what the outcome is, b) the question or questionnaire used to measure and how this is used (e.g. scale score, or the binary above and below a cut-off point), c) measure of aggregation such as mean value or mean individual difference, and d) time point (e.g. 1, 4, and 8 weeks post intervention).
Some trials did not specify whether their outcomes were primary or secondary outcomes.
This could be because the trial was a pilot study and specification may not be required (Eldridge et al., 2016), or it might be stated in a trial registry. However, excluding this from reporting is problematic (Begg, et al., 1996;Moher, et al., 2010). In addition, although one might expect trials to have only one primary outcome, we found, of those who specified, the average was two primary outcomes. This was an under-estimate of the average because some papers only reported secondary outcomes; their primary outcome(s) were in other papers with the same trial registration number. The correct interpretation of secondary outcomes is 'through' the primary analysis on the premise that, if the primary outcome is positive, then secondary outcomes can help to understand how the ABI worked. The secondary designation may also be useful for outcomes more distal on the causal pathway that reduced drinking would be expected to change. If the primary outcome is neutral, the secondary outcomes are hypothesis-generating. If the primary outcome is 'negative', the secondary outcomes provide insight into how the treatment caused harm (Freemantle, 2001). If change is shown in some primary outcomes but not others, interpretation can become difficult and it may be a challenge to state the ABI brought about change. To improve the aggregation of trials into the evidence base, outcomes (from a COS or otherwise) should be detailed, identified as primary or secondary with a clear statistical analysis plan, well reported in results sections which include point estimates and variability around estimates, and follow reporting guidance.
Alcohol related outcomes, particularly consumption outcomes, were the predominant outcomes measured in ABI trials. Although some have called for an increase in biomarkers in ABI trials (Kypri, 2007) this call has not been heeded; most outcomes were self-reported.
ABI effectiveness or efficacy meta-analyses rely on the outcomes reported without validating them against objective measures (Moyer et al., 2002), exacerbating the problem of outcome heterogeneity in ABI trials. Our review provides the first systematic and quantifiable evidence to support previous calls for standard definitions of ABI outcomes to compare across studies (Bernstein et al., 2010).
Despite efforts to identify literature from across the globe, most trials were from North American or European countries. This may reflect the predominance of publishing or funding opportunities available to those researchers, the high levels of hazardous and harmful use of alcohol in these countries (Rehm et al., 2009), or be a consequence of the pre-specified databases searched. We attempted to minimize English language bias and improve the quality of the review by including studies reported in languages other than English (Moher et al., 2003).The searching was largely conducted in English, and our ability to extract data from articles in languages other than English was limited, as shown in the CONSORT flowchart ( Figure 1). Although focusing on peer-reviewed literature may have also limited the number of non-English articles included, it is in keeping with our intention to focus on those articles that are likely to be most accessible and influential for many decision makers. Our searches of the grey literature, which constitute a separate part of our PROSPERO-registered systematic review not reported here, will be one opportunity to explore how improving access to a wider range of literature from low-resource settings or from reports in languages other than English, may influence the evidence base. This limitation is likely to have shown additional heterogeneity in findings, as the number of valid trials increased.
There was also a predominance of efficacy trials in the included studies, and attention should turn towards effectiveness trials within the different populations. Efficacious interventions may not be effective in routine practice (McCambridge & Saitz, 2017). Some trials did not specify their trial approach as either efficacy or effectiveness; although this may be a consequence of challenges of specification across the efficacy to effectiveness continuum (Heather, 2014). Short-term follow up was common, as reported by other systematic reviews (Moyer, et al., 2002). This is perhaps expected given effect sizes tend to be larger at early follow up, and there are concerns about the longitudinal effects of ABI (Donoghue et al., 2014). The predominant follow-up interval was around three months between data collection points. With around 20% of studies with four or more follow-up points, there is a balance between minimizing loss to follow-up, timely collection of only important information, and respondent burden (Lin et al., 2012).
By synthesizing outcome selection, this review offers the opportunity to consider outcome choice and the implications for the ABI field; other healthcare areas have noted the importance of design in attrition (Kilburn et al., 2014). Others have considered respondent burden (Cunningham et al., 1999;Kypri, 2007); as the number of outcomes reported was 56 in one trial, this may need careful consideration. Decision making around which outcomes to use for particular trials can be assisted by this outcome map, broken down by research area, effectiveness/efficacy, and primary/secondary/other outcomes. The structure of this outcome map was the process of discussion between co-authors, and we recognize that other structures of categorization may also exist.