## SEM217: David H. Bailey, UC Davis: Why financial research is prone to false statistical discoveries

Tuesday, January 25th @ 11:00-12:30 PM (ONLINE)

Abstract: It is a sad fact that few investment funds, models or strategies actually beat the overall market averages over, say, a 10-year window. Even in academic research work, care must be taken to avoid statistical pitfalls, because: (a) the chances of finding a truly profitable investment design or strategy is very low, due to intense competition; (b) true findings are mostly short-lived, as a result of the non-stationary nature of most financial systems; and (c) it is often difficult to debunk a false claim. Backtest overfitting is a particularly acute problem in finance, both in academic research and commercial development, since it is a simple matter to use a computer program to search thousands, millions or even billions of parameter or weighting variations to find an “optimal” setting. In this talk, we summarize many of these pitfalls, explore why they are so prevalent, and present some tools that can be used to avoid them, including the “False strategy theorem”.

## SEM217: Ricardo Fernholz, Claremont McKenna College: The Universality of Zipf's Law

Tuesday, February 1st @ 11:00-12:30 PM (ONLINE)

Abstract: A set of data with positive values follows a Pareto distribution if the log–log plot of value versus rank is approximately a straight line. A Pareto distribution satisfies Zipf's law if the log–log plot has a slope of −1. Since many types of ranked data follow Zipf's law, it is considered a form of universality. We show that time-dependent systems with growth and variance parameters that are constant across ranks will follow Zipf's law if and only if two natural conditions, conservation and completeness, are satisfied. We also show that conservative and complete systems that have constant growth parameters but variance parameters that increase with rank are quasi-Zipfian, with a log-log plot that is concave and has a tangent line of slope -1 at some point. Our results explain the universality of Zipf's law for data generated by time-dependent rank-based systems, but ranked data generated by other means frequently follow non-Zipfian Pareto distributions. Our analysis explains why, for example, Zipf's law holds for word frequency, firm size, household wealth, and city size, while, for example, it does not hold for earthquake magnitude, cumulative book sales, the intensity of solar flares, and the intensity of wars, all of which follow non-Zipfian Pareto distributions.

## SEM217: Vitali Kalesnik, Research Affiliates: Rebalancing: The Achilles’ Heel of Conventional Indexing

Tuesday, February 8th @ 11:00-12:30 PM (ONLINE)

Abstract: Traditional passive indexes systematically buy companies at high average valuations and sell at low valuations, causing a drag to the index performance. Taking the S&P 500 as an example, in a year prior to the announcement of the index rebalancing and subsequently until the actual rebalancing occurs, a typical addition outperforms a typical deletion by about 83%. In the year after the rebalancing, this pattern mean-reverts, where the addition loses relative to the deletion by about 23%. Simple rules such as delaying rebalancing by a year, or using stale capitalization weights, or using fundamental weights to select the securities into the index, and using banding to limit turnover, results in about 15-27 bps return advantage with about 20% lower turnover. Advantages outside the US are even larger.

## SEM217: Cancelled

Tuesday, February 15th @ 11:00-12:30 PM

## SEM217: Alex Braun, University of St. Gallen: Common Risk Factors in the Cross Section of Catastrophe Bond Returns

Tuesday, March 1st @ 11:00-12:30 PM (ONLINE)

Abstract: Catastrophe bonds are an alternative asset class with high excess returns, for which no factor pricing model has emerged to date. We analyze the cross section of catastrophe bond returns for the complete market between 2001 and 2020. Our empirical results show that, of all known coupon and yield spread determinants, only (seasonal) event risk significantly impacts realized returns. A novel four-factor model based on these insights explains 60% of the historical excess return variation in the cat bond market and more than halves the asset class’ alpha left by the Fama & French (1993) model with TERM and DEF.

## SEM217: Kathleen Houssels, Aspire Ten25: ESG Alpha - Is the Deck Stacked Against It?

Tuesday, March 8th @ 11:00-12:30 PM

Abstract: Billions of dollars continue to flow into ESG strategies as academics and practitioners seek to understand the alpha potential of ESG investments. We evaluate the question of ESG alpha by looking at a popular ESG index fund to see how its alpha potential stacks up against the extra cost investors pay for its ESG features.

## SEM217: Agostino Capponi, Columbia: Robo-Advising: Personalization and Goals-Based Investing

Tuesday, March 15th @ 11:00-12:30 PM (ONLINE)

Abstract: Robo-advising encompasses any form of algorithmic advice offered to clients. We begin by presenting a dynamic optimization framework based on human-machine interactions, where robo-advisors personalize their portfolios to the clients they serve. We characterize the interaction frequency which strikes the optimal balance between frequent interactions to learn clients' risk attitudes and mitigation of behavior biases in clients' responses. We then discuss goal-based robo-advising which, rather than optimizing portfolio Sharpe ratios, aims at maximizing satisfaction of investors' goals by the specified deadlines. We introduce a stochastic control framework for goals based investing, and study the tradeoff between funding the current goal versus saving to meet higher priority future goals

## SEM217: Sanjiv Das, Santa Clara University: Multimodal Machine Learning at Scale: Democratizing AI for Academic Research in Finance

Tuesday, March 29th @ 11:00-12:30 PM (ONLINE)

Abstract: Data analytics is mostly geared towards tabular data (numerical and categorical). Humans form decisions using not only tabular data but also make judgments based on text they read, such as news, reports, etc. Econometrics and Machine learning have been used successfully on tabular data and also on text and images, but the combination of text and tabular data is much more powerful. This presentation will examine how combining natural language processing of text with tabular data brings better results on popular financial risk use cases and also brings machine cognition closer to that of humans. This is especially useful in finance, where humans have been using multimodal data to make decisions.

## SEM217: Hubeyb Gurdogan, CDAR: Bias reduction in optimized portfolios through Multiple Anchor Point Shrinkage (MAPS)

Tuesday, April 5th @ 11:00-12:30 PM

Abstract: Estimation error in a covariance matrix distorts optimized portfolios, and the effect is pronounced when the number of securities p exceeds the number of observations n. In the HL regime where p >> n, we show that a material component of the distortion can be attributed to optimization biases that correspond to the constraints used to construct the portfolio. Using Multiple Anchor Point Shrinkage (MAPS) for eigenvectors developed in Gurdogan & Kercheval (2021), we materially eliminate these optimization biases for large p, and zero them out asymptotically, leading to more accurate portfolios. This work extends the correction of the dispersion bias in Goldberg, Papanicolaou & Shkolnik (2022).

## SEM217: Roger Stein, NYU: Making sense of diagnostic performance when information is limited

Tuesday, April 12th @ 11:00-12:30 PM (Hybrid)

Abstract: In machine learning, drug trials and other domains that involve binary outcomes, it is common to measure the power of a predictive model by constructing an ROC curve and calculating the area under this curve. However, in some cases, it may be difficult to understand the AUC under imperfect conditions. We present results that provide bounds on the AUC in a number of such settings.

## SEM217: Jeongyoun Ahn, University of Georgia: Detecting Outliers in HDLSS Data

Tuesday, April 19th @ 11:00-12:30 PM

Abstract: High-throughput data are usually a product of long and complex experiments in laboratories or fields. Due to the multi-step process when generating data, a concern for possible contamination in high-dimensional data is naturally more severe than low-dimensional counterparts. We propose a new two-stage procedure for detecting multiple outliers when the dimension of the data is much larger than the available sample size. In the first stage, the data are split into two disjoint sets, one containing non-outliers and the other containing the rest of the data that are considered as potential outliers. In the second stage, a series of hypothesis tests are conducted to test the abnormality of each candidate outlier. A nonparametric test based on uniform random rotations is adopted for hypothesis testing. The power of the proposed test is studied under a high-dimensional asymptotic framework and its finite-sample exactness is established under mild conditions. Numerical studies based on simulated examples and face recognition data suggest that the proposed approach is superior to the existing methods, especially in terms of false identification of non-outliers.

## SEM217: Jose Blanchet, Stanford: Distributionally Robust Portfolio Selection and Estimation

Tuesday, April 26th @ 11:00-12:30 PM

Abstract: The focus of this talk is on decision-making rules that are designed to be min-max optimal. The decision-maker chooses a policy class (e.g. affine or even non-parametric classes) and chooses a member of the policy class by playing a (min-max) game against an adversary that chooses a probability distribution in a neighborhood of a baseline model. Both players want to optimize in opposite directions a risk loss. The adversary is able to choose from a non-parametric family (typically within a Wasserstein ball around the baseline model). When this formulation is applied to standard losses (e.g. Markowitz), we recover exact regularization representations which, combined with the existence of a Nash equilibrium provide a rich interpretation in terms of the adversarial robustness of traditional techniques. In addition, these formulations provide additional insights into an optimal selection of regularization parameters which avoids the use of cross-validation. This talk is based on several papers.