SEM217: Tingyue Gan, UC Berkeley: Linking 10-K and the GICS - through Experiments of Text Classification and Clustering

Tuesday, April 16th @ 11:00-12:30 PM (1011 Evans Hall)

Linking 10-K and the GICS - through Experiments of Text Classification and Clustering

Tingyue Gan, UC Berkeley

A 10-K is an annual report filed by a publicly traded company about its financial performance and is required by the U.S. Securities and Exchange Commission (SEC). 10-Ks are fairly long and tend to be complicated. But this is one of the most comprehensive and most important documents a public company can publish on a yearly basis. The Global Industry Classification Standard (GICS) is an industry taxonomy developed in 1999 by MSCI and S&P Dow Jones Indices and is designed to classify a company according to its principal business activity. The GICS hierarchy begins with 11 sectors and is followed by 24 industry groups, 68 industries, and 157 sub-industries. We ask two questions: First, can a classifier be trained to recognize a firm's GICS sector based on the textual information in its 10-K? Second, can we extract, from the classifier, embeddings (low dimensional vectors) for 10-Ks that respect their GICS sectors, so firms within the same sector would have embeddings that are close (measured by cosine similarity)?  We report on a series of experiments with Convolutional Neural Network (CNN) for text classification, trained on two variants of document representations, one uses pre-trained word vectors, the other is based on the simple bag-of-words model.

Slides