NLP Best Practices for Deriving Financial and ESG Insights from News and Social Media

by: SESAMm , 14 minute read , February 25 2022

Executive Summary

SESAMm’s AI and Natural Language Processing (NLP) platform analyzes text in billions of web-based articles and messages to generate investment insights that are used in systematic trading, fundamental research, risk management and sustainability analysis.

This technology enables a more quantitative approach to leveraging the value of web data that is less prone to human bias. It addresses a growing need in the public and private investment sectors for robust, timely and granular sentiment and ESG data.

In this article, we will outline the process by which the data is derived and illustrate its effectiveness and predictive value.

Download the Whitepaper


Content Coverage and Collection

The genesis of SESAMm’s process is the high-quality content that comprises its data lake – the source from which it draws its insights. SESAMm scans over four million data sources, rigorously selected and curated to maximize coverage of both public and private companies. Three guiding criteria – Quality, Quantity, and Frequency – ensure a consistently high value of input.

Every day the system adds millions of articles to the 16 billion already in the data lake, going back to 2008. The coverage is global with 40% of the sources in English (the US and international), and 60% in multiple languages spanning the rest of the world. The data lake, expanding every month, comprises over 4 millions sources including professional news sites, blogs, social media and discussion forums.

The following tables illustrate SESAMm’s data lake distribution (Q1 2022):

Language and Country

Respect for personal privacy figures highly in the data gathering process. We do not capture Personally Identifiable Information (PII) and respect all website terms of service as well as global data handling and privacy laws. SESAMm’s data also doesn’t contain any Material Non-Public Information (MNPI).


Derived Financial Signals and ESG Indicators

SESAMm’s new TextReveal® Streams platform applies Natural Language Processing (NLP) and Artificial Intelligence (AI) expertise to process the premium quality content gathered in its data lake. This is a complex process that involves Named Entity Recognition (NER) and Disambiguation (NED) – the process of identifying entities and distinguishing like-named entities using contextual analysis – and mapping the complex interrelationships between tens of thousands of public and private entities, connecting companies, products and brands by supply chain, location or competitive relationship.


Process representation for NER and NED

Using SESAMm’s TextReveal® Streams, this wealth of information is filtered to focus on four importantcontexts for systematic data processing, risk management, and alpha discovery:

● Sentiment covering major global indices: world equities (and Small Caps, Emerging), US 3000, Europe 600, KOSPI 50, Japan 500, Japan 225;
● Sentiment covering all assets and derivatives traded on the Euronext exchange;
● Private Company Sentiment on 25,000+ private companies;
● ESG Risks covering 90 major Environmental, Social, and Governance risk categories for the entire company universe, which includes 10,000+ public and 25,000+ private companies with worldwide coverage.

TextReveal® Streams data sets are used by hedge funds (quantitative and fundamental) and asset managers to optimize trade timing and identify new investment opportunities. Private equity deal and credit teams also use the data for deal sourcing and due diligence, and private equity ESG teams use it to manage portfolio company environmental, social, and governance risk and reporting.


Methodology and Technology for Processing Unstructured Data


NLP Workflow — From Data Extraction to Granular Insight Aggregation

Data is continually extracted from an expanding universe of over four million sources daily. As it enters the system it is time-stamped, tagged, indexed, and stored in our data lake to update a point-in-time history extending from 2008 to the present.
The source material is then transformed from raw, unstructured text data into conformed, interconnected, machine-readable data with a clear topic.

NLP workflow for TextReveal® Streams

NLP workflow for TextReveal® Streams

The Knowledge Graph — Mapping Relationships between Entities

At the heart of the text analytics process is SESAMm’s proprietary Knowledge Graph, a vast map connecting and integrating over 70 million related entities and their keywords. It is essentially a cross-referenced dictionary of keywords, connecting each organization to its brands, products, associated executives, names, nicknames, and in the case of public companies to their exchange identifiers.
Entities within the Knowledge Graph are updated weekly tagged to ensure changes are properly tracked – the CEO of a company today, for example, may not be the CEO tomorrow, and brands may be bought and sold, changing the parent company with each sale. Weekly updates within the Knowledge Graph ensures the system is aware of these changes.

Named Entity Disambiguation (Named Entity Recognition + Entity Linking) is one of the Natural Language Processing (NLP) techniques that is used to identify named entities in text sources using the entities mapped within the Knowledge Graph universe.
At SESAMm, Named Entity Disambiguation (NED) identifies named entities based on both their context and their usage. Text referencing “Elon”, for example, could refer indirectly to Tesla through its CEO or to a university in North Carolina — only the context allows us to differentiate, and NED takes that context into account when classifying entities. This is superior to simple pattern matching, which limits the number of possible matches, requires frequent manual adjustments, and is unable to distinguish homophones.
Three other NLP tools are used by SESAMm to identify entities and create actionable insights. These are Lemmatization, Embeddings, and Similarity. Each is explained in more detail below.


Lemmatization — Analyzing the Morphology of Words

News articles, blog posts, and social media discussions reference organizations and associated entities in a wide range of forms and functions, Lemmatization seeks to standardize these references so the system knows they all mean the same thing.
For example, “Tesla,” “his firm,” “the company,” and “it” are all noun phrases that can appear in a single article and refer to a single entity. Even where the referent is immediately clear, it can take different forms. For example, “Tesla” and “Teslas” both refer to the same entity but have a slightly different meaning (semantics) and shape (morphology).
The process of lemmatization standardizes reference shape (morphology) to facilitate identification and aggregation. Lemmatization is a more sophisticated process than stemming, which simply truncates words to their stem and sometimes deletes information.


Word Embedding — Encoding Context and Meaning

In Natural Language Processing, an Embedding is a numerical representation of a word that enables its manifold contextual meanings to be calculated relationally. Embeddings are typically real-valued vectors with hundreds of dimensions that encode the contexts in which words appear, and thus also encode their meanings.

Because they are vectors in a predefined vector space, they can be compared, scaled, added, and subtracted. The classic simple example of how this works is that the vector representations of King and Queen bear the same relation to each other as the representations of Man and Woman — once you subtract the vector
that represents Royal.

Vectorized representation of Embeddings

Vectorized representation of Embeddings

Using Embeddings is key, both to analyzing how words change meaning depending on context and to understanding the subtle differences between words that refer to the same concept (synonyms).
To take an example, the words business, company, enterprise and firm can all refer to the same thing if the context is ‘organizations’ but represent very different things (and even different parts of speech) if the context changes.

In the phrase, “[Tesla] will be by far the largest firm by market value ever to join the S&P”, for example, one could replace the word firm with company or enterprise without affecting the meaning significantly. Contrast that with “a firm handshake”, where a similar substitution would render the phrase meaningless.
Also, words referring to the same concept can emphasize slightly different aspects of the concept, or imply specific qualities — for example, an enterprise might be assumed to be larger or to have more components than a firm. Embeddings enable machines to make these subtle distinctions.

One advantage of using Embeddings is that it is practical because it is empirically testable, i.e., we can look at actual usage to determine what a word means.
Another tremendous advantage is that Embeddings are computationally tractable, meaning that this understanding of a word’s definition allows us to transform words into objects of computation to programmatically examine the contexts in which they appear and thus derive their meaning.
Just as Lemmatization is an improvement on Stemming, Embeddings are an improvement on techniques such as one-hot encoding, which is close to the common conception of a definition as a single entry in a dictionary.

SESAMm uses the Global Vectors for Word Representation, or GloVe, algorithm to generate Embeddings. This is an unsupervised learning algorithm that begins by examining how frequently each word in a text corpus co-occurs with other words in the same corpus. The result is an Embedding that encapsulates the word and its context together, allowing SESAMm to identify not just specific words in a list, but different forms of the listed words and unlisted synonyms of the word as well.
GloVe is an extension of recent approaches to vector representation, combining the global statistics of matrix factorization techniques like Latent Semantic Analysis (LSA) with the local context-based learning of word2vec. The result is an unsupervised algorithm that performs well at capturing meaning and demonstrating it on tasks like calculating analogies and identifying synonyms.

BERT is another algorithm used by SESAMm to generate Embeddings. BERT produces word representations that are dynamically informed by the words around them. The technique was developed by Google and it is what is known as a transformer-based Machine Learning technique, which means it does not process
an input sequence token by token but rather takes the entire sequence as input in one go. This is a big improvement over sequential Recurrent Neural Network (RNN) based models because it can be accelerated by Graphics Processing Units (GPUs).
SESAMm uses BERT for multilingual NLP of its extensive foreign language text because it has been retained using an extensive library of unlabeled data extracted from Wikipedia in over 102 languages. BERT model was trained to predict words from context and next sentence prediction where it was trained to predict if a chosen next sentence was probable or not given the first sentence. As a result of this training process, BERT learned contextual Embeddings for words. Due to this comprehensive pre-training BERT can be finetuned with less resources on smaller datasets to optimize its performance on specific tasks.


Cosine Similarity — Linking Words, Sentences, and Topics

Cosine similarity with centered means is identical to the correlation coefficient, which highlights another element of the computational tractability of the Embeddings approach — it makes it easy to compare words and contexts for similarity.

Converting words to vector representations means we can quickly and easily compare word similarity by comparing the angle between two vectors. This angle is a function of the projection of one vector onto another and can identify similar, opposite, or wholly unrelated vectors, which allows us to compute the similarity of the underlying word that the vector represents.
Two vectors aligned in the same orientation will have a similarity measurement of 1, while two orthogonal vectors have a similarity of 0. If two vectors are diametrically opposed, the similarity measurement is -1. In practice, negative similarities are rare, so we clip negative values to 0.

Vectorized representation of cosine similarities

Vectorized representation of cosine similarities


Cosine similarity measures whether two words, sentences, or corpora are close to one another in vector space, or “about” the same thing in semantic space. To answer the question “Is this sentence referencing company X?”, we embed the sentence using the process described above and compute the cosine similarity between the sentence and the embedded company profile.
Analogously, we compute similarities between sentences and the ESG topics SESAMm monitors by taking the maximum similarity between a sentence and each embedded keyword associated with an ESG topic.
These similarities allow us to identify whether a sentence references fraud, tax avoidance, pollution, or any other ESG risk topic among the 90+ that SESAMm tracks across the web.
Similarities within ESG topics combine with word counts to resolve the recall/precision problem. Word counts are precise in the sense that if a word is identified within a context then that context (by construction) references the topic.
The virtue of using these NLP techniques is that even if a given keyword list does not include every possible combination of words that a person might use to discuss a topic, relevant entities missed by the word count process will be identified through vector similarity.
This is the power of SESAMm’s NLP expertise — we can scan in seconds many lifetimes’ worth of data to find both the concepts you explicitly ask for and the concepts that are relevant to your search but that you did not think of yourself.


Deep Learning and Neural Networks — Analyzing Sentiment

Once we have identified the concepts and contexts of interest in all the forms in which they appear, we analyze the context to determine the attitude of the speakers.
We use sentiment classification models to score a sentence with three possible outcomes: negative, neutral, or positive. The current classification models are based on Deep Learning technologies. Specifically, we stack convolutional Neural Networks with word Embeddings, and bayesian optimized hyperparameters (parameters not learned during training). This architecture improves the accuracy and enables fast shipping of production-ready models for a given language. We also produce state-of-the-art frameworks with
architecture variations enabling multilingual capabilities, such as transformers and universal sentence encoders.


Daily Aggregations — Condensing Information and Extracting Insights

Similarities, embedded word counts, and sentiment are state-of-the-art tools for processing unstructured text data. The same tools are effective cross-linguistically.
Once the information has been extracted from millions of data points, it is aggregated and condensed into actionable insights.
All entities are referenced directly or indirectly within an article, then sentence-level references are aggregated to obtain an article-level perspective, and finally, all relevant articles are aggregated to gain an entity-level view for that day.
In this way reams of data are compressed into several metrics to provide a daily aggregated view for each entity, highlighting trends at a sentence, article and entity level, comparable over a multi-year history.


Use Cases

SESAMm’s TextReveal® Streams is used in a variety of investment domains, from asset selection to alpha generation and risk management. Systematic hedge funds track retail interest in real-time to identify investment opportunities and to protect their existing positions. In the Private Equity industry equity and credit deal teams use the data in a variety of ways, from monitoring consumer perspectives via forums and customer reviews for evaluating deal prospects, to estimating due diligence risks. Dedicated teams use our data for monitoring portfolio companies for ESG red flags and for streamlining reporting.
Below are two examples of how aggregated TextReveal® Streams data can be used to help identify investment risk and opportunity.


ESG Signals for Equity Trading — LFIS Capital

ESG controversies can significantly impact asset prices in the short term, and it is now estimated that intangible assets, including a company’s ESG rating, account for 90% of its market value.
Working In partnership with LFIS Capital (“LFIS”), a quantitative asset manager and structured investment solutions provider, SESAMm developed Machine Learning and NLP algorithms that could analyze ESG keywords in articles, blogs and social media, to generate a daily ESG score specific to each stock, which is part of the TextReveal® Streams’ platform’s core functionality.
When these scores were then incorporated into a simulated strategy for trading stocks in the Stoxx600 ESG-X index, the results were found to be promising.
A simulated long-only strategy running between 2015 and 2020, using the signals, delivered a 7.9% annualized return, 2.9% higher than the benchmark for similar annualized volatility (17.3% vs. 17.1%). The information ratio of the strategy was greater than 1, with a tracking error of 2.8%. Results for the past three years were particularly convincing, reflecting the growing interest and news-flow around ESG themes.
Researchers also backtested a hypothetical long/short strategy for all stocks in the Stoxx600 ESG-X index with a market cap of over $7.5bn. This investment strategy delivered a Sharpe ratio of approximately 1 with annualized returns and volatility of 6.1% and 5.9%, respectively, between 2015 and 2020. Like the long-only strategy, returns were particularly robust over the 3 years up to 2020: +6.0% in 2018, +7.3% in 2019, and +11.3% in 2020.

Finally, a simulated “130/30” ESG strategy that combined 100% of the long-only ESG strategy and 30% of the long/short ESG strategy, delivered a 10.8% annualized return, 5.8% higher than that of the Stoxx600 ESG-X index. Annualized volatility was similar at 16.9% vs. 17.1%. The strategy experienced a tracking error of 3.8% and an information ratio of over 1.5, with a consistent outperformance each year.

Simulated results of a hypothetical “130/30” ESG strategy

Simulated results of a hypothetical “130/30” ESG strategy. Source: Bloomberg, LFIS, SESAMm.
Past performance is not an indicator of future results.

Theoretical calculations provided for illustrative purposes only. The investment theme illustrations presented herein do not represent actual transactions currently implemented in any fund or product managed by LFIS

For access to the full SESAMm – LFIS report on ESG Signals for Equity Trading contact us

ESG Sentiment and Volume as Predictive Indicators — Wirecard

The Wirecard scandal broke on June 21, 2020, when newswires carried the story that the major German payment processor had filed for bankruptcy after admitting that €1.9 billion ($2.3 billion) of purported escrow deposits simply did not exist.
The question is, could SESAMm’s TextReveal® Streams platform have provided investors with an early warning that the scandal was about to break?
The chart below, derived from the platform, shows how key ESG metrics, including ESG Scores (Volumes) and ESG Scores (Sentiment) reacted to the news.
An analysis of the charts pinpoints a shallow rise in the ESG Scores (Volumes) time-series in the early part of June before the eruption on June 21.
The ESG Scores (Sentiment) metric also shows a steady increase in negative sentiment for ‘Governance’, which is the most relevant of the three in regards to the scandal.

Volume Graph

Additionally, for most of the time prior to the crash, Governance was the most negative of the three ESG factors. This was especially the case in late March-early April, and then just before the scandal in early June, when Governance negative sentiment clearly diverged higher from the other two.
The rate-of-change of negative Governance sentiment as it rose and peaked in early June just before the scandal broke was also extremely high, perhaps providing the basis for an early warning signal.
Portfolio managers who had been keeping an eye on the reputational slide in Governance for Wirecard may have decided the company was at high risk of a negative controversy emerging, which might have given them cause to drop the stock prior to the event.
In this way, it can be seen how whilst not providing a hard and fast early warning signal, SESAMm’s ESG scores can, nevertheless, be used as the basis for developing a data-driven, rules-based portfolio management approach that can help investors avoid high-risk candidates like Wirecard.

For access to the full SESAMm – Wirecard report on ESG Signals for Equity Trading contact us


About SESAMm

SESAMm’s Natural Language Processing and Artificial Intelligence tools analyze over four million data sources daily to identify thousands of public and private companies and their related products, brands, identifiers, and nicknames, turning reams of unstructured text into structured and actionable data.
SESAMm’s TextReveal® Streams platform can be used in a wide variety of quantitative, quantamental and ESG investment use cases. To find out more about how SESAMm can support you in your decision-making, to request a demo, or for any other questions regarding our data, do not hesitate to contact us.



The contents of this blogpost do not constitute an offer or solicitation to buy services or shares in any fund.
The information in this document does not constitute investment advice or an offer to invest or to provide management services and is subject to correction, completion, and amendment. Past performance is not indicative of future results.