Subscribe to Our Blog
New York, Neudata Data Scouting and Insights Summit, March 2022.
Robert Schuessler, SESAMm's Head of Quantitative Research, shares insights on the systematic detection of ESG controversies and how we process 13 years of historical web-sourced data from over 18 billion articles and forums using natural language processing to generate timely insights.
Below is an approximation of this video’s audio content. Watch the video for a better view of graphs, charts, graphics, images, and quotes the presenter might be referring to in context.
These are actually the same articles (Slide 2), even though they don’t look like it. It’s a very common approach to talking about data. I’m sure people coming to presentations like this are familiar with it.
We see a large number of data and explain how over time, this is going to increase by a thousand or a million times, and we make comparisons to some somewhat out-of-date benchmarks. So in the case of the economists or comparing it to the volume of data you can store on a DVD, pages of data for the financial times or movies for the Wall Street Journal.
One type of journalistic trope I usually think of as “big number gets bigger.” One thing that seems common about them is that there's a sort of an apocalyptic nature to the discussion deluge appeared in two of those articles. It seems as if they're compressing anxieties about change into the amount of data that we now have to deal with.
So this presentation is about the opposite, “big number gets smaller,” and one thing that I'm going to start with is the appropriate type of benchmark to use when you're evaluating how much data you have to deal with.
Here's Henry David Thoreau and a famous quote of his: “The cost of a thing is the amount of what I will call life which is required to be exchanged for it.” You can apply that to size as well, and we'll see how that works in a second.
18 billion is the big number that we're beginning with. This is the number of records that SESAMm has from 2008 to the present. A record for us is an article like the Wall Street Journal article or The Economist article, so a very dense amount of information compressed in into one article.
That's 250 terabytes. If we look at the decimal representation of that, that's four extra zeros from the previous number, so actually, this is “big number gets bigger.” But technically, it's the same size… this is actually a storage for that number, and we'll see how it goes down from here.
Here's another person that you're probably familiar with: Tyler Cowan, a Harvard-trained economist and polymath. Here's a quote from him. If we imagine that you all are superior to Tyler Cowan, and you can read one of these Wall Street Journal articles or Economist articles in a second without pausing to eat or sleep, over the course of a year, you can read 31 million articles, a little more.
So you can read everything ever published by The Economist and move on to niche infrastructure blogs and start reading Reddit.
If you start when you're right out of college and invest an entire career to read a billion—the first billion articles—you'll be in your mid-50s before you finish.
And there will be 17 million, 17 billion, left to read plus everything that's accumulated in the meantime and that's increasing at an exponential rate. So rather than compared to DVDs, which is a little bit awkward of a comparison compared to your actual life, there are many lifetimes worth of data in here. And again, these are dense articles, not just measures, not just data points.
So this is the question that we're interested in: How can you go through that much data in a reasonable amount of time in order to make decisions?
So don't don't be worried. I'm not going to go into detail on this thing. This is what our data model looks like, and what I want to highlight is just the arrows. All of these arrows are opportunities for optimization. So each point where we're ingesting data or manipulating data in some way there's a chance to do something faster.
We start with a document oriented database. And the reason for this is that what people want to see is the actual impetus behind a change. We can aggregate data. We can identify spikes and troughs, but if we're doing that accurately, then people want to understand what's behind that. So ultimately what we're going to do is say, “Hey here are the articles that you should read. These are relevant to your universe.” So that's why we start with Elasticsearch.
We use DistilBERT and the Universal Sentence Encoder, and one thing I want to emphasize is even if the the compressed version of BERT when we're calculating the transformer matrix or we're calculating encodings, we can do that in less than three milliseconds. But that's still a year and a half if you're working with 18 billion data points, so we have to see how to cut these down. I do want to emphasize we spend a lot of time developing online algorithms. This may be an unfamiliar concept if you're not a computer scientist, but these are algorithms where in when you add a new data point, you can just work with that data point and the previous number. You don't have to recalculate using all of the previous data points.
So one obvious answer is to get a billion readers do everything in one second. There are trade-offs for this. It becomes more and more complex as you add servers the complexity is exponential, not linear. And you can get some surprises in terms of your bills, especially if like us you let your clients play with your data.
So what we do is we use a hybrid model, so the servers that serve the API and the dashboard, we control those directly. And we can add GPUs so that they can do math as fast as possible, and it also allows us to make sure that they're optimized for the type of work that that they're doing. So even though we have to hire people for this, it ends up being less expensive. It also allows us to avoid a trap. If we were able to just throw servers at every problem we might be tempted to calculate all embeddings in advance and that would require 500 petabytes more storage than we currently need. Instead, we calculate it on the fly because we use these optimized servers.
My girlfriend's a dentist, and her father's a dentist. And whenever somebody talks about pulling teeth, they like to leap in and explain, “Actually, it's super easy to pull teeth.” I have another contact in Boston who restores the like 18th-century furniture for museums. And whenever he sees this phrase he also leaps in and corrects you.
Apparently, if you're doing something that needs exact precision, first, you cut off everything you're absolutely sure you don't need, but it's a little bigger than you need. And then, you gradually shave it down until it's the correct size, testing each time before you cut off a slice. So you're actually cutting a lot. That's our approach as well, and so maybe an interpretation of that that works is this, “machete, then sandpaper” approach.
So here are a few things that we do. First, we create a knowledge graph. We do this in advance. Those 18 billion articles represent 50 million entities, five million companies, all of the executives, the figgis if they're public companies, the brands, and the products. We do all of this in advance so that when somebody gives us their universe, we know the sorts of things that people are going to be talking about when they're referring to a company in that universe.
We can also, in advance, create indexes. So we do that for every attribute. I'm not going to read all of these, but language is one of them. We have over a hundred languages represented from four million sources that we pull daily. If you know that you only care about English, for example, or you only care about Spanish results, we don't even have to look at everything else because we have separate indices for the forty percent of the database that's in English.
We can also create custom indices if we can think of clever things in advance. SESAMm was founded in France, so I have a slightly chauvinistic example for the French people in the room. We do actually do this. We identified monuments, and it paid off. One of our private equity (PE) clients was like, “We want to understand people's intent to visit monuments as a measure of how safe they're feeling post- COVID-19.” And we were able to do this… turn this around in four days for them because we thought about in advance.
As with everything else, there's a trade-off. You can have a hundred thousand partitions, and then your performance starts to decline.
So you might have thought when you were looking at the first workflow that I showed you that this last arrow there's nothing we can do about because once you put it in front of the user, you can't optimize. But we just saw the RepRisk presentation. I think there's an example of convergent evolution here.
So since we work with private equity companies on due diligence, they ask us to do other things as well. And one of the things that we end up doing is identifying ESG risks for the companies because they want to do it in the same time frame that we do the other stuff for them. So we've done something very similar here.
Looking at like SASB standards and other ESG frameworks, we create these same sorts of indices as we do for monuments but for every topic that might be relevant to somebody either evaluating the company and the due diligence process of the PE life cycle or doing their reporting once it's a portfolio company.
Because we have so many data points, we can also benchmark these scores so that we know that for a particular company and sector, this is going to be relevant, and this is not going to be.
And the same for the SDG goals. A lot of companies want to balance out the negatives that they're reporting on with some positives that the companies are doing as well, or their portfolio overall. So we prepared these indices in advance.
Whenever I talk about precision and recall, I like to give everybody a refresher just in case it's not like right in the forefront of their minds. When I was in college, this is how I remembered the difference between Type I and Type II. If something's true in the world, but I don't believe it, then I'm ignorant. If something is not true, but I believe it anyway, then I'm gullible. And that's how I distinguish Type I and Type II. We don't really have to worry about recall because there's so much data. Your problem is definitely in there somewhere, but we don't want to just dump it all on you. And say you know, dive in, go find it. So we really spent a lot of time on the Type I side. We want to eliminate false positives as much as possible. And for PE firms, it's particularly important because often they have a fiduciary responsibility to track down, to understand, any risk that may be material that gets raised. So we don't want to be sending people on wild goose chases.
So the other way that we use the data to cut down on time is with a specific workflow to identify controversies. Because we have 18 billion records that provides a data set that lets us understand, for a particular size of company and a particular sector and the type of risk that they're running that they've identified, how likely is it to be material? And we score these things and generate a decision tree for each hit that we get. We divide that into five buckets, and we only alert on the, you know, the top four or five. And these are highly flexible, so we can tune them for what the companies actually need. So another way we save time is by avoiding things that will send you on a wild goose chase.
Out of the 18 billion records that we start with that provides the data for us to have this understanding of which things are important and which aren't, the 50 million records in the knowledge graph, the 100 plus risks that we map, and this decision tree that allows us to understand which things are worth forwarding and which things don't need investigation.
For a mid-range 400 to 500 company portfolio over the course of a year, will generate 960 alerts. And those alerts come with the links to the articles so that you can explore and understand exactly what risks the company is running and dive into the problem as we've seen, and we get feedback from that as well.
So if we assume that something bad can happen every day of the year, that's 2.6 alerts on average for every given day. So from 18 billion down to 2.6. And in fact, I have one more slide. And I wanted to time this exactly right, both to come to the natural conclusion of “big number gets smaller” but also because this is the last presentation before break—and to dramatize the idea of saving time, but I intend to end one minute early.
Stay in touch with SESAMm
Thanks for reading this blog post. Be sure to catch the next issue by subscribing to our blog. And if you'd like a TextReveal demo, send us a message via the form below.