Discover our whitepaper highlighting the power of web data on predictive analytics for the financial industry: how alternative data strengthen the market, the challenges of collecting web data and case studies presenting different approaches, such as ESG, showcasing TextReveal features and capacities for investment purposes.
In the past, investment management institutions relied mostly on traditional data to gain an edge in investing. Traditional data ranges from SEC filings to earnings reports and pricing information any type of data produced by the company itself. The rise of the digital age, however, has opened up new sources of data for investors beyond the scope of traditional data. The seemingly infinite scope of alternative data includes data produced from credit cards, satellites, social media and perhaps most importantly the web.
With the additional integration of alternative data, investment management institutions and hedge funds in particular that once relied only on traditional data now have an edge in predicting the rise and fall of the markets. As increasing numbers of financial institutions jump on the bandwagon of alternative data, spending on alternative data by trading and asset management firms is set to exceed $7 billion by 2020.[1]
What was only a few years ago a question of when institutions should start using data has shifted to the question of howthey can organize and structure these mostly unstructured datasets. And with 4 billion webpages and 1.2 million terabytes of data on the internet estimated to be generated globally by 2025, there is no shortage of web data to sort through. As increasing numbers of investment management institutions incorporate alternative web data into their predictive algorithms, it will change the face of investment as we know it.
This white paper is intended to be a guide for investment management (IMs) institutions to better understand how alternative web data is quickly becoming an essential component for generating alpha and mitigating investment risk. In addition, it explores different models of web data crawlers and what IMs need to look for as they incorporate alternative web data into their predictive analytics models.
Section 1: Beating the Market with Alternative Web Data
“Your company’s biggest database isn’t your transaction, CRM, ERP or other internal database. Rather it’s the Web itself…Treat the Internet itself as your organization’s largest data source.” — Gartner
As previously mentioned, alternative data includes any type of data that is beyond the scope of traditional data: satellite imagery, social media data, and web data (which includes news sites, blogs, discussions and forums) along with credit card data. Alternative web data, which falls under the broader category of big data, is typically unstructured and demands a process for structuring it in order to deliver insights.
[1]
Alternative data for investment decisions: Today’s innovation could be tomorrow’s requirement.
Deloitte Center for Financial Services. 2017.
In private equity, as in most industries, decision-making counts on accessing accurate and valuable information. However, these firms often encounter significant challenges when sourcing reliable data, especially when dealing with small, private companies. This article dives into the complexities of identifying high-quality information on smaller companies and underscores its value in investment decisions, operational efficiency, and risk management. It also explores how advanced artificial intelligence (AI) technologies are revolutionizing the identification of these risks, leading to higher rewards and more secure investments, thus providing a competitive edge.
The challenge of identifying valuable information for Smaller Firms
Lack of valuable data
Sturgeon's Law, which states that "Ninety percent of everything is crap (or noise)," becomes particularly relevant in the context of data sourcing. For private equity and investment firms focused on small companies, finding the golden nuggets of information amid the overwhelming amount of digital noise can be daunting. The data available on these companies is often sparse, fragmented, and difficult to uncover using conventional methods. This scarcity of reliable information makes it challenging for private equity firms to make informed decisions, heightening the risk of overlooking critical issues that could impact their investment process.
The difficulties extend beyond just locating information. Many small companies operate without a significant online presence or may not be required to disclose as much information as publicly traded firms. This lack of transparency can further blur critical data points. Furthermore, the data that is available is often unstructured, residing in various forms such as social media posts, obscure local news articles, or industry-specific reports. Extracting meaningful insights from these disparate sources requires sophisticated data processing capabilities, which traditional methods often lack. As a result, private equity firms are left with a significant challenge: how to separate valuable data from the noise without missing critical risk indicators, thereby optimizing their deal sourcing and investment strategies.
Diverse language and terminology
Smaller firms frequently face existential risks, and the potential rewards for identifying these risks early on can be significant for the private equity firms that invest in them. However, mainstream methods of risk identification often fall short, as these companies may not use standardized language to describe materiality. Instead, risks are discussed in varied and context-specific ways, complicating the task of recognizing relevant information. Therefore, it is essential to adopt a specialized approach that analyzes and decodes these firms' unique terminologies and business idiosyncrasies, ultimately translating them into a standardized language that can be effectively used in risk assessment.
The diversity in language is not just a barrier to risk identification but also to the communication of these risks within and between private equity firms. When a small firm uses industry-specific jargon or localized expressions to describe potential threats, it can lead to misunderstandings or underestimations of the actual risk. For instance, a manufacturing startup in a developing country might describe supply chain disruptions in terms that do not translate easily to a global investor’s risk framework. Additionally, cultural differences in how risk is perceived and reported can lead to further complications. This linguistic diversity necessitates the use of advanced natural language processing tools that can interpret data through a common lens while considering industry-specific contexts. For an insurance company, understanding financial models, insurance principles, and regulatory frameworks is crucial. Conversely, assessing risks for a beauty company requires a focus on product safety, consumer preferences, and market trends. By appreciating the specific contexts of each industry, private equity firms can better identify and evaluate potential risks, enhancing decision-making processes, risk and portfolio management strategies, and operational efficiency.
The dynamic nature of the industries themselves further complicates the challenge. For example, the tech industry evolves rapidly, with new risks emerging as technologies develop and consumer expectations shift. What might be considered a negligible risk today could become a significant issue tomorrow as regulatory landscapes, market conditions, and technological advancements alter the playing field. In contrast, industries like agriculture or real estate might have more stable risk profiles but are subject to sudden changes due to environmental factors or policy shifts. This variability across industries means that a one-size-fits-all approach to risk assessment is inadequate. Private equity firms must adopt flexible, industry-specific risk models that can adapt to the unique characteristics and evolving landscapes of the sectors they invest in, thus optimizing their AI capabilities.
The Power of AI in Enhancing Risk Management in Small Firms
AI technologies, particularly natural language processing (NLP) and machine learning algorithms, are important tools for private equity firms aiming to monitor and manage risks in small firms. These technologies can sift through vast amounts of data, extracting the valuable 10% and identifying patterns, trends, and subtle nuances in the language used to describe risks. By detecting these patterns, AI can reveal potential risks that might not be immediately apparent through traditional methods. This proactive approach to risk identification allows firms to address issues before they escalate, providing a more comprehensive and nuanced understanding of the risks facing small firms.
AI's ability to process unstructured data is particularly valuable in this context. Many of the risks that small firms face are discussed informally in places like social media, niche blogs, or local news outlets. Traditional risk management tools might overlook these sources, but AI-powered tools can analyze them in real-time, detecting emerging threats as they develop. Moreover, AI can cross-reference these insights with structured data from financial reports, regulatory filings, and other formal documents to create a holistic risk profile. This multidimensional analysis helps private equity firms not only identify risks but also understand their potential impact, enabling more informed, data-driven decision-making that enhances operational efficiency and competitive edge.
Beyond risk identification, AI also enhances risk mitigation strategies. By continuously monitoring data and learning from new information, AI systems can adapt to changing conditions, offering updated risk assessments that reflect the latest developments. This dynamic approach allows private equity firms to stay ahead of potential issues, making it possible to implement preventative measures rather than reacting to crises after they occur. In this way, AI capabilities contribute significantly to the optimization of risk management processes.
How SESAMm’s Advanced Technology Enhances Risk Assessment
SESAMm’s TextReveal® is at the forefront of this technological revolution, enabling private equity firms to efficiently navigate the vast digital landscape and extract the crucial information needed for informed decision-making. Through our proprietary data lake amounting to over 25 billion online articles with 15 years of historical data and our AI algorithms, TextReveal® can quickly identify and retrieve valuable insights, even when the information is deeply buried or highly specific. The tool's ability to analyze and understand the diverse language and terminology used in discussions about risks on the web empowers private equity firms to objectively assess the materiality of certain risks or identify emerging threats that have yet to be formally recognized.
TextReveal® goes beyond merely identifying risks—it categorizes them, providing context that helps private equity firms understand the severity and relevance of each risk. For example, if a small biotech firm is mentioned in discussions about regulatory hurdles, TextReveal® can determine whether these mentions are isolated incidents or part of a broader trend. It can also assess whether the language used suggests an imminent threat or a longer-term concern, enabling firms to prioritize their responses accordingly. Additionally, TextReveal® integrates sentiment analysis, which can gauge the overall tone of discussions surrounding a company, offering further actionable insights into potential reputational risks.
SESAMm has developed a proprietary metric – the Intensity Score, which calculates an event's relevance based on its news coverage and sentiment. It uses negative sentiment, article dispersion, and empirical ESG risk measures to determine how likely an article is to represent a high-risk controversy. The Intensity Score gives TextReveal users a clear understanding of which events require their attention.
Users can also opt to receive email alerts for the more severe controversies, ensuring they’re always aware of significant risks. In addition to the severity, controversies are also categorized by risk and sub–risk type, making it easy to analyze specific areas of concern.
Moreover, SESAMm's platform is designed to be intuitive and user-friendly, making it accessible to investment professionals who may not have a technical background. This ease of use ensures private equity firms can quickly incorporate AI-driven insights into their risk management processes without a steep learning curve. By streamlining the data analysis process, TextReveal® allows firms to focus on strategic decision-making, confident they have a comprehensive understanding of the risks and opportunities associated with their investments and portfolio companies. This level of operational efficiency and optimization is key to maintaining a competitive edge in the fast-paced world of private equity.
TextReveal’s Risk Assessment module enables deep company and thematic research in multiple languages through on-the-fly keyword searches. Users have full access to articles, sentiment analysis, and trending topics to get a complete understanding of the risks. We’ve even developed an AI Text Summary feature that provides a quick summary of a selected article, saving time and enabling a faster analysis.
In summary, the integration of AI tools and natural language processing technologies is transforming risk management in private equity, particularly for firms dealing with small, private companies. By leveraging these advanced tools, private equity firms can enhance their due diligence processes, better monitor risks and controversies, and ultimately make more informed investment decisions that lead to higher rewards and operational efficiency.
Reach out to SESAMm
TextReveal's web data analysis of over five million public and private companies is essential for keeping tabs on ESG investment risks. To learn more about how you can analyze web data or request a demo, contact one of our representatives.
In the digital age, data proliferates at an astonishing rate. From news articles to social media posts, the information explosion presents unique challenges in processing and understanding content accurately. One significant challenge is distinguishing between entities with similar or identical names in different contexts. named entity disambiguation (NED) is a sophisticated technology within natural language processing (NLP) aimed at tackling this issue. This technology ensures that when you search for "Orange," the results accurately reflect whether you meant the color, the fruit, or the multinational corporation. This article explores the concept of NED, underscores its importance, and elaborates on how SESAMm employs this technology to stand out from other companies in the artificial intelligence (AI) landscape.
What Is Named Entity Disambiguation?
Named Entities: Defining the Basics
In data science and text processing, a named entity is defined as any real-world object that can be denoted with a proper name. This includes people like "Elon Musk," companies like "Google," and landmarks like "Mount Everest." These entities are distinct because they refer to unique individuals, organizations, or locations, unlike common nouns such as "manager" or "river," which are non-specific and can refer to many different entities globally.
Named Entities: Defining the Basics
Named Entity Disambiguation, also known as entity linking, involves identifying which specific entity is referred to in an unstructured text when there are multiple candidates with similar names. This process utilizes a blend of machine learning, knowledge graphs, and other sophisticated NLP algorithms to analyze the text and determine which entity type is relevant in the given context. This determination is important because it affects the interpretation and subsequent processing of the information.
The Importance of Named Entity Disambiguation
The role of NED in text analysis and information processing cannot be overstated, particularly when dealing with large and complex datasets. It enables:
Refined text analytics: For tasks like sentiment analysis, precise entity recognition ensures that emotions or sentiments are accurately associated with the right entities. This is crucial for businesses to understand public perceptions of their products or services accurately.
Efficient construction of knowledge graphs: Knowledge graphs that organize and link real-world information rely heavily on NED to accurately populate and update their data. This accuracy is essential for applications like digital assistants, which use these graphs to provide informed responses to user inquiries.
The Importance of Named Entity Disambiguation
NED is a complex process that involves multiple steps and methodologies to accurately identify and link named entities in a given text to their correct real-world counterparts.
1. Identifying Named Entities
Before disambiguation can occur, named entities must first be identified within a text. This is typically done using named entity recognition (NER), a preliminary step that involves scanning text data to locate and classify entities into predefined categories such as person names, organizations, locations, dates, and other specific information.
Techniques Used in NER
Rule-based systems: These utilize patterns and linguistic rules, such as capitalization or context indicators (e.g., titles like Mr. or corporate designators like Inc.), to identify entities.
Statistical methods: Techniques like Hidden Markov Models (HMMs) or Conditional Random Fields (CRFs) learn from large datasets of annotated text to recognize entities based on probabilistic models.
Deep learning approaches: More recently, models based on neural networks, particularly those using architectures like LSTM (Long Short-Term Memory) or transformers, have become prevalent. These models benefit from large amounts of training data and have shown superior ability to capture context for more accurate entity recognition.
2. Categorizing Named Entities
Once entities are identified, they need to be categorized accurately. This involves classifying each entity according to its type, which helps in narrowing down the possible meanings in the subsequent disambiguation step.
Methods for categorization
Fine-grained classification: Beyond basic categories, entities can be classified into more specific classes, such as distinguishing between types of organizations (e.g., non-profit vs. corporate) or public figures (e.g., politician vs. artist).
Contextual classification: It involves analyzing the surrounding text to understand an entity's role and relevance, using both the immediate context and broader discourse.
3. Disambiguating Named Entities
The core of NED lies in its ability to distinguish between entities that share the same name. This step is critical because it determines the accuracy of information extraction, search engines, knowledge graph construction, and other NLP applications.
Core Techniques in Disambiguation
Rule-based disambiguation: Applies heuristic rules based on linguistic cues and patterns, such as geographical proximity or typical associations (e.g., Apple might be linked to "technology" if the context involves words like "iPhone" or "MacBook").
Machine learning models: Supervised learning models are trained on datasets where each entity is annotated with its correct reference. These models learn to predict the correct entity based on features extracted from the context.
Unsupervised and semi-supervised methods: These involve clustering similar entities and using algorithms to predict the most likely meaning based on the densities of clusters and the contextual similarity.
Knowledge-based approaches: Utilize large external databases or knowledge graphs that contain information about entities and their relationships. By querying these resources, NED systems can pull contextual information and metadata to resolve ambiguities. For example, linking to a specific Wikipedia page can clarify whether "Jordan" refers to the country, the river, or the basketball player, based on the context.
4. Linking Entities to External Databases or Knowledge Graphs
The final step in NED is often linking the disambiguated entity to a unique identifier in an external database or a node in a knowledge graph. This linkage not only confirms the entity’s identity but also enriches the text with semantic information that can be used for further processing and analysis.
Linkage methods
URI Assignment: Each entity is assigned a unique resource identifier (URI) that points to a specific location in a database or a knowledge graph.
Semantic tagging: Entities are tagged with semantic labels that provide additional metadata, enhancing the richness of the data for subsequent analytical tasks.
The combination of these techniques ensures that NED systems can operate with high accuracy and efficiency, making them indispensable in the field of NLP. By understanding and implementing these processes, SESAMm enhances its analytical capabilities, offering precise and context-aware solutions that stand out in the competitive AI landscape.
SESAMm's Innovative Approach to NED
SESAMm has carved a niche in the NLP field by incorporating advanced, proprietary technologies that refine and enhance the NED process:
Cutting-edge algorithms: SESAMm develops and deploys state-of-the-art machine learning approaches and deep learning algorithms designed to increase the precision and reliability of entity disambiguation.
Scalable data processing: SESAMm's platforms are engineered to handle extensive data volumes, making them well-suited for large-scale industrial applications that require robust data analysis capabilities.
Customizable APIs: SESAMm offers adaptable APIs that clients can tailor to fit specific project requirements, whether for financial analysis, marketing research, or other specialized areas.
Seamless knowledge graph integration: By integrating its NED processes with dynamic knowledge graphs, SESAMm enhances its semantic analysis capabilities, enabling deeper insights and more accurate data interpretations.
Conclusion
Named Entity Disambiguation is a fundamental component of modern NLP applications, essential for interpreting the enormous volumes of data generated daily. By accurately identifying and categorizing named entities, NED not only deepens the understanding of text but also improves the efficiency of information processing. SESAMm's approach to NED sets it apart in the AI analytics field, pushing the boundaries of what's possible with smart, context-aware technology solutions. To learn more about SESAMm’s innovative technology and how it is used to identify ESG controversies, request a demo.
Reach out to SESAMm
TextReveal’s web data analysis of over five million public and private companies is essential for keeping tabs on ESG investment risks. To learn more about how you can analyze web data or to request a demo, reach out to one of our representatives.
In this issue of the "what investors ought to know about…" series, we'll cover natural language processing (NLP), a tool that draws from the computer science and computational linguistics disciplines. In the last topic, we discussed knowledge graphs as the core of text analysis. And if knowledge graphs are the core of the data’s context, NLP is the transition to understanding the data.
What is natural language processing?
Natural language processing is an artificial intelligence (AI) technology that automates the data analysis of mined textual, unstructured data to include natural language understanding and natural language generation to simulate a human's ability to create language. It combines computational linguistics with machine learning and deep learning models, performing a special linguistic analysis by algorithms so a machine can "read" text.
Where is natural language processing used?
Today, various industries use NLP, from email filters to virtual assistants and search engines to chatbots. Here's a list of common ways natural language processing is used:
Chatbots: Chatbots are computer programs that use NLP. They simulate human conversation by identifying a sentence's intent, determining suitable topics, keywords, and emotions, and calculating the best response based on the data's interpretation.
Email filters: Email filters apply machine learning using many data samples to sort emails into the right inbox.
Machine translation: Translation software like Google Translate or Microsoft Translator use NLP to translate text from one language to another, such as English to French.
Natural language generation (NLG): NLG, a subfield of NLP, builds applications or computer systems that can automatically produce natural language texts of various types by using a semantic representation as input. Applications of NLG include question answering and text summarization.
Predicting and autocorrecting text: Predictive text and autocorrect use NLP to recognize and recall commonly used words and names to make text suggestions and correct common errors.
Search engines: Search engines like Google search use NLP machine learning to interpret a searcher's intent and provide relevant results. It can even suggest subjects and topics related to the query the searcher might be interested in.
Virtual and voice assistants: Virtual assistants like Apple's Siri or Amazon's Alexa use NLP technology to understand and respond to voice requests. Speech-to-text can dictate messages and notes, and speech recognition can control everything from smartphone apps and smart speakers to thermostats and home security systems.
Web sentiment analysis: Sentiment analysis automates classifying opinions in a text as positive, negative, or neutral. It's a method companies like SESAMm use to monitor sentiments like a brand's sentiment on the web and social media.
Why natural language processing is important to uncover financial-related alternative data
NLP is important because it helps resolve human language ambiguity in big datasets (big data). Languages are complex, diverse, and expressed in unlimited ways, from speaking hundreds of languages and dialects to having a unique set of grammar and syntax rules, slang, and terms for each. In text form, these variables are unstructured text. But with NLP, we can transform unstructured data into structured data and make sense of it.
Because of NLP's power, investors can research and analyze unstructured data from the web to gain insights into financial and ESG data. You can use this wealth of information to focus on systematic data processing, risk management, and alpha discovery through contexts, such as:
Major global indices sentiment
Euronext exchange sentiment
Private company sentiment
ESG risks for public and private companies worldwide
A quick overview of how natural language processing works at SESAMm
At SESAMm, we use named entity recognition (NER), which extracts the names of people, places, and other entities from text, and then named entity disambiguation (NED) to identify named entities based on their context and usage. For example, text referencing "Elon" could refer indirectly to Tesla through its CEO or a university in North Carolina. NED considers the context when classifying entities for an accurate match. Compared to simple pattern matching, which limits the number of possible matches, requires frequent manual adjustments, and can't distinguish homophones, NED is superior.
Process representation for NER and NED.
When identifying entities and creating actionable insights, SESAMm uses three other NLP tools: lemmatization and stemming, embeddings, and similarity. The lemmatization process normalizes a word into its base form (morphology) to help identify and aggregate entities. Embedding assigns the entity a numerical value to help analyze how words change meaning depending on context and understand the subtle differences between words that refer to the same concept—similarity measures whether two words, sentences, or objects are close to one another in meaning.
Representation of nodes in a knowledge graph.
Of course, NLP couldn't function without the core of the text analytics process: knowledge graphs. A knowledge graph is a digital representation of a network of real-world entities, the foundation of a search engine or question-answering service. This structured data model puts the schema in context through semantic metadata and linking, providing a framework for analytics, data integration, sharing, and unification. In other words, it's like a map and legend, with the legend labeling the concepts, entities, and events and the map connecting and identifying their relationships. These details are stored in a graph database and visualized as a graph representation, hence the term knowledge graph.
SESAMm's natural language processing platform for investment research and analysis
SESAMm is the leading provider of natural language processing and machine learning solutions and analytics for investment firms and corporations.