Alternative Data Python Tutorial

On This Page

Overview
Use Alternative Data as numerical representation
Text Embeddings
Word2Vec
GloVe (Global Vectors for Word Representation)
BERT (Bidirectional Encoder Representations from Transformers)
GPT-3
Audio Embeddings
MFCC (Mel-Frequency Cepstral Coefficients)
VGGish
Wav2Vec 2.0
Image Embeddings
Convolutional Neural Networks (CNNs)
ViT (Vision Transformer)
CLIP (Contrastive Language-Image Pre-Training)
Conclusion
Additive Autogresgressive Alternative Data
Step 1: Gather and Preprocess Stock Price Data
Step 2: Gather and Preprocess Event Data
Step 3: Extract Features from Event Data
Step 4: Combine Stock Price and Event Features
Step 5: Build and Train the Autoregressive Model
Handling Audio, Video, and Images
Conclusion
Alternative Data Risk Modelling
Step 1: Gather and Preprocess Stock Price Data for Multiple Assets
Step 2: Calculate Portfolio Returns and Volatility
Step 3: Gather and Preprocess Event Data
Step 4: Extract Features from Event Data
Step 5: Create Lagged Features
Step 6: Build and Train the Autoregressive Model
Handling Audio, Video, and Images
Incorporating Event Features into Portfolio Data
Conclusion
Related Articles
Explore Further Resources

Florian Zyprian is Chief Technology Officer at Helm & Nagel GmbH, where he leads the architecture and implementation of production artificial intelligence systems. His core responsibilities span system design, machine learning infrastructure, model training pipelines, and deployment automation across the insurance, logistics, and energy sectors. With extensive experience in building complex, scalable systems, Florian specializes in creating solutions that operate reliably in production environments for years without requiring human intervention or constant maintenance.

In the world of finance, traditional models have predominantly relied on market prices and historical financial data to predict future trends and assess risk. However, the advent of alternative data has revolutionized this field. Alternative data includes a variety of non-traditional data sources such as news articles, social media posts, satellite imagery, audio recordings, and video content. These data sources provide additional insights that can help in making more informed decisions and improving the accuracy of predictive models.

Overview

This article will guide you through the process of creating numerical representations of alternative data and incorporating these representations into existing financial models. We will focus on a simplified method to create such models using an autoregressive approach combined with traditional market price data. By the end of this guide, you will understand how to:

Create numerical representations (embeddings) of alternative data.
Integrate these embeddings into financial models to assess and forecast risk.

Using Alternative Data in Numerical Models

To incorporate alternative data into our models, we convert these diverse data types into numerical representations, known as embeddings. These embeddings allow us to integrate qualitative data into quantitative models, making them actionable and comparable with traditional financial data.

Text Data: By processing text from news articles or social media, we can capture market sentiment and public opinion. For instance, we use advanced models like BERT or GPT-3 to transform sentences into numerical vectors that reflect their meaning and sentiment.
Audio Data: Speech from earnings calls or CEO interviews is converted into text using speech-to-text technologies. Then, we analyze the sentiment and key topics of these transcripts, turning them into numerical scores.
Image Data: Satellite images of parking lots at major retailers can be analyzed to estimate store traffic. We use pre-trained neural networks like ResNet to extract relevant features from these images, converting them into numerical data.

Integrating into Financial Models

These numerical representations are then combined with traditional market data in our predictive models. For example, we might use an autoregressive model that predicts future volatility by considering both historical price data and sentiment scores derived from news articles. This holistic approach allows us to create more accurate and robust risk assessments, ultimately leading to better investment decisions and risk management.

Use Alternative Data as numerical representation

Creating embeddings for text, audio, and images involves converting these data types into numerical representations that can be processed by machine learning models. Various models and techniques can be used to generate these embeddings, each with its own advantages and applications. Here, we will review some of the most popular models and approaches for text, audio, and image embeddings.

Text Embeddings

Text embeddings are numerical representations of text that capture semantic information. When implementing document processing systems, text embeddings serve as a foundational technique. Several models are commonly used to create text embeddings:

Word2Vec

Word2Vec is an early and influential model for generating word embeddings. It uses two architectures: Continuous Bag of Words (CBOW) and Skip-Gram. Word2Vec captures semantic relationships between words by training on large text corpora.

from gensim.models import Word2Vec

# Alternative Data Python Tutorial
sentences = [["I", "love", "machine", "learning"], ["Word2Vec", "is", "great"]]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
word_vector = model.wv['machine']

GloVe (Global Vectors for Word Representation)

GloVe is another popular method for generating word embeddings. It captures global statistical information by training on the co-occurrence matrix of words in a corpus.

import gensim.downloader as api

## Load pre-trained GloVe model
glove_model = api.load("glove-wiki-gigaword-100")
word_vector = glove_model['machine']

BERT (Bidirectional Encoder Representations from Transformers)

BERT is a transformer-based model that generates context-aware embeddings. It is pre-trained on large text corpora and can be fine-tuned for specific tasks.

from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

## Example text
text = "Machine learning is fascinating."
inputs = tokenizer(text, return_tensors='pt')
outputs = model(**inputs)
embedding = outputs.last_hidden_state.mean(dim=1)

GPT-3

GPT-3, developed by OpenAI, is a powerful language model that can generate high-quality embeddings. It can handle a wide range of tasks, including text classification, translation, and summarization.

from transformers import GPT3Tokenizer, GPT3Model

tokenizer = GPT3Tokenizer.from_pretrained('gpt3')
model = GPT3Model.from_pretrained('gpt3')

## Example text
text = "Machine learning is fascinating."
inputs = tokenizer(text, return_tensors='pt')
outputs = model(**inputs)
embedding = outputs.last_hidden_state.mean(dim=1)

Audio Embeddings

Audio embeddings capture the features of audio signals and convert them into numerical representations. Common models include:

MFCC (Mel-Frequency Cepstral Coefficients)

MFCCs are widely used in speech and audio processing. They capture the short-term power spectrum of a sound signal.

import librosa

## Load audio file
audio_path = 'path/to/audio.wav'
y, sr = librosa.load(audio_path, sr=None)

## Compute MFCC features
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)

VGGish

VGGish is a convolutional neural network pre-trained on a large audio dataset. It extracts features from raw audio signals and generates embeddings.

import tensorflow as tf
import tensorflow_hub as hub

## Load VGGish model
vggish_model = hub.load('https://tfhub.dev/google/vggish/1')

## Load and preprocess audio file
y, sr = librosa.load(audio_path, sr=16000)
embedding = vggish_model(y)

Wav2Vec 2.0

Wav2Vec 2.0 is a transformer-based model developed by Facebook AI. It learns representations directly from raw audio data and is particularly effective for speech recognition tasks.

from transformers import Wav2Vec2Tokenizer, Wav2Vec2Model

tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")

## Load and preprocess audio file
y, sr = librosa.load(audio_path, sr=16000)
inputs = tokenizer(y, return_tensors="pt", sampling_rate=sr)
outputs = model(**inputs)
embedding = outputs.last_hidden_state.mean(dim=1)

Image Embeddings

Image embeddings convert visual information into numerical representations. Common models include:

Convolutional Neural Networks (CNNs)

CNNs are the standard approach for generating image embeddings. Pre-trained models like ResNet, VGG, and Inception are often used. These neural network architectures power many AI applications across industries.

from torchvision import models, transforms
from PIL import Image
import torch

## Load pre-trained ResNet model
resnet = models.resnet50(pretrained=True)
resnet.eval()

## Preprocess the image
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

## Load the image
image = Image.open('path/to/image.jpg')
image = transform(image).unsqueeze(0)

## Extract features
with torch.no_grad():
    image_features = resnet(image).numpy()

ViT (Vision Transformer)

ViT is a transformer-based model for image classification. It treats images as sequences of patches and processes them similarly to text.

from transformers import ViTFeatureExtractor, ViTModel

## Load pre-trained ViT model
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')
model = ViTModel.from_pretrained('google/vit-base-patch16-224')

## Load and preprocess image
image = Image.open('path/to/image.jpg')
inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
embedding = outputs.last_hidden_state.mean(dim=1)

CLIP (Contrastive Language-Image Pre-Training)

CLIP is a model developed by OpenAI that learns joint representations of images and text. It can generate embeddings for both modalities.

from transformers import CLIPProcessor, CLIPModel

## Load pre-trained CLIP model
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

## Load and preprocess image
image = Image.open('path/to/image.jpg')
inputs = processor(text=["a photo of a cat"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
image_embedding = outputs.image_embeds

Conclusion

The choice of model for creating embeddings depends on the type of data and the specific application. Word2Vec, GloVe, BERT, and GPT-3 are popular for text embeddings, while MFCC, VGGish, and Wav2Vec 2.0 are commonly used for audio embeddings. For image embeddings, CNNs, ViT, and CLIP are widely used. Each model has its strengths, and the best choice often depends on the specific requirements of the task at hand.

Additive Autogresgressive Alternative Data

Creating a volatility forecast model that is autoregressive on the stock price and incorporates various types of events (audio, video, news, or images) involves several steps. We need to:

Gather and preprocess stock price data.
Gather and preprocess event data.
Extract features from event data.
Combine stock price and event features.
Build and train the autoregressive model.

Here is a step-by-step outline with code snippets to demonstrate how you could achieve this:

Step 1: Gather and Preprocess Stock Price Data

We'll use historical stock price data to forecast volatility. You can use libraries like yfinance to get this data.

import yfinance as yf
import pandas as pd

## Fetch historical stock price data
ticker = "AAPL"
stock_data = yf.download(ticker, start="2020-01-01", end="2023-01-01")
stock_data['Returns'] = stock_data['Adj Close'].pct_change()
stock_data = stock_data.dropna()

Step 2: Gather and Preprocess Event Data

Event data can come from various sources. For simplicity, we'll assume you have transcripts for audio and video events, and news articles as text data.

## Example news articles
news_data = [
    {"date": "2020-01-15", "text": "Apple releases new iPhone model."},
    {"date": "2020-02-20", "text": "Apple reports quarterly earnings."}
]

## Convert news_data to DataFrame
news_df = pd.DataFrame(news_data)
news_df['date'] = pd.to_datetime(news_df['date'])
news_df.set_index('date', inplace=True)

Step 3: Extract Features from Event Data

We need to convert text data into numerical features. One approach is to use text embeddings. For simplicity, let's use a pre-trained model from the transformers library.

from transformers import pipeline

## Initialize the sentiment-analysis pipeline
sentiment_pipeline = pipeline("sentiment-analysis")

## Apply the pipeline to the news articles
news_df['sentiment'] = news_df['text'].apply(lambda x: sentiment_pipeline(x)[0]['score'])

## Combine sentiment with stock data
stock_data = stock_data.join(news_df['sentiment'], on='Date', how='left')
stock_data['sentiment'] = stock_data['sentiment'].fillna(0)

Step 4: Combine Stock Price and Event Features

Now, let's create lagged features for both stock returns and sentiment scores to build our autoregressive model.

## Create lagged features
for lag in range(1, 6):  # Lag 1 to 5
    stock_data[f'returns_lag_{lag}'] = stock_data['Returns'].shift(lag)
    stock_data[f'sentiment_lag_{lag}'] = stock_data['sentiment'].shift(lag)

## Drop rows with NaN values created by lagging
stock_data = stock_data.dropna()

Step 5: Build and Train the Autoregressive Model

We'll use a simple linear regression model for demonstration. You can replace this with more sophisticated models like LSTM, GRU, or other big data solutions and machine learning algorithms.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

## Prepare the features and target variable
features = [col for col in stock_data.columns if 'lag' in col]
X = stock_data[features]
y = stock_data['Returns']

## Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)

## Make predictions and evaluate the model
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print(f'Mean Squared Error: {mse}')

Handling Audio, Video, and Images

For handling audio, video, and images, you can use pre-trained models to extract features. For audio, use a speech-to-text model to get transcripts and then follow the same process as above. For video, extract key frames and use image recognition models to get features.

Example for Audio:

## Assuming you have an audio file and you have used a speech-to-text service to get the transcript
transcript = "Apple announces the launch of a new product."

## Use the same sentiment analysis as above
audio_sentiment = sentiment_pipeline(transcript)[0]['score']

Example for Images:

Use an image recognition model like a pre-trained ResNet to extract features.

from torchvision import models, transforms
from PIL import Image
import torch

## Load pre-trained ResNet model
resnet = models.resnet50(pretrained=True)
resnet.eval()

## Preprocess the image
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

## Load the image
image = Image.open('path_to_image.jpg')
image = transform(image).unsqueeze(0)

## Extract features
with torch.no_grad():
    image_features = resnet(image).numpy()

Incorporate these features into your dataset similarly to how we added sentiment scores.

Conclusion

By following these steps, you can build an autoregressive volatility forecast model that incorporates various types of events. This approach can be further refined with more sophisticated models and feature engineering techniques tailored to the specificities of the events and stock data.

Alternative Data Risk Modelling

To model risk for portfolios instead of the price of an individual asset, we need to focus on the overall volatility of the portfolio and how it is affected by various events. The process involves similar steps but with modifications to handle multiple assets and their interactions.

Step 1: Gather and Preprocess Stock Price Data for Multiple Assets

First, we need historical price data for all assets in the portfolio.

import yfinance as yf
import pandas as pd

tickers = ["AAPL", "MSFT", "GOOGL"]
stock_data = yf.download(tickers, start="2020-01-01", end="2023-01-01")['Adj Close']
returns = stock_data.pct_change().dropna()

Step 2: Calculate Portfolio Returns and Volatility

Next, we'll calculate the returns and volatility of the portfolio. We'll assume equal weighting for simplicity.

## Assume equal weighting
weights = [1/len(tickers)] * len(tickers)

## Calculate portfolio returns
portfolio_returns = returns.dot(weights)

## Calculate rolling volatility (e.g., 30-day rolling standard deviation)
portfolio_volatility = portfolio_returns.rolling(window=30).std()

Step 3: Gather and Preprocess Event Data

Event data, like news articles, remains the same as in the previous approach. We gather and preprocess it similarly.

## Example news articles
news_data = [
    {"date": "2020-01-15", "text": "Tech sector sees major breakthroughs."},
    {"date": "2020-02-20", "text": "Federal Reserve announces rate cut."}
]

## Convert news_data to DataFrame
news_df = pd.DataFrame(news_data)
news_df['date'] = pd.to_datetime(news_df['date'])
news_df.set_index('date', inplace=True)

Step 4: Extract Features from Event Data

We need to convert text data into numerical features as before.

from transformers import pipeline

## Initialize the sentiment-analysis pipeline
sentiment_pipeline = pipeline("sentiment-analysis")

## Apply the pipeline to the news articles
news_df['sentiment'] = news_df['text'].apply(lambda x: sentiment_pipeline(x)[0]['score'])

## Combine sentiment with stock data
portfolio_data = portfolio_volatility.to_frame(name='volatility')
portfolio_data = portfolio_data.join(news_df['sentiment'], on='Date', how='left')
portfolio_data['sentiment'] = portfolio_data['sentiment'].fillna(0)

Step 5: Create Lagged Features

Create lagged features for both portfolio volatility and sentiment scores to build our autoregressive model.

## Create lagged features
for lag in range(1, 6):  # Lag 1 to 5
    portfolio_data[f'volatility_lag_{lag}'] = portfolio_data['volatility'].shift(lag)
    portfolio_data[f'sentiment_lag_{lag}'] = portfolio_data['sentiment'].shift(lag)

## Drop rows with NaN values created by lagging
portfolio_data = portfolio_data.dropna()

Step 6: Build and Train the Autoregressive Model

We can use a regression model to predict future volatility based on lagged features.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

## Prepare the features and target variable
features = [col for col in portfolio_data.columns if 'lag' in col]
X = portfolio_data[features]
y = portfolio_data['volatility']

## Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)

## Make predictions and evaluate the model
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print(f'Mean Squared Error: {mse}')

Handling Audio, Video, and Images

For audio, video, and images, you would extract features similarly to the previous approach. For example, for audio, you could use speech-to-text to convert audio into text, then perform sentiment analysis. For images, you could use a pre-trained CNN to extract features.

Example for Audio:

## Assuming you have an audio file and you have used a speech-to-text service to get the transcript
transcript = "Federal Reserve announces rate cut."

## Use the same sentiment analysis as above
audio_sentiment = sentiment_pipeline(transcript)[0]['score']

Example for Images:

Use an image recognition model like a pre-trained ResNet to extract features.

from torchvision import models, transforms
from PIL import Image
import torch

## Load pre-trained ResNet model
resnet = models.resnet50(pretrained=True)
resnet.eval()

## Preprocess the image
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

## Load the image
image = Image.open('path_to_image.jpg')
image = transform(image).unsqueeze(0)

## Extract features
with torch.no_grad():
    image_features = resnet(image).numpy()

Incorporating Event Features into Portfolio Data

You can add these features to your portfolio data similarly to how we added sentiment scores. This involves integrating the extracted features into the dataframe and creating lagged versions if needed.

Conclusion

By expanding the model to handle multiple assets and calculating portfolio volatility, we can build a comprehensive risk model for portfolios. This model can be further refined with more sophisticated techniques and a better understanding of the events' impact on the market.

Alternative Data-Driven Decision Making - How organizations can leverage non-traditional data sources to gain competitive intelligence
Risk Analytics: Expert Guide - A comprehensive guide to building robust risk models for financial institutions
Analysing Prime Standard Annual Reports Using Text Analysis and NLP - Applying NLP techniques to extract structured insights from unstructured financial documents

Overview

Use Alternative Data as numerical representation

Text Embeddings

Word2Vec

GloVe (Global Vectors for Word Representation)

BERT (Bidirectional Encoder Representations from Transformers)

GPT-3

Audio Embeddings

MFCC (Mel-Frequency Cepstral Coefficients)

VGGish

Wav2Vec 2.0

Image Embeddings

Convolutional Neural Networks (CNNs)

ViT (Vision Transformer)

CLIP (Contrastive Language-Image Pre-Training)

Conclusion

Additive Autogresgressive Alternative Data

Step 1: Gather and Preprocess Stock Price Data

Step 2: Gather and Preprocess Event Data

Step 3: Extract Features from Event Data

Step 4: Combine Stock Price and Event Features

Step 5: Build and Train the Autoregressive Model

Handling Audio, Video, and Images

Example for Audio:

Example for Images:

Conclusion

Alternative Data Risk Modelling

Step 1: Gather and Preprocess Stock Price Data for Multiple Assets

Step 2: Calculate Portfolio Returns and Volatility

Step 3: Gather and Preprocess Event Data

Step 4: Extract Features from Event Data

Step 5: Create Lagged Features

Step 6: Build and Train the Autoregressive Model

Handling Audio, Video, and Images

Example for Audio:

Example for Images:

Incorporating Event Features into Portfolio Data

Conclusion

Related Articles

Explore Further Resources

Ready to automate?

Related Articles

Alternative Data-Driven Decision-Making

Risk Analytics Expert Guide

Resilience in the Face of Crisis: How Companies Are

AI in Industries: Insurance, Banking, Healthcare