Basilica is a REST API that embeds high-dimensional data into spaces that are easy to work with.
You send us your images, text, or whatever else you have, and we send you back a vector of floats that you can feed into traditional ML models you're already using, like logistic regressions.
We use transfer learning to produce these embeddings, meaning you can get value out of small datasets that would normally be very difficult to work with.
For this tutorial, we're going to use the Basilica API to solve two common ML problems: classification and clustering. we're going to train a classifier that distinguishes cat images from dog images, and then train a nearest neighbors model that takes dog pictures and looks for similar ones in our training set.
Without Basilica, you would need to train sophisticated models on hundreds of thousands of data points to get good performance on photographs. With Basilica, we're going to get really good results with a few thousand data points, using off-the-shelf models that have existed for decades.
Basilica has a REST API, but the easiest way to interact with it is the Python client.
You can install the Python client like so:
pip install basilica
You'll also need the dataset. You can download the dataset, and the completed python files we'll be writing in this demo, here:
wget https://storage.googleapis.com/basilica-public/cats_dogs_demo.tgz tar -xf cats_dogs_demo.tgz
Let's embed our first image, just to see how the API works.
dog.1.jpg, this one:
import basilica API_KEY = 'SLOW_DEMO_KEY' with basilica.Connection(API_KEY) as c: embedding = c.embed_image_file('images/dog.1.jpg') print(embedding)
[0.381211, 3.11767, 0.0206775, ..., 0.403363, 0.0, 0.754328]
As you can see, we now have a vector of floats instead of an image.
We're going to be playing around with these images a lot, so let's embed all of them with the batch API, and save the embeddings:
import basilica from six.moves import zip import json import os EMB_DIR = '/tmp/basilica-embeddings/' if not os.path.exists(EMB_DIR): os.mkdir(EMB_DIR) IMG_DIR = 'images/' API_KEY = 'SLOW_DEMO_KEY' with basilica.Connection(API_KEY) as c: filenames = os.listdir(IMG_DIR) embeddings = c.embed_image_files(IMG_DIR + f for f in filenames) for filename, embedding in zip(filenames, embeddings): with open(EMB_DIR + filename + '.emb', 'w') as f: f.write(json.dumps(embedding)) print(filename)
This should take about a minute.
Now that we have these embeddings, let's see how we can use them to train a traditional ML model.
Let's train a classifier that distinguishes cat pictures from dog
First, we need to import a bunch of things:
import json import numpy as np import os import random import re import sklearn.linear_model import sklearn.preprocessing import time
Next, let's split the embeddings into train and test sets, then load them into numpy arrays.
EMB_DIR = '/tmp/basilica-embeddings/' files = [f for f in os.listdir(EMB_DIR)] random.shuffle(files) train_size = int(len(files)*0.8) x_train = np.zeros((train_size, 2048)) x_test = np.zeros((len(files)-train_size, 2048)) y_train = np.zeros(train_size, dtype=int) y_test = np.zeros(len(files)-train_size, dtype=int) for i in range(train_size): filename = files[i] with open(EMB_DIR + filename, 'r') as f: x_train[i] = json.load(f) y_train[i] = (0 if re.match('.*cat.*', filename) else 1) for i in range(len(files) - train_size): filename = files[train_size+i] with open(EMB_DIR + filename, 'r') as f: x_test[i] = json.load(f) y_test[i] = (0 if re.match('.*cat.*', filename) else 1)
Finally, let's train the classifier.
x_train = sklearn.preprocessing.normalize(x_train) x_test = sklearn.preprocessing.normalize(x_test) model = sklearn.linear_model.LogisticRegression() model.fit(x_train, y_train)
Easy, right? Let's see how well it did:
print('Train accuracy: %.3f' % model.score(x_train, y_train)) print('Test accuracy: %.3f' % model.score(x_test, y_test))
Train accuracy: 0.990 Test accuracy: 0.987
After the test/train split, we had about 2400 datapoints in our test set. Training a classifier to 98% accuracy on 2400 photographs isn't trivial, and we just did it in fewer lines of code than it took to load the data, using an algorithm older than I am.
This is the power of transfer learning. Basilica produced this embedding using a deep neural net trained on milions of generic images, and the features it learned generalized well to our problem.
Try it! Upload your own image or use one of our sample images.
Just as a sanity check, let's take a quick look at the images where our model asigned the lowest probability to the true answer:
test_proba = model.predict_proba(x_test) collected = zip(files[train_size:], y_test, test_proba) probabilities = [(pred[y], f) for f, y, pred in collected] probabilities.sort() for prob, filename in probabilities[:3]: print('%s: %.2f' % (filename, prob))
dog.1004.jpg.emb: 0.10 cat.991.jpg.emb: 0.28 dog.412.jpg.emb: 0.40
Not bad. It seems like we're making reasonable mistakes.
One of the great things about embeddings is that they're general. You can use the same embedding for lots of different machine learning tasks.
Let's train a model that takes an image of a dog and finds similar dog images in our dataset.
First, we import a bunch of things again:
import basilica import json import numpy as np import os import random import re import sklearn.decomposition import sklearn.neighbors import sklearn.preprocessing import time
Next, we load up the embeddings for all of the dogs:
EMB_DIR = '/tmp/basilica-embeddings/' files = [f for f in os.listdir(EMB_DIR) if re.match('.*dog.*', f)] random.shuffle(files) signatures = np.zeros((len(files), 2048)) for i, filename in enumerate(files): with open(EMB_DIR + filename, 'r') as f: signatures[i] = json.load(f)
Next, let's fit a nearest neighbors model. The number of dimensions you want for nearest neighbors search is usually much smaller than the number you want for classification, so we're going to PCA down to 200 dimensions first:
scaler = sklearn.preprocessing.StandardScaler(with_std=False) pca = sklearn.decomposition.PCA(n_components=200, whiten=True) signatures = sklearn.preprocessing.normalize(signatures) signatures = scaler.fit_transform(signatures) signatures = pca.fit_transform(signatures) signatures = sklearn.preprocessing.normalize(signatures) nbrs = sklearn.neighbors.NearestNeighbors(n_neighbors=4).fit(signatures)
Now, let's search for similar dog pictures. Let's look for pictures similar to each of the first 3 in our dataset:
IMG_DIR = 'images/' target_files = ['dog.1.jpg', 'dog.2.jpg', 'dog.3.jpg'] API_KEY = 'SLOW_DEMO_KEY' with basilica.Connection(API_KEY) as c: targets = np.array(list(c.embed_image_files(IMG_DIR + f for f in target_files))) targets = sklearn.preprocessing.normalize(targets) targets = scaler.transform(targets) targets = pca.transform(targets) targets = sklearn.preprocessing.normalize(targets) _, all_indices = nbrs.kneighbors(targets) for indices in all_indices: print(' '.join(files[i] for i in indices))
dog.1.jpg.emb dog.283.jpg.emb dog.603.jpg.emb dog.597.jpg.emb dog.2.jpg.emb dog.849.jpg.emb dog.334.jpg.emb dog.586.jpg.emb dog.3.jpg.emb dog.1130.jpg.emb dog.1157.jpg.emb dog.607.jpg.emb
Here are the similar images grouped together:
Try it! Upload your own image or use one of our sample images.
You'll notice that we're searching for similar dog pictures, not pictures of similar dogs. Our code thinks that environmental features like cages and fences are just as important as the features of the dogs themselves.
If we wanted to instead cluster similar dog pictures together, we'd need a larger dataset with better labeling, so that we could look for features that distinguish dogs from each other, but not from pictures of the same dog against different backgrounds.
You probably noticed in the previous examples that we were treating each float in the embedding (or in the PCA-ed embedding) as a single feature. You're might be wondering: what if I have my own features?
Embeddings play nice with hand-designed features, and with each other. If you were training a tweet classifier, for example, you could combine hand-crafted features like the time of day, Basilica's English language embedding of the tweet, and Basilica's image embedding of any pictures in the tweet, into a single set of features that you feed into a regression.
If you already have an existing ML pipeline that that handles numeric features, and you'd like to improve it with high-dimensional data, embeddings let you add that high dimensional data right alongside your hand-designed features, with no need to retool.