An embedding is when you take high-dimensional data that’s hard to work with (such as images, natural language text, etc.) and transform it into a small vector of floats that’s easy to work with.
The most famous widely-used embedding at the moment is word2vec. Word2vec is so useful that most ML engineers working on an NLP problem will embed all of the words before doing anything else.
Our goal with Basilica is to provide the same level of convenience and ubiquity for other types of high-dimensional data, like images through an API.
We train deep neural networks to perform a mix of tasks on private and public data. The intermediate layers of these networks learn to recognize general features in the data, which we send back to you as a vector of floats. You can read more about this technique in the paper CNN Features off-the-shelf: an Astounding Baseline for Recognition. Why is Basilica an API, when word2vec runs locally on my CPU? There are few enough English-language words that you can explicitly compute the embedding for each of them, and store a giant lookup table.
When you’re embedding something larger, like an image, that doesn’t work any more. The models Basilica uses need to run on a GPU, and would be painful to deploy for most users, so we provide an API instead.
Usually very little! A very common pattern is to embed your high-dimensional data with Basilica, use PCA to reduce the number of dimensions, and then feed the small number of resulting features into an existing model alongside your hand-engineered features.
We currently offer no latency guarantees, but our target latency is 200ms for a single data point. (We recommend embedding in batches, however.) Since the API is freely available right now, you may see latency increase when Basilica is under heavy load. If you are interested in a paid plan with latency guarantees, please contact us.
Basilica is currently free to use for small workloads. We hope to continue having a free tier for as long as costs allow, but embedding large amounts of data with reasonable latency guarantees will require a paid plan.
Embeddings are a form of transfer learning, allowing you to get good results with surprisingly small datasets. In our Image Model Tutorial, we train a classifier with only 3000 data points.
This is a common question from people who are used to a traditional ML paradigm, where you have a small number of carefully-curated and semantically meaningful features. The floats in the vectors we return aren’t really meaningful features per se, but rather values that happen to be linearly recombinable into meaningful features.
In particular, if you feed a Basilica embedding into a linear regression, you will usually find that it is assigning a very small weight to almost all of them, rather than ignoring most of them.
This will change very soon. We cut automatically transforming different image types from our timeline in order to launch quickly. (It is, however, a good idea to preprocess images before sending them to Basilica, if only to reduce their size. We recommend sending us 256x256 or 512x512 JPEGs.)