Serverless Data Engineering
Building products using AI and big data can get complicated, quickly. Combining a specialised AI model with a robust engineering solution can present a big challenge, so we’ve come up with a new way to approach the problem.
We've pioneered a component-based approach to data engineering, building microservices instead of monoliths. Our pipelines are formed from a suite of cloud native components strung together in a workflow, that work together to form a complete end-to-end solution.
These components allow us to deliver robust, dependable AI solutions quickly, whilst still providing agility and flexibility to tailor each application to your business needs.
Elements of Data Engineering
Creating products from reusable components accelerates delivery and increases reliability. To that end, every new piece of functionality required by a Datasparq product or service gets developed as a reusable atomic component, in the form of either an API or a program.
Each addresses an elemental principle of data engineering. All of our components are production-grade and tested on multiple projects to ensure reliability.
Houston is our orchestrator. It’s the glue that connects every other component and ensures that operations happen in the correct order, and only as and when necessary. Houston is the first orchestration tool to enable completely serverless and platform agnostic workflows.
It can be found at the core of most Datasparq products, whether running ML pipelines or automating simpler data cleaning tasks, and is now available for all to use at callhouston.io.
Reliable, repeatable, automated.
We hold our code and deployments to the highest standards, never accepting less than continuous integration, continuous delivery, strict version control, and rigorous testing.
Builder manages environments, builds, configurations, infrastructure and unit tests to ensure repeatable automated deployments. These are subject to strict testing and review and give us the means to carry out maintenance and changes to production systems with confidence.
Builder also allows us to deploy each of our components into a new project with a single click. Infrastructure is provisioned from Terraform modules, minimising administration and allowing for rapid prototyping.
Our approach to data manipulation has been informed by our experience delivering dozens of big data projects. Not all data projects are the same, but many have similarities, and we have found the use of templates to be vital in accelerating data pipeline development.
Pattern is an extensive library of templates and utilities for running common ETL tasks, such as data ingestion, cleaning and transformations. The library supports Google Cloud Storage, Dataflow, BigQuery, Cloud SQL, and Firestore.
"Anything untested is broken"
The value you can derive from insight or predictions will depend on the quality of the data used. Xu is a cloud native expectation testing service built to monitor big data and machine learning workflows. It makes data testing simple and explainable for both engineers and business users. Tests can be quickly configured from templates to cover all aspects of data quality and consistency testing in projects.
Inbuilt statistical tests, such as the KS test, keep models healthy by identifying small deviations in the distribution of input features, while validation tests will alert when certain data isn’t within expected parameters.
Xu supports multiple outputs, including BigQuery, Cloud SQL, JSON, Firestore, or Hangouts Chat Bot. Results can also be viewed via our Xu monitoring dashboard or through any BI application.
Parsing large volumes of raw data from a variety of sources and returning it in a usable format for your software can be a challenge in big data pipelines. Encode leverages Google Cloud Dataflow to ingest data from other systems such as SAP to drastically accelerate data ingress, leaving you more time to focus on the data itself.
Taking machine learning models from notebook to production is never straightforward, but is a process we’ve honed. Our team of data scientists and engineers have worked closely together to develop Judge - a tool that can deploy multiple ML frameworks with minimal configuration.
Judge can manage anything from simple Docker containers running models to Kubernetes clusters for ML training and batch prediction. Using Judge allows you to create and destroy clusters and ML frameworks in your environment with ease, whilst helping to reduce management time and costs usually associated with heavy-duty ML applications.
Managing credentials between several cloud services can prove to be a real pain point for many users, but is vital to the security of your data. Connect is a Python-based authentication middleware for protecting Cloud services using GCloud IAP or Azure AD. Access control can be managed across platforms using one component, making it much more straightforward to ensure fine-grained resource access for all users.
Connect also allows any software with its own authentication policy to interface securely with Datasparq components, allowing functions to access protected services securely for automating tasks.
"If it's not documented it doesn't exist"
Meta produces a data catalogue that helps to navigate and make sense of big data projects, and automatically populates BigQuery table metadata. There are numerous integration opportunities with Meta; for example, when used with Pattern, it can automate the production of documentation for ETL pipelines. This helps to alleviate uncertainty and distrust in data sources. Meta helps bring confidence and an improved understanding of the data you have available to boost the analytical capabilities of business users.
The core of many of our products is a machine learning model. Colosseum provides utilities to speed up your ML tasks from training, validation and testing, all the way through to evaluating on-going model performance and retraining, whilst providing more specialist options for specific tasks, such as image recognition or time series predictions.
Dependable monitoring is vital throughout all phases of deployment, not just in production. Monitor offers high reliability error catching, alerting and pipeline performance metrics for every phase of development, in a variety of outputs; we like to receive our alerts via email or chat bots for ease of viewing.
If you’re interested in a new way of working with Google Cloud Platform which improves reliability, reduces complexity and gets to results, faster, get in touch via datasparq.ai/contact.