Commiting ourselves, before this year is out, of building a high-performance, low-cost petabyte-scale data platform
The objective is to build a data platform to perform acquision, peristance, processing and publishing. This system will be a combination of bought and built components.
The current landscape consists of disparate data systems. Aligning the efforts and expertise from these various systems to achieve common goals can lead to significantly better outcomes. The limitations and flaws in existing systems present an opportunity to build a superior, unified platform.
It should be made clear early that this may exist as parallel systems utilizing common tools and approaches with local embelishments to provide either technical separation or offer unique service offerings within an ecosystem, and this design does not force a single 'platform', rather a single stack with unification points.
Designing an optimal data platform with the following goals:
- Encourage Sharing: Maximize data value through easy analysis and sharing.
- Engineer Satisfaction: Enhance engineer motivation to improve customer outcomes.
- Simplicity and Speed: Develop simple systems for rapid value delivery.
- Cost Efficiency: Maintain low-cost core systems, adding expenses only where they provide value, emphasizing modular design.
- Reliable: Reliable performance and behaviour, handling errors and failures.
- Support Automations: All components are described in code, all functions are exposed as APIs.
- User Abstraction: Users shouldn't be forced to change behaviours based on internal implementations.
The purpose of this document is to describe how the opportunity to create a unified set of tools and common interfaces for the acquision, processing and publishing of data could be realized.
The system will be used as the primary data platform for CSO, it must support
Key constraints and factors influencing technical direction:
- Support petabyte-scale datasets, with terabyte-scale daily loads.
- Ensure reliable and durable storage.
- Maintain immutable data, potentially used in legal procedures.
- Detect and gracefully recover from job failures.
- Optimize storage and compute costs.
- Ensure transparency, traceability, and auditability of all actions.
- Mix of scheduled, event-based, and ad hoc data processing.
- Majority of data processing will be repeatable.
- Support real-time access to data via dashboards (including PowerBI), and analyst workbooks.
For the purposes of this document, it is reasonably assumed that:
- There is no need for dataset indexing (e.g., BigQuery, Snowflake).
- Target Python 3.10 or later for development.
- Compute will primarily be containers within a Kubernetes-like environment.
- Multiple storage platforms will be used, balancing performance and cost.
- No access to running workers; only data outputs and logs are accessible.
- Manage code on GitHub, leveraging features like locked branches and Actions.
- Building a simple version in-house is faster and cost-effective than adopting an existing, opinionated service.
It is believed these are immutable things:
- BigQuery must be used as the store for data exposed to dashboards.
- Scalability: Design for growth, ensuring the system can handle increasing volumes of data without significant performance degradation or cost escalation.
- Flexibility: Support diverse data types and formats, enabling a wide range of analytical tasks.
- Efficiency: Optimize storage and retrieval processes to minimize costs and improve performance.
- Resilience: Ensure pipelines can recover from failures gracefully and continue processing data.
- Modularity: Build pipelines as modular components, allowing for easy updates and replacements.
- Transparency: Make all data processing steps and transformations visible and auditable.
- User-Friendly: Provide intuitive and accessible interfaces for users to interact with the data platform.
- Consistency: Maintain uniformity across different interfaces to streamline user experience and reduce learning curve.
- Extensibility: Allow easy integration with other tools and platforms via standardized APIs.
- Maintainability: Write clean, well-documented code to simplify maintenance and onboarding of new developers.
- Testability: Ensure all components are thoroughly tested, using automated tests to catch issues early.
- Modularity: Structure the system in a modular way to allow components to be updated or replaced without affecting the whole system.
- RESTful APIs: Expose functionalities through RESTful APIs to standardize interactions and integrations.
- Separation of Concerns: Clearly separate storage and compute functions to optimize resource usage and scalability.
- Encourage Flow: Fewer systems, put work, code, documentation and tests in the same system.
- Ensure data can be found and read reliably.
- System components should have high uptime and minimal downtime.
- Minimize time to respond to data queries.
- Minimize time between when data is committed and is available to query.
- Establish and enforce staleness thresholds for data to ensure its relevance.
- Ensure that all committed data conforms to predefined schemas.
- Verify that datasets contain accurate and correct record counts.
- Guarantee no unaccounted-for data, either processed successfully or resulting in clear, unambiguous errors.
- Ensure all non-raw data conforms to a schema.
- Implement robust job failure reporting, ensuring even catastrophic errors are eventually reported and logged.
- Ensure that committed data remains readable and intact over time.
- Retain all types of data (business data, system logs, and error logs) for specified retention periods, complying with data retention policies.
- Ensure all data is able to be traced through the system (e.g. this version of the code operated on this version of the data)
flowchart TD
ESI(Dashboards) --> BQ
USER(User) --> WB[Analyst Workbench]
WB --> CATALOG[Data Catalog]
WB --> BROKER[Data Broker]
WB --> PIPE[Pipelines]
PIPE --> BROKER
subgraph storage
CATALOG
BROKER --> DATA[LTS]
BROKER --> BQ[BigQuery]
end
subgraph consumers
ESI
USER
end
Analyst Workbench
The design and implementation of the workbench is not part of this document, however, the interfaces to build pipelines and
Data Catalog
Metadata Catalog facilitating
- Data Discovery
- Data Security
- Ownership
- Encryption
- Form
- Provenance
- Change tracking
Data Broker
Data federation, single interface for various data source.
Enforcement point for access permissions and perform decryption for authorized users.
sequenceDiagram
User->>+Broker: Get data
Broker->>+Catalog: Where is this data
Catalog-->>-Broker: It's here +perms +schema
Broker-->>+Store: Retrive Data
Store-->>-Broker: Here's the data
Broker->>-User: Here's your data
Pipelines
development
- toolset does logging, retries
- unit testable
operations
- idempotent (may cost storage)
- backfill/replayable
- pausable
- observable
- step failure doesn't kill entire workflow
Data Storage
Data pipelines split data into categories:
- raw Postel's Law - keep everything and only reject malicious data
- managed
- published
Leverage BigQuery for what it was designed for
Just as important as the primary components, but are generally not touchable outside the development and operations teams.
- performance metrics are we tending toward failure, what is the slow part
- alerting when there's problems, we should know
- pipeline libraries error handling, logging, performance, all taken care of for data engineers
- dynamic configuration configuration should not be baked into code or repositories unless it is specific to that code
- code abstractions engineers want to access data, not care about where it is hosted
Active
- schema validation
Passive
- record counts
- job durations
Active
- execute jobs writing to a null-writer to exercise everything but the commit
Passive
The primary gateway for Quality is moving from the uncontrolled environment of the developer
Approach | Purpose | Proposed Implementation |
---|---|---|
Regression Tests | - | pytest |
Test Coverage | - | - |
Code Typing | - | mypy |
Code Form | - | ruff |
Secret Scanning | Detect Leaked Secrets | not trufflehog, fides |
Dry Runs | Test execution before deployment | - |
Weak Coding Practices | - | - |
Composition Analysis | Poor maintenance | bombast , trivy |
Maintainability | Difficult to read code | radon |
Current usage of Explorer (30 days to 24 June 2024)
Unique users: 33
Queries: 1050
Average Execution: 10.51 seconds
Median Execution: 2.20 seconds
Processed Bytes: 0.99 Tb
Estimates are based on observed performance and anticipated volumes:
- Cloud Run (*): 100 thousand executions, 8 CPU, 16 Gb, 60 seconds = £1k
- Storage: 1 Petabyte (compressed approx 4/5ths) 200Tb = £4k
Approximately £60k per annum at the upper end of the estimate.
(*), this is double the current memory and CPU specification of the comparitor execution environment and 4x the execution time of the existing service, but as prices were rounded up to the nearest £1k it made no difference to the cost estimate.
Anticipated most likely scenario is 10x longer queries to to data volumes and fewer queries per month; scaling just the number of queries to 50k/mo, the cost estimate remains the same - primarily due to the small numbers involved rounding up to £1k per month.
Focusng on BigQuery as the largest part of the experience and cost differentiation.
- BigQuery £45k per month (150 Tb interactive querying, 1 Pb processing)
- BigQuery £46k per month (1600 Slots)
Assuming all data in BigQuery as per current MVD.
Approximately £500k per annum at the lower end of the estimate.
This assumes a flat growth of data per query, what is more likely is some datasets will be many terabytes themselves, so rather than queries processing 10s of gigabytes as per Explorer today, some queries will be processing 10s of terabytes. Approximating this to 10 Pb of data accessed in queries is £1.74 million per annum.
- Infrastructure over Application advocates challenging approach.
- Short-termism encouraging a buy-over-build, rather than a buy-for-value approach.