Overview
The TWAICE Pull Stack is designed for batch data processing. It enables the secure and efficient extraction of data from customer databases and APIs in a batched manner.
This document is a customer-facing version to explain key features and requirements. The TWAICE Pull Stack offers a comprehensive solution for integrating with customer data sources, providing a range of features and benefits designed to ensure secure, efficient, and reliable data extraction.
Why do we need this document?
This document explains the value proposition and requirements of pullstack so that the customer can decide whether their setup is compatible with the requirements of pullstack and hence ease up the onboarding process.
Key Value Propositions
1. Flexible Connectivity
Connection with Customer Data Sources
On-Site, On-Premise, or Cloud: Whether your data source is hosted on-site, on-premise, or in another cloud environment, the TWAICE Pull Stack can connect seamlessly, providing flexibility and adaptability to your existing infrastructure.
2. Efficient Data Retrieval
Batch Data Pull
Optimized Pull Intervals: The Pull Stack is designed for batch data processing, with pull intervals optimized based on the customer's infrastructure. This ensures efficient data retrieval without overloading the network or the data source. It is not suitable for stream data processing typical in bus systems.
3. Reliability and Continuity
Alerting on Data Failures
Avoiding Data Gaps: The system includes robust alerting mechanisms to notify stakeholders of any data failures, helping to avoid data gaps and ensure continuous data availability.
Automatic Backfilling
After Connection Issues: In case of connection issues, the Pull Stack supports automatic backfilling, ensuring that no data is lost and all historical data is captured once the connection is restored.
4. Security and Compliance
Static IP Addresses
For Network Security: To enhance security, static IP addresses are provided for integration with the customer's network. This facilitates easier and more secure network configurations and whitelisting.
Site-to-Site VPN Support
Additional VPN: For customers requiring extra security, an additional Site-to-Site VPN can be established separately. This ensures secure and encrypted data transmission between the customer's data source and the TWAICE Pull Stack.
5. Designed for Batch Processing
Optimized for Batches
Efficient Handling: The Pull Stack is specifically designed for handling batches of data, making it ideal for periodic data extraction and processing tasks. This design ensures optimal performance and reliability in batch processing environments.
System Requirements
Python Library: A Python library (version 3.8 or higher) is required for TWAICE to connect and retrieve data from the customer's data source.
Data Source Accessibility: The Pull Stack must be able to access the customer's data source in batches.
Implementation Steps
Setup Pull Client: Deploy the Pull Client to interface with the customer’s database or API. Ensure the Python library (version 3.8 or higher) is installed and configured. Once we have all the details of the connection and the credentials, it takes 2 weeks for the pull stack setup and its testing.
Configure Site-to-Site VPN: If the database is on-premises, set up a Site-to-Site VPN to securely connect the Pull Stack to the local database.
Deploy Pull Stack in AWS Cloud: Provision the Pull Stack in an isolated VPC within the AWS Cloud.
Configure Security Settings: Apply necessary security configurations, including IP whitelisting and other access controls.
Schedule Batch Data Retrieval: Define the batch processing schedule to periodically pull data from the customer’s data source.
What do we need from the customer for the configuration?
1. Credentials of the Datastore
REST Endpoint
If your data source is a REST endpoint, please provide:
URL: The endpoint URL where the data can be accessed.
Authentication Details:
API Key: If your endpoint uses API key-based authentication, provide the API key.
Token: If token-based authentication is used, provide the access token.
Username/Password: If basic authentication is used, provide the username and password.
Datastore
If your data source is a traditional datastore (e.g., SQL database, NoSQL database), please provide:
Connection String: The connection string or URL to access the datastore.
Authentication Details:
Username: The username with necessary permissions to access the datastore.
Password: The corresponding password for the provided username.
2. Additional Configuration
We need these additional information from the customer to setup our pull configuration.
Event-to-Query Delay:
The time delay between when an event occurs and when the data is available for querying in the datastore.
Query Limits and Bottlenecks:
Any restrictions on the number of queries or data pulls that can be performed in a day.
Any potential bottlenecks or rate limits that could impact data retrieval.
Sensor Data Volume:
The number of sensors from which data can be pulled simultaneously for a given time interval.
Data Pull Frequency:
The maximum frequency at which data can be pulled from the data source.
3. Data Format Requirements
To ensure compatibility with our Pull Stack, the data returned from your datastore or REST endpoint should be structured with the following fields:
Sensor Tag: A unique identifier for each sensor.
Relative Time: The time at which the data was recorded, relative to a specific reference point or epoch.
Sensor Value: The value recorded by the sensor.
Example Data Format
Here is an example of how the data should be formatted:
[ { "sensor_tag": "temperature_sensor_01", "relative_time": 1622547800, "sensor_value": 22.5 }, { "sensor_tag": "humidity_sensor_02", "relative_time": 1622547860, "sensor_value": 55.2 } ]
In this example:
sensor_tag: Identifies each sensor (e.g.,
temperature_sensor_01
,humidity_sensor_02
).relative_time: The timestamp of the recorded data (e.g.,
1622547800
).sensor_value: The value recorded by the sensor (e.g.,
22.5
,55.2
)