Data Pipelines
Data Pipes provides a “No Code” User Interface for users to build Data Ingestion and Data Preparation Pipelines for ingesting data from several data sources such as Relational Databases, files (upload from local file system/SFTP/Cloud Storage), REST APIs, Streams, etc and to prepare them for Analytics.
Data Pipes supports data ingestion from multiple source systems. The following Data Sources are supported currently:
CSV/JSON Files (Uploaded by the User)
CSV, JSON, Excel, Parquet hosted on S3 or SFTP
Streaming Data Sources : AWS IOT, Kinesis
Relational Databases : MySQL, SQL Server, Oracle, PostgreSQL
JSON REST APIs
Data Pipes add supports for additional connectors on an ongoing basis.
Key Features
Scheduled Replication
Data can be replicated one time or on a user defined schedule. The scheduling system is flexible, and can accommodate choices such as: hourly, daily at 8AM, weekly on Sunday at 11PM, etc or can also be event driven.
Ingestion Mode
Data Office Admins can choose between two ingestion modes: full replication and CDC (Change Data Capture) replication. Full replication performs a complete copy of all tables from the source on a user selected schedule, while CDC only replicates changes. CDC significantly reduces the daily bandwidth and replication time requirements (by about 1000 and 10 respectively).
Streaming Sources
Data Pipes allow ingestion of data from MQTT based sources. MQTT clients or publishers are expected to push the data to MQTT brokers in AWS, we leverage AWS IoT core service. Once the data is pushed to the broker, Data Pipes loads the data to Athena near real-time.
ETL & Business Rules
AWS Databrew is natively embedded in Data Pipes to allow the creation of business rules with over 250 different transformations via an interactive web interface. Those transformations include filtering anomalies, converting data to standard formats, and correcting invalid values.
Business rules can be defined when creating a new ingestion, or post ingestion. The business rules will run every time the data sources are ingested by Data Pipes, as scheduled.
Automated PII Scanning
Data Pipes allows scanning any replicated data for PII. The results of the PII scan are shown in the interface, in order to choose how to handle each PII column. Possible choices include masking the data, tokenization of the data, and application of a Data Security tag.
Configuring a pipeline
Any user can create an ingestion pipeline to ingest data
Ingestion pipeline creation is broadly classified into 5 steps.
Configure Source
Configure Destination
Select Replication mode
Configure Replication
Start Replication.
Configure Source:
In this step the user has to provide information about the datasource depending upon the type of source. Upon creation Data Pipes will validate the source and throw appropriate errors if validation, upon successful creation users can move to creation of destination.
Configure Destination:
In this step the user will have to provide details of the destination. Users will need to provide valid credentials to the destination. Currently Data Pipes supports Athena and Snowflake as destinations. Data Pipes will validate the credentials and permissions to those credentials and let users move to select replication mode.
Select Replication mode:
Data Pipes supports various replication modes depending on the selected source. In this step the user will have an option to select the type of replication whether it's just one time ingestion or incremental ingestion.
Configure Replication:
In this step, the user will be asked to select the domain & dataset which they want to load the data into. The user will also be prompted to provide the pipeline name and the table name for the ingested data. The ingested data will be then visible in that specific domain once the pipeline is completed.
Start Replication:
Once the pipeline is configured, the user will have to click on the play button to start the replication. For incremental pipeline, users will also be prompted to enter the schedule for each incremental run like at what time or frequency, the next run of pipeline should happen.