File-based Ingestion
With File based ingestion, Data Pipes allows ingestion of data from S3 based sources or one can also upload either CSV or JSON file to the data lake.
CSV Upload:
Replication modes: Full, Databrew
Click on the pipeline icon on the left hand side of the Data Pipes portal
You will land on the ingestion pipelines page, where one can see all the pipelines initiated by the user.
Click on the plus icon to create a new pipeline.
Once you click on the plus icon, click on the configure source block to create the source.
Select File -> CSV source
Upon clicking CSV source, the following form appears where the user can upload its own file to ingest to the data lake.
Once the source is configured, click on the configure destination option and select the default destination of the data lake (Athena/Snowflake).
Select the replication modes:
Full - In full replication mode, the file gets ingested as-is without any change.
DataBrew - In DataBrew replication mode, we allow users to prepare or transform the data before the ingestion process. Data Pipes leverages AWS DataBrew service.
Once the replication mode is selected, click on configure pipeline to provide details like - Pipeline Name, Domain Name(where the data is to be loaded), table name.
Click on the play button to start the pipeline.
Once started each pipeline will undergo various steps
Fetching Schema - This will show the actual schema of how the data looks like.
PII scan running - Each pipeline undergoes PII scanning to identify any critical PII data. In case PII data is detected, the user can then decide whether or not to mask/tokenize that column.
Started - Once the user selects the columns to mask(optional), he/she can start the actual ingestion.