HANDS-ON-LAB

Build Real Estate Transactions Pipeline

Problem Statement

This hands-on build a Data Pipeline for Real Estate Example code aims to create a Lambda function to cleanse YouTube statistics reference data and store it in an S3 bucket in CSV format. Additionally, the cleansed data should be exposed in the Glue catalog. 

The statistics reference data (the JSON files) is placed in the raw S3 bucket:

s3://<raw_bucket_name>/youtube/raw_statistics_reference_data/

Tasks

  1. Create a new S3 bucket and upload the dataset file containing the real estate transactions in the city of Sacramento.

  2. Develop a Python script to read the dataset file and push the records to a Kinesis data stream. Include an extra column called "txn_timestamp" to track the time of record insertion.

  3. Set up a Flink SQL environment.

  4. Write a Flink SQL query to create a table with appropriate column types matching the dataset structure provided.

  5. Configure Flink to consume records from the Kinesis data stream and insert them into the created table.

  6. Execute a select statement in Flink SQL to display the contents of the table.


Join the hands-on lab now and master real-time data processing with a Real Estate Transactions pipeline.

Learnings

  • Knowledge of setting up and utilizing AWS services such as S3 and Kinesis.

  • Proficiency in Python scripting for data ingestion to Kinesis.

  • Understanding of Flink SQL syntax for creating tables and querying data.

  • Experience in working with different data types in Flink SQL.

  • Familiarity with real-time data processing pipelines using AWS services.

 

Please note that the exercise provides the column names and their corresponding data types.

Column

Data type

street

string

city

string

zip

string

state

string

beds

int

baths

int

sq__ft

int

type

string

sale_date

string

price

int

latitude

double

longitude

double

FAQs

Q1. What is the purpose of the "txn_timestamp" column in the pipeline?

The "txn_timestamp" column tracks the time of record insertion, providing information about when each transaction was processed in the pipeline.

 

Q2. Can I customize the Flink SQL table based on the provided column names and data types?

Yes, you can write a Flink SQL query to create a table with appropriate column names and data types matching the structure of the real estate transactions dataset.

 

Q3. How can I display the contents of the Flink SQL table?

 By executing a select statement in Flink SQL, you can retrieve and display the contents of the table, showing the real estate transaction records.