Nevenka Lukic
- Oct 11, 2023
- 3 min read

Unveiling the Azure Data Lake for Bike Share Data Analytics

Updated: Oct 24, 2023

This is the second big project from the course Data Engineering with Azure specialization on Udacity that I've recently completed. If you want to read about my first project, you can find it in this blog.

Data Engineering with Azure could be your gateway to unlocking thrilling projects like building a comprehensive data lake solution. In this blog post, we will unravel the great journey of developing a data lake solution for Divvy bikeshare, a bike sharing program based in Chicago, Illinois, USA.

You can find the code related to this project in my GitHub repository.

Project Overview

Divvy bike share is a popular program that allows riders in Chicago to access bikes via kiosks or a mobile application. The anonymized bike trip data from Divvy is made publicly available by the City of Chicago for analysis, forming the foundation of this data engineering project.

The goal of this project was to develop a data lake solution using Azure Databricks, using a lake house architecture. The objectives included designing a star schema, importing data into Azure Databricks using Delta Lake for creating bronze and gold data stores, and transforming the data into the star schema for the gold data store.

Building the Star Schema

To ensure a robust foundation for analysis, a star schema was designed. This schema included two vital fact tables—one related to trip facts and another related to payment facts. The trip fact table incorporated fields for trip duration and rider age at the time of the trip, while the payment fact table featured a field related to the amount of payment. Dimensions related to the trip fact table encompassed riders, stations, and dates, whereas dimensions related to the payment fact table included dates and riders.

The ETL Process: Extract, Transform, and Load

Extract Step

In the extraction phase, Spark code was crafted using Jupyter Notebooks and Python scripts in Databricks. This code was designed to extract information from CSV files stored in Databricks and write it to the Delta file system. Leveraging distributed data storage using Azure Data Storage options, the data was efficiently picked up from Databricks file system storage and written out to Delta file locations.

Load Step

The loading process involved implementing key features of data lakes on Azure. This step included creating tables and loading data from Delta files. Utilizing spark.sql statements, the tables were created, and data was loaded from the files extracted in the previous step.

Transform Step

Transforming the data involved employing Spark and Databricks to run ELT processes, creating both fact and dimension tables. The Python scripts were meticulously crafted to align with the schema diagrams, generating appropriate keys and facts. The transform scripts adhered to essential criteria, such as writing to Delta, utilizing overwrite mode, and saving as tables in Delta.

Conclusion

As a new developer, this project was a perfect introduction to Azure services and data engineering. Navigating through the project allowed me to understand the principles of creating a data lake solution using a lake house architecture. The step-by-step approach, coupled with the guidance from the Udacity course, provided a solid foundation to comprehend Azure Databricks, Delta Lake, and the intricacies of designing star schemas. The experience of working on this project not only honed my technical skills but also boosted my confidence in working with a cloud-based data engineering platform.

In conclusion, this project was not just about creating a data lake for bike share data analytics; it was a transformative experience that equipped me with the skills and knowledge necessary to embark on a fulfilling career as a data engineer and developer.

Embark on your own data engineering journey with Azure and unlock a world of possibilities in data analytics and insights. Happy coding! 🚀