Zero to Data: Creating Synthetic E-commerce Datasets with Python

Introduction: The Foundation of Data Analysis

Before diving into the powerful world of SQL and the dynamic visualizations of Power BI, every data analysis project starts with its raw material: the dataset. For this global web store case study, the first crucial step was building a structured and realistic dataset from scratch using Python.

Generating synthetic data allows us to design the exact structure of the tables (a Star Schema is the goal), ensuring the data reflects the complexity of global transactions, customers, and products needed for a meaningful analysis.

🛠️ Key Libraries and Tools

The entire process relies on three fundamental Python libraries:


Pandas: Essential for structuring and manipulating data into DataFrames for easy export to CSV.

Numpy: Used for numerical operations, like generating random quantities and applying statistical distributions to prices.

Faker: The star of the show! It creates realistic, yet fictional, data points like names, emails, countries, and cities efficiently.


To get started, we only need to install them: [ Console: pip install Faker pandas numpy ]


💻 The Generation Code: Building a Star Schema

The modeling process generates four main tables, following a classic e-commerce transactional structure: two Dimension Tables (master data) and two Fact Tables (transactional data).

1. Dimension Tables (The 'Who' and 'What')

These tables describe the entities involved in the transactions.

Customers (customers.csv): Generated using Faker to assign unique IDs, names, emails, and crucial geographical details (country, city) for global analysis.

Products (products.csv): This simulates the product catalog, including IDs, specific product names, and categories. This structure is vital for calculating KPIs like "Sales by Category."

2. Fact Tables (The 'When' and 'How Much')

These are the largest tables and the engine of our analysis, recording every sales activity.

Orders (orders.csv): This table creates the transaction headers, randomly assigning a customer_id and an order_date within a specified period.

Order Items (order_items.csv): This is the granular detail table. It links orders to products (order_id, product_id), generates the quantity, and calculates the total price for that specific item line by pulling the base price from the products table.

📋 The Complete Python Script

Here is the complete code used to generate the four CSV files, ready for upload:

View code in Python >

✅ Conclusion: Ready for the Cloud and Analysis

By executing this Python script, we've successfully generated four well-structured CSV files: customers.csv, products.csv, orders.csv, and order_items.csv.

These clean, interconnected datasets are now perfectly modeled for an analytical workflow. The next crucial steps in this case study involve:

Cloud Data Processing (BigQuery): We'll upload these CSV files to Google BigQuery to handle large-scale data querying. This is where we'll leverage the power of SQL to perform advanced transformations, calculate Key Performance Indicators (KPIs), and prepare the aggregated tables.

Business Intelligence (Power BI): Finally, we'll connect Power BI directly to the transformed data in BigQuery. This allows us to create an interactive dashboard, visualizing the KPIs (like revenue, sales by region, and top products) to tell the complete story of our global e-commerce business.

In the next post we will dive into SQL transformations in BigQuery and the final visualization in Power BI!  Next >