In this segment, you will learn about the data that you will be dealing with in this project in detail. So, go through the following video where our SME, Mayukh will take you through the datasets.
As discussed in the video, a centralised Kafka server has been set up where the data will be hosted. The details of the Kafka server, such as the broker and topic, will be you in the next segment.
The data is based on an Online Retail Data Set in the UCI Machine Learning Repository. Each order’s invoice has been represented in a JSON format. The sample data looks like this.
As you can see, the data contains the following information:
- Invoice number: Identifier of the invoice
- Country: Country where the order is placed
- Timestamp: Time at which the order is placed
- Type: Whether this is a new order or a return order
- SKU (Stock Keeping Unit): Identifier of the product being ordered
- Title: Name of the product is ordered
- Unit price: Price of a single unit of the product
- Quantity: Quantity of the product being ordered
You will be using these columns to derive some additional columns as well, which will help you in calculating the KPIs. You will learn more about this in the next segment.
Acknowledgement
The data was published in the UCI Machine Learning Repository by Dr Daqing Chen, Director: Public Analytics group.
Reference
Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197aC”208, 2012 (Published online before print: 27 August 2012. doi: 10:1057/dbm.2012.17).
Report an error