Skip to main content Link Menu Expand (external link) Document Search Copy Copied

🏪 Instacart Market Basket Analysis

By Xiaoyu Gui

Table of contents

  1. 🧾 Summary and Business Recommendations
  2. 🧽 Data Preprocessing
  3. 🧐 Exploratory Data Analysis
    1. 🛒 Customer Behavior
    2. 🍌 Product
  4. 📝 Future work

🧾 Summary and Business Recommendations

  1. Order volumes are considerably higher during daytime hours on weekends (9 AM to 5 PM). Therefore, it is recommended that the company allocates more drivers during these peak hours by adjusting delivery fees. This would ensure sufficient driver availability and timely order deliveries, leading to increased customer satisfaction and loyalty.
  2. The majority of orders (~48%) are placed more than 7 days after the customer's last order. To encourage more frequent purchase, it is suggested that the company utilizes targeted strategies to increase customer loyalty and drive repeat business. For example, coupons, discounts and free shipping can be offered to customers whose second order exceeds a certain amount within a week.
  3. Among the top 20 products, 14 (~75%) are organic. Organic products also have higher reorder rates compared to non-organic products. Given the growing demand for healthier food options, the company should consider increasing the visibility of organic products across its websites and mobile applications.
  4. The company may consider building a dynamic dashboard based on the treemap (Figure 10) with regular updates. It can facilitate ongoing monitoring of overall sales conditions, streamline data storage and analysis, and improve decision making efficiency.

🧽 Data Preprocessing

The original datasets were obtained from Kaggle. There are five original datasets, which respectively contain information on order information, products purchased in each order, product information, department information and aisle information from Instacart, a grocery delivery service.

During dataset joining, there seems to be a problem of mismatched numbers of rows between datasets, which may indicate many-to-one or duplicate issues. After investigating the unique values in datasets that contain department and aisle information, two departments and three aisles with repetitive ids and unreasonable names were dropped. It is more common to reach out to data providers when dealing with this kind of issue in practice, but here intuition works just fine since stores probably do not sell “illegal drugs” or “nuclear missiles”.

There are also some messy entries of aisle names. Necessary actions have been done to clean up these data.

Original Cleaned
1_!ice68'_cream68'_toppings ice cream toppings
10_!bakery68'_desserts bakery desserts

Table 1: Example of Data Cleaning Process

🧐 Exploratory Data Analysis

🛒 Customer Behavior

Figure 1: Distribution of Number of Orders Per User

There are approximately 200k users, 3.2 million orders and 50k unique product offerings according to the dataset. As is demonstrated in Figure 1, the number of orders per user has a negative exponential distribution. Around 29% of users placed orders 1-5 times, about 25% of users placed orders 6-10 times, and no more than 9% of users had ordered more than 40 times. There is a peak at 99 orders per user, which may indicate potential data issues.

Figure 2-1: Heatmap of Order Volume by Day of Week and Time of Day

According to Figure 2-1, more than 34% of the orders are placed on Saturday and Sunday, and about 72% of orders were placed during daytime hours (i.e. 9 AM to 5 PM), which aligns with normal working hours.

Therefore, it is recommended that the company increase the delivery fee during peak hours, specifically from 9 AM to 5 PM on weekends, to ensure an adequate number of available drivers. Such an approach would guarantee timely order deliveries and increase customer satisfaction and loyalty.

Figure 2-2: Distribution of Orders by Day of Week

Figure 2-3: Distribution of Orders by Time of Day

Note: The day of week in the original dataset is marked with mere numbers from 0 to 6. This analysis assumes that day 0 is Saturday and day 1 is Sunday given that people tend to shop for groceries more on weekends. Another point to note is that day 2 has the most orders among the rest of the days. Such a phenomenon supports the above assumption of days of weeks because intuitively people are likely to shop for what they’ve missed in weekend shopping on Monday. More details are shown in figure 2-2 and 2-3. A follow up with data providers would still be helpful.



Figure 3: Distribution of Order size

As for the number of products per order, Figure 3 indicates that over 62% of orders contain no more than 10 products, and less than 10% of orders have more than 20 items.



Figure 4: Distribution of Day Since Prior Order

When looking closely at Figure 4, the distribution of the number of days customers placed their prior order, there seems to be a weekly user cycle since there’s a peak every seven days. There is a peak on day 30, which probably implies that the data maintainer aggregates any data larger than 30 into that bin. A follow up investigation is needed.

Figure 5: Probability Density Histogram of Second Order Size relative to individual subpopulations. Orders are divided into three subpopulations by time since last order: “Same day” (day_since_prior_order = 0), “Within a week” (0 < day_since_prior_order <= 7), and “More than a week” (day_since_prior_order > 7).

When comparing the distribution of order size across subpopulations with varying time intervals since the prior order, as shown in Figure 5, there seem to be more small-sized second orders placed on the same day, which indicates that people who reordered on the same day may just forget a thing or two.

Looking closer at the individual subpopulations, it’s worth noting that only about 2% of orders fall into the “same day” category, while no more than 40% of orders are placed within a week of the customer’s previous order. The majority of orders (~48%) fall into the “more than a week” category. Additionally, the median order sizes for the second order in the “same day”, “within a week”, and “more than a week” subpopulations are 5, 8, and 9, respectively. This aligns with the common understanding that people tend to purchase more items when placing an order after a longer interval of time.

As a result, to encourage customers who place orders more than a week apart to shift into the “within a week” category, it is recommended that marketing, sales, and other relevant departments collaborate to develop targeted strategies. One approach could be to offer coupons, discounts, or free shipping for customers whose second order exceeds a certain amount within a week.

🍌 Product

As for individual products, the most ordered product is banana. The top 20 products with highest volume percentages are all from the produce and dairy eggs departments. Produce and vegetables expire much quicker than the other product offerings and are more commonly in everyday dishes/meals which aligns with what might be expected. Among the top 20 products, 14 (~75%) are organic. Furthermore, organic products have higher reorder rates compared to non-organic products.

With the increasing trend towards natural and healthy foods, this presents a promising opportunity for the company to explore further. For instance, the company could enhance its promotion of organic products by increasing their visibility on its websites and apps.

Product Name Volume Percentages Reorder Rate
Banana 1.46% 0.8435
Bag of Organic Bananas 1.17% 0.8326
Organic Strawberries 0.82% 0.7777
Organic Baby Spinach 0.75% 0.7725
Organic Hass Avocado 0.66% 0.7966
Organic Avocado 0.54% 0.7581
Large Lemon 0.47% 0.696
Strawberries 0.44% 0.6982
Limes 0.43% 0.681
Organic Whole Milk 0.43% 0.8304
Organic Raspberries 0.42% 0.7691
Organic Yellow Onion 0.35% 0.6971
Organic Garlic 0.33% 0.6801
Organic Zucchini 0.32% 0.6884
Organic Blueberries 0.31% 0.6288
Cucumber Kirby 0.3% 0.6917
Organic Fuji Apple 0.27% 0.7119
Organic Lemon 0.27% 0.6899
Apple Honeycrisp Organic 0.26% 0.7352
Organic Grape Tomatoes 0.26% 0.6555

Table 2: Top 20 Best Selling Products With Reorder Rates

In Figure 6 and Figure 7, the most reordered ones are mostly food, drinks, and personal care items, which are the ones that get used up pretty quickly.

Figure 6: Department Reorder Rates

Figure 7: Aisle Reorder Rates

According to Figure 8, the department with the largest order volume is produce. The other category in the pie chart is made up of departments with less than 2% of total order volumes, which includes personal care, babies, international, alcohol, pets, missing, other and bulk.

Figure 8: Pie Chart of Order Volume Percentage By Department

According to Figure 9, the single aisle with the largest order volume is fresh fruits. The aisles that are not among the top 15 ones with most order volumes are compiled into the other category due to space limitations.

Figure 9: Pie Chart of Order Volume Percentage By Aisle

The treemap below (Figure 10) provides a visual representation of order volumes and reorder rates across various departments and aisles, using size and color to convey information. To facilitate ongoing monitoring of overall sales conditions, the company may consider transforming this treemap into a dynamic dashboard that can be accessed by product managers and other senior staff. Implementing this approach can also help optimize data storage and analysis as well as enhance the efficiency of decision-making process.

Figure 10: Treemap of Department and Aisles, sized by order volumes and colored by reorder rates. More detailed data can be found by hovering over the cells.

📝 Future work

The questions that require follow up with the data providers include the following:

Improvement to data quality:

Extra information:

With these questions answered, it is possible to further enhance the insights provided above.