# Migrating 2.5M Messages: A Case Study on Using MongoDB Aggregation Pipeline to Extract Data from Multiple Collections

### Introduction: What is Aggregation in MongoDB?

MongoDB's aggregation framework is a powerful tool for performing data processing and transformation operations on the documents within a collection. The core of this framework is the aggregation pipeline, which allows you to process data through a sequence of stages. Each stage transforms the documents as they pass through, and the result is a new set of documents. The aggregation pipeline can perform a variety of operations, including filtering, grouping, sorting, reshaping, and more. This makes it an essential feature for complex data processing tasks that go beyond simple queries.

### The Problem to Solve

In a production scenario, I faced a unique challenge. I needed to perform a database migration and send all the data to our client. The client required all the data and metadata, which were spread across multiple collections, to be consolidated into a single JSON file. This was necessary to ensure the client could easily import the data into their system without dealing with multiple sources. The collections involved were `customer_chat`, `attendant`, and `messages`, each containing different parts of the required information.

### How I Solved It with an Aggregation Pipeline

To solve this problem, I used MongoDB's aggregation pipeline to join the data from the different collections into a single, cohesive dataset. This pipeline was executed on the `chatrooms` collection, where each document contained keys that linked to related data in the `customer_chat`, `attendant`, and `messages` collections. By utilizing fields such as `v._id` to connect with `customer_chat`, `servedBy._id` to link with `attendant`, and `_id` to join with `messages`, we were able to seamlessly integrate the disparate data sources into a single document structure. Below is the pipeline I used and an explanation of each stage:

```json
[
  {
    $lookup: {
      from: "customer_chat",
      localField: "customer._id",
      foreignField: "_id",
      as: "client"
    }
  },
  {
    $unwind: {
      path: "$client",
      preserveNullAndEmptyArrays: true
    }
  },
  {
    $lookup: {
      from: "attendant",
      localField: "servedBy._id",
      foreignField: "_id",
      as: "attendant"
    }
  },
  {
    $unwind: {
      path: "$attendant",
      preserveNullAndEmptyArrays: true
    }
  },
  {
    $lookup: {
      from: "messages",
      localField: "_id",
      foreignField: "rid",
      as: "messages"
    }
  },
  {
    $project: {
      _id: 1,
      msgs: 1,
      usersCount: 1,
      departmentId: 1,
      omnichannel: 1,
      responseBy: 1,
      client: "$client",
      attendant: "$attendant",
      messages: "$messages"
    }
  }
]
```

1. **$lookup:**
    
    * The first `$lookup` stage joins the `customer_chat` collection with the main collection using the `v._id` field. The matched documents are stored in the `client` field.
        
    * The second `$lookup` stage joins the `attendant` collection using the `servedBy._id` field, storing the results in the `attendant` field.
        
    * The third `$lookup` stage joins the `messages` collection using the `_id` field from the main collection and `rid` field from the `messages` collection, storing the results in the `messages` field.
        
2. **$unwind:**
    
    * The `$unwind` stages flatten the arrays created by the `$lookup` stages. Using `preserveNullAndEmptyArrays: true` ensures that documents without matching entries in the joined collections are still included in the results.
        
3. **$project:**
    
    * The `$project` stage reshapes each document to include only the necessary fields: `_id`, `msgs`, `usersCount`, `departmentId`, `omnichannel`, `responseBy`, and the joined fields (`client`, `attendant`, `messages`).
        

By running this pipeline, we can create a single, unified document for each entry in the main collection, containing all relevant data from the associated collections. This approach ensures that all the required data is consolidated into a single JSON structure, ready for export.

### Why and When to Use Pipeline Aggregation

The aggregation pipeline is particularly useful in scenarios where you need to:

1. **Combine Data from Multiple Collections:**
    
    * When data is distributed across several collections, the aggregation pipeline's `$lookup` stage can effectively join these collections into a single, unified view.
        
2. **Perform Complex Transformations:**
    
    * The pipeline can perform sophisticated data transformations and calculations, which are often necessary for preparing data for reporting, analysis, or migration.
        
3. **Optimize Query Performance:**
    
    * Aggregation pipelines can offload complex processing to the database server, which can be more efficient than performing multiple queries and processing the data in the application layer.
        
4. **Generate Customized Reports:**
    
    * You can use aggregation pipelines to create detailed and customized reports by grouping, filtering, and transforming the data in various ways.
        

In our case, the need to migrate and consolidate data from multiple collections into a single JSON file for our client made the aggregation pipeline the ideal solution. It allowed us to efficiently join, transform, and export the data, ensuring that the client received a comprehensive and easy-to-import dataset.

By understanding and leveraging the power of MongoDB's aggregation pipeline, you can tackle complex data processing challenges and streamline your workflows, making it an invaluable tool for any MongoDB-based application.
