asdasd

19th of 19 Questions.

You are scraping 1 million products per day using Puppeteer. If a product price changes, you need to update it, but if it's new, you must insert it. You also need to track the 'Last Seen' timestamp. How would you structure this CRUD operation to be most efficient?

Structure the operation using unordered bulkWrite with upserts, a compound unique index on product identifier fields, and timestamp management via $currentDate or server-side date operators

For processing 1 million products daily, individual update/insert operations would create excessive network overhead and database load . The optimal solution is to use MongoDB's bulkWrite() with unordered operations and upserts. This approach batches multiple operations into a single database call, significantly reducing network round trips while handling both updates and inserts seamlessly. The unordered mode allows operations to continue even if some fail, providing better fault tolerance at this scale .

The architecture relies on three critical components: a unique identifier for matching existing products, unordered bulk operations for maximum throughput, and server-side timestamp handling for accuracy. By using ordered: false, MongoDB can execute operations in parallel, improving write throughput by 5-10x compared to ordered operations . The unique index on your product identifier (such as SKU or product ID) enables efficient matching and prevents duplicate entries.

Implementation Example

At this scale, several optimizations are essential. First, batch size matters: 1,000-5,000 operations per batch balances network efficiency against memory usage . Second, consider temporarily dropping non-essential indexes during the bulk operation and recreating them afterward to reduce write overhead . Third, the unordered mode (ordered: false) is critical for performance—it allows MongoDB to parallelize writes and continue processing even if individual operations fail . A real-world case study shows that moving from ordered to unordered bulk operations can improve throughput by up to 5-10x .

Optimizing the 'Last Seen' Timestamp

Server-side timestamps: Use $$NOW or $$CLUSTER_TIME in an aggregation pipeline to ensure consistent, server-generated timestamps regardless of client clock variations .
$currentDate operator: For simpler implementations, use $currentDate: { lastSeen: true } which sets the field to the current server time .
Preserving creation time: Use $setOnInsert for fields that should only be set when inserting new documents, such as firstSeen or createdAt .

With 1 million daily operations, concurrent updates are inevitable. Without proper safeguards, you might encounter duplicate key errors (E11000) when multiple processes attempt to upsert the same product simultaneously . To handle this, implement retry logic with exponential backoff, or use optimistic locking with version numbers if your data source provides them . The aggregation pipeline approach shown earlier naturally handles this by only applying updates when the condition is met.

Beyond the unique index on your product identifier, consider additional indexes based on your query patterns. For efficient batch operations, ensure your filter fields are indexed. However, during massive write operations, each index adds overhead—every upsert must update all indexes on the collection . A balanced approach is to maintain only essential indexes during daily operations and consider creating analytical indexes on a secondary read replica.

Implement comprehensive error handling that captures partial results. When using unordered bulk operations, the result object contains writeErrors—an array of individual operation failures . Log these for analysis and consider retrying failed operations in smaller batches. Monitor db.currentOp() during large batch operations to detect performance bottlenecks and adjust batch sizes accordingly .

Question Loading...

asdasd

19th of 19 Questions.

Structure the operation using unordered bulkWrite with upserts, a compound unique index on product identifier fields, and timestamp management via $currentDate or server-side date operators

Implementation Example

Optimizing the 'Last Seen' Timestamp

Server-side timestamps: Use $$NOW or $$CLUSTER_TIME in an aggregation pipeline to ensure consistent, server-generated timestamps regardless of client clock variations .
$currentDate operator: For simpler implementations, use $currentDate: { lastSeen: true } which sets the field to the current server time .
Preserving creation time: Use $setOnInsert for fields that should only be set when inserting new documents, such as firstSeen or createdAt .