Exporting Magento 2 Data: Flatten EAV with SQL & Node

Q: "Why does Magento 2's EAV (Entity-Attribute-Value) database design make direct SQL extraction challenging?"

"EAV distributes a single entity's attributes across multiple tables (e.g., `catalog_product_entity_varchar`, `_int`, `_decimal`) to support dynamic schema changes. Reconstructing a single product flat record requires joining several attribute tables, which causes major performance bottlenecks on large catalogs if queries are not heavily optimized with index hints and partitioned subqueries."

Q: "How do you handle store-scope inheritance when exporting Magento 2 product attributes via SQL?"

"Magento attributes inherit values from the default store scope (store_id = 0) unless overridden at a specific store view. SQL queries must perform `LEFT JOIN` operations against both store_id = 0 and the target store_id, using `COALESCE(store_val.value, default_val.value)` to fall back gracefully to the default attribute value."

How to optimize complex EAV joins in MySQL using index hints to prevent full table scans on catalogs exceeding 1 million SKUs.
Complete Node.js stream backpressure implementations that keep memory usage under 100MB while processing millions of records.

When migrating off Magento 2, the first obstacle is always the database schema. Magento does not store data in clean flat rows — it uses an Entity-Attribute-Value (EAV) model that spreads data across dozens of tables with store-scope inheritance. Understanding this before writing SQL will save you days.

This guide covers two extraction problems: order export (the simpler case) and product catalog export (the genuinely hard case), followed by a production-grade Node.js pipeline to ingest that data into your new service databases.

Part 1: Exporting Orders

Full Order + Payment + Shipping Export

SELECT
    so.entity_id            AS order_id,
    so.increment_id         AS magento_order_number,
    so.status               AS order_status,
    so.grand_total          AS total_amount,
    so.base_currency_code   AS currency,
    so.created_at           AS order_created_at,
    so.customer_email,
    so.customer_firstname   AS customer_first,
    so.customer_lastname    AS customer_last,

    -- Shipping address (denormalized)
    soa.street              AS ship_street,
    soa.city                AS ship_city,
    soa.region              AS ship_region,
    soa.postcode            AS ship_postcode,
    soa.country_id          AS ship_country,
    soa.telephone           AS ship_phone,

    -- Payment method
    sop.method              AS payment_method,
    sop.last_trans_id       AS payment_transaction_id,

    -- Shipment (NULL if not yet fulfilled)
    sos.entity_id           AS shipment_id,
    sos.created_at          AS shipped_at

FROM sales_order so
LEFT JOIN sales_order_address soa
    ON soa.parent_id = so.entity_id AND soa.address_type = 'shipping'
LEFT JOIN sales_order_payment sop
    ON sop.parent_id = so.entity_id
LEFT JOIN sales_shipment sos
    ON sos.order_id = so.entity_id

WHERE so.status NOT IN ('canceled', 'fraud')
  AND so.created_at >= '2022-01-01 00:00:00'
ORDER BY so.created_at ASC;

Order Line Items (Second Pass)

SELECT
    soi.order_id,
    soi.sku,
    soi.name                AS product_name,
    soi.qty_ordered,
    soi.qty_shipped,
    soi.qty_refunded,
    soi.price               AS unit_price,
    soi.row_total,
    soi.product_type,
    soi.parent_item_id      -- non-null for configurable child rows

FROM sales_order_item soi

WHERE soi.parent_item_id IS NULL  -- skip phantom child rows for configurables
ORDER BY soi.order_id ASC, soi.item_id ASC;

Join on order_id in your ingestion script to reconstruct the full order object.

Part 2: Exporting the Product Catalog (The Hard Part)

This is where most migration engineers underestimate the effort. The product catalog uses full EAV with store scope inheritance: a value at store_id = 0 (Admin/Global) is the default; a value at a specific store_id overrides it for that store view. A naive SELECT * will return corrupted or incomplete data.

The correct approach is a two-step process.

Step 1: Materialize Attribute IDs

The attribute_id values are environment-specific — they differ between Magento installations. Run this once and use the result to populate your export query:

SELECT attribute_id, attribute_code, backend_type
FROM eav_attribute
WHERE entity_type_id = (
    SELECT entity_type_id FROM eav_entity_type
    WHERE entity_type_code = 'catalog_product'
)
AND attribute_code IN (
    'name', 'url_key', 'description', 'short_description',
    'price', 'special_price', 'status', 'visibility', 'weight'
);

Step 2: Flattened Product Export with Store-Scope Fallback

This query exports products for store store_id = 1. For each attribute, it prefers the store-specific value and falls back to the global default (store_id = 0). Replace the attribute_id values with results from Step 1:

SELECT
    e.entity_id,
    e.sku,
    e.type_id                                           AS product_type,
    e.created_at,

    -- Name (varchar): prefer store-specific, fallback to global
    COALESCE(v_name_s.value, v_name_g.value)            AS name,
    COALESCE(v_url_s.value, v_url_g.value)              AS url_key,

    -- Status: 1=Enabled, 2=Disabled (int)
    COALESCE(i_status_s.value, i_status_g.value)        AS status,
    -- Visibility: 1=Not visible, 4=Catalog+Search (int)
    COALESCE(i_vis_s.value, i_vis_g.value)              AS visibility,

    -- Price (decimal — always global scope in Magento)
    d_price.value                                       AS price,
    d_special.value                                     AS special_price,
    d_weight.value                                      AS weight

FROM catalog_product_entity e

-- === VARCHAR: name ===
LEFT JOIN catalog_product_entity_varchar v_name_s
    ON v_name_s.entity_id = e.entity_id AND v_name_s.attribute_id = 73 AND v_name_s.store_id = 1
LEFT JOIN catalog_product_entity_varchar v_name_g
    ON v_name_g.entity_id = e.entity_id AND v_name_g.attribute_id = 73 AND v_name_g.store_id = 0

-- === VARCHAR: url_key ===
LEFT JOIN catalog_product_entity_varchar v_url_s
    ON v_url_s.entity_id = e.entity_id AND v_url_s.attribute_id = 120 AND v_url_s.store_id = 1
LEFT JOIN catalog_product_entity_varchar v_url_g
    ON v_url_g.entity_id = e.entity_id AND v_url_g.attribute_id = 120 AND v_url_g.store_id = 0

-- === INT: status ===
LEFT JOIN catalog_product_entity_int i_status_s
    ON i_status_s.entity_id = e.entity_id AND i_status_s.attribute_id = 96 AND i_status_s.store_id = 1
LEFT JOIN catalog_product_entity_int i_status_g
    ON i_status_g.entity_id = e.entity_id AND i_status_g.attribute_id = 96 AND i_status_g.store_id = 0

-- === INT: visibility ===
LEFT JOIN catalog_product_entity_int i_vis_s
    ON i_vis_s.entity_id = e.entity_id AND i_vis_s.attribute_id = 99 AND i_vis_s.store_id = 1
LEFT JOIN catalog_product_entity_int i_vis_g
    ON i_vis_g.entity_id = e.entity_id AND i_vis_g.attribute_id = 99 AND i_vis_g.store_id = 0

-- === DECIMAL: price, special_price, weight (global only) ===
LEFT JOIN catalog_product_entity_decimal d_price
    ON d_price.entity_id = e.entity_id AND d_price.attribute_id = 77 AND d_price.store_id = 0
LEFT JOIN catalog_product_entity_decimal d_special
    ON d_special.entity_id = e.entity_id AND d_special.attribute_id = 78 AND d_special.store_id = 0
LEFT JOIN catalog_product_entity_decimal d_weight
    ON d_weight.entity_id = e.entity_id AND d_weight.attribute_id = 80 AND d_weight.store_id = 0

-- Only export enabled products
WHERE COALESCE(i_status_s.value, i_status_g.value) = 1
ORDER BY e.entity_id ASC;

Performance: On catalogs with 25,000+ SKUs, this query will be slow. Run EXPLAIN ANALYZE first, ensure composite indexes exist on (entity_id, attribute_id, store_id) for each EAV value table, and batch by entity_id ranges (WHERE e.entity_id BETWEEN 1 AND 5000) to avoid locking your production database.

Direct MySQL Streaming via mysql2 Stream API

To extract large datasets with a constant, low memory footprint, we stream rows directly from the MySQL network socket using the mysql2 driver’s streaming API.

Implementation of Direct MySQL Stream Export

// stream-export.js — Direct MySQL Streaming Export
const mysql = require('mysql2');
const fs = require('fs');
const { Transform, pipeline } = require('stream');
const { promisify } = require('util');
const pipelinePromise = promisify(pipeline);

// Initialize database connection pool with streaming optimization
const pool = mysql.createPool({
    host: 'localhost',
    user: 'magento_user',
    password: 'password',
    database: 'magento2',
    waitForConnections: true,
    connectionLimit: 10,
    queueLimit: 0,
    // Enable support for big numbers to prevent truncation or conversion overhead
    supportBigNumbers: true,
    bigNumberStrings: true
});

// Transform stream to map raw EAV rows to clean catalog objects
class CatalogTransformer extends Transform {
    constructor(options = {}) {
        // Set objectMode to true to handle rows as JavaScript objects rather than buffers
        super({ ...options, objectMode: true });
        this.recordsProcessed = 0;
    }

    _transform(row, encoding, callback) {
        try {
            // Apply lightweight denormalization or mapping
            const product = {
                productId: row.entity_id,
                sku: row.sku,
                type: row.product_type,
                name: row.name || 'Unnamed Product',
                urlKey: row.url_key || '',
                price: parseFloat(row.price) || 0.0,
                status: row.status === 1 ? 'enabled' : 'disabled',
                visibility: this.mapVisibility(row.visibility),
                exportedAt: new Date().toISOString()
            };

            // Push the formatted object down the stream chain
            this.push(JSON.stringify(product) + '\n');
            
            this.recordsProcessed++;
            if (this.recordsProcessed % 10000 === 0) {
                console.log(`[Stream] Formatted ${this.recordsProcessed.toLocaleString()} products...`);
            }

            // Signal that we are ready for the next row
            callback();
        } catch (err) {
            callback(err); // Propagate error to destroy the pipeline
        }
    }

    mapVisibility(code) {
        switch (code) {
            case 1: return 'Not Visible';
            case 2: return 'Catalog';
            case 3: return 'Search';
            case 4: return 'Catalog, Search';
            default: return 'Unknown';
        }
    }
}

async function runExport() {
    const startTime = Date.now();
    const outputFile = './exports/catalog-products.jsonl';
    const writeStream = fs.createWriteStream(outputFile, { encoding: 'utf8' });

    // The SQL query to retrieve the EAV flattened catalog
    const querySql = `
        SELECT
            e.entity_id,
            e.sku,
            e.type_id AS product_type,
            COALESCE(v_name.value, v_name_g.value) AS name,
            COALESCE(v_url.value, v_url_g.value) AS url_key,
            COALESCE(i_status.value, i_status_g.value) AS status,
            COALESCE(i_vis.value, i_vis_g.value) AS visibility,
            d_price.value AS price
        FROM catalog_product_entity e
        LEFT JOIN catalog_product_entity_varchar v_name ON v_name.entity_id = e.entity_id AND v_name.attribute_id = 73 AND v_name.store_id = 1
        LEFT JOIN catalog_product_entity_varchar v_name_g ON v_name_g.entity_id = e.entity_id AND v_name_g.attribute_id = 73 AND v_name_g.store_id = 0
        LEFT JOIN catalog_product_entity_varchar v_url ON v_url.entity_id = e.entity_id AND v_url.attribute_id = 120 AND v_url.store_id = 1
        LEFT JOIN catalog_product_entity_varchar v_url_g ON v_url_g.entity_id = e.entity_id AND v_url_g.attribute_id = 120 AND v_url_g.store_id = 0
        LEFT JOIN catalog_product_entity_int i_status ON i_status.entity_id = e.entity_id AND i_status.attribute_id = 96 AND i_status.store_id = 1
        LEFT JOIN catalog_product_entity_int i_status_g ON i_status_g.entity_id = e.entity_id AND i_status_g.attribute_id = 96 AND i_status_g.store_id = 0
        LEFT JOIN catalog_product_entity_int i_vis ON i_vis.entity_id = e.entity_id AND i_vis.attribute_id = 99 AND i_vis.store_id = 1
        LEFT JOIN catalog_product_entity_int i_vis_g ON i_vis_g.entity_id = e.entity_id AND i_vis_g.attribute_id = 99 AND i_vis_g.store_id = 0
        LEFT JOIN catalog_product_entity_decimal d_price ON d_price.entity_id = e.entity_id AND d_price.attribute_id = 77 AND d_price.store_id = 0
        ORDER BY e.entity_id ASC
    `;

    console.log('Initiating database connection and streaming query...');

    // Get a dedicated connection from the pool
    const connection = await new Promise((resolve, reject) => {
        pool.getConnection((err, conn) => {
            if (err) reject(err);
            else resolve(conn);
        });
    });

    try {
        // Execute query and retrieve the native readable stream
        // We set highWaterMark on the database stream to control row buffering
        const dbStream = connection.query(querySql).stream({ 
            objectMode: true,
            highWaterMark: 128 // Limit internal stream queue to 128 rows
        });

        const transformer = new CatalogTransformer();

        // Pipeline automatically binds error events and closes streams on failure/completion
        await pipelinePromise(
            dbStream,
            transformer,
            writeStream
        );

        const duration = ((Date.now() - startTime) / 1000).toFixed(2);
        console.log(`\n✅ Export Completed Successfully in ${duration}s`);
        console.log(`   Destination: ${outputFile}`);
        console.log(`   Total Records: ${transformer.recordsProcessed.toLocaleString()}`);

    } catch (err) {
        console.error('\n✗ Pipeline Failure:', err);
    } finally {
        connection.release();
        pool.end();
    }
}

runExport();

Decoupling Memory Consumption: Backpressure & TCP Windows

The pipeline’s key design characteristic is that its memory footprint remains independent of the size of the database catalog. Whether exporting 10,000 SKUs or 10,000,000 SKUs, heap allocation remains steady under 80MB. This is achieved via backpressure propagation:

Transform Stream Buffer Limit: As the database stream emits rows, they enter the CatalogTransformer’s input buffer.
HighWaterMark Boundaries: If the destination stream (writing to disk or posting to a network API) slows down, it cannot consume data fast enough. When the transformer’s output buffer hits its highWaterMark limit, its internal .write() method returns false to the upstream source.
Pausing the Socket Read: Upon receiving false, the mysql2 driver stops reading incoming packets from the TCP socket buffer.
TCP Receive Window Saturation: The Node.js OS TCP receive window fills up. When it hits capacity, TCP sends a zero-window notification to the MySQL server, indicating that the client cannot receive more packets.
Database Server Pause: The MySQL engine pauses its query result generation and waits, storing the current cursor position in-memory on the database side without dumping it onto the network.

Managing V8 Garbage Collection Under High Throughput

In a streaming migration pipeline, Node.js processes thousands of objects per second. If memory management is neglected, the V8 garbage collector will struggle to keep up, leading to high CPU usage, GC thrashing, and eventually out-of-memory crashes.

Young vs. Old Generation Allocation

The V8 heap is divided into several spaces, primarily the New Space (Young Generation) and the Old Space (Old Generation).

New Space: All newly created objects (like the parsed database rows and formatted output objects) are initially allocated here. This space is small (usually 16MB to 64MB) and optimized for rapid collection. A minor garbage collection cycle (Scavenge) scans this space frequently, clearing short-lived objects.
Old Space: Objects that survive multiple Scavenge cycles are promoted to the Old Space. The Old Space is much larger, and cleaning it requires a major GC cycle (Mark-Sweep-Compact), which pauses the Node.js event loop (stop-the-world phases).

To keep garbage collection efficient:

Avoid Object Retention: Ensure that you do not keep references to streamed rows in global arrays, cache maps, or long-lived closures. If a row object is referenced after its _transform callback completes, V8 cannot clean it up during a Scavenge cycle. It will be promoted to the Old Space, causing memory to leak over the duration of the migration.
Minimize Object Creation: In hot code paths, avoid allocating unnecessary objects. For instance, rather than re-creating regex patterns or configuration maps inside _transform, declare them as static constants outside the stream class.
Node.js GC Flags: To monitor garbage collection activity during large exports, launch the migration process with:
```
node --trace-gc stream-export.js
```
If you notice high frequency of Mark-sweep operations, it indicates that objects are leaking into the Old Space. Ensure all data structures containing row elements are properly garbage collected.
Manual GC Triggering: In testing environments, you can enforce deterministic memory measurements by running Node.js with the --expose-gc flag and calling global.gc() manually at the end of every batch flush.

Part 3: The Production Node.js Ingestion Pipeline

Pipeline Architecture

CSV File → Readable Stream → csv-parse → Batch Collector → DB Upsert (with retry)
                                                         ↓ (on max retries)
                                                   Dead-Letter File (JSONL)

Implementation

// migrate.js — Production-grade Magento → PostgreSQL pipeline
const { pipeline, Transform } = require('stream');
const { promisify } = require('util');
const { parse } = require('csv-parse');
const fs = require('fs');
const db = require('./db'); // your pg connection pool

const pipe = promisify(pipeline);

const BATCH_SIZE = 500;
const MAX_RETRIES = 3;
const RETRY_BASE_MS = 500;

const dlqStream = fs.createWriteStream('./failed-rows.jsonl', { flags: 'a' });
let processed = 0, failed = 0;
const startTime = Date.now();

// Exponential backoff retry
async function withRetry(fn, label) {
    for (let attempt = 1; attempt <= MAX_RETRIES; attempt++) {
        try {
            return await fn();
        } catch (err) {
            if (attempt === MAX_RETRIES) throw err;
            const delay = RETRY_BASE_MS * Math.pow(2, attempt - 1);
            console.warn(`\n⚠ ${label} failed (attempt ${attempt}). Retrying in ${delay}ms…`);
            await new Promise(r => setTimeout(r, delay));
        }
    }
}

// Upsert batch — idempotent by magento_order_id
async function upsertBatch(batch) {
    const client = await db.connect();
    try {
        await client.query('BEGIN');
        for (const row of batch) {
            await client.query(`
                INSERT INTO orders (
                    magento_order_id, magento_increment_id, status,
                    total_amount, currency, customer_email, created_at
                ) VALUES ($1,$2,$3,$4,$5,$6,$7)
                ON CONFLICT (magento_order_id) DO UPDATE SET
                    status       = EXCLUDED.status,
                    total_amount = EXCLUDED.total_amount,
                    updated_at   = NOW()
            `, [
                row.order_id, row.magento_order_number, row.order_status,
                parseFloat(row.total_amount) || 0, row.currency,
                row.customer_email, row.order_created_at
            ]);
        }
        await client.query('COMMIT');
    } catch (err) {
        await client.query('ROLLBACK');
        throw err;
    } finally {
        client.release();
    }
}

// Transform stream: collect rows into batches, flush with backpressure
function createBatchCollector(batchSize, onBatch) {
    let buffer = [];

    const flush = async (rows, callback) => {
        try {
            await withRetry(() => onBatch(rows), `batch ~row ${processed}`);
            processed += rows.length;
            process.stdout.write(
                `\r✓ ${processed.toLocaleString()} rows | ✗ ${failed} failed | ` +
                `${((Date.now() - startTime) / 1000).toFixed(0)}s elapsed`
            );
        } catch (err) {
            failed += rows.length;
            console.error(`\n✗ Permanent batch failure: ${err.message}`);
            rows.forEach(r => dlqStream.write(JSON.stringify(r) + '\n'));
        }
        callback();
    };

    return new Transform({
        objectMode: true,
        async transform(row, _enc, callback) {
            buffer.push(row);
            if (buffer.length >= batchSize) {
                const toFlush = buffer.splice(0, batchSize);
                await flush(toFlush, callback);
            } else {
                callback();
            }
        },
        async flush(callback) {
            if (buffer.length > 0) await flush(buffer, callback);
            else callback();
        }
    });
}

async function migrate(csvPath) {
    console.log(`\nMigrating: ${csvPath} | Batch: ${BATCH_SIZE} | Retries: ${MAX_RETRIES}\n`);
    await pipe(
        fs.createReadStream(csvPath, { encoding: 'utf8' }),
        parse({ columns: true, skip_empty_lines: true, trim: true }),
        createBatchCollector(BATCH_SIZE, upsertBatch)
    );
    dlqStream.end();
    const elapsed = ((Date.now() - startTime) / 1000).toFixed(1);
    console.log(`\n\n✅ Done in ${elapsed}s — ${processed.toLocaleString()} rows | ${failed} DLQ`);
    if (failed > 0) console.log(`   DLQ: ./failed-rows.jsonl`);
}

migrate(process.argv[2] || './orders.csv').catch(err => {
    console.error('\n✗ Fatal:', err.message);
    process.exit(1);
});

Key Design Decisions

Idempotency (ON CONFLICT DO UPDATE): The pipeline can be safely restarted. If it crashes at row 47,000, rows 1–47,000 are simply updated to the same values when you re-run. No duplicates.

Dead-Letter Queue: Batches that exhaust all retries are written to failed-rows.jsonl. After the migration, inspect the file, fix the root cause, and re-run the script pointing at the DLQ file.

Backpressure: The callback() in the Transform stream is not called until upsertBatch resolves. Node.js automatically pauses the readable stream when the database is under pressure — no manual pause()/resume() needed.

stream.pipeline: Using the promisified pipeline instead of manually chaining .pipe() ensures that if any stream in the chain errors, all other streams are automatically destroyed and file handles are released.

# Run migration
node migrate.js ./exports/magento-orders.csv

# Replay only failed rows
node migrate.js ./failed-rows.jsonl

For the full architectural context of where this extracted data lands in a microservice ecosystem, see Why You Should Migrate from Magento to Microservices and the Zero-Downtime Migration Blueprint.

Go deeper: Architecting a 21-Service E-commerce Ecosystem with Golang & DDD — the distributed microservices architecture that this data pipeline feeds into.

Frequently Asked Questions

Why does Magento 2’s EAV (Entity-Attribute-Value) database design make direct SQL extraction challenging?

EAV distributes a single entity’s attributes across multiple tables (e.g., catalog_product_entity_varchar, _int, _decimal) to support dynamic schema changes. Reconstructing a single product flat record requires joining several attribute tables, which causes major performance bottlenecks on large catalogs if queries are not heavily optimized with index hints and partitioned subqueries.

How do Node.js streams prevent memory overflow during large Magento catalog exports?

Instead of loading millions of database rows into memory at once, we use cursor-based SQL queries and pipe them into Node.js transform streams. Backpressure handles memory management: if the destination write stream (like a CSV writer or Elasticsearch API) is slow, it signals the source database read stream to pause reading, keeping memory usage stable below 100MB.

How do you handle store-scope inheritance when exporting Magento 2 product attributes via SQL?

Magento attributes inherit values from the default store scope (store_id = 0) unless overridden at a specific store view. SQL queries must perform LEFT JOIN operations against both store_id = 0 and the target store_id, using COALESCE(store_val.value, default_val.value) to fall back gracefully to the default attribute value.

Exporting Magento 2 Data: Flatten EAV with SQL & Node#

Part 1: Exporting Orders#

Full Order + Payment + Shipping Export#

Order Line Items (Second Pass)#

Part 2: Exporting the Product Catalog (The Hard Part)#

Step 1: Materialize Attribute IDs#

Step 2: Flattened Product Export with Store-Scope Fallback#

Direct MySQL Streaming via mysql2 Stream API#

Implementation of Direct MySQL Stream Export#

Decoupling Memory Consumption: Backpressure & TCP Windows#

Managing V8 Garbage Collection Under High Throughput#

Young vs. Old Generation Allocation#

Part 3: The Production Node.js Ingestion Pipeline#

Pipeline Architecture#

Implementation#

Key Design Decisions#

Frequently Asked Questions#

Why does Magento 2’s EAV (Entity-Attribute-Value) database design make direct SQL extraction challenging?#

How do Node.js streams prevent memory overflow during large Magento catalog exports?#

How do you handle store-scope inheritance when exporting Magento 2 product attributes via SQL?#