본문으로 건너뛰기
Data Engineer Path

Code nodes — writing transforms in Python or SQL

When standard nodes (filter, aggregate, join) can't finish a transform, drop a short Python or SQL snippet into a code node and keep the work inside the same pipeline.

9

The pipeline you built in the previous lesson stopped at column selection and renaming. Real mart datasets quickly run into transforms that one or two standard nodes can't finish — currency normalization, parsing substrings, reconciling against external codes. These belong in a code node. This lesson inserts one code node into the pipeline from before and runs it again.

Transforms the standard nodes handle vs. ones they don't

Patterns the standard transform nodes handle well:

  • Column selection · renaming · simple type casts
  • Single-condition filters (status = 'active')
  • Single-key aggregates (sum, count, avg)
  • A simple join between two datasets

Patterns that move naturally into a code node:

  • A single column packing two pieces of information that need regex parsing ("4,660 ₩"4660 + "KRW")
  • The same column arriving in multiple representations ("Seoul" / "서울" / "SEL") that you normalize through a lookup
  • Joining against semi-static reference data (exchange rates, external code tables) and adding a column
  • Reshaping one row into many (explode) or many rows into one (pivot)
  • Data-quality checks (failing the run when row count drops below a threshold, detecting duplicate keys)

The one-line rule of thumb: if expressing the transform in one or two lines of SQL or pandas makes the intent clearer, use a code node. If you have to chain five or more standard nodes, a code node is usually shorter from there.

A code asset lives in two places

The Codes asset in portal lives in two distinct places:

  1. Codes assets inside a collection — Reusable snippets stored inside a collection. Multiple pipelines can pull them in, and they are versioned separately.
  2. Code nodes inline in a pipeline — Inline scripts used only inside that pipeline. Suited to short transforms.

This lesson starts with (2), the inline code node, then closes by promoting the same snippet into (1), a collection code asset.

Drop one code node in

Reopen the pipeline from the previous lesson.

  1. Drag Node library → Transform → Code Node (Python) between the transform node and the sink node.
  2. Break the existing arrows and reconnect them as Source → Transform → Code Node → Sink.
  3. In the code node's configuration panel, set the input dataset alias (defaults to df).
  4. Leave the output dataset alias at the same name (df) and mutate that dataframe in place inside the body.

The body of a code node has this shape:

def transform(df):
    # df: the output dataframe from the previous node
    # return value: the input dataframe for the next node
    df["price_krw"] = (
        df["price_krw_str"]
        .str.replace(",", "", regex=False)
        .str.replace("₩", "", regex=False)
        .str.strip()
        .astype("int64")
    )
    df["currency"] = "KRW"
    return df[["order_date", "item_code", "price_krw", "currency"]]

Save, then click Validate in the upper-right of the canvas. Portal statically analyzes the function signature and return type. Green means it passed.

SQL code nodes sit in the same place

Inside the same code-node library you'll also find Code Node (SQL). When one or two inputs need a join, group by, or window function, SQL is often shorter.

SELECT
  order_date,
  category,
  SUM(quantity * price_krw) AS revenue_krw,
  COUNT(DISTINCT order_id) AS order_count
FROM df
WHERE status = 'paid'
GROUP BY 1, 2

Here df is the virtual table name bound to the previous node's output. To accept a second input, expand the code node's input ports to two and reference the second input's alias (e.g. df2) directly inside the SQL.

Run it once and see what changed

Click Run again. Confirm two things on the output dataset:

  • Does the price_krw column come through as an integer type (check the schema preview)?
  • Does the input row count match the output row count as intended (identical if you didn't add a filter)?

Promote the snippet to a collection code asset

When the same normalization logic starts repeating across two or three pipelines, the inline code node becomes a maintenance burden. At that point, move it with this pattern:

  1. From the code node's upper-right menu, choose Extract to collection code.
  2. Name the asset (e.g. normalize_price_krw) and pick the collection to store it in.
  3. The code node in the original pipeline becomes a call into that asset.
  4. From any other pipeline, pull it in directly with Node library → Codes → normalize_price_krw.

A promoted snippet lives on its own page with a change history. Fix the transform once and every dependent pipeline picks up the change on the next run.

Self-check

  • Did you add one inline Python code node and take a run end-to-end?
  • Did you also express the same transform as a SQL code node at least once (optional)?
  • Did you extract the snippet into a collection code asset?
  • Did you change at least one line of the AI assistant's first draft instead of accepting it verbatim?

Next lesson

The next lesson makes the pipeline from this one run automatically at a fixed time every day, sets the run mode (overwrite, append, CDC), and routes failure alerts to the right channel.