Pulling external data with connectors
Register one connector against an external database, S3 bucket, or REST API, and land production data as a portal dataset — organized along three axes (authentication, schema, permissions).
The first surface an engineer touches is Connectors. You start with portal holding zero rows and finish with one external production system wired in, producing the input every later pipeline reads from. This lesson takes one connector from start to finish and confirms that a dataset lands in the collection tree as a result.
Prerequisites
- Access to one external system. Any of the following works:
- A read-only account on an internal production database (PostgreSQL, MySQL, Snowflake, or any RDBMS portal supports)
- An access key with read permission on S3 or an S3-compatible object store
- A REST API endpoint + token (e.g. an internal metrics API)
- One collection to work in. The collection you built in lesson 03 of the Analyst Path is fine to reuse.
Don't keep production credentials on your own laptop. Portal stores credentials encrypted in a secrets store, with an ops-team rotation flow on top.
A connector is always built in two steps
Every connector in portal goes through the same two steps:
- Pick a type and register authentication — Which external system (DB, object store, REST, message queue), and how you authenticate (username/password, key pair, OAuth, token).
- Map the schema — Which objects inside that system (tables, bucket prefixes, endpoint paths), and what shape they should take as portal datasets.
Once these two steps are done, the connector becomes the origin of one or more datasets.
Step 1 — Register authentication
- Click Connectors in the left sidebar and choose Create connector inside a collection.
- From the type list, pick the external system you want. The common starting points are PostgreSQL, S3, and REST API.
- Fill in non-secret identification: host, port, database name (or bucket name, base URL).
- For credentials, choose Register a new secret in the secrets store and let portal manage it. Don't paste plaintext into the form.
- Click Test connection. Portal does one handshake against the external system. Green means success; red comes back with which step (DNS, authentication, permission) failed.
Step 2 — Map the schema
Once the connection test passes, portal scans the external system's schema once.
- DB connector: Tables inside the database appear as a tree. Check the ones you want to pull. Column-level selection, column aliasing, and type casts live on the same screen.
- S3 connector: Specify one prefix and a file pattern (
*.parquet,*.csv.gz). Multiple files under the same prefix are grouped as partitions of a single dataset. - REST API connector: Specify an endpoint and a JSON response path (
$.data.items[*]). Portal flattens that path's result into a single tabular view.
In all three cases the last step decides the name of the dataset that will be created. A single connector can produce several datasets — a consistent naming convention inside one collection (e.g. src_<system>_<entity>) makes nodes easier to find in later pipelines.
Run the ingest once
After schema mapping the Run ingest button lights up. Click it and portal pulls the data once and lands it as a dataset. When the run finishes, confirm three things:
- A new dataset (e.g.
src_postgres_orders) appears in the collection tree. - The row count matches what you expect from the external system (compare with a simple count query).
- The schema preview infers types as intended (especially: did the date column come in as a string?).
The dataset that lands at this point is a verbatim copy of the source. Transformation is the next step.
Keep it out of the analyst's view for now
Source datasets are usually scoped to engineers only. If analysts start querying source datasets directly from the collection tree, you lose the ability to trace which widget breaks when downstream transforms change.
The recommended pattern is:
- Keep source datasets inside a data engineer collection with tight Reader permissions.
- The mart datasets you'll expose to analysts are produced by the pipeline in the next lesson and land in a separate collection, where the analyst group gets Reader access on that collection only.
This separation comes back during the handoff step in lesson 06.
Self-check
- Does the new connector show up in the left tree, with Test connection green?
- Did the ingest produce one new dataset?
- Are the credentials registered in portal's secrets store, with no plaintext in the form?
- Is the source dataset's permission blocked from analysts?
Next lesson
In the next lesson you'll place this dataset as a source node in the Workflow editor and take your first pipeline through one full cycle.