Version: Next

Redshift

Important Capabilities

Capability	Status	Notes
Asset Containers	✅	Enabled by default. Supported for types - Database, Schema.
Classification	✅	Optionally enabled via `classification.enabled`.
Column-level Lineage	✅	Optionally enabled via configuration (`mixed` or `sql_based` lineage needs to be enabled).
Data Profiling	✅	Optionally enabled via configuration.
Dataset Usage	✅	Optionally enabled via `include_usage_statistics`.
Descriptions	✅	Enabled by default.
Detect Deleted Entities	✅	Enabled by default via stateful ingestion.
Domains	✅	Supported via the `domain` config field.
Platform Instance	✅	Enabled by default.
Schema Metadata	✅	Enabled by default.
Table-Level Lineage	✅	Optionally enabled via configuration.
Test Connection	✅	Enabled by default.

This plugin extracts the following:

Metadata for databases, schemas, views and tables
Column types associated with each table
Table, row, and column statistics via optional SQL profiling
Table lineage
Usage statistics

Prerequisites

The DataHub Redshift connector requires specific database privileges to extract metadata, lineage, and usage statistics from your Amazon Redshift cluster.

Permission Overview

DataHub requires three categories of permissions:

System Table Access - Access to Redshift system tables for lineage and usage statistics
System View Access - Access to system views for metadata discovery
Data Access - Access to user schemas and tables for profiling and classification

System Table and View Permissions

Execute the following commands as a database superuser or user with sufficient privileges to grant these permissions:

-- Core system access (required for lineage and usage statistics)
ALTER USER datahub WITH SYSLOG ACCESS UNRESTRICTED;

-- Core metadata extraction (always required)
GRANT SELECT ON pg_catalog.svv_redshift_databases TO datahub;    -- Database information and properties
GRANT SELECT ON pg_catalog.svv_redshift_schemas TO datahub;      -- Schema information within databases
GRANT SELECT ON pg_catalog.svv_external_schemas TO datahub;      -- External schemas (Spectrum, federated)
GRANT SELECT ON pg_catalog.svv_table_info TO datahub;           -- Table metadata, statistics, and properties
GRANT SELECT ON pg_catalog.svv_external_tables TO datahub;       -- External table definitions (Spectrum)
GRANT SELECT ON pg_catalog.svv_external_columns TO datahub;      -- External table column information
GRANT SELECT ON pg_catalog.pg_class_info TO datahub;            -- Table creation timestamps and basic info

-- Essential pg_catalog tables for table discovery
GRANT SELECT ON pg_catalog.pg_class TO datahub;                 -- Table and view definitions
GRANT SELECT ON pg_catalog.pg_namespace TO datahub;             -- Schema namespace information
GRANT SELECT ON pg_catalog.pg_description TO datahub;           -- Table and column descriptions/comments
GRANT SELECT ON pg_catalog.pg_database TO datahub;              -- Database catalog information
GRANT SELECT ON pg_catalog.pg_attribute TO datahub;             -- Column definitions and properties
GRANT SELECT ON pg_catalog.pg_attrdef TO datahub;               -- Column default values
GRANT SELECT ON pg_catalog.svl_user_info TO datahub;            -- User information for ownership

-- Datashare lineage (enabled by default)
GRANT SELECT ON pg_catalog.svv_datashares TO datahub;           -- Cross-cluster datashare information

-- Choose ONE based on your Redshift type:
-- For Provisioned Clusters:
GRANT SELECT ON pg_catalog.stv_mv_info TO datahub;              -- Materialized view information (provisioned)

-- For Serverless Workgroups:
-- GRANT SELECT ON pg_catalog.svv_user_info TO datahub;          -- User information (serverless alternative)
-- GRANT SELECT ON pg_catalog.svv_mv_info TO datahub;           -- Materialized view information (serverless)

-- Schema access (required to read tables in each schema)
GRANT USAGE ON SCHEMA <schema_to_ingest> TO datahub;             -- Replace with actual schema names

Important: Make sure to grant USAGE on any schema you want to ingest from:

GRANT USAGE ON SCHEMA <schema_to_ingest> TO datahub;

Detailed Permission Breakdown

The following sections provide detailed information about which permissions are required for specific features and configurations.

Core System Views (Always Required)

These system views are accessed in all DataHub configurations:

-- Schema discovery
GRANT SELECT ON pg_catalog.svv_redshift_schemas TO datahub;      -- Schema information within databases
GRANT SELECT ON pg_catalog.svv_external_schemas TO datahub;      -- External schemas (Spectrum, federated)

-- Database information
GRANT SELECT ON pg_catalog.svv_redshift_databases TO datahub;    -- Database information and properties

-- Table metadata and statistics
GRANT SELECT ON pg_catalog.svv_table_info TO datahub;           -- Table metadata, statistics, and properties

-- External table support
GRANT SELECT ON pg_catalog.svv_external_tables TO datahub;       -- External table definitions (Spectrum)
GRANT SELECT ON pg_catalog.svv_external_columns TO datahub;      -- External table column information

-- Table creation timestamps
GRANT SELECT ON pg_catalog.pg_class_info TO datahub;            -- Table creation timestamps and basic info

Conditional System Tables (Feature Dependent)

Shared Database (Datashare Consumer)

-- Required when is_shared_database = True
GRANT SELECT ON pg_catalog.svv_redshift_tables TO datahub;       -- Table information in shared databases
GRANT SELECT ON pg_catalog.svv_redshift_columns TO datahub;      -- Column information in shared databases

Redshift Serverless Workgroups

-- Required for serverless workgroups
GRANT SELECT ON pg_catalog.svv_user_info TO datahub;            -- User information (serverless alternative)
GRANT SELECT ON pg_catalog.svv_mv_info TO datahub;             -- Materialized view information (serverless)

Redshift Provisioned Clusters

-- Required for provisioned clusters
GRANT SELECT ON pg_catalog.svl_user_info TO datahub;            -- User information for ownership (superuser table)
GRANT SELECT ON pg_catalog.stv_mv_info TO datahub;              -- Materialized view information (provisioned)

Datashares Lineage

-- Required when include_share_lineage: true (default)
GRANT SELECT ON pg_catalog.svv_datashares TO datahub;           -- Cross-cluster datashare information

Recommended Permission Set

For a typical provisioned cluster with default settings:

-- Core system access
ALTER USER datahub WITH SYSLOG ACCESS UNRESTRICTED;

-- Always required
GRANT SELECT ON pg_catalog.svv_redshift_databases TO datahub;    -- Database information and properties
GRANT SELECT ON pg_catalog.svv_redshift_schemas TO datahub;      -- Schema information within databases
GRANT SELECT ON pg_catalog.svv_external_schemas TO datahub;      -- External schemas (Spectrum, federated)
GRANT SELECT ON pg_catalog.svv_table_info TO datahub;           -- Table metadata, statistics, and properties
GRANT SELECT ON pg_catalog.svv_external_tables TO datahub;       -- External table definitions (Spectrum)
GRANT SELECT ON pg_catalog.svv_external_columns TO datahub;      -- External table column information
GRANT SELECT ON pg_catalog.pg_class_info TO datahub;            -- Table creation timestamps and basic info

-- Essential pg_catalog tables for table discovery
GRANT SELECT ON pg_catalog.pg_class TO datahub;                 -- Table and view definitions
GRANT SELECT ON pg_catalog.pg_namespace TO datahub;             -- Schema namespace information
GRANT SELECT ON pg_catalog.pg_description TO datahub;           -- Table and column descriptions/comments
GRANT SELECT ON pg_catalog.pg_database TO datahub;              -- Database catalog information
GRANT SELECT ON pg_catalog.pg_attribute TO datahub;             -- Column definitions and properties
GRANT SELECT ON pg_catalog.pg_attrdef TO datahub;               -- Column default values
GRANT SELECT ON pg_catalog.svl_user_info TO datahub;            -- User information for ownership (superuser table)

-- Datashares (since include_share_lineage defaults to true)
GRANT SELECT ON pg_catalog.svv_datashares TO datahub;           -- Cross-cluster datashare information

-- Provisioned cluster materialized views
GRANT SELECT ON pg_catalog.stv_mv_info TO datahub;              -- Materialized view information (provisioned)

-- Schema access (required to read tables in each schema)
GRANT USAGE ON SCHEMA <schema_to_ingest> TO datahub;             -- Replace with actual schema names

Data Access Privileges (Required for Data Profiling and Classification)

Important: The system table permissions above only provide access to metadata. To enable data profiling, classification, or any feature that reads actual table data, you must grant additional privileges:

-- Grant USAGE privilege on schemas (required to access schema objects)
GRANT USAGE ON SCHEMA public TO datahub;
GRANT USAGE ON SCHEMA your_schema_name TO datahub;

-- Grant SELECT privilege on existing tables for data access
GRANT SELECT ON ALL TABLES IN SCHEMA public TO datahub;
GRANT SELECT ON ALL TABLES IN SCHEMA your_schema_name TO datahub;

-- Grant privileges on future objects (recommended for production)
-- IMPORTANT: These must be run by each user who will create tables/views
-- OR by a superuser with FOR ROLE clause

-- Option 1: If you (as admin) will create all future tables/views:
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO datahub;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON VIEWS TO datahub;
ALTER DEFAULT PRIVILEGES IN SCHEMA your_schema_name GRANT SELECT ON TABLES TO datahub;
ALTER DEFAULT PRIVILEGES IN SCHEMA your_schema_name GRANT SELECT ON VIEWS TO datahub;

-- Option 2: If other users will create tables/views, run this for each user:
-- ALTER DEFAULT PRIVILEGES FOR ROLE other_user_name IN SCHEMA public GRANT SELECT ON TABLES TO datahub;
-- ALTER DEFAULT PRIVILEGES FOR ROLE other_user_name IN SCHEMA public GRANT SELECT ON VIEWS TO datahub;

-- Option 3: For all future users (requires superuser):
-- ALTER DEFAULT PRIVILEGES FOR ALL ROLES IN SCHEMA public GRANT SELECT ON TABLES TO datahub;
-- ALTER DEFAULT PRIVILEGES FOR ALL ROLES IN SCHEMA public GRANT SELECT ON VIEWS TO datahub;

Data Access vs Metadata Access

The permissions are split into two categories:

System table permissions (above) - Required for metadata extraction, lineage, and usage statistics
Data access permissions (this section) - Required for data profiling, classification, and any feature that reads actual table content

Default privileges only apply to objects created by the user who ran the ALTER DEFAULT PRIVILEGES command. If multiple users create tables in your schemas, you need to:

Run the commands as each user, OR
Use FOR ROLE other_user_name for each user who creates objects, OR
Use FOR ALL ROLES (requires superuser privileges)

Common gotcha: If User A runs ALTER DEFAULT PRIVILEGES and User B creates a table, DataHub won't have access to User B's table unless you used Option 2 or 3 above.

Alternative approach: Instead of default privileges, consider using a scheduled job to periodically grant access to new tables:

-- Run this periodically to catch new tables
GRANT SELECT ON ALL TABLES IN SCHEMA your_schema_name TO datahub;

Optional: Datashare Privileges

To enable cross-cluster lineage through datashares, grant the following privileges:

-- Grant SHARE privilege on datashares (replace with actual datashare names)
GRANT SHARE ON your_datashare_name TO datahub;

Ingestion of multiple redshift databases, namespaces

If multiple databases are present in the Redshift namespace (or provisioned cluster), you would need to set up a separate ingestion per database.
Ingestion recipes of all databases in a particular redshift namespace should use same platform instance.
If you've multiple redshift namespaces that you want to ingest within DataHub, it is highly recommended that you specify a platform_instance equivalent to namespace in recipe. It can be same as namespace id or other human readable name however it should be unique across all your redshift namespaces.

Lineage

There are multiple lineage collector implementations as Redshift does not support table lineage out of the box.

stl_scan_based

The stl_scan based collector uses Redshift's stl_insert and stl_scan system tables to discover lineage between tables.

Pros:

Fast
Reliable

Cons:

Does not work with Spectrum/external tables because those scans do not show up in stl_scan table.
If a table is depending on a view then the view won't be listed as dependency. Instead the table will be connected with the view's dependencies.

sql_based

The sql_based based collector uses Redshift's stl_insert to discover all the insert queries and uses sql parsing to discover the dependencies.

Pros:

Works with Spectrum tables
Views are connected properly if a table depends on it

Cons:

Slow.
Less reliable as the query parser can fail on certain queries

mixed

Using both collector above and first applying the sql based and then the stl_scan based one.

Pros:

Works with Spectrum tables
Views are connected properly if a table depends on it
A bit more reliable than the sql_based one only

Cons:

Slow
May be incorrect at times as the query parser can fail on certain queries

note

The redshift stl redshift tables which are used for getting data lineage retain at most seven days of log history, and sometimes closer to 2-5 days. This means you cannot extract lineage from queries issued outside that window.

Datashares Lineage

This is enabled by default, can be disabled via setting include_share_lineage: False

It is mandatory to run redshift ingestion of datashare producer namespace at least once so that lineage shows up correctly after datashare consumer namespace is ingested.

Profiling

Profiling runs sql queries on the redshift cluster to get statistics about the tables. To be able to do that, the user needs to have read access to the tables that should be profiled.

If you don't want to grant read access to the tables you can enable table level profiling which will get table statistics without reading the data.

profiling:
  profile_table_level_only: true

Caveats

note

System table access: The SYSLOG ACCESS UNRESTRICTED privilege gives the user visibility to data generated by other users. For example, STL_QUERY and STL_QUERYTEXT contain the full text of INSERT, UPDATE, and DELETE statements.

note

Datashare lineage: For cross-cluster lineage through datashares, the datahub user requires SHARE privileges on datashares in both producer and consumer namespaces. See the Amazon Redshift datashare documentation for more information.

CLI based Ingestion

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
  type: redshift
  config:
    # Coordinates
    host_port: example.something.us-west-2.redshift.amazonaws.com:5439
    database: DemoDatabase

    # Credentials
    username: user
    password: pass

    # Options
    options:
      # driver_option: some-option

    include_table_lineage: true
    include_usage_statistics: true
    # The following options are only used when include_usage_statistics is true
    # it appends the domain after the redshift username which is extracted from the Redshift audit history
    # in the format username@email_domain
    email_domain: mydomain.com

    profiling:
      enabled: true
      # Only collect table level profiling information
      profile_table_level_only: true

  sink:
  # sink configs

#------------------------------------------------------------------------------
# Extra options when running Redshift behind a proxy</summary>
# This requires you to have already installed the Microsoft ODBC Driver for SQL Server.
# See https://docs.microsoft.com/en-us/sql/connect/python/pyodbc/step-1-configure-development-environment-for-pyodbc-python-development?view=sql-server-ver15
#------------------------------------------------------------------------------

source:
  type: redshift
  config:
    host_port: my-proxy-hostname:5439

    options:
      connect_args:
        # check all available options here: https://pypi.org/project/redshift-connector/
        ssl_insecure: "false" # Specifies if IDP hosts server certificate will be verified

sink:
  # sink configs

Config Details

Options
Schema

Note that a . is used to denote nested fields in the YAML recipe.

Field	Description
host_port ✅ string	host URL
bucket_duration Enum	One of: "DAY", "HOUR"
convert_urns_to_lowercase boolean	Whether to convert dataset urns to lowercase. Default: False
database string	database Default: dev
default_schema string	The default schema to use if the sql parser fails to parse the schema with `sql_based` lineage collector Default: public
email_domain One of string, null	Email domain of your organisation so users can be displayed on UI appropriately. Default: None
enable_stateful_lineage_ingestion boolean	Enable stateful lineage ingestion. This will store lineage window timestamps after successful lineage ingestion. and will not run lineage ingestion for same timestamps in subsequent run. Default: True
enable_stateful_profiling boolean	Enable stateful profiling. This will store profiling timestamps per dataset after successful profiling. and will not run profiling again in subsequent run if table has not been updated. Default: True
enable_stateful_usage_ingestion boolean	Enable stateful lineage ingestion. This will store usage window timestamps after successful usage ingestion. and will not run usage ingestion for same timestamps in subsequent run. Default: True
end_time string(date-time)	Latest date of lineage/usage to consider. Default: Current time in UTC
extra_client_options object	Default: {}
extract_column_level_lineage boolean	Whether to extract column level lineage. This config works with rest-sink only. Default: True
format_sql_queries boolean	Whether to format sql queries Default: False
include_copy_lineage boolean	Whether lineage should be collected from copy commands Default: True
include_operational_stats boolean	Whether to display operational stats. Default: True
include_read_operational_stats boolean	Whether to report read operational stats. Experimental. Default: False
include_share_lineage boolean	Whether lineage should be collected from datashares Default: True
include_table_lineage boolean	Whether table lineage should be ingested. Default: True
include_table_location_lineage boolean	If the source supports it, include table lineage to the underlying storage location. Default: True
include_table_rename_lineage boolean	Whether we should follow `alter table ... rename to` statements when computing lineage. Default: True
include_tables boolean	Whether tables should be ingested. Default: True
include_top_n_queries boolean	Whether to ingest the top_n_queries. Default: True
include_unload_lineage boolean	Whether lineage should be collected from unload commands Default: True
include_usage_statistics boolean	Generate usage statistic. email_domain config parameter needs to be set if enabled Default: False
include_view_column_lineage boolean	Populates column-level lineage for view->view and table->view lineage using DataHub's sql parser. Requires `include_view_lineage` to be enabled. Default: True
include_view_lineage boolean	Populates view->view and table->view lineage using DataHub's sql parser. Default: True
include_views boolean	Whether views should be ingested. Default: True
incremental_lineage boolean	When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run. Default: False
is_serverless boolean	Whether target Redshift instance is serverless (alternative is provisioned cluster) Default: False
lineage_generate_queries boolean	Whether to generate queries entities for the SQL-based lineage collector. Default: True
match_fully_qualified_names boolean	Whether `schema_pattern` is matched against fully qualified schema name `<database>.<schema>`. Default: False
options object	Any options specified here will be passed to SQLAlchemy.create_engine as kwargs. To set connection arguments in the URL, specify them under `connect_args`.
password One of string(password), null	password Default: None
patch_custom_properties boolean	Whether to patch custom properties on existing datasets rather than replace. Default: True
platform_instance One of string, null	The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details. Default: None
platform_instance_map One of string, null	A holder for platform -> platform_instance mappings to generate correct dataset urns Default: None
resolve_temp_table_in_lineage boolean	Whether to resolve temp table appear in lineage to upstream permanent tables. Default: True
skip_external_tables boolean	Whether to skip EXTERNAL tables. Default: False
sqlalchemy_uri One of string, null	URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters. Default: None
start_time string(date-time)	Earliest date of lineage/usage to consider. Default: Last full day in UTC (or hour, depending on `bucket_duration`). You can also specify relative time with respect to end_time such as '-7 days' Or '-7d'. Default: None
table_lineage_mode Enum	One of: "sql_based", "stl_scan_based", "mixed"
top_n_queries integer	Number of top queries to save to each table. Default: 10
use_file_backed_cache boolean	Whether to use a file backed cache for the view definitions. Default: True
username One of string, null	username Default: None
env string	The environment that all assets produced by this connector belong to Default: PROD
domain map(str,AllowDenyPattern)	A class to store allow deny regexes
domain.`key`.allow array	List of regex patterns to include in ingestion Default: ['.*']
domain.`key`.allow.string string
domain.`key`.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
domain.`key`.deny array	List of regex patterns to exclude from ingestion. Default: []
domain.`key`.deny.string string
profile_pattern AllowDenyPattern	A class to store allow deny regexes
profile_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
s3_lineage_config S3LineageProviderConfig	Any source that produces s3 lineage from/to Datasets should inherit this class.
s3_lineage_config.ignore_non_path_spec_path boolean	Ignore paths that are not match in path_specs. It only applies if path_specs are specified. Default: False
s3_lineage_config.strip_urls boolean	Strip filename from s3 url. It only applies if path_specs are not specified. Default: True
s3_lineage_config.path_specs array	List of PathSpec. See below the details about PathSpec Default: []
s3_lineage_config.path_specs.PathSpec PathSpec
s3_lineage_config.path_specs.PathSpec.include ❓ string	Path to table. Name variable `{table}` is used to mark the folder with dataset. In absence of `{table}`, file level dataset will be created. Check below examples for more details.
s3_lineage_config.path_specs.PathSpec.allow_double_stars boolean	Allow double stars in the include path. This can affect performance significantly if enabled Default: False
s3_lineage_config.path_specs.PathSpec.autodetect_partitions boolean	Autodetect partition(s) from the path. If set to true, it will autodetect partition key/value if the folder format is {partition_key}={partition_value} for example `year=2024` Default: True
s3_lineage_config.path_specs.PathSpec.default_extension One of string, null	For files without extension it will assume the specified file type. If it is not set the files without extensions will be skipped. Default: None
s3_lineage_config.path_specs.PathSpec.enable_compression boolean	Enable or disable processing compressed files. Currently .gz and .bz files are supported. Default: True
s3_lineage_config.path_specs.PathSpec.include_hidden_folders boolean	Include hidden folders in the traversal (folders starting with . or _ Default: False
s3_lineage_config.path_specs.PathSpec.sample_files boolean	Not listing all the files but only taking a handful amount of sample file to infer the schema. File count and file size calculation will be disabled. This can affect performance significantly if enabled Default: True
s3_lineage_config.path_specs.PathSpec.table_name One of string, null	Display name of the dataset.Combination of named variables from include path and strings Default: None
s3_lineage_config.path_specs.PathSpec.traversal_method Enum	One of: "ALL", "MIN_MAX", "MAX"
s3_lineage_config.path_specs.PathSpec.exclude One of array, null	list of paths in glob pattern which will be excluded while scanning for the datasets Default: []
s3_lineage_config.path_specs.PathSpec.exclude.string string
s3_lineage_config.path_specs.PathSpec.file_types array	Files with extenstions specified here (subset of default value) only will be scanned to create dataset. Other files will be omitted. Default: ['csv', 'tsv', 'json', 'parquet', 'avro']
s3_lineage_config.path_specs.PathSpec.file_types.string string
s3_lineage_config.path_specs.PathSpec.tables_filter_pattern AllowDenyPattern	A class to store allow deny regexes
s3_lineage_config.path_specs.PathSpec.tables_filter_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
schema_pattern AllowDenyPattern	A class to store allow deny regexes
schema_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
table_pattern AllowDenyPattern	A class to store allow deny regexes
table_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
user_email_pattern AllowDenyPattern	A class to store allow deny regexes
user_email_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
view_pattern AllowDenyPattern	A class to store allow deny regexes
view_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
classification ClassificationConfig
classification.enabled boolean	Whether classification should be used to auto-detect glossary terms Default: False
classification.info_type_to_term map(str,string)
classification.max_workers integer	Number of worker processes to use for classification. Set to 1 to disable. Default: 4
classification.sample_size integer	Number of sample values used for classification. Default: 100
classification.classifiers array	Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance. Default: [{'type': 'datahub', 'config': None}]
classification.classifiers.DynamicTypedClassifierConfig DynamicTypedClassifierConfig
classification.classifiers.DynamicTypedClassifierConfig.type ❓ string	The type of the classifier to use. For DataHub, use `datahub`
classification.classifiers.DynamicTypedClassifierConfig.config One of object, null	The configuration required for initializing the classifier. If not specified, uses defaults for classifer type. Default: None
classification.column_pattern AllowDenyPattern	A class to store allow deny regexes
classification.column_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
classification.table_pattern AllowDenyPattern	A class to store allow deny regexes
classification.table_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
profiling GEProfilingConfig
profiling.catch_exceptions boolean	Default: True
profiling.enabled boolean	Whether profiling should be done. Default: False
profiling.field_sample_values_limit integer	Upper limit for number of sample values to collect for all columns. Default: 20
profiling.include_field_distinct_count boolean	Whether to profile for the number of distinct values for each column. Default: True
profiling.include_field_distinct_value_frequencies boolean	Whether to profile for distinct value frequencies. Default: False
profiling.include_field_histogram boolean	Whether to profile for the histogram for numeric fields. Default: False
profiling.include_field_max_value boolean	Whether to profile for the max value of numeric columns. Default: True
profiling.include_field_mean_value boolean	Whether to profile for the mean value of numeric columns. Default: True
profiling.include_field_median_value boolean	Whether to profile for the median value of numeric columns. Default: True
profiling.include_field_min_value boolean	Whether to profile for the min value of numeric columns. Default: True
profiling.include_field_null_count boolean	Whether to profile for the number of nulls for each column. Default: True
profiling.include_field_quantiles boolean	Whether to profile for the quantiles of numeric columns. Default: False
profiling.include_field_sample_values boolean	Whether to profile for the sample values for all columns. Default: True
profiling.include_field_stddev_value boolean	Whether to profile for the standard deviation of numeric columns. Default: True
profiling.limit One of integer, null	Max number of documents to profile. By default, profiles all documents. Default: None
profiling.max_number_of_fields_to_profile One of integer, null	A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up. Default: None
profiling.max_workers integer	Number of worker threads to use for profiling. Set to 1 to disable. Default: 20
profiling.offset One of integer, null	Offset in documents to profile. By default, uses no offset. Default: None
profiling.partition_datetime One of string(date-time), null	If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this. Default: None
profiling.partition_profiling_enabled boolean	Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling. Default: True
profiling.profile_external_tables boolean	Whether to profile external tables. Only Snowflake and Redshift supports this. Default: False
profiling.profile_if_updated_since_days One of number, null	Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`. Default: None
profiling.profile_nested_fields boolean	Whether to profile complex types like structs, arrays and maps. Default: False
profiling.profile_table_level_only boolean	Whether to perform profiling at table-level only, or include column-level profiling as well. Default: False
profiling.profile_table_row_count_estimate_only boolean	Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. Default: False
profiling.profile_table_row_limit One of integer, null	Profile tables only if their row count is less than specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `Snowflake`, `BigQuery`. Supported for `Oracle` based on gathered stats. Default: 5000000
profiling.profile_table_size_limit One of integer, null	Profile tables only if their size is less than specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `Snowflake`, `BigQuery` and `Databricks`. Supported for `Oracle` based on calculated size from gathered stats. Default: 5
profiling.query_combiner_enabled boolean	This feature is still experimental and can be disabled if it causes issues. Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible. Default: True
profiling.report_dropped_profiles boolean	Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes. Default: False
profiling.sample_size integer	Number of rows to be sampled from table for column level profiling.Applicable only if `use_sampling` is set to True. Default: 10000
profiling.turn_off_expensive_profiling_metrics boolean	Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10. Default: False
profiling.use_sampling boolean	Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. Default: True
profiling.operation_config OperationConfig
profiling.operation_config.lower_freq_profile_enabled boolean	Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling. Default: False
profiling.operation_config.profile_date_of_month One of integer, null	Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect. Default: None
profiling.operation_config.profile_day_of_week One of integer, null	Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect. Default: None
profiling.tags_to_ignore_sampling One of array, null	Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on `use_sampling`. Default: None
profiling.tags_to_ignore_sampling.string string
stateful_ingestion One of StatefulStaleMetadataRemovalConfig, null	Default: None
stateful_ingestion.enabled boolean	Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False Default: False
stateful_ingestion.fail_safe_threshold number	Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'. Default: 75.0
stateful_ingestion.remove_stale_metadata boolean	Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True

The JSONSchema for this configuration is inlined below.

{
  "$defs": {
    "AllowDenyPattern": {
      "additionalProperties": false,
      "description": "A class to store allow deny regexes",
      "properties": {
        "allow": {
          "default": [
            ".*"
          ],
          "description": "List of regex patterns to include in ingestion",
          "items": {
            "type": "string"
          },
          "title": "Allow",
          "type": "array"
        },
        "deny": {
          "default": [],
          "description": "List of regex patterns to exclude from ingestion.",
          "items": {
            "type": "string"
          },
          "title": "Deny",
          "type": "array"
        },
        "ignoreCase": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": true,
          "description": "Whether to ignore case sensitivity during pattern matching.",
          "title": "Ignorecase"
        }
      },
      "title": "AllowDenyPattern",
      "type": "object"
    },
    "BucketDuration": {
      "enum": [
        "DAY",
        "HOUR"
      ],
      "title": "BucketDuration",
      "type": "string"
    },
    "ClassificationConfig": {
      "additionalProperties": false,
      "properties": {
        "enabled": {
          "default": false,
          "description": "Whether classification should be used to auto-detect glossary terms",
          "title": "Enabled",
          "type": "boolean"
        },
        "sample_size": {
          "default": 100,
          "description": "Number of sample values used for classification.",
          "title": "Sample Size",
          "type": "integer"
        },
        "max_workers": {
          "default": 4,
          "description": "Number of worker processes to use for classification. Set to 1 to disable.",
          "title": "Max Workers",
          "type": "integer"
        },
        "table_pattern": {
          "$ref": "#/$defs/AllowDenyPattern",
          "default": {
            "allow": [
              ".*"
            ],
            "deny": [],
            "ignoreCase": true
          },
          "description": "Regex patterns to filter tables for classification. This is used in combination with other patterns in parent config. Specify regex to match the entire table name in `database.schema.table` format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
        },
        "column_pattern": {
          "$ref": "#/$defs/AllowDenyPattern",
          "default": {
            "allow": [
              ".*"
            ],
            "deny": [],
            "ignoreCase": true
          },
          "description": "Regex patterns to filter columns for classification. This is used in combination with other patterns in parent config. Specify regex to match the column name in `database.schema.table.column` format."
        },
        "info_type_to_term": {
          "additionalProperties": {
            "type": "string"
          },
          "default": {},
          "description": "Optional mapping to provide glossary term identifier for info type",
          "title": "Info Type To Term",
          "type": "object"
        },
        "classifiers": {
          "default": [
            {
              "type": "datahub",
              "config": null
            }
          ],
          "description": "Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance.",
          "items": {
            "$ref": "#/$defs/DynamicTypedClassifierConfig"
          },
          "title": "Classifiers",
          "type": "array"
        }
      },
      "title": "ClassificationConfig",
      "type": "object"
    },
    "DynamicTypedClassifierConfig": {
      "additionalProperties": false,
      "properties": {
        "type": {
          "description": "The type of the classifier to use. For DataHub,  use `datahub`",
          "title": "Type",
          "type": "string"
        },
        "config": {
          "anyOf": [
            {},
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "The configuration required for initializing the classifier. If not specified, uses defaults for classifer type.",
          "title": "Config"
        }
      },
      "required": [
        "type"
      ],
      "title": "DynamicTypedClassifierConfig",
      "type": "object"
    },
    "FolderTraversalMethod": {
      "enum": [
        "ALL",
        "MIN_MAX",
        "MAX"
      ],
      "title": "FolderTraversalMethod",
      "type": "string"
    },
    "GEProfilingConfig": {
      "additionalProperties": false,
      "properties": {
        "enabled": {
          "default": false,
          "description": "Whether profiling should be done.",
          "title": "Enabled",
          "type": "boolean"
        },
        "operation_config": {
          "$ref": "#/$defs/OperationConfig",
          "description": "Experimental feature. To specify operation configs."
        },
        "limit": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Max number of documents to profile. By default, profiles all documents.",
          "title": "Limit"
        },
        "offset": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Offset in documents to profile. By default, uses no offset.",
          "title": "Offset"
        },
        "profile_table_level_only": {
          "default": false,
          "description": "Whether to perform profiling at table-level only, or include column-level profiling as well.",
          "title": "Profile Table Level Only",
          "type": "boolean"
        },
        "include_field_null_count": {
          "default": true,
          "description": "Whether to profile for the number of nulls for each column.",
          "title": "Include Field Null Count",
          "type": "boolean"
        },
        "include_field_distinct_count": {
          "default": true,
          "description": "Whether to profile for the number of distinct values for each column.",
          "title": "Include Field Distinct Count",
          "type": "boolean"
        },
        "include_field_min_value": {
          "default": true,
          "description": "Whether to profile for the min value of numeric columns.",
          "title": "Include Field Min Value",
          "type": "boolean"
        },
        "include_field_max_value": {
          "default": true,
          "description": "Whether to profile for the max value of numeric columns.",
          "title": "Include Field Max Value",
          "type": "boolean"
        },
        "include_field_mean_value": {
          "default": true,
          "description": "Whether to profile for the mean value of numeric columns.",
          "title": "Include Field Mean Value",
          "type": "boolean"
        },
        "include_field_median_value": {
          "default": true,
          "description": "Whether to profile for the median value of numeric columns.",
          "title": "Include Field Median Value",
          "type": "boolean"
        },
        "include_field_stddev_value": {
          "default": true,
          "description": "Whether to profile for the standard deviation of numeric columns.",
          "title": "Include Field Stddev Value",
          "type": "boolean"
        },
        "include_field_quantiles": {
          "default": false,
          "description": "Whether to profile for the quantiles of numeric columns.",
          "title": "Include Field Quantiles",
          "type": "boolean"
        },
        "include_field_distinct_value_frequencies": {
          "default": false,
          "description": "Whether to profile for distinct value frequencies.",
          "title": "Include Field Distinct Value Frequencies",
          "type": "boolean"
        },
        "include_field_histogram": {
          "default": false,
          "description": "Whether to profile for the histogram for numeric fields.",
          "title": "Include Field Histogram",
          "type": "boolean"
        },
        "include_field_sample_values": {
          "default": true,
          "description": "Whether to profile for the sample values for all columns.",
          "title": "Include Field Sample Values",
          "type": "boolean"
        },
        "max_workers": {
          "default": 20,
          "description": "Number of worker threads to use for profiling. Set to 1 to disable.",
          "title": "Max Workers",
          "type": "integer"
        },
        "report_dropped_profiles": {
          "default": false,
          "description": "Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.",
          "title": "Report Dropped Profiles",
          "type": "boolean"
        },
        "turn_off_expensive_profiling_metrics": {
          "default": false,
          "description": "Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.",
          "title": "Turn Off Expensive Profiling Metrics",
          "type": "boolean"
        },
        "field_sample_values_limit": {
          "default": 20,
          "description": "Upper limit for number of sample values to collect for all columns.",
          "title": "Field Sample Values Limit",
          "type": "integer"
        },
        "max_number_of_fields_to_profile": {
          "anyOf": [
            {
              "exclusiveMinimum": 0,
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.",
          "title": "Max Number Of Fields To Profile"
        },
        "profile_if_updated_since_days": {
          "anyOf": [
            {
              "exclusiveMinimum": 0,
              "type": "number"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`.",
          "schema_extra": {
            "supported_sources": [
              "snowflake",
              "bigquery"
            ]
          },
          "title": "Profile If Updated Since Days"
        },
        "profile_table_size_limit": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": 5,
          "description": "Profile tables only if their size is less than specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `Snowflake`, `BigQuery` and `Databricks`. Supported for `Oracle` based on calculated size from gathered stats.",
          "schema_extra": {
            "supported_sources": [
              "snowflake",
              "bigquery",
              "unity-catalog",
              "oracle"
            ]
          },
          "title": "Profile Table Size Limit"
        },
        "profile_table_row_limit": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": 5000000,
          "description": "Profile tables only if their row count is less than specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `Snowflake`, `BigQuery`. Supported for `Oracle` based on gathered stats.",
          "schema_extra": {
            "supported_sources": [
              "snowflake",
              "bigquery",
              "oracle"
            ]
          },
          "title": "Profile Table Row Limit"
        },
        "profile_table_row_count_estimate_only": {
          "default": false,
          "description": "Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. ",
          "schema_extra": {
            "supported_sources": [
              "postgres",
              "mysql"
            ]
          },
          "title": "Profile Table Row Count Estimate Only",
          "type": "boolean"
        },
        "query_combiner_enabled": {
          "default": true,
          "description": "*This feature is still experimental and can be disabled if it causes issues.* Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.",
          "title": "Query Combiner Enabled",
          "type": "boolean"
        },
        "catch_exceptions": {
          "default": true,
          "description": "",
          "title": "Catch Exceptions",
          "type": "boolean"
        },
        "partition_profiling_enabled": {
          "default": true,
          "description": "Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling.",
          "schema_extra": {
            "supported_sources": [
              "athena",
              "bigquery"
            ]
          },
          "title": "Partition Profiling Enabled",
          "type": "boolean"
        },
        "partition_datetime": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this.",
          "schema_extra": {
            "supported_sources": [
              "bigquery"
            ]
          },
          "title": "Partition Datetime"
        },
        "use_sampling": {
          "default": true,
          "description": "Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. ",
          "schema_extra": {
            "supported_sources": [
              "bigquery",
              "snowflake"
            ]
          },
          "title": "Use Sampling",
          "type": "boolean"
        },
        "sample_size": {
          "default": 10000,
          "description": "Number of rows to be sampled from table for column level profiling.Applicable only if `use_sampling` is set to True.",
          "schema_extra": {
            "supported_sources": [
              "bigquery",
              "snowflake"
            ]
          },
          "title": "Sample Size",
          "type": "integer"
        },
        "profile_external_tables": {
          "default": false,
          "description": "Whether to profile external tables. Only Snowflake and Redshift supports this.",
          "schema_extra": {
            "supported_sources": [
              "redshift",
              "snowflake"
            ]
          },
          "title": "Profile External Tables",
          "type": "boolean"
        },
        "tags_to_ignore_sampling": {
          "anyOf": [
            {
              "items": {
                "type": "string"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on `use_sampling`.",
          "title": "Tags To Ignore Sampling"
        },
        "profile_nested_fields": {
          "default": false,
          "description": "Whether to profile complex types like structs, arrays and maps. ",
          "title": "Profile Nested Fields",
          "type": "boolean"
        }
      },
      "title": "GEProfilingConfig",
      "type": "object"
    },
    "LineageMode": {
      "enum": [
        "sql_based",
        "stl_scan_based",
        "mixed"
      ],
      "title": "LineageMode",
      "type": "string"
    },
    "OperationConfig": {
      "additionalProperties": false,
      "properties": {
        "lower_freq_profile_enabled": {
          "default": false,
          "description": "Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling.",
          "title": "Lower Freq Profile Enabled",
          "type": "boolean"
        },
        "profile_day_of_week": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect.",
          "title": "Profile Day Of Week"
        },
        "profile_date_of_month": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect.",
          "title": "Profile Date Of Month"
        }
      },
      "title": "OperationConfig",
      "type": "object"
    },
    "PathSpec": {
      "additionalProperties": false,
      "properties": {
        "include": {
          "description": "Path to table. Name variable `{table}` is used to mark the folder with dataset. In absence of `{table}`, file level dataset will be created. Check below examples for more details.",
          "title": "Include",
          "type": "string"
        },
        "exclude": {
          "anyOf": [
            {
              "items": {
                "type": "string"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": [],
          "description": "list of paths in glob pattern which will be excluded while scanning for the datasets",
          "title": "Exclude"
        },
        "file_types": {
          "default": [
            "csv",
            "tsv",
            "json",
            "parquet",
            "avro"
          ],
          "description": "Files with extenstions specified here (subset of default value) only will be scanned to create dataset. Other files will be omitted.",
          "items": {
            "type": "string"
          },
          "title": "File Types",
          "type": "array"
        },
        "default_extension": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "For files without extension it will assume the specified file type. If it is not set the files without extensions will be skipped.",
          "title": "Default Extension"
        },
        "table_name": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Display name of the dataset.Combination of named variables from include path and strings",
          "title": "Table Name"
        },
        "enable_compression": {
          "default": true,
          "description": "Enable or disable processing compressed files. Currently .gz and .bz files are supported.",
          "title": "Enable Compression",
          "type": "boolean"
        },
        "sample_files": {
          "default": true,
          "description": "Not listing all the files but only taking a handful amount of sample file to infer the schema. File count and file size calculation will be disabled. This can affect performance significantly if enabled",
          "title": "Sample Files",
          "type": "boolean"
        },
        "allow_double_stars": {
          "default": false,
          "description": "Allow double stars in the include path. This can affect performance significantly if enabled",
          "title": "Allow Double Stars",
          "type": "boolean"
        },
        "autodetect_partitions": {
          "default": true,
          "description": "Autodetect partition(s) from the path. If set to true, it will autodetect partition key/value if the folder format is {partition_key}={partition_value} for example `year=2024`",
          "title": "Autodetect Partitions",
          "type": "boolean"
        },
        "traversal_method": {
          "$ref": "#/$defs/FolderTraversalMethod",
          "default": "MAX",
          "description": "Method to traverse the folder. ALL: Traverse all the folders, MIN_MAX: Traverse the folders by finding min and max value, MAX: Traverse the folder with max value"
        },
        "include_hidden_folders": {
          "default": false,
          "description": "Include hidden folders in the traversal (folders starting with . or _",
          "title": "Include Hidden Folders",
          "type": "boolean"
        },
        "tables_filter_pattern": {
          "$ref": "#/$defs/AllowDenyPattern",
          "default": {
            "allow": [
              ".*"
            ],
            "deny": [],
            "ignoreCase": true
          },
          "description": "The tables_filter_pattern configuration field uses regular expressions to filter the tables part of the Pathspec for ingestion, allowing fine-grained control over which tables are included or excluded based on specified patterns. The default setting allows all tables."
        }
      },
      "required": [
        "include"
      ],
      "title": "PathSpec",
      "type": "object"
    },
    "S3LineageProviderConfig": {
      "additionalProperties": false,
      "description": "Any source that produces s3 lineage from/to Datasets should inherit this class.",
      "properties": {
        "path_specs": {
          "default": [],
          "description": "List of PathSpec. See below the details about PathSpec",
          "items": {
            "$ref": "#/$defs/PathSpec"
          },
          "title": "Path Specs",
          "type": "array"
        },
        "strip_urls": {
          "default": true,
          "description": "Strip filename from s3 url. It only applies if path_specs are not specified.",
          "title": "Strip Urls",
          "type": "boolean"
        },
        "ignore_non_path_spec_path": {
          "default": false,
          "description": "Ignore paths that are not match in path_specs. It only applies if path_specs are specified.",
          "title": "Ignore Non Path Spec Path",
          "type": "boolean"
        }
      },
      "title": "S3LineageProviderConfig",
      "type": "object"
    },
    "StatefulStaleMetadataRemovalConfig": {
      "additionalProperties": false,
      "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
      "properties": {
        "enabled": {
          "default": false,
          "description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
          "title": "Enabled",
          "type": "boolean"
        },
        "remove_stale_metadata": {
          "default": true,
          "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
          "title": "Remove Stale Metadata",
          "type": "boolean"
        },
        "fail_safe_threshold": {
          "default": 75.0,
          "description": "Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.",
          "maximum": 100.0,
          "minimum": 0.0,
          "title": "Fail Safe Threshold",
          "type": "number"
        }
      },
      "title": "StatefulStaleMetadataRemovalConfig",
      "type": "object"
    }
  },
  "additionalProperties": false,
  "properties": {
    "schema_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns for schemas to filter in ingestion. Specify regex to only match the schema name. e.g. to match all tables in schema analytics, use the regex 'analytics'"
    },
    "table_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
    },
    "view_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns for views to filter in ingestion. Note: Defaults to table_pattern if not specified. Specify regex to match the entire view name in database.schema.view format. e.g. to match all views starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
    },
    "classification": {
      "$ref": "#/$defs/ClassificationConfig",
      "default": {
        "enabled": false,
        "sample_size": 100,
        "max_workers": 4,
        "table_pattern": {
          "allow": [
            ".*"
          ],
          "deny": [],
          "ignoreCase": true
        },
        "column_pattern": {
          "allow": [
            ".*"
          ],
          "deny": [],
          "ignoreCase": true
        },
        "info_type_to_term": {},
        "classifiers": [
          {
            "config": null,
            "type": "datahub"
          }
        ]
      },
      "description": "For details, refer to [Classification](../../../../metadata-ingestion/docs/dev_guides/classification.md)."
    },
    "enable_stateful_profiling": {
      "default": true,
      "description": "Enable stateful profiling. This will store profiling timestamps per dataset after successful profiling. and will not run profiling again in subsequent run if table has not been updated. ",
      "title": "Enable Stateful Profiling",
      "type": "boolean"
    },
    "enable_stateful_lineage_ingestion": {
      "default": true,
      "description": "Enable stateful lineage ingestion. This will store lineage window timestamps after successful lineage ingestion. and will not run lineage ingestion for same timestamps in subsequent run. ",
      "title": "Enable Stateful Lineage Ingestion",
      "type": "boolean"
    },
    "bucket_duration": {
      "$ref": "#/$defs/BucketDuration",
      "default": "DAY",
      "description": "Size of the time window to aggregate usage stats."
    },
    "end_time": {
      "description": "Latest date of lineage/usage to consider. Default: Current time in UTC",
      "format": "date-time",
      "title": "End Time",
      "type": "string"
    },
    "start_time": {
      "default": null,
      "description": "Earliest date of lineage/usage to consider. Default: Last full day in UTC (or hour, depending on `bucket_duration`). You can also specify relative time with respect to end_time such as '-7 days' Or '-7d'.",
      "format": "date-time",
      "title": "Start Time",
      "type": "string"
    },
    "enable_stateful_usage_ingestion": {
      "default": true,
      "description": "Enable stateful lineage ingestion. This will store usage window timestamps after successful usage ingestion. and will not run usage ingestion for same timestamps in subsequent run. ",
      "title": "Enable Stateful Usage Ingestion",
      "type": "boolean"
    },
    "top_n_queries": {
      "default": 10,
      "description": "Number of top queries to save to each table.",
      "exclusiveMinimum": 0,
      "title": "Top N Queries",
      "type": "integer"
    },
    "user_email_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "regex patterns for user emails to filter in usage."
    },
    "include_operational_stats": {
      "default": true,
      "description": "Whether to display operational stats.",
      "title": "Include Operational Stats",
      "type": "boolean"
    },
    "include_read_operational_stats": {
      "default": false,
      "description": "Whether to report read operational stats. Experimental.",
      "title": "Include Read Operational Stats",
      "type": "boolean"
    },
    "format_sql_queries": {
      "default": false,
      "description": "Whether to format sql queries",
      "title": "Format Sql Queries",
      "type": "boolean"
    },
    "include_top_n_queries": {
      "default": true,
      "description": "Whether to ingest the top_n_queries.",
      "title": "Include Top N Queries",
      "type": "boolean"
    },
    "email_domain": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Email domain of your organisation so users can be displayed on UI appropriately.",
      "title": "Email Domain"
    },
    "incremental_lineage": {
      "default": false,
      "description": "When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run.",
      "title": "Incremental Lineage",
      "type": "boolean"
    },
    "s3_lineage_config": {
      "$ref": "#/$defs/S3LineageProviderConfig",
      "default": {
        "path_specs": [],
        "strip_urls": true,
        "ignore_non_path_spec_path": false
      },
      "description": "Common config for S3 lineage generation"
    },
    "convert_urns_to_lowercase": {
      "default": false,
      "description": "Whether to convert dataset urns to lowercase.",
      "title": "Convert Urns To Lowercase",
      "type": "boolean"
    },
    "env": {
      "default": "PROD",
      "description": "The environment that all assets produced by this connector belong to",
      "title": "Env",
      "type": "string"
    },
    "platform_instance_map": {
      "anyOf": [
        {
          "additionalProperties": {
            "type": "string"
          },
          "type": "object"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "A holder for platform -> platform_instance mappings to generate correct dataset urns",
      "title": "Platform Instance Map"
    },
    "platform_instance": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.",
      "title": "Platform Instance"
    },
    "stateful_ingestion": {
      "anyOf": [
        {
          "$ref": "#/$defs/StatefulStaleMetadataRemovalConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "options": {
      "additionalProperties": true,
      "description": "Any options specified here will be passed to [SQLAlchemy.create_engine](https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine) as kwargs. To set connection arguments in the URL, specify them under `connect_args`.",
      "title": "Options",
      "type": "object"
    },
    "profile_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns to filter tables (or specific columns) for profiling during ingestion. Note that only tables allowed by the `table_pattern` will be considered."
    },
    "domain": {
      "additionalProperties": {
        "$ref": "#/$defs/AllowDenyPattern"
      },
      "default": {},
      "description": "Attach domains to databases, schemas or tables during ingestion using regex patterns. Domain key can be a guid like *urn:li:domain:ec428203-ce86-4db3-985d-5a8ee6df32ba* or a string like \"Marketing\".) If you provide strings, then datahub will attempt to resolve this name to a guid, and will error out if this fails. There can be multiple domain keys specified.",
      "title": "Domain",
      "type": "object"
    },
    "include_views": {
      "default": true,
      "description": "Whether views should be ingested.",
      "title": "Include Views",
      "type": "boolean"
    },
    "include_tables": {
      "default": true,
      "description": "Whether tables should be ingested.",
      "title": "Include Tables",
      "type": "boolean"
    },
    "include_table_location_lineage": {
      "default": true,
      "description": "If the source supports it, include table lineage to the underlying storage location.",
      "title": "Include Table Location Lineage",
      "type": "boolean"
    },
    "include_view_lineage": {
      "default": true,
      "description": "Populates view->view and table->view lineage using DataHub's sql parser.",
      "title": "Include View Lineage",
      "type": "boolean"
    },
    "include_view_column_lineage": {
      "default": true,
      "description": "Populates column-level lineage for  view->view and table->view lineage using DataHub's sql parser. Requires `include_view_lineage` to be enabled.",
      "title": "Include View Column Lineage",
      "type": "boolean"
    },
    "use_file_backed_cache": {
      "default": true,
      "description": "Whether to use a file backed cache for the view definitions.",
      "title": "Use File Backed Cache",
      "type": "boolean"
    },
    "profiling": {
      "$ref": "#/$defs/GEProfilingConfig",
      "default": {
        "enabled": false,
        "operation_config": {
          "lower_freq_profile_enabled": false,
          "profile_date_of_month": null,
          "profile_day_of_week": null
        },
        "limit": null,
        "offset": null,
        "profile_table_level_only": false,
        "include_field_null_count": true,
        "include_field_distinct_count": true,
        "include_field_min_value": true,
        "include_field_max_value": true,
        "include_field_mean_value": true,
        "include_field_median_value": true,
        "include_field_stddev_value": true,
        "include_field_quantiles": false,
        "include_field_distinct_value_frequencies": false,
        "include_field_histogram": false,
        "include_field_sample_values": true,
        "max_workers": 20,
        "report_dropped_profiles": false,
        "turn_off_expensive_profiling_metrics": false,
        "field_sample_values_limit": 20,
        "max_number_of_fields_to_profile": null,
        "profile_if_updated_since_days": null,
        "profile_table_size_limit": 5,
        "profile_table_row_limit": 5000000,
        "profile_table_row_count_estimate_only": false,
        "query_combiner_enabled": true,
        "catch_exceptions": true,
        "partition_profiling_enabled": true,
        "partition_datetime": null,
        "use_sampling": true,
        "sample_size": 10000,
        "profile_external_tables": false,
        "tags_to_ignore_sampling": null,
        "profile_nested_fields": false
      }
    },
    "username": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "username",
      "title": "Username"
    },
    "password": {
      "anyOf": [
        {
          "format": "password",
          "type": "string",
          "writeOnly": true
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "password",
      "title": "Password"
    },
    "host_port": {
      "description": "host URL",
      "title": "Host Port",
      "type": "string"
    },
    "database": {
      "default": "dev",
      "description": "database",
      "title": "Database",
      "type": "string"
    },
    "sqlalchemy_uri": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters.",
      "title": "Sqlalchemy Uri"
    },
    "default_schema": {
      "default": "public",
      "description": "The default schema to use if the sql parser fails to parse the schema with `sql_based` lineage collector",
      "title": "Default Schema",
      "type": "string"
    },
    "is_serverless": {
      "default": false,
      "description": "Whether target Redshift instance is serverless (alternative is provisioned cluster)",
      "title": "Is Serverless",
      "type": "boolean"
    },
    "lineage_generate_queries": {
      "default": true,
      "description": "Whether to generate queries entities for the SQL-based lineage collector.",
      "title": "Lineage Generate Queries",
      "type": "boolean"
    },
    "include_table_lineage": {
      "default": true,
      "description": "Whether table lineage should be ingested.",
      "title": "Include Table Lineage",
      "type": "boolean"
    },
    "include_copy_lineage": {
      "default": true,
      "description": "Whether lineage should be collected from copy commands",
      "title": "Include Copy Lineage",
      "type": "boolean"
    },
    "include_share_lineage": {
      "default": true,
      "description": "Whether lineage should be collected from datashares",
      "title": "Include Share Lineage",
      "type": "boolean"
    },
    "include_usage_statistics": {
      "default": false,
      "description": "Generate usage statistic. email_domain config parameter needs to be set if enabled",
      "title": "Include Usage Statistics",
      "type": "boolean"
    },
    "include_unload_lineage": {
      "default": true,
      "description": "Whether lineage should be collected from unload commands",
      "title": "Include Unload Lineage",
      "type": "boolean"
    },
    "include_table_rename_lineage": {
      "default": true,
      "description": "Whether we should follow `alter table ... rename to` statements when computing lineage. ",
      "title": "Include Table Rename Lineage",
      "type": "boolean"
    },
    "table_lineage_mode": {
      "$ref": "#/$defs/LineageMode",
      "default": "mixed",
      "description": "Which table lineage collector mode to use. Available modes are: [stl_scan_based, sql_based, mixed]"
    },
    "extra_client_options": {
      "additionalProperties": true,
      "default": {},
      "title": "Extra Client Options",
      "type": "object"
    },
    "match_fully_qualified_names": {
      "default": false,
      "description": "Whether `schema_pattern` is matched against fully qualified schema name `<database>.<schema>`.",
      "title": "Match Fully Qualified Names",
      "type": "boolean"
    },
    "extract_column_level_lineage": {
      "default": true,
      "description": "Whether to extract column level lineage. This config works with rest-sink only.",
      "title": "Extract Column Level Lineage",
      "type": "boolean"
    },
    "patch_custom_properties": {
      "default": true,
      "description": "Whether to patch custom properties on existing datasets rather than replace.",
      "title": "Patch Custom Properties",
      "type": "boolean"
    },
    "resolve_temp_table_in_lineage": {
      "default": true,
      "description": "Whether to resolve temp table appear in lineage to upstream permanent tables.",
      "title": "Resolve Temp Table In Lineage",
      "type": "boolean"
    },
    "skip_external_tables": {
      "default": false,
      "description": "Whether to skip EXTERNAL tables.",
      "title": "Skip External Tables",
      "type": "boolean"
    }
  },
  "required": [
    "host_port"
  ],
  "title": "RedshiftConfig",
  "type": "object"
}

Troubleshooting

Schema Discovery Issues

If you're not seeing all schemas or tables after following the setup steps, check the following:

Missing Schemas

1. Check schema filtering configuration:

# In your recipe, ensure schema patterns are correct
schema_pattern:
  allow:
    - "your_schema_name"
    - "public"
  # Remove deny patterns that might be blocking schemas

2. Verify permissions on specific schemas:

-- Test if you can see schemas
SELECT schema_name, schema_type
FROM svv_redshift_schemas
WHERE database_name = 'your_database';

-- Test external schemas
SELECT schemaname, eskind, databasename
FROM SVV_EXTERNAL_SCHEMAS;

3. Check for external schemas: External schemas (Redshift Spectrum) require both permissions:

GRANT SELECT ON pg_catalog.svv_external_schemas TO datahub_user;
GRANT SELECT ON pg_catalog.svv_external_tables TO datahub_user;
GRANT SELECT ON pg_catalog.svv_external_columns TO datahub_user;

Missing Tables Within Schemas

1. Check table filtering:

table_pattern:
  allow:
    - "your_schema.your_table"
  # Ensure no overly restrictive deny patterns

2. Test table visibility:

-- For regular tables
SELECT schemaname, tablename, tabletype
FROM pg_tables
WHERE schemaname = 'your_schema';

-- For views
SELECT schemaname, viewname
FROM pg_views
WHERE schemaname = 'your_schema';

-- For external tables
SELECT schemaname, tablename
FROM SVV_EXTERNAL_TABLES
WHERE schemaname = 'your_schema';

Configuration Issues

1. Database specification: Ensure you're connecting to the correct database - Redshift ingestion works per database:

database: "your_actual_database_name" # Not the cluster name

2. Schema access permissions: Ensure you have USAGE permissions on the schemas you want to discover:

-- Check if you have USAGE on schemas
SELECT n.nspname as schema_name,
       has_schema_privilege('datahub_user', n.nspname, 'USAGE') as has_usage
FROM pg_catalog.pg_namespace n
WHERE n.nspname NOT LIKE 'pg_%'
  AND n.nspname != 'information_schema';

-- Grant USAGE if missing
GRANT USAGE ON SCHEMA your_schema_name TO datahub_user;

3. Shared database configuration: If using datashare consumers, add:

is_shared_database: true

Permission Test Queries

Run these to verify your permissions are working:

-- Test core permissions
SELECT COUNT(*) FROM svv_redshift_schemas WHERE database_name = 'your_database';
SELECT COUNT(*) FROM svv_table_info WHERE database = 'your_database';

-- Test external permissions
SELECT COUNT(*) FROM svv_external_schemas;
SELECT COUNT(*) FROM svv_external_tables;

Data Profiling Issues

Profile Data Not Appearing

1. Check data access permissions: Ensure you have USAGE on schemas and SELECT on tables:

-- Test schema access
SELECT has_schema_privilege('datahub_user', 'your_schema', 'USAGE');

-- Test table access
SELECT has_table_privilege('datahub_user', 'your_schema.your_table', 'SELECT');

2. Enable table-level profiling only: If you cannot grant SELECT on tables, use table-level profiling:

profiling:
  profile_table_level_only: true

Lineage Issues

Missing Lineage Information

1. Check lineage configuration:

table_lineage_mode: stl_scan_based # or sql_based, mixed
include_usage_statistics: true

2. Verify SYSLOG ACCESS:

-- Check if user has SYSLOG ACCESS
SELECT usename, usesyslog
FROM pg_user
WHERE usename = 'datahub_user';
-- usesyslog should be 't' (true)

Cross-Cluster Lineage (Datashares)

For lineage across datashares, ensure:

DataHub user has SHARE privileges on datashares
Both producer and consumer clusters are ingested
include_share_lineage: true in configuration

-- Check datashare access
SELECT * FROM svv_datashares WHERE share_name = 'your_share';

Code Coordinates

Class Name: datahub.ingestion.source.redshift.redshift.RedshiftSource
Browse on GitHub

Questions

If you've got any questions on configuring ingestion for Redshift, feel free to ping us on our Slack.

Is this page helpful?

Redshift

Important Capabilities​

Prerequisites​

Permission Overview​

System Table and View Permissions​

Detailed Permission Breakdown​

Core System Views (Always Required)​

Conditional System Tables (Feature Dependent)​

Shared Database (Datashare Consumer)​

Redshift Serverless Workgroups​

Redshift Provisioned Clusters​

Datashares Lineage​

Recommended Permission Set​

Data Access Privileges (Required for Data Profiling and Classification)​

Optional: Datashare Privileges​

Ingestion of multiple redshift databases, namespaces​

Lineage​

stl_scan_based​

sql_based​

mixed​

Datashares Lineage​

Profiling​

Caveats​

CLI based Ingestion​

Starter Recipe​

Config Details​

Troubleshooting​

Schema Discovery Issues​

Missing Schemas​

Missing Tables Within Schemas​

Configuration Issues​

Permission Test Queries​

Data Profiling Issues​

Profile Data Not Appearing​

Lineage Issues​

Missing Lineage Information​

Cross-Cluster Lineage (Datashares)​

Code Coordinates​

Questions

Important Capabilities

Prerequisites

Permission Overview

System Table and View Permissions

Detailed Permission Breakdown

Core System Views (Always Required)

Conditional System Tables (Feature Dependent)

Shared Database (Datashare Consumer)

Redshift Serverless Workgroups

Redshift Provisioned Clusters

Datashares Lineage

Recommended Permission Set

Data Access Privileges (Required for Data Profiling and Classification)

Optional: Datashare Privileges

Ingestion of multiple redshift databases, namespaces

Lineage

stl_scan_based

sql_based

mixed

Datashares Lineage

Profiling

Caveats

CLI based Ingestion

Starter Recipe

Config Details

Troubleshooting

Schema Discovery Issues

Missing Schemas

Missing Tables Within Schemas

Configuration Issues

Permission Test Queries

Data Profiling Issues

Profile Data Not Appearing

Lineage Issues

Missing Lineage Information

Cross-Cluster Lineage (Datashares)

Code Coordinates