Version: Next

Elasticsearch

Important Capabilities

Capability	Status	Notes
Detect Deleted Entities	✅	Enabled by default via stateful ingestion.
Platform Instance	✅	Enabled by default.

This plugin extracts the following:

Metadata for indexes
Column types associated with each index field

CLI based Ingestion

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
  type: "elasticsearch"
  config:
    # Coordinates
    host: 'localhost:9200'

    # Credentials
    username: user # optional
    password: pass # optional

    # SSL support
    use_ssl: False
    verify_certs: False
    ca_certs: "./path/ca.cert"
    client_cert: "./path/client.cert"
    client_key: "./path/client.key"
    ssl_assert_hostname: False
    ssl_assert_fingerprint: "./path/cert.fingerprint"

    # Options
    url_prefix: "" # optional url_prefix
    env: "PROD"
    index_pattern:
      allow: [".*some_index_name_pattern*"]
      deny: [".*skip_index_name_pattern*"]
    ingest_index_templates: False
    index_template_pattern:
      allow: [".*some_index_template_name_pattern*"]

sink:
# sink configs

Config Details

Options
Schema

Note that a . is used to denote nested fields in the YAML recipe.

Field	Description
api_key One of object, string, null	API Key authentication. Accepts either a list with id and api_key (UTF-8 representation), or a base64 encoded string of id and api_key combined by ':'. Default: None
ca_certs One of string, null	Path to a certificate authority (CA) certificate. Default: None
client_cert One of string, null	Path to the file containing the private key and the certificate, or cert only if using client_key. Default: None
client_key One of string, null	Path to the file containing the private key if using separate cert and key files. Default: None
host string	The elastic search host URI. Default: localhost:9200
ingest_index_templates boolean	Ingests ES index templates if enabled. Default: False
password One of string, null	The password credential. Default: None
platform_instance One of string, null	The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details. Default: None
ssl_assert_fingerprint One of string, null	Verify the supplied certificate fingerprint if not None. Default: None
ssl_assert_hostname boolean	Use hostname verification if not False. Default: False
url_prefix string	There are cases where an enterprise would have multiple elastic search clusters. One way for them to manage is to have a single endpoint for all the elastic search clusters and use url_prefix for routing requests to different clusters. Default:
use_ssl boolean	Whether to use SSL for the connection or not. Default: False
username One of string, null	The username credential. Default: None
verify_certs boolean	Whether to verify SSL certificates. Default: False
env string	The environment that all assets produced by this connector belong to Default: PROD
collapse_urns CollapseUrns
collapse_urns.urns_suffix_regex array	List of regex patterns to remove from the name of the URN. All of the indices before removal of URNs are considered as the same dataset. These are applied in order for each URN. The main case where you would want to have multiple of these if the name where you are trying to remove suffix from have different formats. e.g. ending with -YYYY-MM-DD as well as ending -epochtime would require you to have 2 regex patterns to remove the suffixes across all URNs.
collapse_urns.urns_suffix_regex.string string
index_pattern AllowDenyPattern	A class to store allow deny regexes
index_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
index_template_pattern AllowDenyPattern	A class to store allow deny regexes
index_template_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
profiling ElasticProfiling
profiling.enabled boolean	Whether to enable profiling for the elastic search source. Default: False
profiling.operation_config OperationConfig
profiling.operation_config.lower_freq_profile_enabled boolean	Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling. Default: False
profiling.operation_config.profile_date_of_month One of integer, null	Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect. Default: None
profiling.operation_config.profile_day_of_week One of integer, null	Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect. Default: None
stateful_ingestion One of StatefulIngestionConfig, null	Stateful Ingestion Config Default: None
stateful_ingestion.enabled boolean	Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False Default: False

The JSONSchema for this configuration is inlined below.

{
  "$defs": {
    "AllowDenyPattern": {
      "additionalProperties": false,
      "description": "A class to store allow deny regexes",
      "properties": {
        "allow": {
          "default": [
            ".*"
          ],
          "description": "List of regex patterns to include in ingestion",
          "items": {
            "type": "string"
          },
          "title": "Allow",
          "type": "array"
        },
        "deny": {
          "default": [],
          "description": "List of regex patterns to exclude from ingestion.",
          "items": {
            "type": "string"
          },
          "title": "Deny",
          "type": "array"
        },
        "ignoreCase": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": true,
          "description": "Whether to ignore case sensitivity during pattern matching.",
          "title": "Ignorecase"
        }
      },
      "title": "AllowDenyPattern",
      "type": "object"
    },
    "CollapseUrns": {
      "additionalProperties": false,
      "properties": {
        "urns_suffix_regex": {
          "description": "List of regex patterns to remove from the name of the URN. All of the indices before removal of URNs are considered as the same dataset. These are applied in order for each URN.\n        The main case where you would want to have multiple of these if the name where you are trying to remove suffix from have different formats.\n        e.g. ending with -YYYY-MM-DD as well as ending -epochtime would require you to have 2 regex patterns to remove the suffixes across all URNs.",
          "items": {
            "type": "string"
          },
          "title": "Urns Suffix Regex",
          "type": "array"
        }
      },
      "title": "CollapseUrns",
      "type": "object"
    },
    "ElasticProfiling": {
      "additionalProperties": false,
      "properties": {
        "enabled": {
          "default": false,
          "description": "Whether to enable profiling for the elastic search source.",
          "title": "Enabled",
          "type": "boolean"
        },
        "operation_config": {
          "$ref": "#/$defs/OperationConfig",
          "description": "Experimental feature. To specify operation configs."
        }
      },
      "title": "ElasticProfiling",
      "type": "object"
    },
    "OperationConfig": {
      "additionalProperties": false,
      "properties": {
        "lower_freq_profile_enabled": {
          "default": false,
          "description": "Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling.",
          "title": "Lower Freq Profile Enabled",
          "type": "boolean"
        },
        "profile_day_of_week": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect.",
          "title": "Profile Day Of Week"
        },
        "profile_date_of_month": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect.",
          "title": "Profile Date Of Month"
        }
      },
      "title": "OperationConfig",
      "type": "object"
    },
    "StatefulIngestionConfig": {
      "additionalProperties": false,
      "description": "Basic Stateful Ingestion Specific Configuration for any source.",
      "properties": {
        "enabled": {
          "default": false,
          "description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
          "title": "Enabled",
          "type": "boolean"
        }
      },
      "title": "StatefulIngestionConfig",
      "type": "object"
    }
  },
  "additionalProperties": false,
  "properties": {
    "env": {
      "default": "PROD",
      "description": "The environment that all assets produced by this connector belong to",
      "title": "Env",
      "type": "string"
    },
    "platform_instance": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.",
      "title": "Platform Instance"
    },
    "stateful_ingestion": {
      "anyOf": [
        {
          "$ref": "#/$defs/StatefulIngestionConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Stateful Ingestion Config"
    },
    "host": {
      "default": "localhost:9200",
      "description": "The elastic search host URI.",
      "title": "Host",
      "type": "string"
    },
    "username": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "The username credential.",
      "title": "Username"
    },
    "password": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "The password credential.",
      "title": "Password"
    },
    "api_key": {
      "anyOf": [
        {},
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "API Key authentication. Accepts either a list with id and api_key (UTF-8 representation), or a base64 encoded string of id and api_key combined by ':'.",
      "title": "Api Key"
    },
    "use_ssl": {
      "default": false,
      "description": "Whether to use SSL for the connection or not.",
      "title": "Use Ssl",
      "type": "boolean"
    },
    "verify_certs": {
      "default": false,
      "description": "Whether to verify SSL certificates.",
      "title": "Verify Certs",
      "type": "boolean"
    },
    "ca_certs": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Path to a certificate authority (CA) certificate.",
      "title": "Ca Certs"
    },
    "client_cert": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Path to the file containing the private key and the certificate, or cert only if using client_key.",
      "title": "Client Cert"
    },
    "client_key": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Path to the file containing the private key if using separate cert and key files.",
      "title": "Client Key"
    },
    "ssl_assert_hostname": {
      "default": false,
      "description": "Use hostname verification if not False.",
      "title": "Ssl Assert Hostname",
      "type": "boolean"
    },
    "ssl_assert_fingerprint": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Verify the supplied certificate fingerprint if not None.",
      "title": "Ssl Assert Fingerprint"
    },
    "url_prefix": {
      "default": "",
      "description": "There are cases where an enterprise would have multiple elastic search clusters. One way for them to manage is to have a single endpoint for all the elastic search clusters and use url_prefix for routing requests to different clusters.",
      "title": "Url Prefix",
      "type": "string"
    },
    "index_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [
          "^_.*",
          "^ilm-history.*"
        ],
        "ignoreCase": true
      },
      "description": "regex patterns for indexes to filter in ingestion."
    },
    "ingest_index_templates": {
      "default": false,
      "description": "Ingests ES index templates if enabled.",
      "title": "Ingest Index Templates",
      "type": "boolean"
    },
    "index_template_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [
          "^_.*"
        ],
        "ignoreCase": true
      },
      "description": "The regex patterns for filtering index templates to ingest."
    },
    "profiling": {
      "$ref": "#/$defs/ElasticProfiling",
      "description": "Configs to ingest data profiles from ElasticSearch."
    },
    "collapse_urns": {
      "$ref": "#/$defs/CollapseUrns",
      "description": "List of regex patterns to remove from the name of the URN. All of the indices before removal of URNs are considered as the same dataset. These are applied in order for each URN.\n        The main case where you would want to have multiple of these if the name where you are trying to remove suffix from have different formats.\n        e.g. ending with -YYYY-MM-DD as well as ending -epochtime would require you to have 2 regex patterns to remove the suffixes across all URNs."
    }
  },
  "title": "ElasticsearchSourceConfig",
  "type": "object"
}

Code Coordinates

Class Name: datahub.ingestion.source.elastic_search.ElasticsearchSource
Browse on GitHub

Questions

If you've got any questions on configuring ingestion for Elasticsearch, feel free to ping us on our Slack.

Is this page helpful?

Elasticsearch

Important Capabilities​

CLI based Ingestion​

Starter Recipe​

Config Details​

Code Coordinates​