Elasticsearch
Important Capabilities
Capability | Status | Notes |
---|---|---|
Detect Deleted Entities | ✅ | Enabled by default via stateful ingestion. |
Platform Instance | ✅ | Enabled by default. |
This plugin extracts the following:
- Metadata for indexes
- Column types associated with each index field
CLI based Ingestion
Starter Recipe
Check out the following recipe to get started with ingestion! See below for full configuration options.
For general pointers on writing and running a recipe, see our main recipe guide.
source:
type: "elasticsearch"
config:
# Coordinates
host: 'localhost:9200'
# Credentials
username: user # optional
password: pass # optional
# SSL support
use_ssl: False
verify_certs: False
ca_certs: "./path/ca.cert"
client_cert: "./path/client.cert"
client_key: "./path/client.key"
ssl_assert_hostname: False
ssl_assert_fingerprint: "./path/cert.fingerprint"
# Options
url_prefix: "" # optional url_prefix
env: "PROD"
index_pattern:
allow: [".*some_index_name_pattern*"]
deny: [".*skip_index_name_pattern*"]
ingest_index_templates: False
index_template_pattern:
allow: [".*some_index_template_name_pattern*"]
sink:
# sink configs
Config Details
- Options
- Schema
Note that a .
is used to denote nested fields in the YAML recipe.
Field | Description |
---|---|
api_key One of object, string, null | API Key authentication. Accepts either a list with id and api_key (UTF-8 representation), or a base64 encoded string of id and api_key combined by ':'. Default: None |
ca_certs One of string, null | Path to a certificate authority (CA) certificate. Default: None |
client_cert One of string, null | Path to the file containing the private key and the certificate, or cert only if using client_key. Default: None |
client_key One of string, null | Path to the file containing the private key if using separate cert and key files. Default: None |
host string | The elastic search host URI. Default: localhost:9200 |
ingest_index_templates boolean | Ingests ES index templates if enabled. Default: False |
password One of string, null | The password credential. Default: None |
platform_instance One of string, null | The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details. Default: None |
ssl_assert_fingerprint One of string, null | Verify the supplied certificate fingerprint if not None. Default: None |
ssl_assert_hostname boolean | Use hostname verification if not False. Default: False |
url_prefix string | There are cases where an enterprise would have multiple elastic search clusters. One way for them to manage is to have a single endpoint for all the elastic search clusters and use url_prefix for routing requests to different clusters. Default: |
use_ssl boolean | Whether to use SSL for the connection or not. Default: False |
username One of string, null | The username credential. Default: None |
verify_certs boolean | Whether to verify SSL certificates. Default: False |
env string | The environment that all assets produced by this connector belong to Default: PROD |
collapse_urns CollapseUrns | |
collapse_urns.urns_suffix_regex array | List of regex patterns to remove from the name of the URN. All of the indices before removal of URNs are considered as the same dataset. These are applied in order for each URN. The main case where you would want to have multiple of these if the name where you are trying to remove suffix from have different formats. e.g. ending with -YYYY-MM-DD as well as ending -epochtime would require you to have 2 regex patterns to remove the suffixes across all URNs. |
collapse_urns.urns_suffix_regex.string string | |
index_pattern AllowDenyPattern | A class to store allow deny regexes |
index_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
index_template_pattern AllowDenyPattern | A class to store allow deny regexes |
index_template_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
profiling ElasticProfiling | |
profiling.enabled boolean | Whether to enable profiling for the elastic search source. Default: False |
profiling.operation_config OperationConfig | |
profiling.operation_config.lower_freq_profile_enabled boolean | Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling. Default: False |
profiling.operation_config.profile_date_of_month One of integer, null | Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect. Default: None |
profiling.operation_config.profile_day_of_week One of integer, null | Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect. Default: None |
stateful_ingestion One of StatefulIngestionConfig, null | Stateful Ingestion Config Default: None |
stateful_ingestion.enabled boolean | Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False Default: False |
The JSONSchema for this configuration is inlined below.
{
"$defs": {
"AllowDenyPattern": {
"additionalProperties": false,
"description": "A class to store allow deny regexes",
"properties": {
"allow": {
"default": [
".*"
],
"description": "List of regex patterns to include in ingestion",
"items": {
"type": "string"
},
"title": "Allow",
"type": "array"
},
"deny": {
"default": [],
"description": "List of regex patterns to exclude from ingestion.",
"items": {
"type": "string"
},
"title": "Deny",
"type": "array"
},
"ignoreCase": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"description": "Whether to ignore case sensitivity during pattern matching.",
"title": "Ignorecase"
}
},
"title": "AllowDenyPattern",
"type": "object"
},
"CollapseUrns": {
"additionalProperties": false,
"properties": {
"urns_suffix_regex": {
"description": "List of regex patterns to remove from the name of the URN. All of the indices before removal of URNs are considered as the same dataset. These are applied in order for each URN.\n The main case where you would want to have multiple of these if the name where you are trying to remove suffix from have different formats.\n e.g. ending with -YYYY-MM-DD as well as ending -epochtime would require you to have 2 regex patterns to remove the suffixes across all URNs.",
"items": {
"type": "string"
},
"title": "Urns Suffix Regex",
"type": "array"
}
},
"title": "CollapseUrns",
"type": "object"
},
"ElasticProfiling": {
"additionalProperties": false,
"properties": {
"enabled": {
"default": false,
"description": "Whether to enable profiling for the elastic search source.",
"title": "Enabled",
"type": "boolean"
},
"operation_config": {
"$ref": "#/$defs/OperationConfig",
"description": "Experimental feature. To specify operation configs."
}
},
"title": "ElasticProfiling",
"type": "object"
},
"OperationConfig": {
"additionalProperties": false,
"properties": {
"lower_freq_profile_enabled": {
"default": false,
"description": "Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling.",
"title": "Lower Freq Profile Enabled",
"type": "boolean"
},
"profile_day_of_week": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect.",
"title": "Profile Day Of Week"
},
"profile_date_of_month": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect.",
"title": "Profile Date Of Month"
}
},
"title": "OperationConfig",
"type": "object"
},
"StatefulIngestionConfig": {
"additionalProperties": false,
"description": "Basic Stateful Ingestion Specific Configuration for any source.",
"properties": {
"enabled": {
"default": false,
"description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
"title": "Enabled",
"type": "boolean"
}
},
"title": "StatefulIngestionConfig",
"type": "object"
}
},
"additionalProperties": false,
"properties": {
"env": {
"default": "PROD",
"description": "The environment that all assets produced by this connector belong to",
"title": "Env",
"type": "string"
},
"platform_instance": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.",
"title": "Platform Instance"
},
"stateful_ingestion": {
"anyOf": [
{
"$ref": "#/$defs/StatefulIngestionConfig"
},
{
"type": "null"
}
],
"default": null,
"description": "Stateful Ingestion Config"
},
"host": {
"default": "localhost:9200",
"description": "The elastic search host URI.",
"title": "Host",
"type": "string"
},
"username": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "The username credential.",
"title": "Username"
},
"password": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "The password credential.",
"title": "Password"
},
"api_key": {
"anyOf": [
{},
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "API Key authentication. Accepts either a list with id and api_key (UTF-8 representation), or a base64 encoded string of id and api_key combined by ':'.",
"title": "Api Key"
},
"use_ssl": {
"default": false,
"description": "Whether to use SSL for the connection or not.",
"title": "Use Ssl",
"type": "boolean"
},
"verify_certs": {
"default": false,
"description": "Whether to verify SSL certificates.",
"title": "Verify Certs",
"type": "boolean"
},
"ca_certs": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path to a certificate authority (CA) certificate.",
"title": "Ca Certs"
},
"client_cert": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path to the file containing the private key and the certificate, or cert only if using client_key.",
"title": "Client Cert"
},
"client_key": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Path to the file containing the private key if using separate cert and key files.",
"title": "Client Key"
},
"ssl_assert_hostname": {
"default": false,
"description": "Use hostname verification if not False.",
"title": "Ssl Assert Hostname",
"type": "boolean"
},
"ssl_assert_fingerprint": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Verify the supplied certificate fingerprint if not None.",
"title": "Ssl Assert Fingerprint"
},
"url_prefix": {
"default": "",
"description": "There are cases where an enterprise would have multiple elastic search clusters. One way for them to manage is to have a single endpoint for all the elastic search clusters and use url_prefix for routing requests to different clusters.",
"title": "Url Prefix",
"type": "string"
},
"index_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [
"^_.*",
"^ilm-history.*"
],
"ignoreCase": true
},
"description": "regex patterns for indexes to filter in ingestion."
},
"ingest_index_templates": {
"default": false,
"description": "Ingests ES index templates if enabled.",
"title": "Ingest Index Templates",
"type": "boolean"
},
"index_template_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [
"^_.*"
],
"ignoreCase": true
},
"description": "The regex patterns for filtering index templates to ingest."
},
"profiling": {
"$ref": "#/$defs/ElasticProfiling",
"description": "Configs to ingest data profiles from ElasticSearch."
},
"collapse_urns": {
"$ref": "#/$defs/CollapseUrns",
"description": "List of regex patterns to remove from the name of the URN. All of the indices before removal of URNs are considered as the same dataset. These are applied in order for each URN.\n The main case where you would want to have multiple of these if the name where you are trying to remove suffix from have different formats.\n e.g. ending with -YYYY-MM-DD as well as ending -epochtime would require you to have 2 regex patterns to remove the suffixes across all URNs."
}
},
"title": "ElasticsearchSourceConfig",
"type": "object"
}
Code Coordinates
- Class Name:
datahub.ingestion.source.elastic_search.ElasticsearchSource
- Browse on GitHub
Questions
If you've got any questions on configuring ingestion for Elasticsearch, feel free to ping us on our Slack.
Is this page helpful?