Elasticsearch：ingest pipelines - 使用技巧和窍门_ar

在今天的文章中，我将列举一些例子来讲述使用 elasticsearch ingest pipeline （摄取管道）的一些技巧。这些技巧虽然简单，但是在很多的应用场景中还是非常实用的。更多关于 ingest pipeline 的文章，请详细阅读我们之前的文章 “elastic：开发者上手指南” 里的 “ingest pipeline” 章节。

为文档添加 last_update_time

摄取管道以两种方式帮助处理数据：

在最终摄取到 elasticsearch 数据节点之前对文档进行预处理（又名数据处理过程）
根据增强的业务需求对现有文档进行数据修复（也称为数据修补）

无论是预处理阶段还是数据修复阶段，通常都有一个共同的目标 => 添加一个 last_update_time 时间戳字段来标识何时施加了更改。

添加回当前时间戳也有几种方法。

是在文档的预处理阶段重新使用摄取时间戳

我们使用如下的命令来模拟一个 ingest pipeline：

post _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "pre-processing: add back last_update_time by using the ingest object",
    "processors": [
      {
        "set": {
          "field": "last_update_time",
          "value": "{{_ingest.timestamp}}"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "action": "an action",
        "user": "balabala",
        "procurement_id": "12_yuy190"
      }
    }
  ]
}

上面命令运行的结果是：

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "action": "an action",
          "last_update_time": "2023-03-09t00:02:35.886218305z",
          "procurement_id": "12_yuy190",
          "user": "balabala"
        },
        "_ingest": {
          "timestamp": "2023-03-09t00:02:35.886218305z"
        }
      }
    }
  ]
}

在上面，我们通过设置 last_update_time 为 ingest pipeline 运行的时间来达到记录处理事件的时间。

通过添加脚本处理器并以脚本方式创建当前时间戳

post _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "any stage: add back last_update_time by script",
    "processors": [
      {
        "script": {
          "lang": "painless",
          "source": """
              def ts = new date();
              ctx.last_update_time = ts;
          """
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "action": "an action",
        "user": "balabala",
        "procurement_id": "12_yuy190"
      }
    }
  ]
}

模拟的结果是：

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "action": "an action",
          "last_update_time": "2023-03-09t00:06:00.095z",
          "procurement_id": "12_yuy190",
          "user": "balabala"
        },
        "_ingest": {
          "timestamp": "2023-03-09t00:06:00.095106511z"
        }
      }
    }
  ]
}

我们通过脚本活动当前机器的时间，并把这个时间设置为 last_update_time。有趣的是，你现在可以看到 last_update_time 与 _ingest.timestamp 略有不同，因为脚本处理器在摄取之前运行，因此 _ingest.timestamp 应该稍晚一些。

管道设计方法

现在我们对添加当前时间戳的方法进行了测试；但是随后许多数据消息处理过程需要在他们的工作流程中添加回这个时间戳信息，这是否意味着你每次创建新管道时都需要复制并粘贴上述代码/处理器？？？

从 elasticsearch 6.5.x 开始，有一个新的 “pipeline” 处理器，我们可以在其中调用当前管道中的另一个管道。现在这真的很酷，所有管道突然都像函数/api；你可以在必要时调用其中任何一个，因此可以重用代码。

基于这个新特性，你可能需要重构现有的 pipeline 代码，将常见的业务逻辑提取出来，形成一个 function pipeline，供其他 main-stream-pipelines 调用。在我们的场景中，添加 last_update_time 是一种功能管道

创建一个新的功能管道如下：

put _ingest/pipeline/func_add_last_update_time
{
  "version": 1,
  "description": "any stage: add back last update timestamp using script approach",
  "processors": [
    {
      "script": {
        "lang": "painless",
        "source": """
        ctx.last_update_time = new date();
        """
      }
    }
  ]
}

太好了，我们添加了 func_add_last_update_time。现在假设我们有另一个用于某些业务逻辑的管道，每当对文档应用更改时，我们希望在其中添加一个 last_update_time。

post _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "dissect": {
          "field": "message",
          "pattern": "%{user_id}:%{action}:%{action_id}:%{action_ts}:%{description}"
        }
      },
      {
        "date": {
          "field": "action_ts",
          "formats": ["yyyy/mm/dd hh-mm-ss"],
          "target_field": "action_ts"
        }
      },
      {
        "remove": {
          "field": [
            "message"
          ]
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "jaime_10234:filling_procurement_form:12_yuy190:2019/04/18 13-12-09:procurement for abc company on spare parts"
      }
    }
  ]
}

上面运行的结果为：

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "action": "filling_procurement_form",
          "action_ts": "2019-04-18t13:12:09.000z",
          "description": "procurement for abc company on spare parts",
          "user_id": "jaime_10234",
          "action_id": "12_yuy190"
        },
        "_ingest": {
          "timestamp": "2023-03-09t00:11:30.535138719z"
        }
      }
    }
  ]
}

上面的主流管道非常简单，它试图将给定的 message 解析为各个字段；还有一个时间戳字段转换加上在使用后删除原始 message。如果我们想添加一个last_update_time，只需在工作流程的后面添加一个 pipeline 处理器：

post _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "dissect": {
          "field": "message",
          "pattern": "%{user_id}:%{action}:%{action_id}:%{action_ts}:%{description}"
        }
      },
      {
        "date": {
          "field": "action_ts",
          "formats": ["yyyy/mm/dd hh-mm-ss"],
          "target_field": "action_ts"
        }
      },
      {
        "pipeline": {
          "name": "func_add_last_update_time"
        }
      },
      {
        "remove": {
          "field": [
            "message"
          ]
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "jaime_10234:filling_procurement_form:12_yuy190:2019/04/18 13-12-09:procurement for abc company on spare parts"
      }
    }
  ]
}

上面命令运行的结果是：

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "last_update_time": "2023-03-09t00:21:37.511z",
          "user_id": "jaime_10234",
          "action_id": "12_yuy190",
          "action": "filling_procurement_form",
          "action_ts": "2019-04-18t13:12:09.000z",
          "description": "procurement for abc company on spare parts"
        },
        "_ingest": {
          "timestamp": "2023-03-09t00:21:37.51032725z"
        }
      }
    }
  ]
}

现在我们毫无困难地得到了 last_update_time！看看重用代码是多么容易！

调用另一个管道时的异常处理

因为我们现在可以像函数调用一样调用另一个管道；有一个新问题浮出水面 —— 调用管道上的异常处理……

让我们创建一个新的功能管道和另一个主流管道来说明这个场景：

put _ingest/pipeline/func_convert_age
{
  "processors": [
    {
      "convert": {
        "field": "age",
        "type": "integer"
      }
    }
  ]
}

模拟如下的 ingest pipeline：

post _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "dissect": {
          "field": "message",
          "pattern": "%{name}:%{age}"
        }
      },
      {
        "pipeline": {
          "name": "func_convert_age"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "helen wong:45"
      }
    },
    {
      "_source": {
        "message": "josh blake:a46"
      }
    }
  ]
}

结果：

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "message": "helen wong:45",
          "name": "helen wong",
          "age": 45
        },
        "_ingest": {
          "timestamp": "2023-03-09t00:25:48.38476263z"
        }
      }
    },
    {
      "error": {
        "root_cause": [
          {
            "type": "illegal_argument_exception",
            "reason": "unable to convert [a46] to integer"
          }
        ],
        "type": "illegal_argument_exception",
        "reason": "unable to convert [a46] to integer",
        "caused_by": {
          "type": "number_format_exception",
          "reason": "for input string: \"a46\""
        }
      }
    }
  ]
}

用于模拟的第一个文件，它应该可以工作，但第二个文档不会，原因很简单 -> “无法将 [a46] 转换为整数”。如果我们想捕获这个异常并做一些事情，只需添加“on_failure” 子句：

post _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "dissect": {
          "field": "message",
          "pattern": "%{name}:%{age}"
        }
      },
      {
        "pipeline": {
          "name": "func_convert_age",
          "on_failure": [
            {
              "set": {
                "field": "error",
                "value": "{{ _ingest.on_failure_processor_type }} - {{ _ingest.on_failure_message }}"
              }
            },
            {
              "remove": {
                "field": [ "name", "age" ]
              }
            }
          ]
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "helen wong:45"
      }
    },
    {
      "_source": {
        "message": "josh blake:a46"
      }
    }
  ]
}

上述命令的返回结果是：

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "message": "helen wong:45",
          "name": "helen wong",
          "age": 45
        },
        "_ingest": {
          "timestamp": "2023-03-09t00:27:50.726106798z"
        }
      }
    },
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "message": "josh blake:a46",
          "error": "convert - for input string: \\\"a46\\\""
        },
        "_ingest": {
          "timestamp": "2023-03-09t00:27:50.726149381z"
        }
      }
    }
  ]
}

现在有问题的文档；它将包含未触及的 message 字段和一个说明异常的新 error 字段。稍后，如果你只想查询有异常的文档，exists 查询就可以解决问题。

// replace {{target_index}} with the resulting index
get {{target_index}}/_search
{
  "query": {
    "exists": {
      "field": "error"
    }
  }
}

更多关于 ingest pipeline 的文章请详细阅读：

将管道应用于现有文档（数据修复阶段）

将创建的管道应用于现有文档；你可以简单地使用 _update_by_query：

// replace the {{pipeline_name}} to any valid pipeline
post blog_pipeline_tips1/_update_by_query?pipeline={{pipeline_name}}

重要的一点是添加了一个 pipeline 参数并精确定位到相应的管道 id。

为索引设置默认管道

从 elasticsearch 6.5.x 开始，引入了一个名为 index.default_pipeline 的新索引设置。这仅仅意味着所有摄取的文档都将由默认管道进行预处理；例如，添加last_update_time 用例应该在索引的每个新传入文档上运行。语法相当简单明了：

put app_log_1
{
  "settings": {
    "default_pipeline": "add_last_update_time"
  }
}

条件切换逻辑

根据源字段设置值

举个例子，我们有一个名为 categoryvalue 的字段。如果此值等于 plant，则 categorycode 字段的值将设置为 a。下面是逻辑矩阵：

categoryvalue = “plant”, categorycode = “a”
categoryvalue = "animal", categorycode = "b"

对应的 pipeline 可以这样写：

post _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "set": {
          "field": "categorycode",
          "value": "a",
          "if": "ctx.categoryvalue == 'plant'"
        }
      },
      {
        "set": {
          "field": "categorycode",
          "value": "b",
          "if": "ctx.categoryvalue == 'animal'"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "categoryvalue": "plant"
      }
    },
    {
      "_source": {
        "categoryvalue": "animal"
      }
    },
    {
      "_source": {
        "categoryvalue": "unknown"
      }
    }
  ]
}

响应是：

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "categoryvalue": "plant",
          "categorycode": "a"
        },
        "_ingest": {
          "timestamp": "2023-03-09t00:45:25.659339217z"
        }
      }
    },
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "categoryvalue": "animal",
          "categorycode": "b"
        },
        "_ingest": {
          "timestamp": "2023-03-09t00:45:25.659388883z"
        }
      }
    },
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "categoryvalue": "unknown"
        },
        "_ingest": {
          "timestamp": "2023-03-09t00:45:25.659396717z"
        }
      }
    }
  ]
}

切换逻辑基于 set 处理器的 if子句，变量 ctx 是提供对该文档实例中字段的访问的文档上下文。

很简单，不是吗？

向管道提供参数 - 1

如果你阅读 pipeline 处理器的官方文档，你将找不到一句话提到如何为 pipeline 提供参数。但实际上，有一个解决方法（虽然有点难看）。

put _ingest/pipeline/pipmultiply2
{
  "processors": [
    {
      "script": {
        "source": """ 
          if (ctx.paramvalue != null) {
              ctx.finalvalue = ctx.paramvalue * 2;
  
              // remove the original "paramvalue" field
              ctx.remove("paramvalue");
          }
        """
      }
    }
  ]
}

我们首先创建一个名为 pipmultiply2 的管道 —— 简单地运行字段 paramvalue 提供的值的乘法。请注意，字段存在性检查是通过以下方式完成的：

if (ctx.paramvalue != null) …

乘法结果被设置为字段 finalvalue。之后我们还删除了参数字段：

ctx.remove(“paramvalue”)

现在是模拟管道 + 提供测试参数的时候了：

post _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "set": {
          "field": "paramvalue",
          "value": 100
        }
      },
      {
        "pipeline": {
          "name": "pipmultiply2"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {}
    }
  ]
}

结果值将恰好为 200。如前所述，这种方法可行但有点难看，因为我们需要在运行乘以 2 管道之前为目标文档设置一个字段（例如 paramvalue）。之后还需要删除 paramvalue（如有必要）。

向管道提供参数 - 2

我们已经知道如何使用 “丑陋” 的方法将参数传递给管道，但是如果你不喜欢这种方法，还有另一种解决方法，如下所示：

put _scripts/test_parameter_pip
{
  "script": {
    "lang": "painless",
    "source": """ 
      if (params['paramvalue'] != null) {
        ctx.finalvalue = params['paramvalue'] * 2;  
      }
    """
  }
}

我们在 elasticsearch 集群中创建了一个存储脚本。这个脚本的逻辑很简单 —— 将参数的值乘以 2。请注意检查存在性逻辑是通过以下方式应用的：

if (params[‘paramvalue’] != null) …

现在在管道中测试我们的脚本：

post _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "script": {
          "id": "test_parameter_pip",
          "params": {
            "paramvalue": 12
          }
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "hello"
      }
    }
  ]
}

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "message": "hello",
          "finalvalue": 24
        },
        "_ingest": {
          "timestamp": "2023-03-09t00:55:19.670380755z"
        }
      }
    }
  ]
}

最终结果将涉及一个值为 24 的字段 finalvalue。这种方法在技术上使用了 pipeline 处理器，但它的工作原理很吸引人，并且仍然能够保持可重用性（尽管我们通过脚本而不是 pipeline 处理器来抽象可重用性特性......我知道这听起来令人困惑:)）））

Elasticsearch：ingest pipelines - 使用技巧和窍门

2024年08月06日 • ar •我要评论

为文档添加 last_update_time

是在文档的预处理阶段重新使用摄取时间戳

通过添加脚本处理器并以脚本方式创建当前时间戳

管道设计方法

调用另一个管道时的异常处理

将管道应用于现有文档（数据修复阶段）

为索引设置默认管道

条件切换逻辑

根据源字段设置值

向管道提供参数 - 1

向管道提供参数 - 2

相关文章:

vim-easy-align verilog 对齐 (原创)

MICCAI 2024Centerline Boundary Dice Loss for Vascular Segmentation

超详细！！！一文搞定！单目深度估计MiDas思想Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot

发表评论


验证码：