elasticsearch で10000件以上検索する方法

こんにちは、サイオステクノロジーの藤井です。
この記事では、elasticsearchで大量のデータを検索するときの注意点について紹介していきたいと思います。
elasticsearchとは、Elastic社が提供している分散型RESTful検索エンジンです。(公式サイト)
elasticsearchの検索では、デフォルトで検索できる件数の上限が決まっています。特に、10000件を超える検索では、ひと手間かける必要があります。
そこで、elasticsearchで10000件以上の検索をする3つの方法について、typescriptを使ってご紹介します。

使用ライブラリ等
elastic search 7.12.1
Elasticsearch Node.js client

必要なライブラリをインポートします。
以下のインポート部分のコードは毎回必要なので、以後省略します

import { RequestParams,Client } from '@elastic/elasticsearch';
export const elasticsearchClient = new Client()

1 データ登録
2 普通に検索した場合
3 index.max_result_windowを設定する方法
4 search_afterを使う方法
5 scrollを使う方法
6 まとめ

データ登録

ここでは、”name”と”gender”を持つ”user”というindexを男女20000件ずつ、合計40000件登録しました。

const createData = async (index: string, data: any, id?: string) => {
  const params: RequestParams.Create = {
    index: index,
    // idを指定することでelasticSearchのIndexを登録した際に割り当てられるIdを固定できる
    id: id,
    refresh: true,
    body: data,
  };
  const result = await elasticsearchClient.create(params);
  return result;
};
const number = 20000;
for (let i = 0; i

この記事ではgenderを条件に検索するケースについて実験します

普通に検索した場合

特に、何も設定せずに検索を行った場合です。
result.body.hits.hitsの中身が検索結果です。登録したデータの半分は、gender: 'male'のため、20000件のデータが検索に引っかかるはずですが、10件しか取得出来ていません。
elasticsearchデフォルトの設定だと10件しか検索できないためです。

const searchData = async (index: string, query: any) => {
  const params: RequestParams.Search = {
    index: index,
    body: {
      query: query,
    },
  };
  const result = await elasticsearchClient.search(params);
  return result;
};
const result = await searchData('user', {
  match: {
    gender: 'male',
  },
});
console.log(result.body);

実行結果

{
  took: 10,
  timed_out: false,
  _shards: { total: 1, successful: 1, skipped: 0, failed: 0 },
  hits: {
    total: { value: 10000, relation: 'gte' },
    max_score: 0.6931471,
    hits: [
      [Object], [Object],
      [Object], [Object],
      [Object], [Object],
      [Object], [Object],
      [Object], [Object]
    ]
  }
}

query.sizeを指定することで最大取得件数を変更することができます。
query.sizeに10以上の値を指定すれば、10件以上取得できます。

const searchData = async (index: string, query: any) => {
  const params: RequestParams.Search = {
    index: index,
    body: {
      query: query,
    },
    size: 10000,// 最大取得件数を指定
  };
  const result = await elasticsearchClient.search(params);
  return result;
};

実行結果

{
  took: 247,
  timed_out: false,
  _shards: { total: 1, successful: 1, skipped: 0, failed: 0 },
  hits: {
    total: { value: 10000, relation: 'gte' },
    max_score: 0.6931471,
    hits: [
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object],
      ... 9900 more items
    ]
  }
}

しかし、sizeに10000より大きい数字を指定するとエラーになります。

const searchData = async (index: string, query: any) => {
  const params: RequestParams.Search = {
    index: index,
    body: {
      query: query,
    },
    size: 10001,
  };
  const result = await elasticsearchClient.search(params);
  return result;
};

実行結果

ResponseError: search_phase_execution_exception: [illegal_argument_exception] Reason: Result window is too large, from + size must be less than or equal to: [10000] but was [10001]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.

index.max_result_windowを設定する方法

index設定のmax_result_windowに、最大取得可能件数を設定しています。デフォルトではこの値が10000になっているのが、query.sizeに10000より大きい値を指定できない原因です。
elasticsearchClient.indices.getSettingsでindexの設定情報の取得が、elasticsearchClient.indices.putSettingsでindexの設定情報の更新が出来ます。
ここでは、max_result_windowに50000を設定します

const putIndexSettings = async (setting: any, index?: string) => {
  const param: RequestParams.IndicesPutSettings = {
    index: index,
    body: setting,
  };
  const result = await elasticsearchClient.indices.putSettings(param);
  return result;
};
const getIndexSettings = async (index?: string) => {
  const param: RequestParams.IndicesGetSettings = {
    index: index,
  };
  const result = await elasticsearchClient.indices.getSettings(param);
  return result;
};
const result1 = await getIndexSettings('user');
console.log(result1);
await putIndexSettings(
  {
    index: {
      max_result_window: 50000,
    },
  },
  'user',
);
const result2 = await getIndexSettings('user');
console.log(result2);

実行結果
result1

{
  index: {
    routing: { allocation: [Object] },
    number_of_shards: '1',
    provided_name: 'user',
    creation_date: '1678242539081',
    number_of_replicas: '1',
    uuid: 'a-Hef2ssTAu5FYacvD_uEQ',
    version: { created: '7120199' }
  }
}

result2

{
  index: {
    routing: { allocation: [Object] },
    number_of_shards: '1',
    provided_name: 'user',
    max_result_window: '50000',
    creation_date: '1678242539081',
    number_of_replicas: '1',
    uuid: 'a-Hef2ssTAu5FYacvD_uEQ',
    version: { created: '7120199' }
  }
}

以下のようにquery.sizeに50000を指定し、先ほどと同じように検索すると、20000件全て取得出来ています。(result.body.hits.total.valueが10000のままでおかしいですが、result.body.hits.hitsを見るとちゃんと20000件取得出来ています。)

const searchData = async (index: string, query: any) => {
  const params: RequestParams.Search = {
    index: index,
    body: {
      query: query,
    },
    size: 50000,
  };
  const result = await elasticsearchClient.search(params);
  return result;
};

実行結果

{
  took: 491,
  timed_out: false,
  _shards: { total: 1, successful: 1, skipped: 0, failed: 0 },
  hits: {
    total: { value: 10000, relation: 'gte' },
    max_score: 0.6931471,
    hits: [
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object],
      ... 19900 more items
    ]
  }
}

ただし、max_result_windowを大きくするとメモリを消費し速度も遅くなるので、後述するsearch afterかscrollを使う方が効果的らしいです。
参考

search_afterを使う方法

search_afterを使う場合は、

point in timeを作成
全件取得し終わるまで、10000件ずつの取得を繰り返す
search_afterに前回の検索結果の最後のデータのsortを指定する
point in timeを削除

という流れで処理していきます。
この方法の場合、以下の2点に注意する必要があります。
毎回の検索で順番が変わらないようにsortを指定します。
検索の途中で別ユーザーによってデータの登録更新が行われても結果に影響が無いようにpoint in timeを作成します。

const searchDataWithSearchAfter = async (
  index: string,
  query: any,
): Promise => {
  const size = 10000;
  const keep_alive = '1m';
  let search_after = undefined; //初回検索時はundefined
  const sources: T[] = [];
  // point in timeを作成
  const pitResult = await elasticsearchClient.openPointInTime({
    index,
    keep_alive,
  });
  const pitId = pitResult.body?.id;
  while (true) {
    const params: RequestParams.Search = {
      body: {
        size,
        query,
        pit: {
          id: pitId,
          keep_alive,
        },
        sort: [{ _id: 'desc' }],
        search_after,
        track_total_hits: false, //パフォーマンスのため件数は省略
      },
    };
    const searchResult = await elasticsearchClient.search(params);
    const hits = searchResult.body?.hits?.hits;
    sources.push(...hits);
    // 取得件数がsizeより小さい場合、全て取得済み
    if (!hits || hits.length

実行結果

検索結果の件数をログに出して、20000件取得できていることが確認できました。

scrollを使う方法

公式ドキュメントによると、この方法は推奨されなくなったようです。

- We no longer recommend using the scroll API for deep pagination. If you need to preserve the index state while paging through more than 10,000 hits, use the search_after parameter with a point in time (PIT).

helperが有るので、それを利用しました。

const scrollSearchData = async (index: string, query: any) => {
  const sources: any[] = [];
  const params: RequestParams.Search = {
    index: index,
    body: {
      query: query,
    },
  };
  const scrollSearch = elasticsearchClient.helpers.scrollSearch(params);
  for await (const result of scrollSearch) {
    sources.push(...result.body.hits.hits);
  }
  return sources;
};
const result = await scrollSearchData('user', {
  match: {
    gender: 'male',
  },
});
console.log(result.length);

実行結果

search_afterを使う方法と同じく、検索結果の件数をログに出して、20000件取得できていることが確認できました。