word cloud 긴어절 term aggregation하기

728x90

이전 글에서 word cloud fielddata=true를 하면 document를 world cloud에 사용할 수 있도록

term aggregaion가능하게 해준다.

하지만 내가 원하는 것은 ~~

United Kingdom 입력하면 내가 원하는건

term aggregation을 했을 때 United Kingdom 1개 인데

실제로 아래의 이미지인 term aggregation결과는 United 1개, Kingdom 1개의 결과가 나온다.

United Kingdom각각 스페이스별로 term aggregation을 한다.

term aggregation 은 es 자체에 저장된 단어들을 count해주므로

analyzer를 잘해서 es에 저장을 잘할 수 있도록

인덱스의 settings를 고민함.

ngram은 k ki i ig d do 이런식을 쪼개서 탈락. !

edge_ngram 도 마찬가지이구

shingle 을 사용함.

아래 처럼 shingle analyzer를 사용하여 세팅/매핑을 하고

PUT /my-index-000004
{
  "settings": {
    "analysis": {
      "analyzer": {
        "standard_shingle": {
          "tokenizer": "standard",
          "filter": [
            "shingle"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "message": {
        "type": "text",
        "analyzer": "standard_shingle",
        "fielddata": true
      }
    }
  }
}

POST my-index-000004/_doc
{ "message": "United Kingdom mom what"

}
POST my-index-000004/_doc
{ "message": "United"

}
POST my-index-000004/_doc
{ "message": "mom"

}

POST my-index-000004/_doc
{ "message": "United Kingdom"

}
GET my-index-000004/_search
{
  "size": 0,
  "aggs": {
    "NAME": {
      "terms": {
        "field": "message",
        "size": 10
      }
    }
  }
}

원하는 대로 United Kingdom을 카운트하는것을 볼 수 있었다.

"fielddata": true 를 없이 해보았더니 해당 옵션이 있어야 위와 같은 결과가 나오고 없으면 너 옵션 "fielddata": true필요할거같은데? 에러를 발생한다.

##어떤 문서에 어떤 단어가 많이 나오는지 count하기위해 출처:위키디피아 에서 러시아에 관한 설명 일부를 데이터로 넣고 term aggregation을 하니

POST my-index-000004/_doc
{ "message": "Russia (Russian: Россия, tr. Rossiya, pronounced [rɐˈsʲijə]), or the Russian Federation,[b] is a transcontinental country spanning Eastern Europe and Northern Asia. It is the largest country in the world by area, covering over 17,125,191 square "

}

아래와 같은 결과를 얻을 수 있었다.

* 장문의 문장은 일부 특수기호 때문인지 전부 들어가지 않았고,

긴 문장을 많이넣는 것은 mapping할때 최대길이 옵션 관련하여 추가 옵션으로 넣도록 찾아보아야겠다.

이상

문서에서 어떤 단어가 많이 나오는지 어절의 길이 별 term aggregation을 해보았다.

가자가자가자

word cloud 긴어절 term aggregation하기

댓글

티스토리툴바