Google NLP API parsing

使用google 提供的API做語意分析。

語意分析(syntactic analysis)能夠提取語言的訊息,把文章拆成句子,句子在拆成更小的每個分詞,做更進一步的分析,Goole NLP API 會給予每個字詞的詞性以及彼此的關係。

Analyzing syntax

進入GCP新增一個API Key 並確認NLP API狀態為enable;詳細的GCP申請操作步驟可以看官方文件。(或是以後有機會寫。)

API Enabled

因為這次是介紹,所以使用google cloud shell;在平常使用下可以把某些步驟改成習慣的語言及IDE。

新增環境變數

export API_KEY=<YOUR_KEY>

確認輸入後,增加要丟進API的文字json檔 text.json

{
  "document":{
    "type":"PLAIN_TEXT", 
    ## Required:TYPE_UNSPECIFIED、PLAIN_TEXT、HTML
    "content": "Beirut rescuers search the site for possible survivor 30 days after the explosion."
    ## The content of input string.
    # "gcsContentUri": gs://bucket_name/object_name
    ## google storage uri
  },
  "encodingType": "UTF8"
}

標準的json檔輸入資訊:https://cloud.google.com/natural-language/docs

使用curl post資料

curl "https://language.googleapis.com/v1/documents:analyzeSyntax?key=${API_KEY}"   -s -X POST -H "Content-Type: application/json" --data-binary @text.json

會得到解析出來的資訊

{
  "sentences": [
    {
      "text": {
        "content": "Beirut rescuers search the site for possible survivor 30 days after the explosion.",
        "beginOffset": 0
      }
    }
  ],
  "tokens": [
    {
      "text": {
        "content": "Beirut",
        "beginOffset": 0
      },
      "partOfSpeech": {
        "tag": "NOUN",
        "aspect": "ASPECT_UNKNOWN",
        "case": "CASE_UNKNOWN",
        "form": "FORM_UNKNOWN",
        "gender": "GENDER_UNKNOWN",
        "mood": "MOOD_UNKNOWN",
        "number": "SINGULAR",
        "person": "PERSON_UNKNOWN",
        "proper": "PROPER",
        "reciprocity": "RECIPROCITY_UNKNOWN",
        "tense": "TENSE_UNKNOWN",
        "voice": "VOICE_UNKNOWN"
      },
      "dependencyEdge": {
        "headTokenIndex": 1,
        "label": "NN"
      },
      "lemma": "Beirut"
    },
    {
      "text": {
        "content": "rescuers",
        "beginOffset": 7
      },
      "partOfSpeech": {
        "tag": "NOUN",
        "aspect": "ASPECT_UNKNOWN",
        "case": "CASE_UNKNOWN",
        "form": "FORM_UNKNOWN",
        "gender": "GENDER_UNKNOWN",
        "mood": "MOOD_UNKNOWN",
        "number": "PLURAL",
        "person": "PERSON_UNKNOWN",
        "proper": "PROPER_UNKNOWN",
        "reciprocity": "RECIPROCITY_UNKNOWN",
        "tense": "TENSE_UNKNOWN",
        "voice": "VOICE_UNKNOWN"
      },
      "dependencyEdge": {
        "headTokenIndex": 2,
        "label": "NSUBJ"
      },
      "lemma": "rescuer"
    },
    {
      "text": {
        "content": "search",
        "beginOffset": 16
      },
      "partOfSpeech": {
        "tag": "VERB",
        "aspect": "ASPECT_UNKNOWN",
        "case": "CASE_UNKNOWN",
        "form": "FORM_UNKNOWN",
        "gender": "GENDER_UNKNOWN",
        "mood": "INDICATIVE",
        "number": "NUMBER_UNKNOWN",
        "person": "PERSON_UNKNOWN",
        "proper": "PROPER_UNKNOWN",
        "reciprocity": "RECIPROCITY_UNKNOWN",
        "tense": "PRESENT",
        "voice": "VOICE_UNKNOWN"
      },
......
  ],
  "language": "en"
}

觀察一下上面的結果

  • partOfSpeech: tag告訴你詞性rescuers是none,search是verb。
  • lemma: 詞的標準行事,例如 run, runs 和ran都會是run。
  • headTokenIndex: 代表他修改、修飾的是哪一個字。(index從零開始看起)
  • dependencyEdge: 本質上可以看成一幅圖,他會告訴你每個單詞間關聯。如下圖
dependency tree

處理分類完這些文字後,接下來可以做更多使用。

更多詳細的介紹以及參數意義可以看官網的doc,裡面也有很多語言詳細的說明。

發表留言