使用google 提供的API做語意分析。
語意分析(syntactic analysis)能夠提取語言的訊息,把文章拆成句子,句子在拆成更小的每個分詞,做更進一步的分析,Goole NLP API 會給予每個字詞的詞性以及彼此的關係。
Analyzing syntax
進入GCP新增一個API Key 並確認NLP API狀態為enable;詳細的GCP申請操作步驟可以看官方文件。(或是以後有機會寫。)
因為這次是介紹,所以使用google cloud shell;在平常使用下可以把某些步驟改成習慣的語言及IDE。
新增環境變數
export API_KEY=<YOUR_KEY>
確認輸入後,增加要丟進API的文字json檔 text.json
{
"document":{
"type":"PLAIN_TEXT",
## Required:TYPE_UNSPECIFIED、PLAIN_TEXT、HTML
"content": "Beirut rescuers search the site for possible survivor 30 days after the explosion."
## The content of input string.
# "gcsContentUri": gs://bucket_name/object_name
## google storage uri
},
"encodingType": "UTF8"
}
標準的json
檔輸入資訊:https://cloud.google.com/natural-language/docs
使用curl post資料
curl "https://language.googleapis.com/v1/documents:analyzeSyntax?key=${API_KEY}" -s -X POST -H "Content-Type: application/json" --data-binary @text.json
會得到解析出來的資訊
{
"sentences": [
{
"text": {
"content": "Beirut rescuers search the site for possible survivor 30 days after the explosion.",
"beginOffset": 0
}
}
],
"tokens": [
{
"text": {
"content": "Beirut",
"beginOffset": 0
},
"partOfSpeech": {
"tag": "NOUN",
"aspect": "ASPECT_UNKNOWN",
"case": "CASE_UNKNOWN",
"form": "FORM_UNKNOWN",
"gender": "GENDER_UNKNOWN",
"mood": "MOOD_UNKNOWN",
"number": "SINGULAR",
"person": "PERSON_UNKNOWN",
"proper": "PROPER",
"reciprocity": "RECIPROCITY_UNKNOWN",
"tense": "TENSE_UNKNOWN",
"voice": "VOICE_UNKNOWN"
},
"dependencyEdge": {
"headTokenIndex": 1,
"label": "NN"
},
"lemma": "Beirut"
},
{
"text": {
"content": "rescuers",
"beginOffset": 7
},
"partOfSpeech": {
"tag": "NOUN",
"aspect": "ASPECT_UNKNOWN",
"case": "CASE_UNKNOWN",
"form": "FORM_UNKNOWN",
"gender": "GENDER_UNKNOWN",
"mood": "MOOD_UNKNOWN",
"number": "PLURAL",
"person": "PERSON_UNKNOWN",
"proper": "PROPER_UNKNOWN",
"reciprocity": "RECIPROCITY_UNKNOWN",
"tense": "TENSE_UNKNOWN",
"voice": "VOICE_UNKNOWN"
},
"dependencyEdge": {
"headTokenIndex": 2,
"label": "NSUBJ"
},
"lemma": "rescuer"
},
{
"text": {
"content": "search",
"beginOffset": 16
},
"partOfSpeech": {
"tag": "VERB",
"aspect": "ASPECT_UNKNOWN",
"case": "CASE_UNKNOWN",
"form": "FORM_UNKNOWN",
"gender": "GENDER_UNKNOWN",
"mood": "INDICATIVE",
"number": "NUMBER_UNKNOWN",
"person": "PERSON_UNKNOWN",
"proper": "PROPER_UNKNOWN",
"reciprocity": "RECIPROCITY_UNKNOWN",
"tense": "PRESENT",
"voice": "VOICE_UNKNOWN"
},
......
],
"language": "en"
}
觀察一下上面的結果
- partOfSpeech: tag告訴你詞性rescuers是none,search是verb。
- lemma: 詞的標準行事,例如 run, runs 和ran都會是run。
- headTokenIndex: 代表他修改、修飾的是哪一個字。(index從零開始看起)
- dependencyEdge: 本質上可以看成一幅圖,他會告訴你每個單詞間關聯。如下圖
處理分類完這些文字後,接下來可以做更多使用。
更多詳細的介紹以及參數意義可以看官網的doc,裡面也有很多語言詳細的說明。