从Nginx日志中提取UserAgent、IP等信息
有时我们需要用到大量UserAgent信息,而Nginx日志文件中包含了海量真实的UserAgent,从Nginx文件中提取所有UserAgent信息就很有必要;
Python代码实现从Nginx日志文件中提取UserAgent信息:
import re
def extract_nginx_log(log):
regex = re.compile(
'''(?P<remote_addr>[\d\.]{7,})\
- - (?:\[(?P<datetime>[^\[\]]+)\])\
"(?P<request>[^"]+)" (?P<status>\d+)\
(?P<size>\d+) "(?:[^"]+)" "(?P<user_agent>[^"]+)"'''
)
return regex.match(log).groupdict()
ua_lst = []
with open('access.log','r') as fp:
for line in fp.readlines():
ua_lst.append(extract_nginx_log(line)['user_agent'])
ua_lst = list(set(ua_lst)) # 简单去重
with open('user_agent.txt','w') as fp:
for ua in ua_lst:
fp.write(ua+'\n')