为何写就首测试

事先写了一个Python的轻量级爬虫框架pycrawler,因为爬虫属于IO密集型程序,因此想到了使并发,但Python本身对于出现的支撑并无好,于是改吧利用并发网络库eventlet.
eventlet库用法非常简单,可以指定并发访问的url数,框架的最初版本将这数值设为10。但是下的类别遭到待爬取140k+
url,10独url并作访问的说话效率特别小。查了eventlet的文档之后并无察觉有关最深出现数量之克,只有这么一词话:

Note also that imap is memory-bounded and won’t consume gigabytes of
memory if the list of urls grows to the tens of thousands (yes, we had
that problem in production once!).

故此自己尝试以连发数设置也1000,这个尝试真正的确提高了频率,但是抓取的结果也出现了问题。项目受到url从网页遭到剖析出长到任务队列,当并发数为1000经常,爬行完成后终究url访问数一味生80k,远低于预期的140k。说明爬行过程遭到起了不可预料的题材。
于是乎想追一下eventlet库最佳的产出数量。

20150609 Created By BaoXinjian

测试环境

  • Python 2.7
  • Mac OS X Yosemite 10.10.3
  • eventlet 0.17.3

图片 1一、摘要

测试思路

核心的测试思路是并作下充斥及一个页面,计算总耗时和平均耗时,并且统计总返回页面数是否和访问url数相等。
测试代码如下:

import time
import eventlet
from eventlet.green import urllib2


def fetchone(url):
    try:
        res = urllib2.urlopen(url)
    except (IOError, urllib2.HTTPError) as e:
        html = None
    else:
        html = res.read()
    return html


def fetch(urls):
    pool = eventlet.GreenPool()
    start = time.time()
    results = pool.imap(fetchone, urls)
    end = time.time()
    count = 0
    for html in results:
        if html:
            count += 1
    ms = (end - start) * 1000
    total = round(ms, 4)
    ave = round(ms / len(urls), 4)
    print('Try {0} urls and get {1} responses. Total {2}ms average {3}ms'.format(len(urls), count, total, ave))

if __name__ == '__main__':
    base = ['http://www.zhihu.com']
    fetch(base)
    for i in xrange(10, 110, 10):
        fetch(base*i)

初使用Google主页进行测试,但由于看太频繁被Google
503了。而且因自身当美国测试,Google的响应时间最不够,于是改用知乎主页进行测试。
测试结果如下:

Try 1 urls and get 1 responses. Total 0.9949ms average 0.9949ms
Try 10 urls and get 10 responses. Total 0.041ms average 0.0041ms
Try 20 urls and get 20 responses. Total 0.041ms average 0.0021ms
Try 30 urls and get 30 responses. Total 0.0401ms average 0.0013ms                                                          
Try 40 urls and get 40 responses. Total 0.042ms average 0.001ms
Try 50 urls and get 50 responses. Total 0.0479ms average 0.001ms
Try 60 urls and get 60 responses. Total 0.061ms average 0.001ms
Try 70 urls and get 70 responses. Total 0.0451ms average 0.0006ms
Try 80 urls and get 80 responses. Total 0.0429ms average 0.0005ms
Try 90 urls and get 90 responses. Total 0.041ms average 0.0005ms
Try 100 urls and get 100 responses. Total 0.0448ms average 0.0004ms

结果分析

打测试结果可以看,单线程访问耗时极度丰富,这应当是eventlet内部贯彻招的。随着并发数的增高,平均耗时渐变短。在依照测试范围中,并发数为100常,平均耗时为0.0004ms。因此怀疑此起彼伏加强并发数后,平均耗时会连续变短直到上某一个尽小值,然后搭。我尝试提高并发数至1000,但结果是来了大。
设想到对目标服务器的压力与连续增强并发数所能够拉动的特性提升,因此并发数100得成为相对的无限优解。实际项目面临好设计于适应算法来动态改变并发数来实现性能的极致优化。

oracle logon
trigger一般用来审计用户登录信息或限制用户登录,虽说不常用,但仍不失为一种植好方法。

声明

  • 率先不善以简书上勾画技术文章,应该会发生许多尾巴,希望各位前辈可以不吝赐教。
  • 测试代码有无数无谨慎的地方,例如没有考虑到不同时之网状况的差异。
  • 章开始提到的爬虫框架是闲来无事写的第一个开源项目,不全面的地方要大家指正。

 

  1. 免克审计dba用户登录

  2. 咦时候可采取

It is advised you use this trigger only
when

(1) not using archive logging on the
database or

不归档模式

(2) there are few logons to the
database.

报到次数少

 

图片 2老二、锁定统计信息


Step1. 创立审计表

CREATE TABLE gavin.bxj_logonlog(
  os_user varchar2(30),
  user_name varchar2(30),
  logon_time date,
  session_user varchar2(30),
  ip_address varchar2(15),
  program varchar2(30)
);

Step2.  创建Logon Trigger

CREATE OR REPLACE TRIGGER gavin.bxj_on_logon
  AFTER logon ON DATABASE
DECLARE
  user_name varchar2(30);
  os_user      varchar2(30);
  v_sid        number;
  v_su         varchar2(15);
  v_program    varchar2(30);
  v_ip         varchar2(15);
BEGIN
  EXECUTE IMMEDIATE 'select distinct sid from sys.v_$mystat'
    INTO v_sid;

  EXECUTE IMMEDIATE 'select osuser, username, program from sys.v_$session where sid = :b1'
    INTO os_user, user_name, v_program
    USING v_sid;

  SELECT sys_context('userenv', 'SESSION_USER') INTO v_su from dual;
  SELECT sys_context('userenv', 'IP_ADDRESS') INTO v_ip from dual;

  INSERT INTO gavin.bxj_logonlog
  VALUES
    (os_user, user_name, sysdate, v_su, v_ip, v_program);

  --禁止WWW账号登陆
  IF (user_name = 'WWW') THEN
    DBMS_SESSION.SET_IDENTIFIER('about to raise app_error..');
    RAISE_APPLICATION_ERROR(-20003, 'You are not allowed to connect to the database');
  END IF;

  --启用GAVINTEST账号的所有操作的trace记录
  IF user = 'GAVINTEST' THEN
    EXECUTE IMMEDIATE 'alter session set statistics_level=ALL';
    EXECUTE IMMEDIATE 'alter session set max_dump_file_size=UNLIMITED';
    EXECUTE IMMEDIATE 'alter session set tracefile_identifier='''||user||'_10046''';
    EXECUTE IMMEDIATE 'alter session set events ''10046 trace name context forever, level 12''';
  END IF;   
END;
/

Step3.
使普通账号登陆,用以测试查看BXJ_LOGONLOG审计表记录

 

Step4.
使用WWW账号登陆,用以测试是否允许登录

 

Step5.
使用GAVINTEST账号登陆,用以而是是否发trace file

 

Thanks and Regards

参考:Matelink – How To Create A Trigger
To Capture User Information On Logon [ID 454088.1]

图片 3