Changeset 150
- Timestamp:
- 04/26/08 18:11:04 (2 months ago)
- Files:
-
- trunk/README.txt (modified) (1 diff)
- trunk/development.ini.tmpl (moved) (moved from trunk/development.ini) (1 diff)
- trunk/etc (deleted)
- trunk/shakespeare.egg-info/paste_deploy_config.ini_tmpl (modified) (1 diff)
- trunk/shakespeare/__init__.py (modified) (1 diff)
- trunk/shakespeare/cache.py (modified) (1 diff)
- trunk/shakespeare/cache_test.py (modified) (1 diff)
- trunk/shakespeare/concordance.py (modified) (1 diff)
- trunk/shakespeare/concordance_test.py (modified) (1 diff)
- trunk/shakespeare/config/routing.py (modified) (1 diff)
- trunk/shakespeare/controllers/site.py (moved) (moved from trunk/shakespeare/wsgiplain.py) (1 diff)
- trunk/shakespeare/gutenberg.py (modified) (1 diff)
- trunk/shakespeare/gutenberg_test.py (modified) (1 diff)
- trunk/shakespeare/index.py (modified) (1 diff)
- trunk/shakespeare/moby.py (modified) (1 diff)
- trunk/shakespeare/moby_test.py (modified) (1 diff)
- trunk/shakespeare/model/__init__.py (modified) (1 diff)
- trunk/shakespeare/model/dm.py (moved) (moved from trunk/shakespeare/dm.py) (1 diff)
- trunk/shakespeare/src/eb_test.py (modified) (1 diff)
- trunk/shakespeare/template (deleted)
- trunk/shakespeare/templates/__init__.py (added)
- trunk/shakespeare/templates/concordance.html (copied) (copied from trunk/shakespeare/template/concordance.html) (1 diff)
- trunk/shakespeare/templates/concordance_by_word.html (copied) (copied from trunk/shakespeare/template/concordance_by_word.html) (1 diff)
- trunk/shakespeare/templates/guide.html (copied) (copied from trunk/shakespeare/template/guide.html)
- trunk/shakespeare/templates/index.html (copied) (copied from trunk/shakespeare/template/index.html) (1 diff)
- trunk/shakespeare/templates/layout.html (copied) (copied from trunk/shakespeare/template/layout.html) (1 diff)
- trunk/shakespeare/templates/view.html (copied) (copied from trunk/shakespeare/template/view.html) (1 diff)
- trunk/shakespeare/templates/view_annotate.html (copied) (copied from trunk/shakespeare/template/view_annotate.html)
- trunk/shakespeare/tests/__init__.py (modified) (1 diff)
- trunk/shakespeare/tests/functional/test_site.py (moved) (moved from trunk/shakespeare/site_test.py) (1 diff)
- trunk/shakespeare/tests/test_model.py (moved) (moved from trunk/shakespeare/dm_test.py) (1 diff)
- trunk/shakespeare/tests/test_models.py (deleted)
- trunk/shakespeare/wsgi.py (deleted)
Legend:
- Unmodified
- Added
- Removed
- Modified
- Copied
- Moved
trunk/README.txt
Revision 148 Revision 150 1 Introduction 1 Introduction 2 ************ 2 ************ 3 3 4 The Open Shakespeare package provides a full open set of shakespeare's works 4 The Open Shakespeare package provides a full open set of shakespeare's works 5 (often in multiple versions) along with ancillary material, a variety of tools 5 (often in multiple versions) along with ancillary material, a variety of tools 6 and a python API. 6 and a python API. 7 7 8 Specifically in addition to the works themselves (often in multiple versions) 8 Specifically in addition to the works themselves (often in multiple versions) 9 there is an introduction, a chronology, explanatory notes, a concordance and 9 there is an introduction, a chronology, explanatory notes, a concordance and 10 search facilities. 10 search facilities. 11 11 12 All material is open source/open knowledge so that anyone can use, redistribute 12 All material is open source/open knowledge so that anyone can use, redistribute 13 and reuse these materials freely. For exact details of the license under which 13 and reuse these materials freely. For exact details of the license under which 14 this package is made available please see COPYING.txt. 14 this package is made available please see COPYING.txt. 15 15 16 Open Shakespeare has been developed under the aegis of the Open Knowledge 16 Open Shakespeare has been developed under the aegis of the Open Knowledge 17 Foundation (http://www.okfn.org/). 17 Foundation (http://www.okfn.org/). 18 18 19 Contact the Project 19 Contact the Project 20 ******************* 20 ******************* 21 21 22 Please mail info@okfn.org or join the okfn-discuss mailing list: 22 Please mail info@okfn.org or join the okfn-discuss mailing list: 23 23 24 http://lists.okfn.org/listinfo/okfn-discuss 24 http://lists.okfn.org/listinfo/okfn-discuss 25 25 26 26 27 Installation and Setup 27 Installation and Setup 28 ********************** 28 ********************** 29 29 30 1. Install the code 30 1. Install the code 31 =================== 31 =================== 32 32 33 1.1: (EITHER) Install using setup.py (preferred) 33 1.1: (EITHER) Install using setup.py (preferred) 34 ------------------------------------------------ 34 ------------------------------------------------ 35 35 36 Install ``shakespeare`` using easy_install:: 36 Install ``shakespeare`` using easy_install:: 37 37 38 easy_install shakespeare 38 easy_install shakespeare 39 39 40 NB: If you don't have easy_install you can get from here: 40 NB: If you don't have easy_install you can get from here: 41 41 42 <http://peak.telecommunity.com/DevCenter/EasyInstall#installation-instructions> 42 <http://peak.telecommunity.com/DevCenter/EasyInstall#installation-instructions> 43 43 44 Make a config file as follows:: 44 Make a config file as follows:: 45 45 46 paster make-config shakespeare config.ini 46 paster make-config shakespeare config.ini 47 47 48 Tweak the config file as appropriate and then setup the application:: 48 Tweak the config file as appropriate and then setup the application:: 49 49 50 paster setup-app config.ini 50 paster setup-app config.ini 51 51 52 1.2 (OR) Get the code straight from subversion 52 1.2 (OR) Get the code straight from subversion 53 ------------------------------------------------ 53 ------------------------------------------------ 54 54 55 1. Check out the subversion trunk:: 55 1. Check out the subversion trunk:: 56 56 57 svn co https://knowledgeforge.net/shakespeare/svn/trunk 57 svn co https://knowledgeforge.net/shakespeare/svn/trunk 58 58 59 2. Do:: 59 2. Do:: 60 60 61 sudo python setup.py develop 61 sudo python setup.py develop 62 62 63 63 64 2. Cache Directory 64 2. Cache Directory 65 ================== 65 ================== 66 66 67 Create a cache directory where texts and other material can be stored 67 Create a cache directory where texts and other material can be stored 68 68 69 This directory needs to be semi-permanent so do *not* put under a location such 69 This directory needs to be semi-permanent so do *not* put under a location such 70 as /tmp. 70 as /tmp. 71 71 72 72 73 3. Create a configuration file74 ==============================75 76 1. copy the template at etc/shakespeare.conf.new to a suitable new location77 (suggestion: etc/shakespeare.conf)78 79 2. edit to reflect your setup (see comments in file)80 81 3. make sure the config file can be found:82 1. EITHER: it must be located at etc/shakespeare.conf relative to the83 directory from which you run scripts84 85 2. OR: set the SHAKESPEARECONF environment variable to contain the path to86 the configuration file87 88 73 89 5. Initialize the system 74 5. Initialize the system 90 ======================== 75 ======================== 91 76 92 Run: $ bin/shakespeare-admin init 77 Run: $ bin/shakespeare-admin init 93 78 94 This may take some time to run so be patient 79 This may take some time to run so be patient 95 80 96 TIP: using sqlite building the concordance really **does** seem to run forever 81 TIP: using sqlite building the concordance really **does** seem to run forever 97 so recommend using postgresql or mysql if you are going to build the 82 so recommend using postgresql or mysql if you are going to build the 98 concordance. 83 concordance. 99 84 100 85 101 Getting Started 86 Getting Started 102 *************** 87 *************** 103 88 104 As a user: 89 As a user: 105 ========== 90 ========== 106 91 107 Start up the web interface by running the webserver: 92 Start up the web interface by running the webserver: 108 93 109 $ bin/shakespeare-admin runserver 94 $ bin/shakespeare-admin runserver 110 95 111 Then visit http://localhost:8080/ using your favourite web browser. 96 Then visit http://localhost:8080/ using your favourite web browser. 112 97 113 As a developer: 98 As a developer: 114 =============== 99 =============== 115 100 101 0. Copy development.ini.tmpl to development.ini and edit to your taste. 102 116 1. Check out the administrative commands: $ bin/shakespeare-admin help. 103 1. Check out the administrative commands: $ bin/shakespeare-admin help. 117 104 118 2. Run the tests: $ py.test 105 2. Run the tests using either py.test of nosetests:: 119 120 Note that: 121 122 * The tests use [py.test] so you will need to have installed this 123 106 124 * To run the website tests (site_test etc) you will need to install [twill] 107 $ nosetests shakespeare 125 and have the webserver running 126 108 127 [py.test]: http://codespeak.net/py/current/doc/getting-started.html128 [twill]: http://twill.idyll.org/129 trunk/development.ini.tmpl
Revision 148 Revision 150 1 # 1 # 2 # shakespeare - Pylons development environment configuration 2 # shakespeare - Pylons development environment configuration 3 # 3 # 4 # The %(here)s variable will be replaced with the parent directory of this file 4 # The %(here)s variable will be replaced with the parent directory of this file 5 # 5 # 6 [DEFAULT] 6 [DEFAULT] 7 debug = true 7 debug = true 8 # Uncomment and replace with the address which should receive any error reports 8 # Uncomment and replace with the address which should receive any error reports 9 #email_to = you@yourdomain.com 9 #email_to = you@yourdomain.com 10 smtp_server = localhost 10 smtp_server = localhost 11 error_email_from = paste@localhost 11 error_email_from = paste@localhost 12 13 # directory where we can store all local copies of texts 14 # at present should be different from the app's cache_dir 15 cachedir = %(here)s/cache 16 12 17 13 [server:main] 18 [server:main] 14 use = egg:Paste#http 19 use = egg:Paste#http 15 host = 0.0.0.0 20 host = 0.0.0.0 16 port = 5000 21 port = 5000 17 22 18 [app:main] 23 [app:main] 19 use = egg:shakespeare 24 use = egg:shakespeare 20 full_stack = true 25 full_stack = true 21 cache_dir = %(here)s/data 26 cache_dir = %(here)s/data 22 beaker.session.key = shakespeare 27 beaker.session.key = shakespeare 23 beaker.session.secret = somesecret 28 beaker.session.secret = somesecret 24 29 25 # If you'd like to fine-tune the individual locations of the cache data dirs 30 # If you'd like to fine-tune the individual locations of the cache data dirs 26 # for the Cache data, or the Session saves, un-comment the desired settings 31 # for the Cache data, or the Session saves, un-comment the desired settings 27 # here: 32 # here: 28 #beaker.cache.data_dir = %(here)s/data/cache 33 #beaker.cache.data_dir = %(here)s/data/cache 29 #beaker.session.data_dir = %(here)s/data/sessions 34 #beaker.session.data_dir = %(here)s/data/sessions 30 35 31 # WARNING: *THE LINE BELOW MUST BE UNCOMMENTED ON A PRODUCTION ENVIRONMENT* 36 # WARNING: *THE LINE BELOW MUST BE UNCOMMENTED ON A PRODUCTION ENVIRONMENT* 32 # Debug mode will enable the interactive debugging tool, allowing ANYONE to 37 # Debug mode will enable the interactive debugging tool, allowing ANYONE to 33 # execute malicious code after an exception is raised. 38 # execute malicious code after an exception is raised. 34 #set debug = false 39 #set debug = false 40 41 # using sqlite in memory leads to thread issues when using db ... 42 # sqlobject.dburi = sqlite:///:memory: 43 sqlobject.dburi = postgres://<username>:<password>@localhost/<your-dbname> 35 44 36 45 37 # Logging configuration 46 # Logging configuration 38 [loggers] 47 [loggers] 39 keys = root, shakespeare 48 keys = root, shakespeare 40 49 41 [handlers] 50 [handlers] 42 keys = console 51 keys = console 43 52 44 [formatters] 53 [formatters] 45 keys = generic 54 keys = generic 46 55 47 [logger_root] 56 [logger_root] 48 level = INFO 57 level = INFO 49 handlers = console 58 handlers = console 50 59 51 [logger_shakespeare] 60 [logger_shakespeare] 52 level = DEBUG 61 level = DEBUG 53 handlers = 62 handlers = 54 qualname = shakespeare 63 qualname = shakespeare 55 64 56 [handler_console] 65 [handler_console] 57 class = StreamHandler 66 class = StreamHandler 58 args = (sys.stderr,) 67 args = (sys.stderr,) 59 level = NOTSET 68 level = NOTSET 60 formatter = generic 69 formatter = generic 61 70 62 [formatter_generic] 71 [formatter_generic] 63 format = %(asctime)s,%(msecs)03d %(levelname)-5.5s [%(name)s] %(message)s 72 format = %(asctime)s,%(msecs)03d %(levelname)-5.5s [%(name)s] %(message)s 64 datefmt = %H:%M:%S 73 datefmt = %H:%M:%S 74 trunk/shakespeare.egg-info/paste_deploy_config.ini_tmpl
Revision 148 Revision 150 1 # 1 # 2 # shakespeare - Pylons configuration 2 # shakespeare - Pylons configuration 3 # 3 # 4 # The %(here)s variable will be replaced with the parent directory of this file 4 # The %(here)s variable will be replaced with the parent directory of this file 5 # 5 # 6 [DEFAULT] 6 [DEFAULT] 7 debug = true 7 debug = true 8 email_to = you@yourdomain.com 8 email_to = you@yourdomain.com 9 smtp_server = localhost 9 smtp_server = localhost 10 error_email_from = paste@localhost 10 error_email_from = paste@localhost 11 12 # directory where we can store all local copies of texts 13 # at present should be different from the app's cache_dir 14 cachedir = ./cache 11 15 12 [server:main] 16 [server:main] 13 use = egg:Paste#http 17 use = egg:Paste#http 14 host = 0.0.0.0 18 host = 0.0.0.0 15 port = 5000 19 port = 5000 16 20 17 [app:main] 21 [app:main] 18 use = egg:shakespeare 22 use = egg:shakespeare 19 full_stack = true 23 full_stack = true 20 cache_dir = %(here)s/data 24 cache_dir = %(here)s/data 21 beaker.session.key = shakespeare 25 beaker.session.key = shakespeare 22 beaker.session.secret = ${app_instance_secret} 26 beaker.session.secret = ${app_instance_secret} 23 app_instance_uuid = ${app_instance_uuid} 27 app_instance_uuid = ${app_instance_uuid} 24 28 25 # If you'd like to fine-tune the individual locations of the cache data dirs 29 # If you'd like to fine-tune the individual locations of the cache data dirs 26 # for the Cache data, or the Session saves, un-comment the desired settings 30 # for the Cache data, or the Session saves, un-comment the desired settings 27 # here: 31 # here: 28 #beaker.cache.data_dir = %(here)s/data/cache 32 #beaker.cache.data_dir = %(here)s/data/cache 29 #beaker.session.data_dir = %(here)s/data/sessions 33 #beaker.session.data_dir = %(here)s/data/sessions 30 34 31 # WARNING: *THE LINE BELOW MUST BE UNCOMMENTED ON A PRODUCTION ENVIRONMENT* 35 # WARNING: *THE LINE BELOW MUST BE UNCOMMENTED ON A PRODUCTION ENVIRONMENT* 32 # Debug mode will enable the interactive debugging tool, allowing ANYONE to 36 # Debug mode will enable the interactive debugging tool, allowing ANYONE to 33 # execute malicious code after an exception is raised. 37 # execute malicious code after an exception is raised. 34 set debug = false 38 set debug = false 35 39 40 # using sqlite in memory leads to thread issues when using db ... 41 # sqlobject.dburi = sqlite:///:memory: 42 sqlobject.dburi = postgres://<username>:<password>@localhost/<your-dbname> 36 43 37 # Logging configuration 44 # Logging configuration 38 [loggers] 45 [loggers] 39 keys = root 46 keys = root 40 47 41 [handlers] 48 [handlers] 42 keys = console 49 keys = console 43 50 44 [formatters] 51 [formatters] 45 keys = generic 52 keys = generic 46 53 47 [logger_root] 54 [logger_root] 48 level = INFO 55 level = INFO 49 handlers = console 56 handlers = console 50 57 51 [handler_console] 58 [handler_console] 52 class = StreamHandler 59 class = StreamHandler 53 args = (sys.stderr,) 60 args = (sys.stderr,) 54 level = NOTSET 61 level = NOTSET 55 formatter = generic 62 formatter = generic 56 63 57 [formatter_generic] 64 [formatter_generic] 58 format = %(asctime)s %(levelname)-5.5s [%(name)s] %(message)s 65 format = %(asctime)s %(levelname)-5.5s [%(name)s] %(message)s 66 67 68 [misc] 69 # directory where we can store all local copies of texts 70 cachedir = ./cache 71 72 [db] 73 # sqlobject database uri. see sqlobject documentation for details 74 # uri = postgres://user:pass@host/dbname 75 uri = sqlite:/:memory: 76 77 [web] 78 # directory where the templates used by web front end are kept 79 template_dir = ./src/shakespeare/template 80 81 [annotater] 82 # url at which marginalia files (css/js etc) should be mounted 83 marginalia_prefix = /marginalia trunk/shakespeare/__init__.py
Revision 148 Revision 150 1 __version__ = '0.5dev' 1 __version__ = '0.5dev' 2 __application_name__ = 'shakespeare' 2 __application_name__ = 'shakespeare' 3 3 4 def conf(): 4 def conf(): 5 import os 5 import os 6 defaultPath = os.path.abspath('./ etc/%s.conf' % __application_name__)6 defaultPath = os.path.abspath('./development.ini') 7 envVarName = __application_name__.upper() + 'CONF' 7 envVarName = __application_name__.upper() + 'CONF' 8 confPath = os.environ.get(envVarName, defaultPath) 8 confPath = os.environ.get(envVarName, defaultPath) 9 if not os.path.exists(confPath): 9 if not os.path.exists(confPath): 10 raise ValueError('No Configuration file exists at: %s' % confPath) 10 raise ValueError('No Configuration file exists at: %s' % confPath) 11 import ConfigParser 11 12 conf = ConfigParser.SafeConfigParser() 12 # register the config 13 conf.read(confPath) 13 import paste.deploy 14 import shakespeare.config.environment 15 pasteconf = paste.deploy.appconfig('config:' + confPath) 16 17 shakespeare.config.environment.load_environment(pasteconf.global_conf, 18 pasteconf.local_conf) 19 from pylons import config 20 conf = config 21 22 # import ConfigParser 23 # conf = ConfigParser.SafeConfigParser() 24 # conf.read(confPath) 25 14 return conf 26 return conf 15 27 16 trunk/shakespeare/cache.py
Revision 139 Revision 150 1 import os 1 import os 2 import urllib 2 import urllib 3 3 4 import shakespeare 4 import shakespeare 5 conf = shakespeare.conf() 5 conf = shakespeare.conf() 6 6 7 class Cache(object): 7 class Cache(object): 8 """Provide a local filesystem cache for material. 8 """Provide a local filesystem cache for material. 9 """ 9 """ 10 10 11 def __init__(self, cache_path): 11 def __init__(self, cache_path): 12 self.cache_path = cache_path 12 self.cache_path = cache_path 13 13 14 def path(self, remote_url, version=''): 14 def path(self, remote_url, version=''): 15 """Get local path to text of remote url. 15 """Get local path to text of remote url. 16 @type: string giving version of text (''|'cleaned') 16 @type: string giving version of text (''|'cleaned') 17 """ 17 """ 18 protocolEnd = remote_url.index(':') + 3 # add 3 for :// 18 protocolEnd = remote_url.index(':') + 3 # add 3 for :// 19 path = remote_url[protocolEnd:] 19 path = remote_url[protocolEnd:] 20 base, name = os.path.split(path) 20 base, name = os.path.split(path) 21 name = version + name 21 name = version + name 22 offset = os.path.join(base, name) 22 offset = os.path.join(base, name) 23 localPath = self.path_from_offset(offset) 23 localPath = self.path_from_offset(offset) 24 return localPath 24 return localPath 25 25 26 def download_url(self, url, overwrite=False): 26 def download_url(self, url, overwrite=False): 27 """Download a url to the local cache 27 """Download a url to the local cache 28 @overwrite: if True overwrite an existing local copy otherwise don't 28 @overwrite: if True overwrite an existing local copy otherwise don't 29 """ 29 """ 30 localPath = self.path(url) 30 localPath = self.path(url) 31 dirpath = os.path.dirname(localPath) 31 dirpath = os.path.dirname(localPath) 32 if overwrite or not(os.path.exists(localPath)): 32 if overwrite or not(os.path.exists(localPath)): 33 if not os.path.exists(dirpath): 33 if not os.path.exists(dirpath): 34 os.makedirs(dirpath) 34 os.makedirs(dirpath) 35 # use wget as it seems to work more reliably on wikimedia 35 # use wget as it seems to work more reliably on wikimedia 36 # see extensive comments on issue in shakespeare.eb.Wikimedia class 36 # see extensive comments on issue in shakespeare.eb.Wikimedia class 37 # rgrp: 2008-03-18 use urllib rather than wget despite these issues 37 # rgrp: 2008-03-18 use urllib rather than wget despite these issues 38 # as wget is fairly specific to linux/unix and even there may not 38 # as wget is fairly specific to linux/unix and even there may not 39 # be installed. 39 # be installed. 40 # cmd = 'wget -O %s %s' % (localPath, url) 40 # cmd = 'wget -O %s %s' % (localPath, url) 41 # os.system(cmd) 41 # os.system(cmd) 42 urllib.urlretrieve(url, localPath) 42 urllib.urlretrieve(url, localPath) 43 43 44 def path_from_offset(self, offset): 44 def path_from_offset(self, offset): 45 "Get full path of file in cache given by offset." 45 "Get full path of file in cache given by offset." 46 return os.path.join(self.cache_path, offset) 46 return os.path.join(self.cache_path, offset) 47 47 48 48 49 default_path = shakespeare.conf() .get('misc', 'cachedir')49 default_path = shakespeare.conf()['cachedir'] 50 default = Cache(default_path) 50 default = Cache(default_path) 51 51 trunk/shakespeare/cache_test.py
Revision 50 Revision 150 1 import os 1 import os 2 import shutil 2 import shutil 3 import tempfile 3 import tempfile 4 4 5 import shakespeare.cache 5 import shakespeare.cache 6 6 7 class TestCache(object): 7 class TestCache(object): 8 8 9 @classmethod 9 def setup_class(cls): 10 def setup_class(cls): 10 cls.cache_path = tempfile.mkdtemp() 11 cls.cache_path = tempfile.mkdtemp() 11 cls.cache = shakespeare.cache.Cache(cls.cache_path) 12 cls.cache = shakespeare.cache.Cache(cls.cache_path) 12 cls.url = 'http://www.gutenberg.org/dirs/GUTINDEX.ALL' 13 cls.url = 'http://www.gutenberg.org/dirs/GUTINDEX.ALL' 13 cls.url2 = 'http://project.knowledgeforge.net/shakespeare/svn/trunk/CHANGELOG.txt' 14 cls.url2 = 'http://project.knowledgeforge.net/shakespeare/svn/trunk/CHANGELOG.txt' 14 15 16 @classmethod 15 def teardown_class(cls): 17 def teardown_class(cls): 16 shutil.rmtree(cls.cache_path) 18 shutil.rmtree(cls.cache_path) 17 19 18 def test_path(self): 20 def test_path(self): 19 exp = os.path.join(self.cache_path, self.url[7:]) 21 exp = os.path.join(self.cache_path, self.url[7:]) 20 out = self.cache.path(self.url) 22 out = self.cache.path(self.url) 21 assert out == exp 23 assert out == exp 22 24 23 def test_path_2(self): 25 def test_path_2(self): 24 exp = os.path.join(self.cache_path, 26 exp = os.path.join(self.cache_path, 25 'www.gutenberg.org/dirs/cleanedGUTINDEX.ALL') 27 'www.gutenberg.org/dirs/cleanedGUTINDEX.ALL') 26 out = self.cache.path(self.url, 'cleaned') 28 out = self.cache.path(self.url, 'cleaned') 27 assert exp == out 29 assert exp == out 28 30 29 def test_download_url(self): 31 def test_download_url(self): 30 exp = os.path.join(self.cache_path, self.url2[7:]) 32 exp = os.path.join(self.cache_path, self.url2[7:]) 31 self.cache.download_url(self.url2, overwrite=True) 33 self.cache.download_url(self.url2, overwrite=True) 32 assert os.path.exists(exp) 34 assert os.path.exists(exp) 33 35 trunk/shakespeare/concordance.py
Revision 74 Revision 150 1 """ 1 """ 2 Concordance (and statistics) for texts in database. 2 Concordance (and statistics) for texts in database. 3 3 4 To build concordance use ConcordanceBuilder. To access concordance/statistics 4 To build concordance use ConcordanceBuilder. To access concordance/statistics 5 use Concordance/Statistics class. Concordance and statistics are provided as 5 use Concordance/Statistics class. Concordance and statistics are provided as 6 dictionaries keyed by words. 6 dictionaries keyed by words. 7 7 8 NB: all word keys have been lower-cased in order to render them 8 NB: all word keys have been lower-cased in order to render them 9 case-insensitive 9 case-insensitive 10 """ 10 """ 11 import re 11 import re 12 12 13 import sqlobject 13 import sqlobject 14 14 15 import shakespeare.index 15 import shakespeare.index 16 import shakespeare.cache 16 import shakespeare.cache 17 17 18 18 19 class ConcordanceBase(object): 19 class ConcordanceBase(object): 20 """ 20 """ 21 TODO: caching?? 21 TODO: caching?? 22 """ 22 """ 23 sqlcc = shakespeare. dm.Concordance23 sqlcc = shakespeare.model.Concordance 24 sqlstat = shakespeare. dm.Statistic24 sqlstat = shakespeare.model.Statistic 25 25 26 def __init__(self, filter_names=None): 26 def __init__(self, filter_names=None): 27 """ 27 """ 28 @param filter_names: a list of id names with which to filter results 28 @param filter_names: a list of id names with which to filter results 29 (i.e. only return results relating to those texts) 29 (i.e. only return results relating to those texts) 30 """ 30 """ 31 self._filter_names = filter_names 31 self._filter_names = filter_names 32 self.sqlcc_filter = self._make_filter(self.sqlcc) 32 self.sqlcc_filter = self._make_filter(self.sqlcc) 33 self.sqlstat_filter = self._make_filter(self.sqlstat) 33 self.sqlstat_filter = self._make_filter(self.sqlstat) 34 34 35 def _make_filter(self, sqlobj): 35 def _make_filter(self, sqlobj): 36 sql_filter = True 36 sql_filter = True 37 if self._filter_names is not None: 37 if self._filter_names is not None: 38 arglist = [] 38 arglist = [] 39 for name in self._filter_names: 39 for name in self._filter_names: 40 newarg = sqlobj.q.textID == self._name2id(name) 40 newarg = sqlobj.q.textID == self._name2id(name) 41 arglist.append(newarg) 41 arglist.append(newarg) 42 sql_filter = sqlobject.OR(*arglist) 42 sql_filter = sqlobject.OR(*arglist) 43 return sql_filter 43 return sql_filter 44 44 45 def _name2id(self, name): 45 def _name2id(self, name): 46 return shakespeare. dm.Material.byName(name).id46 return shakespeare.model.Material.byName(name).id 47 47 48 def keys(self): 48 def keys(self): 49 """Return list of *distinct* words in concordance/statistics 49 """Return list of *distinct* words in concordance/statistics 50 """ 50 """ 51 all = self.sqlstat.select(self.sqlstat_filter, 51 all = self.sqlstat.select(self.sqlstat_filter, 52 orderBy=self.sqlstat.q.word, 52 orderBy=self.sqlstat.q.word, 53 ) 53 ) 54 words = [ xx.word for xx in list(all) ] 54 words = [ xx.word for xx in list(all) ] 55 distinct = list(set(words)) 55 distinct = list(set(words)) 56 distinct.sort() 56 distinct.sort() 57 return distinct 57 return distinct 58 58 59 59 60 class Concordance(ConcordanceBase): 60 class Concordance(ConcordanceBase): 61 """Concordance by word for a set of texts 61 """Concordance by word for a set of texts 62 """ 62 """ 63 63 64 def get(self, word): 64 def get(self, word): 65 """Get list of occurrences for word 65 """Get list of occurrences for word 66 @return: sqlobject query list 66 @return: sqlobject query list 67 """ 67 """ 68 select = self.sqlcc.select(sqlobject.AND(self.sqlcc_filter, self.sqlcc.q.word==word)) 68 select = self.sqlcc.select(sqlobject.AND(self.sqlcc_filter, self.sqlcc.q.word==word)) 69 return select 69 return select 70 70 71 class Statistics(ConcordanceBase): 71 class Statistics(ConcordanceBase): 72 72 73 def get(self, word): 73 def get(self, word): 74 select = self.sqlstat.select( 74 select = self.sqlstat.select( 75 sqlobject.AND(self.sqlstat_filter, self.sqlstat.q.word==word) 75 sqlobject.AND(self.sqlstat_filter, self.sqlstat.q.word==word) 76 ) 76 ) 77 total = 0 77 total = 0 78 for stat in select: 78 for stat in select: 79 total += stat.occurrences 79 total += stat.occurrences 80 return total 80 return total 81 81 82 class ConcordanceBuilder(object): 82 class ConcordanceBuilder(object): 83 """Build a concordance and associated statistics for a set of texts. 83 """Build a concordance and associated statistics for a set of texts. 84 84 85 """ 85 """ 86 86 87 # multiline, unicode and ignorecase 87 # multiline, unicode and ignorecase 88 word_regex = re.compile(r'\b(\w+)\b', re.U | re.M | re.I) 88 word_regex = re.compile(r'\b(\w+)\b', re.U | re.M | re.I) 89 89 90 words_to_ignore = [ 90 words_to_ignore = [ 91 # 'a', 'the', 'and', 'as', 'are', 'be', 'but', 'in' 91 # 'a', 'the', 'and', 'as', 'are', 'be', 'but', 'in' 92 ] 92 ] 93 non_words = [ 93 non_words = [ 94 'd', # accus'd 94 'd', # accus'd 95 't', 95 't', 96 ] 96 ] 97 97 98 def is_roman_numeral(self, word): 98 def is_roman_numeral(self, word): 99 digits = [ 'i', 'ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix' ] 99 digits = [ 'i', 'ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix' ] 100 others = [ 'l', 'x', 'c' ] 100 others = [ 'l', 'x', 'c' ] 101 if word == 'i': return False # exception because this conflicts with I 101 if word == 'i': return False # exception because this conflicts with I 102 while word[0] in others: 102 while word[0] in others: 103 if len(word) == 1: 103 if len(word) == 1: 104 return True 104 return True 105 else: 105 else: 106 word = word[1:] 106 word = word[1:] 107 return word in digits 107 return word in digits 108 108 109 def ignore_word(self, word): 109 def ignore_word(self, word): 110 "Return True if this word should not be added to the concordance." 110 "Return True if this word should not be added to the concordance." 111 bool1 = word in self.words_to_ignore 111 bool1 = word in self.words_to_ignore 112 bool2 = word in self.non_words 112 bool2 = word in self.non_words 113 # do roman numerals 113 # do roman numerals 114 bool3 = self.is_roman_numeral(word) 114 bool3 = self.is_roman_numeral(word) 115 return bool1 or bool2 or bool3 115 return bool1 or bool2 or bool3 116 116 117 def _text_already_done(self, text): 117 def _text_already_done(self, text): 118 numrecs = shakespeare. dm.Concordance.select(118 numrecs = shakespeare.model.Concordance.select( 119 shakespeare. dm.Concordance.q.textID==text.id119 shakespeare.model.Concordance.q.textID==text.id 120 ).count() 120 ).count() 121 return numrecs > 0 121 return numrecs > 0 122 122 123 def add_text(self, name, text=None): 123 def add_text(self, name, text=None): 124 """Add a text to the concordance. 124 """Add a text to the concordance. 125 @param name: name of text to add 125 @param name: name of text to add 126 @param text: [optional] a file-like object containing text data. If not 126 @param text: [optional] a file-like object containing text data. If not 127 provided will default to using file in cache associated with named 127 provided will default to using file in cache associated with named 128 text 128 text 129 """ 129 """ 130 dmText = shakespeare. dm.Material.byName(name)130 dmText = shakespeare.model.Material.byName(name) 131 if self._text_already_done(dmText): 131 if self._text_already_done(dmText): 132 msg = 'Have already added to concordance text: %s' % dmText 132 msg = 'Have already added to concordance text: %s' % dmText 133 # raise ValueError(msg) 133 # raise ValueError(msg) 134 print msg 134 print msg 135 print 'Skipping' 135 print 'Skipping' 136 return 136 return 137 if text is None: 137 if text is None: 138 tpath = dmText.get_cache_path('plain') 138 tpath = dmText.get_cache_path('plain') 139 text = file(tpath) 139 text = file(tpath) 140 lineCount = 0 140 lineCount = 0 141 charIndex = 0 141 charIndex = 0 142 stats = {} 142 stats = {} 143 trans = shakespeare. dm.Concordance._connection.transaction()143 trans = shakespeare.model.Concordance._connection.transaction() 144 for line in text.readlines(): 144 for line in text.readlines(): 145 for match in self.word_regex.finditer(line): 145 for match in self.word_regex.finditer(line): 146 word = match.group().lower() # case insensitive 146 word = match.group().lower() # case insensitive 147 if self.ignore_word(word): 147 if self.ignore_word(word): 148 continue 148 continue 149 shakespeare. dm.Concordance(connection=trans,149 shakespeare.model.Concordance(connection=trans, 150 text=dmText, 150 text=dmText, 151 word=word, 151 word=word, 152 line=lineCount, 152 line=lineCount, 153 char_index=charIndex+match.start()) 153 char_index=charIndex+match.start()) 154 stats[word] = stats.get(word, 0) + 1 154 stats[word] = stats.get(word, 0) + 1 155 lineCount += 1 155 lineCount += 1 156 charIndex += len(line) 156 charIndex += len(line) 157 trans.commit() 157 trans.commit() 158 trans = shakespeare. dm.Concordance._connection.transaction()158 trans = shakespeare.model.Concordance._connection.transaction() 159 for word, value in stats.items(): 159 for word, value in stats.items(): 160 tresults = shakespeare. dm.Statistic.select(160 tresults = shakespeare.model.Statistic.select( 161 sqlobject.AND( 161 sqlobject.AND( 162 shakespeare. dm.Statistic.q.textID == dmText.id,162 shakespeare.model.Statistic.q.textID == dmText.id, 163 shakespeare. dm.Statistic.q.word == word163 shakespeare.model.Statistic.q.word == word 164 )) 164 )) 165 try: 165 try: 166 dbstat = list(tresults)[0] 166 dbstat = list(tresults)[0] 167 dbstat.occurrences += value 167 dbstat.occurrences += value 168 except: 168 except: 169 shakespeare. dm.Statistic(169 shakespeare.model.Statistic( 170 connection=trans, 170 connection=trans, 171 text=dmText, 171 text=dmText, 172 word=word, 172 word=word, 173 occurrences=value 173 occurrences=value 174 ) 174 ) 175 trans.commit() 175 trans.commit() 176 176 177 177 178 def remove_text(self, name): 178 def remove_text(self, name): 179 """Remove a text from the concordance. 179 """Remove a text from the concordance. 180 180 181 @param name: as for add_text 181 @param name: as for add_text 182 """ 182 """ 183 dmText = shakespeare. dm.Material.byName(name)183 dmText = shakespeare.model.Material.byName(name) 184 recs = shakespeare. dm.Concordance.select(184 recs = shakespeare.model.Concordance.select( 185 shakespeare. dm.Concordance.q.textID==dmText.id185 shakespeare.model.Concordance.q.textID==dmText.id 186 ) 186 ) 187 for rec in recs: 187 for rec in recs: 188
