Commit 989dd1dd authored by Kenneth Reitz's avatar Kenneth Reitz

pip v1.5.6, setuptools v3.6

parent 0468ef22
...@@ -27,8 +27,8 @@ PROFILE_PATH="$BUILD_DIR/.profile.d/python.sh" ...@@ -27,8 +27,8 @@ PROFILE_PATH="$BUILD_DIR/.profile.d/python.sh"
DEFAULT_PYTHON_VERSION="python-2.7.6" DEFAULT_PYTHON_VERSION="python-2.7.6"
PYTHON_EXE="/app/.heroku/python/bin/python" PYTHON_EXE="/app/.heroku/python/bin/python"
PIP_VERSION="1.5.4" PIP_VERSION="1.5.6"
SETUPTOOLS_VERSION="2.1" SETUPTOOLS_VERSION="3.6"
# Setup bpwatch # Setup bpwatch
export PATH=$PATH:$ROOT_DIR/vendor/bpwatch export PATH=$PATH:$ROOT_DIR/vendor/bpwatch
......
Metadata-Version: 1.1
Name: pip
Version: 1.5.4
Summary: A tool for installing and managing Python packages.
Home-page: http://www.pip-installer.org
Author: The pip developers
Author-email: python-virtualenv@groups.google.com
License: MIT
Description:
Project Info
============
* Project Page: https://github.com/pypa/pip
* Install howto: http://www.pip-installer.org/en/latest/installing.html
* Changelog: http://www.pip-installer.org/en/latest/news.html
* Bug Tracking: https://github.com/pypa/pip/issues
* Mailing list: http://groups.google.com/group/python-virtualenv
* Docs: http://www.pip-installer.org/
* User IRC: #pip on Freenode.
* Dev IRC: #pypa on Freenode.
Quickstart
==========
First, :doc:`Install pip <installing>`.
Install a package from `PyPI`_:
::
$ pip install SomePackage
[...]
Successfully installed SomePackage
Show what files were installed:
::
$ pip show --files SomePackage
Name: SomePackage
Version: 1.0
Location: /my/env/lib/pythonx.x/site-packages
Files:
../somepackage/__init__.py
[...]
List what packages are outdated:
::
$ pip list --outdated
SomePackage (Current: 1.0 Latest: 2.0)
Upgrade a package:
::
$ pip install --upgrade SomePackage
[...]
Found existing installation: SomePackage 1.0
Uninstalling SomePackage:
Successfully uninstalled SomePackage
Running setup.py install for SomePackage
Successfully installed SomePackage
Uninstall a package:
::
$ pip uninstall SomePackage
Uninstalling SomePackage:
/my/env/lib/pythonx.x/site-packages/somepackage
Proceed (y/n)? y
Successfully uninstalled SomePackage
.. _PyPI: http://pypi.python.org/pypi/
Keywords: easy_install distutils setuptools egg virtualenv
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Software Development :: Build Tools
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.6
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.1
Classifier: Programming Language :: Python :: 3.2
Classifier: Programming Language :: Python :: 3.3
AUTHORS.txt
CHANGES.txt
LICENSE.txt
MANIFEST.in
PROJECT.txt
README.rst
setup.cfg
setup.py
docs/configuration.rst
docs/cookbook.rst
docs/development.rst
docs/distribute_setuptools.rst
docs/index.rst
docs/installing.rst
docs/logic.rst
docs/news.rst
docs/quickstart.rst
docs/usage.rst
docs/user_guide.rst
docs/reference/index.rst
docs/reference/pip.rst
docs/reference/pip_freeze.rst
docs/reference/pip_install.rst
docs/reference/pip_list.rst
docs/reference/pip_search.rst
docs/reference/pip_show.rst
docs/reference/pip_uninstall.rst
docs/reference/pip_wheel.rst
pip/__init__.py
pip/__main__.py
pip/basecommand.py
pip/baseparser.py
pip/cmdoptions.py
pip/download.py
pip/exceptions.py
pip/index.py
pip/locations.py
pip/log.py
pip/pep425tags.py
pip/req.py
pip/runner.py
pip/status_codes.py
pip/util.py
pip/wheel.py
pip.egg-info/PKG-INFO
pip.egg-info/SOURCES.txt
pip.egg-info/dependency_links.txt
pip.egg-info/entry_points.txt
pip.egg-info/not-zip-safe
pip.egg-info/requires.txt
pip.egg-info/top_level.txt
pip/_vendor/__init__.py
pip/_vendor/pkg_resources.py
pip/_vendor/re-vendor.py
pip/_vendor/six.py
pip/_vendor/_markerlib/__init__.py
pip/_vendor/_markerlib/markers.py
pip/_vendor/colorama/__init__.py
pip/_vendor/colorama/ansi.py
pip/_vendor/colorama/ansitowin32.py
pip/_vendor/colorama/initialise.py
pip/_vendor/colorama/win32.py
pip/_vendor/colorama/winterm.py
pip/_vendor/distlib/__init__.py
pip/_vendor/distlib/compat.py
pip/_vendor/distlib/database.py
pip/_vendor/distlib/index.py
pip/_vendor/distlib/locators.py
pip/_vendor/distlib/manifest.py
pip/_vendor/distlib/markers.py
pip/_vendor/distlib/metadata.py
pip/_vendor/distlib/resources.py
pip/_vendor/distlib/scripts.py
pip/_vendor/distlib/t32.exe
pip/_vendor/distlib/t64.exe
pip/_vendor/distlib/util.py
pip/_vendor/distlib/version.py
pip/_vendor/distlib/w32.exe
pip/_vendor/distlib/w64.exe
pip/_vendor/distlib/wheel.py
pip/_vendor/distlib/_backport/__init__.py
pip/_vendor/distlib/_backport/misc.py
pip/_vendor/distlib/_backport/shutil.py
pip/_vendor/distlib/_backport/sysconfig.cfg
pip/_vendor/distlib/_backport/sysconfig.py
pip/_vendor/distlib/_backport/tarfile.py
pip/_vendor/html5lib/__init__.py
pip/_vendor/html5lib/constants.py
pip/_vendor/html5lib/html5parser.py
pip/_vendor/html5lib/ihatexml.py
pip/_vendor/html5lib/inputstream.py
pip/_vendor/html5lib/sanitizer.py
pip/_vendor/html5lib/tokenizer.py
pip/_vendor/html5lib/utils.py
pip/_vendor/html5lib/filters/__init__.py
pip/_vendor/html5lib/filters/_base.py
pip/_vendor/html5lib/filters/alphabeticalattributes.py
pip/_vendor/html5lib/filters/inject_meta_charset.py
pip/_vendor/html5lib/filters/lint.py
pip/_vendor/html5lib/filters/optionaltags.py
pip/_vendor/html5lib/filters/sanitizer.py
pip/_vendor/html5lib/filters/whitespace.py
pip/_vendor/html5lib/serializer/__init__.py
pip/_vendor/html5lib/serializer/htmlserializer.py
pip/_vendor/html5lib/treebuilders/__init__.py
pip/_vendor/html5lib/treebuilders/_base.py
pip/_vendor/html5lib/treebuilders/dom.py
pip/_vendor/html5lib/treebuilders/etree.py
pip/_vendor/html5lib/treebuilders/etree_lxml.py
pip/_vendor/html5lib/treewalkers/__init__.py
pip/_vendor/html5lib/treewalkers/_base.py
pip/_vendor/html5lib/treewalkers/dom.py
pip/_vendor/html5lib/treewalkers/etree.py
pip/_vendor/html5lib/treewalkers/genshistream.py
pip/_vendor/html5lib/treewalkers/lxmletree.py
pip/_vendor/html5lib/treewalkers/pulldom.py
pip/_vendor/html5lib/trie/__init__.py
pip/_vendor/html5lib/trie/_base.py
pip/_vendor/html5lib/trie/datrie.py
pip/_vendor/html5lib/trie/py.py
pip/_vendor/requests/__init__.py
pip/_vendor/requests/adapters.py
pip/_vendor/requests/api.py
pip/_vendor/requests/auth.py
pip/_vendor/requests/cacert.pem
pip/_vendor/requests/certs.py
pip/_vendor/requests/compat.py
pip/_vendor/requests/cookies.py
pip/_vendor/requests/exceptions.py
pip/_vendor/requests/hooks.py
pip/_vendor/requests/models.py
pip/_vendor/requests/sessions.py
pip/_vendor/requests/status_codes.py
pip/_vendor/requests/structures.py
pip/_vendor/requests/utils.py
pip/_vendor/requests/packages/__init__.py
pip/_vendor/requests/packages/charade/__init__.py
pip/_vendor/requests/packages/charade/__main__.py
pip/_vendor/requests/packages/charade/big5freq.py
pip/_vendor/requests/packages/charade/big5prober.py
pip/_vendor/requests/packages/charade/chardistribution.py
pip/_vendor/requests/packages/charade/charsetgroupprober.py
pip/_vendor/requests/packages/charade/charsetprober.py
pip/_vendor/requests/packages/charade/codingstatemachine.py
pip/_vendor/requests/packages/charade/compat.py
pip/_vendor/requests/packages/charade/constants.py
pip/_vendor/requests/packages/charade/cp949prober.py
pip/_vendor/requests/packages/charade/escprober.py
pip/_vendor/requests/packages/charade/escsm.py
pip/_vendor/requests/packages/charade/eucjpprober.py
pip/_vendor/requests/packages/charade/euckrfreq.py
pip/_vendor/requests/packages/charade/euckrprober.py
pip/_vendor/requests/packages/charade/euctwfreq.py
pip/_vendor/requests/packages/charade/euctwprober.py
pip/_vendor/requests/packages/charade/gb2312freq.py
pip/_vendor/requests/packages/charade/gb2312prober.py
pip/_vendor/requests/packages/charade/hebrewprober.py
pip/_vendor/requests/packages/charade/jisfreq.py
pip/_vendor/requests/packages/charade/jpcntx.py
pip/_vendor/requests/packages/charade/langbulgarianmodel.py
pip/_vendor/requests/packages/charade/langcyrillicmodel.py
pip/_vendor/requests/packages/charade/langgreekmodel.py
pip/_vendor/requests/packages/charade/langhebrewmodel.py
pip/_vendor/requests/packages/charade/langhungarianmodel.py
pip/_vendor/requests/packages/charade/langthaimodel.py
pip/_vendor/requests/packages/charade/latin1prober.py
pip/_vendor/requests/packages/charade/mbcharsetprober.py
pip/_vendor/requests/packages/charade/mbcsgroupprober.py
pip/_vendor/requests/packages/charade/mbcssm.py
pip/_vendor/requests/packages/charade/sbcharsetprober.py
pip/_vendor/requests/packages/charade/sbcsgroupprober.py
pip/_vendor/requests/packages/charade/sjisprober.py
pip/_vendor/requests/packages/charade/universaldetector.py
pip/_vendor/requests/packages/charade/utf8prober.py
pip/_vendor/requests/packages/chardet/__init__.py
pip/_vendor/requests/packages/chardet/big5freq.py
pip/_vendor/requests/packages/chardet/big5prober.py
pip/_vendor/requests/packages/chardet/chardetect.py
pip/_vendor/requests/packages/chardet/chardistribution.py
pip/_vendor/requests/packages/chardet/charsetgroupprober.py
pip/_vendor/requests/packages/chardet/charsetprober.py
pip/_vendor/requests/packages/chardet/codingstatemachine.py
pip/_vendor/requests/packages/chardet/compat.py
pip/_vendor/requests/packages/chardet/constants.py
pip/_vendor/requests/packages/chardet/cp949prober.py
pip/_vendor/requests/packages/chardet/escprober.py
pip/_vendor/requests/packages/chardet/escsm.py
pip/_vendor/requests/packages/chardet/eucjpprober.py
pip/_vendor/requests/packages/chardet/euckrfreq.py
pip/_vendor/requests/packages/chardet/euckrprober.py
pip/_vendor/requests/packages/chardet/euctwfreq.py
pip/_vendor/requests/packages/chardet/euctwprober.py
pip/_vendor/requests/packages/chardet/gb2312freq.py
pip/_vendor/requests/packages/chardet/gb2312prober.py
pip/_vendor/requests/packages/chardet/hebrewprober.py
pip/_vendor/requests/packages/chardet/jisfreq.py
pip/_vendor/requests/packages/chardet/jpcntx.py
pip/_vendor/requests/packages/chardet/langbulgarianmodel.py
pip/_vendor/requests/packages/chardet/langcyrillicmodel.py
pip/_vendor/requests/packages/chardet/langgreekmodel.py
pip/_vendor/requests/packages/chardet/langhebrewmodel.py
pip/_vendor/requests/packages/chardet/langhungarianmodel.py
pip/_vendor/requests/packages/chardet/langthaimodel.py
pip/_vendor/requests/packages/chardet/latin1prober.py
pip/_vendor/requests/packages/chardet/mbcharsetprober.py
pip/_vendor/requests/packages/chardet/mbcsgroupprober.py
pip/_vendor/requests/packages/chardet/mbcssm.py
pip/_vendor/requests/packages/chardet/sbcharsetprober.py
pip/_vendor/requests/packages/chardet/sbcsgroupprober.py
pip/_vendor/requests/packages/chardet/sjisprober.py
pip/_vendor/requests/packages/chardet/universaldetector.py
pip/_vendor/requests/packages/chardet/utf8prober.py
pip/_vendor/requests/packages/urllib3/__init__.py
pip/_vendor/requests/packages/urllib3/_collections.py
pip/_vendor/requests/packages/urllib3/connection.py
pip/_vendor/requests/packages/urllib3/connectionpool.py
pip/_vendor/requests/packages/urllib3/exceptions.py
pip/_vendor/requests/packages/urllib3/fields.py
pip/_vendor/requests/packages/urllib3/filepost.py
pip/_vendor/requests/packages/urllib3/poolmanager.py
pip/_vendor/requests/packages/urllib3/request.py
pip/_vendor/requests/packages/urllib3/response.py
pip/_vendor/requests/packages/urllib3/util.py
pip/_vendor/requests/packages/urllib3/contrib/__init__.py
pip/_vendor/requests/packages/urllib3/contrib/ntlmpool.py
pip/_vendor/requests/packages/urllib3/contrib/pyopenssl.py
pip/_vendor/requests/packages/urllib3/packages/__init__.py
pip/_vendor/requests/packages/urllib3/packages/ordered_dict.py
pip/_vendor/requests/packages/urllib3/packages/six.py
pip/_vendor/requests/packages/urllib3/packages/ssl_match_hostname/__init__.py
pip/_vendor/requests/packages/urllib3/packages/ssl_match_hostname/_implementation.py
pip/backwardcompat/__init__.py
pip/commands/__init__.py
pip/commands/bundle.py
pip/commands/completion.py
pip/commands/freeze.py
pip/commands/help.py
pip/commands/install.py
pip/commands/list.py
pip/commands/search.py
pip/commands/show.py
pip/commands/uninstall.py
pip/commands/unzip.py
pip/commands/wheel.py
pip/commands/zip.py
pip/vcs/__init__.py
pip/vcs/bazaar.py
pip/vcs/git.py
pip/vcs/mercurial.py
pip/vcs/subversion.py
\ No newline at end of file
[console_scripts]
pip = pip:main
pip2.7 = pip:main
pip2 = pip:main
[testing]
pytest
virtualenv>=1.10
scripttest>=1.3
mock
\ No newline at end of file
######################## BEGIN LICENSE BLOCK ########################
# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
#
# This library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public
# License along with this library; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
__version__ = "1.0.3"
from sys import version_info
def detect(aBuf):
if ((version_info < (3, 0) and isinstance(aBuf, unicode)) or
(version_info >= (3, 0) and not isinstance(aBuf, bytes))):
raise ValueError('Expected a bytes object, not a unicode object')
from . import universaldetector
u = universaldetector.UniversalDetector()
u.reset()
u.feed(aBuf)
u.close()
return u.result
def _description_of(path):
"""Return a string describing the probable encoding of a file."""
from charade.universaldetector import UniversalDetector
u = UniversalDetector()
for line in open(path, 'rb'):
u.feed(line)
u.close()
result = u.result
if result['encoding']:
return '%s: %s with confidence %s' % (path,
result['encoding'],
result['confidence'])
else:
return '%s: no result' % path
def charade_cli():
"""
Script which takes one or more file paths and reports on their detected
encodings
Example::
% chardetect.py somefile someotherfile
somefile: windows-1252 with confidence 0.5
someotherfile: ascii with confidence 1.0
"""
from sys import argv
for path in argv[1:]:
print(_description_of(path))
\ No newline at end of file
'''
support ';python -m charade <file1> [file2] ...' package execution syntax (2.7+)
'''
from charade import charade_cli
charade_cli()
######################## BEGIN LICENSE BLOCK ########################
# The Original Code is Mozilla Universal charset detector code.
#
# The Initial Developer of the Original Code is
# Netscape Communications Corporation.
# Portions created by the Initial Developer are Copyright (C) 2001
# the Initial Developer. All Rights Reserved.
#
# Contributor(s):
# Mark Pilgrim - port to Python
# Shy Shalom - original C code
#
# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
#
# This library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public
# License along with this library; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
from .charsetprober import CharSetProber
from .constants import eNotMe
from .compat import wrap_ord
FREQ_CAT_NUM = 4
UDF = 0 # undefined
OTH = 1 # other
ASC = 2 # ascii capital letter
ASS = 3 # ascii small letter
ACV = 4 # accent capital vowel
ACO = 5 # accent capital other
ASV = 6 # accent small vowel
ASO = 7 # accent small other
CLASS_NUM = 8 # total classes
Latin1_CharToClass = (
OTH, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # 00 - 07
OTH, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # 08 - 0F
OTH, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # 10 - 17
OTH, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # 18 - 1F
OTH, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # 20 - 27
OTH, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # 28 - 2F
OTH, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # 30 - 37
OTH, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # 38 - 3F
OTH, ASC, ASC, ASC, ASC, ASC, ASC, ASC, # 40 - 47
ASC, ASC, ASC, ASC, ASC, ASC, ASC, ASC, # 48 - 4F
ASC, ASC, ASC, ASC, ASC, ASC, ASC, ASC, # 50 - 57
ASC, ASC, ASC, OTH, OTH, OTH, OTH, OTH, # 58 - 5F
OTH, ASS, ASS, ASS, ASS, ASS, ASS, ASS, # 60 - 67
ASS, ASS, ASS, ASS, ASS, ASS, ASS, ASS, # 68 - 6F
ASS, ASS, ASS, ASS, ASS, ASS, ASS, ASS, # 70 - 77
ASS, ASS, ASS, OTH, OTH, OTH, OTH, OTH, # 78 - 7F
OTH, UDF, OTH, ASO, OTH, OTH, OTH, OTH, # 80 - 87
OTH, OTH, ACO, OTH, ACO, UDF, ACO, UDF, # 88 - 8F
UDF, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # 90 - 97
OTH, OTH, ASO, OTH, ASO, UDF, ASO, ACO, # 98 - 9F
OTH, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # A0 - A7
OTH, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # A8 - AF
OTH, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # B0 - B7
OTH, OTH, OTH, OTH, OTH, OTH, OTH, OTH, # B8 - BF
ACV, ACV, ACV, ACV, ACV, ACV, ACO, ACO, # C0 - C7
ACV, ACV, ACV, ACV, ACV, ACV, ACV, ACV, # C8 - CF
ACO, ACO, ACV, ACV, ACV, ACV, ACV, OTH, # D0 - D7
ACV, ACV, ACV, ACV, ACV, ACO, ACO, ACO, # D8 - DF
ASV, ASV, ASV, ASV, ASV, ASV, ASO, ASO, # E0 - E7
ASV, ASV, ASV, ASV, ASV, ASV, ASV, ASV, # E8 - EF
ASO, ASO, ASV, ASV, ASV, ASV, ASV, OTH, # F0 - F7
ASV, ASV, ASV, ASV, ASV, ASO, ASO, ASO, # F8 - FF
)
# 0 : illegal
# 1 : very unlikely
# 2 : normal
# 3 : very likely
Latin1ClassModel = (
# UDF OTH ASC ASS ACV ACO ASV ASO
0, 0, 0, 0, 0, 0, 0, 0, # UDF
0, 3, 3, 3, 3, 3, 3, 3, # OTH
0, 3, 3, 3, 3, 3, 3, 3, # ASC
0, 3, 3, 3, 1, 1, 3, 3, # ASS
0, 3, 3, 3, 1, 2, 1, 2, # ACV
0, 3, 3, 3, 3, 3, 3, 3, # ACO
0, 3, 1, 3, 1, 1, 1, 3, # ASV
0, 3, 1, 3, 1, 1, 3, 3, # ASO
)
class Latin1Prober(CharSetProber):
def __init__(self):
CharSetProber.__init__(self)
self.reset()
def reset(self):
self._mLastCharClass = OTH
self._mFreqCounter = [0] * FREQ_CAT_NUM
CharSetProber.reset(self)
def get_charset_name(self):
return "windows-1252"
def feed(self, aBuf):
aBuf = self.filter_with_english_letters(aBuf)
for c in aBuf:
charClass = Latin1_CharToClass[wrap_ord(c)]
freq = Latin1ClassModel[(self._mLastCharClass * CLASS_NUM)
+ charClass]
if freq == 0:
self._mState = eNotMe
break
self._mFreqCounter[freq] += 1
self._mLastCharClass = charClass
return self.get_state()
def get_confidence(self):
if self.get_state() == eNotMe:
return 0.01
total = sum(self._mFreqCounter)
if total < 0.01:
confidence = 0.0
else:
confidence = ((float(self._mFreqCounter[3]) / total)
- (self._mFreqCounter[1] * 20.0 / total))
if confidence < 0.0:
confidence = 0.0
# lower the confidence of latin1 so that other more accurate
# detector can take priority.
confidence = confidence * 0.5
return confidence
######################## BEGIN LICENSE BLOCK ########################
# The Original Code is Mozilla Universal charset detector code.
#
# The Initial Developer of the Original Code is
# Netscape Communications Corporation.
# Portions created by the Initial Developer are Copyright (C) 2001
# the Initial Developer. All Rights Reserved.
#
# Contributor(s):
# Mark Pilgrim - port to Python
# Shy Shalom - original C code
#
# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
#
# This library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public
# License along with this library; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
from . import constants
import sys
import codecs
from .latin1prober import Latin1Prober # windows-1252
from .mbcsgroupprober import MBCSGroupProber # multi-byte character sets
from .sbcsgroupprober import SBCSGroupProber # single-byte character sets
from .escprober import EscCharSetProber # ISO-2122, etc.
import re
MINIMUM_THRESHOLD = 0.20
ePureAscii = 0
eEscAscii = 1
eHighbyte = 2
class UniversalDetector:
def __init__(self):
self._highBitDetector = re.compile(b'[\x80-\xFF]')
self._escDetector = re.compile(b'(\033|~{)')
self._mEscCharSetProber = None
self._mCharSetProbers = []
self.reset()
def reset(self):
self.result = {'encoding': None, 'confidence': 0.0}
self.done = False
self._mStart = True
self._mGotData = False
self._mInputState = ePureAscii
self._mLastChar = b''
if self._mEscCharSetProber:
self._mEscCharSetProber.reset()
for prober in self._mCharSetProbers:
prober.reset()
def feed(self, aBuf):
if self.done:
return
aLen = len(aBuf)
if not aLen:
return
if not self._mGotData:
# If the data starts with BOM, we know it is UTF
if aBuf[:3] == codecs.BOM:
# EF BB BF UTF-8 with BOM
self.result = {'encoding': "UTF-8", 'confidence': 1.0}
elif aBuf[:4] in (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE):
# FF FE 00 00 UTF-32, little-endian BOM
# 00 00 FE FF UTF-32, big-endian BOM
self.result = {'encoding': "UTF-32", 'confidence': 1.0}
elif aBuf[:4] == b'\xFE\xFF\x00\x00':
# FE FF 00 00 UCS-4, unusual octet order BOM (3412)
self.result = {
'encoding': "X-ISO-10646-UCS-4-3412",
'confidence': 1.0
}
elif aBuf[:4] == b'\x00\x00\xFF\xFE':
# 00 00 FF FE UCS-4, unusual octet order BOM (2143)
self.result = {
'encoding': "X-ISO-10646-UCS-4-2143",
'confidence': 1.0
}
elif aBuf[:2] == codecs.BOM_LE or aBuf[:2] == codecs.BOM_BE:
# FF FE UTF-16, little endian BOM
# FE FF UTF-16, big endian BOM
self.result = {'encoding': "UTF-16", 'confidence': 1.0}
self._mGotData = True
if self.result['encoding'] and (self.result['confidence'] > 0.0):
self.done = True
return
if self._mInputState == ePureAscii:
if self._highBitDetector.search(aBuf):
self._mInputState = eHighbyte
elif ((self._mInputState == ePureAscii) and
self._escDetector.search(self._mLastChar + aBuf)):
self._mInputState = eEscAscii
self._mLastChar = aBuf[-1:]
if self._mInputState == eEscAscii:
if not self._mEscCharSetProber:
self._mEscCharSetProber = EscCharSetProber()
if self._mEscCharSetProber.feed(aBuf) == constants.eFoundIt:
self.result = {
'encoding': self._mEscCharSetProber.get_charset_name(),
'confidence': self._mEscCharSetProber.get_confidence()
}
self.done = True
elif self._mInputState == eHighbyte:
if not self._mCharSetProbers:
self._mCharSetProbers = [MBCSGroupProber(), SBCSGroupProber(),
Latin1Prober()]
for prober in self._mCharSetProbers:
if prober.feed(aBuf) == constants.eFoundIt:
self.result = {'encoding': prober.get_charset_name(),
'confidence': prober.get_confidence()}
self.done = True
break
def close(self):
if self.done:
return
if not self._mGotData:
if constants._debug:
sys.stderr.write('no data received!\n')
return
self.done = True
if self._mInputState == ePureAscii:
self.result = {'encoding': 'ascii', 'confidence': 1.0}
return self.result
if self._mInputState == eHighbyte:
proberConfidence = None
maxProberConfidence = 0.0
maxProber = None
for prober in self._mCharSetProbers:
if not prober:
continue
proberConfidence = prober.get_confidence()
if proberConfidence > maxProberConfidence:
maxProberConfidence = proberConfidence
maxProber = prober
if maxProber and (maxProberConfidence > MINIMUM_THRESHOLD):
self.result = {'encoding': maxProber.get_charset_name(),
'confidence': maxProber.get_confidence()}
return self.result
if constants._debug:
sys.stderr.write('no probers hit minimum threshhold\n')
for prober in self._mCharSetProbers[0].mProbers:
if not prober:
continue
sys.stderr.write('%s confidence = %s\n' %
(prober.get_charset_name(),
prober.get_confidence()))
######################## BEGIN LICENSE BLOCK ########################
# The Original Code is Mozilla Communicator client code.
#
# The Initial Developer of the Original Code is
# Netscape Communications Corporation.
# Portions created by the Initial Developer are Copyright (C) 1998
# the Initial Developer. All Rights Reserved.
#
# Contributor(s):
# Mark Pilgrim - port to Python
#
# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
#
# This library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public
# License along with this library; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
from .mbcharsetprober import MultiByteCharSetProber
from .codingstatemachine import CodingStateMachine
from .chardistribution import Big5DistributionAnalysis
from .mbcssm import Big5SMModel
class Big5Prober(MultiByteCharSetProber):
def __init__(self):
MultiByteCharSetProber.__init__(self)
self._mCodingSM = CodingStateMachine(Big5SMModel)
self._mDistributionAnalyzer = Big5DistributionAnalysis()
self.reset()
def get_charset_name(self):
return "Big5"
######################## BEGIN LICENSE BLOCK ########################
# The Original Code is Mozilla Communicator client code.
#
# The Initial Developer of the Original Code is
# Netscape Communications Corporation.
# Portions created by the Initial Developer are Copyright (C) 1998
# the Initial Developer. All Rights Reserved.
#
# Contributor(s):
# Mark Pilgrim - port to Python
#
# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
#
# This library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public
# License along with this library; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
from .euctwfreq import (EUCTWCharToFreqOrder, EUCTW_TABLE_SIZE,
EUCTW_TYPICAL_DISTRIBUTION_RATIO)
from .euckrfreq import (EUCKRCharToFreqOrder, EUCKR_TABLE_SIZE,
EUCKR_TYPICAL_DISTRIBUTION_RATIO)
from .gb2312freq import (GB2312CharToFreqOrder, GB2312_TABLE_SIZE,
GB2312_TYPICAL_DISTRIBUTION_RATIO)
from .big5freq import (Big5CharToFreqOrder, BIG5_TABLE_SIZE,
BIG5_TYPICAL_DISTRIBUTION_RATIO)
from .jisfreq import (JISCharToFreqOrder, JIS_TABLE_SIZE,
JIS_TYPICAL_DISTRIBUTION_RATIO)
from .compat import wrap_ord
ENOUGH_DATA_THRESHOLD = 1024
SURE_YES = 0.99
SURE_NO = 0.01
MINIMUM_DATA_THRESHOLD = 3
class CharDistributionAnalysis:
def __init__(self):
# Mapping table to get frequency order from char order (get from
# GetOrder())
self._mCharToFreqOrder = None
self._mTableSize = None # Size of above table
# This is a constant value which varies from language to language,
# used in calculating confidence. See
# http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
# for further detail.
self._mTypicalDistributionRatio = None
self.reset()
def reset(self):
"""reset analyser, clear any state"""
# If this flag is set to True, detection is done and conclusion has
# been made
self._mDone = False
self._mTotalChars = 0 # Total characters encountered
# The number of characters whose frequency order is less than 512
self._mFreqChars = 0
def feed(self, aBuf, aCharLen):
"""feed a character with known length"""
if aCharLen == 2:
# we only care about 2-bytes character in our distribution analysis
order = self.get_order(aBuf)
else:
order = -1
if order >= 0:
self._mTotalChars += 1
# order is valid
if order < self._mTableSize:
if 512 > self._mCharToFreqOrder[order]:
self._mFreqChars += 1
def get_confidence(self):
"""return confidence based on existing data"""
# if we didn't receive any character in our consideration range,
# return negative answer
if self._mTotalChars <= 0 or self._mFreqChars <= MINIMUM_DATA_THRESHOLD:
return SURE_NO
if self._mTotalChars != self._mFreqChars:
r = (self._mFreqChars / ((self._mTotalChars - self._mFreqChars)
* self._mTypicalDistributionRatio))
if r < SURE_YES:
return r
# normalize confidence (we don't want to be 100% sure)
return SURE_YES
def got_enough_data(self):
# It is not necessary to receive all data to draw conclusion.
# For charset detection, certain amount of data is enough
return self._mTotalChars > ENOUGH_DATA_THRESHOLD
def get_order(self, aBuf):
# We do not handle characters based on the original encoding string,
# but convert this encoding string to a number, here called order.
# This allows multiple encodings of a language to share one frequency
# table.
return -1
class EUCTWDistributionAnalysis(CharDistributionAnalysis):
def __init__(self):
CharDistributionAnalysis.__init__(self)
self._mCharToFreqOrder = EUCTWCharToFreqOrder
self._mTableSize = EUCTW_TABLE_SIZE
self._mTypicalDistributionRatio = EUCTW_TYPICAL_DISTRIBUTION_RATIO
def get_order(self, aBuf):
# for euc-TW encoding, we are interested
# first byte range: 0xc4 -- 0xfe
# second byte range: 0xa1 -- 0xfe
# no validation needed here. State machine has done that
first_char = wrap_ord(aBuf[0])
if first_char >= 0xC4:
return 94 * (first_char - 0xC4) + wrap_ord(aBuf[1]) - 0xA1
else:
return -1
class EUCKRDistributionAnalysis(CharDistributionAnalysis):
def __init__(self):
CharDistributionAnalysis.__init__(self)
self._mCharToFreqOrder = EUCKRCharToFreqOrder
self._mTableSize = EUCKR_TABLE_SIZE
self._mTypicalDistributionRatio = EUCKR_TYPICAL_DISTRIBUTION_RATIO
def get_order(self, aBuf):
# for euc-KR encoding, we are interested
# first byte range: 0xb0 -- 0xfe
# second byte range: 0xa1 -- 0xfe
# no validation needed here. State machine has done that
first_char = wrap_ord(aBuf[0])
if first_char >= 0xB0:
return 94 * (first_char - 0xB0) + wrap_ord(aBuf[1]) - 0xA1
else:
return -1
class GB2312DistributionAnalysis(CharDistributionAnalysis):
def __init__(self):
CharDistributionAnalysis.__init__(self)
self._mCharToFreqOrder = GB2312CharToFreqOrder
self._mTableSize = GB2312_TABLE_SIZE
self._mTypicalDistributionRatio = GB2312_TYPICAL_DISTRIBUTION_RATIO
def get_order(self, aBuf):
# for GB2312 encoding, we are interested
# first byte range: 0xb0 -- 0xfe
# second byte range: 0xa1 -- 0xfe
# no validation needed here. State machine has done that
first_char, second_char = wrap_ord(aBuf[0]), wrap_ord(aBuf[1])
if (first_char >= 0xB0) and (second_char >= 0xA1):
return 94 * (first_char - 0xB0) + second_char - 0xA1
else:
return -1
class Big5DistributionAnalysis(CharDistributionAnalysis):
def __init__(self):
CharDistributionAnalysis.__init__(self)
self._mCharToFreqOrder = Big5CharToFreqOrder
self._mTableSize = BIG5_TABLE_SIZE
self._mTypicalDistributionRatio = BIG5_TYPICAL_DISTRIBUTION_RATIO
def get_order(self, aBuf):
# for big5 encoding, we are interested
# first byte range: 0xa4 -- 0xfe
# second byte range: 0x40 -- 0x7e , 0xa1 -- 0xfe
# no validation needed here. State machine has done that
first_char, second_char = wrap_ord(aBuf[0]), wrap_ord(aBuf[1])
if first_char >= 0xA4:
if second_char >= 0xA1:
return 157 * (first_char - 0xA4) + second_char - 0xA1 + 63
else:
return 157 * (first_char - 0xA4) + second_char - 0x40
else:
return -1
class SJISDistributionAnalysis(CharDistributionAnalysis):
def __init__(self):
CharDistributionAnalysis.__init__(self)
self._mCharToFreqOrder = JISCharToFreqOrder
self._mTableSize = JIS_TABLE_SIZE
self._mTypicalDistributionRatio = JIS_TYPICAL_DISTRIBUTION_RATIO
def get_order(self, aBuf):
# for sjis encoding, we are interested
# first byte range: 0x81 -- 0x9f , 0xe0 -- 0xfe
# second byte range: 0x40 -- 0x7e, 0x81 -- oxfe
# no validation needed here. State machine has done that
first_char, second_char = wrap_ord(aBuf[0]), wrap_ord(aBuf[1])
if (first_char >= 0x81) and (first_char <= 0x9F):
order = 188 * (first_char - 0x81)
elif (first_char >= 0xE0) and (first_char <= 0xEF):
order = 188 * (first_char - 0xE0 + 31)
else:
return -1
order = order + second_char - 0x40
if second_char > 0x7F:
order = -1
return order
class EUCJPDistributionAnalysis(CharDistributionAnalysis):
def __init__(self):
CharDistributionAnalysis.__init__(self)
self._mCharToFreqOrder = JISCharToFreqOrder
self._mTableSize = JIS_TABLE_SIZE
self._mTypicalDistributionRatio = JIS_TYPICAL_DISTRIBUTION_RATIO
def get_order(self, aBuf):
# for euc-JP encoding, we are interested
# first byte range: 0xa0 -- 0xfe
# second byte range: 0xa1 -- 0xfe
# no validation needed here. State machine has done that
char = wrap_ord(aBuf[0])
if char >= 0xA0:
return 94 * (char - 0xA1) + wrap_ord(aBuf[1]) - 0xa1
else:
return -1
######################## BEGIN LICENSE BLOCK ########################
# The Original Code is Mozilla Communicator client code.
#
# The Initial Developer of the Original Code is
# Netscape Communications Corporation.
# Portions created by the Initial Developer are Copyright (C) 1998
# the Initial Developer. All Rights Reserved.
#
# Contributor(s):
# Mark Pilgrim - port to Python
#
# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
#
# This library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public
# License along with this library; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
from . import constants
import sys
from .charsetprober import CharSetProber
class CharSetGroupProber(CharSetProber):
def __init__(self):
CharSetProber.__init__(self)
self._mActiveNum = 0
self._mProbers = []
self._mBestGuessProber = None
def reset(self):
CharSetProber.reset(self)
self._mActiveNum = 0
for prober in self._mProbers:
if prober:
prober.reset()
prober.active = True
self._mActiveNum += 1
self._mBestGuessProber = None
def get_charset_name(self):
if not self._mBestGuessProber:
self.get_confidence()
if not self._mBestGuessProber:
return None
# self._mBestGuessProber = self._mProbers[0]
return self._mBestGuessProber.get_charset_name()
def feed(self, aBuf):
for prober in self._mProbers:
if not prober:
continue
if not prober.active:
continue
st = prober.feed(aBuf)
if not st:
continue
if st == constants.eFoundIt:
self._mBestGuessProber = prober
return self.get_state()
elif st == constants.eNotMe:
prober.active = False
self._mActiveNum -= 1
if self._mActiveNum <= 0:
self._mState = constants.eNotMe
return self.get_state()
return self.get_state()
def get_confidence(self):
st = self.get_state()
if st == constants.eFoundIt:
return 0.99
elif st == constants.eNotMe:
return 0.01
bestConf = 0.0
self._mBestGuessProber = None
for prober in self._mProbers:
if not prober:
continue
if not prober.active:
if constants._debug:
sys.stderr.write(prober.get_charset_name()
+ ' not active\n')
continue
cf = prober.get_confidence()
if constants._debug:
sys.stderr.write('%s confidence = %s\n' %
(prober.get_charset_name(), cf))
if bestConf < cf:
bestConf = cf
self._mBestGuessProber = prober
if not self._mBestGuessProber:
return 0.0
return bestConf
# else:
# self._mBestGuessProber = self._mProbers[0]
# return self._mBestGuessProber.get_confidence()
######################## BEGIN LICENSE BLOCK ########################
# The Original Code is Mozilla Universal charset detector code.
#
# The Initial Developer of the Original Code is
# Netscape Communications Corporation.
# Portions created by the Initial Developer are Copyright (C) 2001
# the Initial Developer. All Rights Reserved.
#
# Contributor(s):
# Mark Pilgrim - port to Python
# Shy Shalom - original C code
#
# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
#
# This library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public
# License along with this library; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
from . import constants
import re
class CharSetProber:
def __init__(self):
pass
def reset(self):
self._mState = constants.eDetecting
def get_charset_name(self):
return None
def feed(self, aBuf):
pass
def get_state(self):
return self._mState
def get_confidence(self):
return 0.0
def filter_high_bit_only(self, aBuf):
aBuf = re.sub(b'([\x00-\x7F])+', b' ', aBuf)
return aBuf
def filter_without_english_letters(self, aBuf):
aBuf = re.sub(b'([A-Za-z])+', b' ', aBuf)
return aBuf
def filter_with_english_letters(self, aBuf):
# TODO
return aBuf
######################## BEGIN LICENSE BLOCK ########################
# The Original Code is mozilla.org code.
#
# The Initial Developer of the Original Code is
# Netscape Communications Corporation.
# Portions created by the Initial Developer are Copyright (C) 1998
# the Initial Developer. All Rights Reserved.
#
# Contributor(s):
# Mark Pilgrim - port to Python
#
# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
#
# This library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public
# License along with this library; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
from .constants import eStart
from .compat import wrap_ord
class CodingStateMachine:
def __init__(self, sm):
self._mModel = sm
self._mCurrentBytePos = 0
self._mCurrentCharLen = 0
self.reset()
def reset(self):
self._mCurrentState = eStart
def next_state(self, c):
# for each byte we get its class
# if it is first byte, we also get byte length
# PY3K: aBuf is a byte stream, so c is an int, not a byte
byteCls = self._mModel['classTable'][wrap_ord(c)]
if self._mCurrentState == eStart:
self._mCurrentBytePos = 0
self._mCurrentCharLen = self._mModel['charLenTable'][byteCls]
# from byte's class and stateTable, we get its next state
curr_state = (self._mCurrentState * self._mModel['classFactor']
+ byteCls)
self._mCurrentState = self._mModel['stateTable'][curr_state]
self._mCurrentBytePos += 1
return self._mCurrentState
def get_current_charlen(self):
return self._mCurrentCharLen
def get_coding_state_machine(self):
return self._mModel['name']
######################## BEGIN LICENSE BLOCK ########################
# Contributor(s):
# Ian Cordasco - port to Python
#
# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
#
# This library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public
# License along with this library; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
import sys
if sys.version_info < (3, 0):
base_str = (str, unicode)
else:
base_str = (bytes, str)
def wrap_ord(a):
if sys.version_info < (3, 0) and isinstance(a, base_str):
return ord(a)
else:
return a
######################## BEGIN LICENSE BLOCK ########################
# The Original Code is Mozilla Universal charset detector code.
#
# The Initial Developer of the Original Code is
# Netscape Communications Corporation.
# Portions created by the Initial Developer are Copyright (C) 2001
# the Initial Developer. All Rights Reserved.
#
# Contributor(s):
# Mark Pilgrim - port to Python
# Shy Shalom - original C code
#
# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
#
# This library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public
# License along with this library; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
_debug = 0
eDetecting = 0
eFoundIt = 1
eNotMe = 2
eStart = 0
eError = 1
eItsMe = 2
SHORTCUT_THRESHOLD = 0.95
######################## BEGIN LICENSE BLOCK ########################
# The Original Code is mozilla.org code.
#
# The Initial Developer of the Original Code is
# Netscape Communications Corporation.
# Portions created by the Initial Developer are Copyright (C) 1998
# the Initial Developer. All Rights Reserved.
#
# Contributor(s):
# Mark Pilgrim - port to Python
#
# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
#
# This library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public
# License along with this library; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
from .mbcharsetprober import MultiByteCharSetProber
from .codingstatemachine import CodingStateMachine
from .chardistribution import EUCKRDistributionAnalysis
from .mbcssm import CP949SMModel
class CP949Prober(MultiByteCharSetProber):
def __init__(self):
MultiByteCharSetProber.__init__(self)
self._mCodingSM = CodingStateMachine(CP949SMModel)
# NOTE: CP949 is a superset of EUC-KR, so the distribution should be
# not different.
self._mDistributionAnalyzer = EUCKRDistributionAnalysis()
self.reset()
def get_charset_name(self):
return "CP949"
######################## BEGIN LICENSE BLOCK ########################
# The Original Code is mozilla.org code.
#
# The Initial Developer of the Original Code is
# Netscape Communications Corporation.
# Portions created by the Initial Developer are Copyright (C) 1998
# the Initial Developer. All Rights Reserved.
#
# Contributor(s):
# Mark Pilgrim - port to Python
#
# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
#
# This library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public
# License along with this library; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
from . import constants
from .escsm import (HZSMModel, ISO2022CNSMModel, ISO2022JPSMModel,
ISO2022KRSMModel)
from .charsetprober import CharSetProber
from .codingstatemachine import CodingStateMachine
from .compat import wrap_ord
class EscCharSetProber(CharSetProber):
def __init__(self):
CharSetProber.__init__(self)
self._mCodingSM = [
CodingStateMachine(HZSMModel),
CodingStateMachine(ISO2022CNSMModel),
CodingStateMachine(ISO2022JPSMModel),
CodingStateMachine(ISO2022KRSMModel)
]
self.reset()
def reset(self):
CharSetProber.reset(self)
for codingSM in self._mCodingSM:
if not codingSM:
continue
codingSM.active = True
codingSM.reset()
self._mActiveSM = len(self._mCodingSM)
self._mDetectedCharset = None
def get_charset_name(self):
return self._mDetectedCharset
def get_confidence(self):
if self._mDetectedCharset:
return 0.99
else:
return 0.00
def feed(self, aBuf):
for c in aBuf:
# PY3K: aBuf is a byte array, so c is an int, not a byte
for codingSM in self._mCodingSM:
if not codingSM:
continue
if not codingSM.active:
continue
codingState = codingSM.next_state(wrap_ord(c))
if codingState == constants.eError:
codingSM.active = False
self._mActiveSM -= 1
if self._mActiveSM <= 0:
self._mState = constants.eNotMe
return self.get_state()
elif codingState == constants.eItsMe:
self._mState = constants.eFoundIt
self._mDetectedCharset = codingSM.get_coding_state_machine() # nopep8
return self.get_state()
return self.get_state()
######################## BEGIN LICENSE BLOCK ########################
# The Original Code is mozilla.org code.
#
# The Initial Developer of the Original Code is
# Netscape Communications Corporation.
# Portions created by the Initial Developer are Copyright (C) 1998
# the Initial Developer. All Rights Reserved.
#
# Contributor(s):
# Mark Pilgrim - port to Python
#
# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
#
# This library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public
# License along with this library; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
from .constants import eStart, eError, eItsMe
HZ_cls = (
1,0,0,0,0,0,0,0, # 00 - 07
0,0,0,0,0,0,0,0, # 08 - 0f
0,0,0,0,0,0,0,0, # 10 - 17
0,0,0,1,0,0,0,0, # 18 - 1f
0,0,0,0,0,0,0,0, # 20 - 27
0,0,0,0,0,0,0,0, # 28 - 2f
0,0,0,0,0,0,0,0, # 30 - 37
0,0,0,0,0,0,0,0, # 38 - 3f
0,0,0,0,0,0,0,0, # 40 - 47
0,0,0,0,0,0,0,0, # 48 - 4f
0,0,0,0,0,0,0,0, # 50 - 57
0,0,0,0,0,0,0,0, # 58 - 5f
0,0,0,0,0,0,0,0, # 60 - 67
0,0,0,0,0,0,0,0, # 68 - 6f
0,0,0,0,0,0,0,0, # 70 - 77
0,0,0,4,0,5,2,0, # 78 - 7f
1,1,1,1,1,1,1,1, # 80 - 87
1,1,1,1,1,1,1,1, # 88 - 8f
1,1,1,1,1,1,1,1, # 90 - 97
1,1,1,1,1,1,1,1, # 98 - 9f
1,1,1,1,1,1,1,1, # a0 - a7
1,1,1,1,1,1,1,1, # a8 - af
1,1,1,1,1,1,1,1, # b0 - b7
1,1,1,1,1,1,1,1, # b8 - bf
1,1,1,1,1,1,1,1, # c0 - c7
1,1,1,1,1,1,1,1, # c8 - cf
1,1,1,1,1,1,1,1, # d0 - d7
1,1,1,1,1,1,1,1, # d8 - df
1,1,1,1,1,1,1,1, # e0 - e7
1,1,1,1,1,1,1,1, # e8 - ef
1,1,1,1,1,1,1,1, # f0 - f7
1,1,1,1,1,1,1,1, # f8 - ff
)
HZ_st = (
eStart,eError, 3,eStart,eStart,eStart,eError,eError,# 00-07
eError,eError,eError,eError,eItsMe,eItsMe,eItsMe,eItsMe,# 08-0f
eItsMe,eItsMe,eError,eError,eStart,eStart, 4,eError,# 10-17
5,eError, 6,eError, 5, 5, 4,eError,# 18-1f
4,eError, 4, 4, 4,eError, 4,eError,# 20-27
4,eItsMe,eStart,eStart,eStart,eStart,eStart,eStart,# 28-2f
)
HZCharLenTable = (0, 0, 0, 0, 0, 0)
HZSMModel = {'classTable': HZ_cls,
'classFactor': 6,
'stateTable': HZ_st,
'charLenTable': HZCharLenTable,
'name': "HZ-GB-2312"}
ISO2022CN_cls = (
2,0,0,0,0,0,0,0, # 00 - 07
0,0,0,0,0,0,0,0, # 08 - 0f
0,0,0,0,0,0,0,0, # 10 - 17
0,0,0,1,0,0,0,0, # 18 - 1f
0,0,0,0,0,0,0,0, # 20 - 27
0,3,0,0,0,0,0,0, # 28 - 2f
0,0,0,0,0,0,0,0, # 30 - 37
0,0,0,0,0,0,0,0, # 38 - 3f
0,0,0,4,0,0,0,0, # 40 - 47
0,0,0,0,0,0,0,0, # 48 - 4f
0,0,0,0,0,0,0,0, # 50 - 57
0,0,0,0,0,0,0,0, # 58 - 5f
0,0,0,0,0,0,0,0, # 60 - 67
0,0,0,0,0,0,0,0, # 68 - 6f
0,0,0,0,0,0,0,0, # 70 - 77
0,0,0,0,0,0,0,0, # 78 - 7f
2,2,2,2,2,2,2,2, # 80 - 87
2,2,2,2,2,2,2,2, # 88 - 8f
2,2,2,2,2,2,2,2, # 90 - 97
2,2,2,2,2,2,2,2, # 98 - 9f
2,2,2,2,2,2,2,2, # a0 - a7
2,2,2,2,2,2,2,2, # a8 - af
2,2,2,2,2,2,2,2, # b0 - b7
2,2,2,2,2,2,2,2, # b8 - bf
2,2,2,2,2,2,2,2, # c0 - c7
2,2,2,2,2,2,2,2, # c8 - cf
2,2,2,2,2,2,2,2, # d0 - d7
2,2,2,2,2,2,2,2, # d8 - df
2,2,2,2,2,2,2,2, # e0 - e7
2,2,2,2,2,2,2,2, # e8 - ef
2,2,2,2,2,2,2,2, # f0 - f7
2,2,2,2,2,2,2,2, # f8 - ff
)
ISO2022CN_st = (
eStart, 3,eError,eStart,eStart,eStart,eStart,eStart,# 00-07
eStart,eError,eError,eError,eError,eError,eError,eError,# 08-0f
eError,eError,eItsMe,eItsMe,eItsMe,eItsMe,eItsMe,eItsMe,# 10-17
eItsMe,eItsMe,eItsMe,eError,eError,eError, 4,eError,# 18-1f
eError,eError,eError,eItsMe,eError,eError,eError,eError,# 20-27
5, 6,eError,eError,eError,eError,eError,eError,# 28-2f
eError,eError,eError,eItsMe,eError,eError,eError,eError,# 30-37
eError,eError,eError,eError,eError,eItsMe,eError,eStart,# 38-3f
)
ISO2022CNCharLenTable = (0, 0, 0, 0, 0, 0, 0, 0, 0)
ISO2022CNSMModel = {'classTable': ISO2022CN_cls,
'classFactor': 9,
'stateTable': ISO2022CN_st,
'charLenTable': ISO2022CNCharLenTable,
'name': "ISO-2022-CN"}
ISO2022JP_cls = (
2,0,0,0,0,0,0,0, # 00 - 07
0,0,0,0,0,0,2,2, # 08 - 0f
0,0,0,0,0,0,0,0, # 10 - 17
0,0,0,1,0,0,0,0, # 18 - 1f
0,0,0,0,7,0,0,0, # 20 - 27
3,0,0,0,0,0,0,0, # 28 - 2f
0,0,0,0,0,0,0,0, # 30 - 37
0,0,0,0,0,0,0,0, # 38 - 3f
6,0,4,0,8,0,0,0, # 40 - 47
0,9,5,0,0,0,0,0, # 48 - 4f
0,0,0,0,0,0,0,0, # 50 - 57
0,0,0,0,0,0,0,0, # 58 - 5f
0,0,0,0,0,0,0,0, # 60 - 67
0,0,0,0,0,0,0,0, # 68 - 6f
0,0,0,0,0,0,0,0, # 70 - 77
0,0,0,0,0,0,0,0, # 78 - 7f
2,2,2,2,2,2,2,2, # 80 - 87
2,2,2,2,2,2,2,2, # 88 - 8f
2,2,2,2,2,2,2,2, # 90 - 97
2,2,2,2,2,2,2,2, # 98 - 9f
2,2,2,2,2,2,2,2, # a0 - a7
2,2,2,2,2,2,2,2, # a8 - af
2,2,2,2,2,2,2,2, # b0 - b7
2,2,2,2,2,2,2,2, # b8 - bf
2,2,2,2,2,2,2,2, # c0 - c7
2,2,2,2,2,2,2,2, # c8 - cf
2,2,2,2,2,2,2,2, # d0 - d7
2,2,2,2,2,2,2,2, # d8 - df
2,2,2,2,2,2,2,2, # e0 - e7
2,2,2,2,2,2,2,2, # e8 - ef
2,2,2,2,2,2,2,2, # f0 - f7
2,2,2,2,2,2,2,2, # f8 - ff
)
ISO2022JP_st = (
eStart, 3,eError,eStart,eStart,eStart,eStart,eStart,# 00-07
eStart,eStart,eError,eError,eError,eError,eError,eError,# 08-0f
eError,eError,eError,eError,eItsMe,eItsMe,eItsMe,eItsMe,# 10-17
eItsMe,eItsMe,eItsMe,eItsMe,eItsMe,eItsMe,eError,eError,# 18-1f
eError, 5,eError,eError,eError, 4,eError,eError,# 20-27
eError,eError,eError, 6,eItsMe,eError,eItsMe,eError,# 28-2f
eError,eError,eError,eError,eError,eError,eItsMe,eItsMe,# 30-37
eError,eError,eError,eItsMe,eError,eError,eError,eError,# 38-3f
eError,eError,eError,eError,eItsMe,eError,eStart,eStart,# 40-47
)
ISO2022JPCharLenTable = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
ISO2022JPSMModel = {'classTable': ISO2022JP_cls,
'classFactor': 10,
'stateTable': ISO2022JP_st,
'charLenTable': ISO2022JPCharLenTable,
'name': "ISO-2022-JP"}
ISO2022KR_cls = (
2,0,0,0,0,0,0,0, # 00 - 07
0,0,0,0,0,0,0,0, # 08 - 0f
0,0,0,0,0,0,0,0, # 10 - 17
0,0,0,1,0,0,0,0, # 18 - 1f
0,0,0,0,3,0,0,0, # 20 - 27
0,4,0,0,0,0,0,0, # 28 - 2f
0,0,0,0,0,0,0,0, # 30 - 37
0,0,0,0,0,0,0,0, # 38 - 3f
0,0,0,5,0,0,0,0, # 40 - 47
0,0,0,0,0,0,0,0, # 48 - 4f
0,0,0,0,0,0,0,0, # 50 - 57
0,0,0,0,0,0,0,0, # 58 - 5f
0,0,0,0,0,0,0,0, # 60 - 67
0,0,0,0,0,0,0,0, # 68 - 6f
0,0,0,0,0,0,0,0, # 70 - 77
0,0,0,0,0,0,0,0, # 78 - 7f
2,2,2,2,2,2,2,2, # 80 - 87
2,2,2,2,2,2,2,2, # 88 - 8f
2,2,2,2,2,2,2,2, # 90 - 97
2,2,2,2,2,2,2,2, # 98 - 9f
2,2,2,2,2,2,2,2, # a0 - a7
2,2,2,2,2,2,2,2, # a8 - af
2,2,2,2,2,2,2,2, # b0 - b7
2,2,2,2,2,2,2,2, # b8 - bf
2,2,2,2,2,2,2,2, # c0 - c7
2,2,2,2,2,2,2,2, # c8 - cf
2,2,2,2,2,2,2,2, # d0 - d7
2,2,2,2,2,2,2,2, # d8 - df
2,2,2,2,2,2,2,2, # e0 - e7
2,2,2,2,2,2,2,2, # e8 - ef
2,2,2,2,2,2,2,2, # f0 - f7
2,2,2,2,2,2,2,2, # f8 - ff
)
ISO2022KR_st = (
eStart, 3,eError,eStart,eStart,eStart,eError,eError,# 00-07
eError,eError,eError,eError,eItsMe,eItsMe,eItsMe,eItsMe,# 08-0f
eItsMe,eItsMe,eError,eError,eError, 4,eError,eError,# 10-17
eError,eError,eError,eError, 5,eError,eError,eError,# 18-1f
eError,eError,eError,eItsMe,eStart,eStart,eStart,eStart,# 20-27
)
ISO2022KRCharLenTable = (0, 0, 0, 0, 0, 0)
ISO2022KRSMModel = {'classTable': ISO2022KR_cls,
'classFactor': 6,
'stateTable': ISO2022KR_st,
'charLenTable': ISO2022KRCharLenTable,
'name': "ISO-2022-KR"}
# flake8: noqa
######################## BEGIN LICENSE BLOCK ########################
# The Original Code is mozilla.org code.
#
# The Initial Developer of the Original Code is
# Netscape Communications Corporation.
# Portions created by the Initial Developer are Copyright (C) 1998
# the Initial Developer. All Rights Reserved.
#
# Contributor(s):
# Mark Pilgrim - port to Python
#
# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
#
# This library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public
# License along with this library; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
import sys
from . import constants
from .mbcharsetprober import MultiByteCharSetProber
from .codingstatemachine import CodingStateMachine
from .chardistribution import EUCJPDistributionAnalysis
from .jpcntx import EUCJPContextAnalysis
from .mbcssm import EUCJPSMModel
class EUCJPProber(MultiByteCharSetProber):
def __init__(self):
MultiByteCharSetProber.__init__(self)
self._mCodingSM = CodingStateMachine(EUCJPSMModel)
self._mDistributionAnalyzer = EUCJPDistributionAnalysis()
self._mContextAnalyzer = EUCJPContextAnalysis()
self.reset()
def reset(self):
MultiByteCharSetProber.reset(self)
self._mContextAnalyzer.reset()
def get_charset_name(self):
return "EUC-JP"
def feed(self, aBuf):
aLen = len(aBuf)
for i in range(0, aLen):
# PY3K: aBuf is a byte array, so aBuf[i] is an int, not a byte
codingState = self._mCodingSM.next_state(aBuf[i])
if codingState == constants.eError:
if constants._debug:
sys.stderr.write(self.get_charset_name()
+ ' prober hit error at byte ' + str(i)
+ '\n')
self._mState = constants.eNotMe
break
elif codingState == constants.eItsMe:
self._mState = constants.eFoundIt
break
elif codingState == constants.eStart:
charLen = self._mCodingSM.get_current_charlen()
if i == 0:
self._mLastChar[1] = aBuf[0]
self._mContextAnalyzer.feed(self._mLastChar, charLen)
self._mDistributionAnalyzer.feed(self._mLastChar, charLen)
else:
self._mContextAnalyzer.feed(aBuf[i - 1:i + 1], charLen)
self._mDistributionAnalyzer.feed(aBuf[i - 1:i + 1],
charLen)
self._mLastChar[0] = aBuf[aLen - 1]
if self.get_state() == constants.eDetecting:
if (self._mContextAnalyzer.got_enough_data() and
(self.get_confidence() > constants.SHORTCUT_THRESHOLD)):
self._mState = constants.eFoundIt
return self.get_state()
def get_confidence(self):
contxtCf = self._mContextAnalyzer.get_confidence()
distribCf = self._mDistributionAnalyzer.get_confidence()
return max(contxtCf, distribCf)
######################## BEGIN LICENSE BLOCK ########################
# The Original Code is mozilla.org code.
#
# The Initial Developer of the Original Code is
# Netscape Communications Corporation.
# Portions created by the Initial Developer are Copyright (C) 1998
# the Initial Developer. All Rights Reserved.
#
# Contributor(s):
# Mark Pilgrim - port to Python
#
# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
#
# This library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public
# License along with this library; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
from .mbcharsetprober import MultiByteCharSetProber
from .codingstatemachine import CodingStateMachine
from .chardistribution import EUCKRDistributionAnalysis
from .mbcssm import EUCKRSMModel
class EUCKRProber(MultiByteCharSetProber):
def __init__(self):
MultiByteCharSetProber.__init__(self)
self._mCodingSM = CodingStateMachine(EUCKRSMModel)
self._mDistributionAnalyzer = EUCKRDistributionAnalysis()
self.reset()
def get_charset_name(self):
return "EUC-KR"
######################## BEGIN LICENSE BLOCK ########################
# The Original Code is mozilla.org code.
#
# The Initial Developer of the Original Code is
# Netscape Communications Corporation.
# Portions created by the Initial Developer are Copyright (C) 1998
# the Initial Developer. All Rights Reserved.
#
# Contributor(s):
# Mark Pilgrim - port to Python
#
# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
#
# This library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public
# License along with this library; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
from .mbcharsetprober import MultiByteCharSetProber
from .codingstatemachine import CodingStateMachine
from .chardistribution import EUCTWDistributionAnalysis
from .mbcssm import EUCTWSMModel
class EUCTWProber(MultiByteCharSetProber):
def __init__(self):
MultiByteCharSetProber.__init__(self)
self._mCodingSM = CodingStateMachine(EUCTWSMModel)
self._mDistributionAnalyzer = EUCTWDistributionAnalysis()
self.reset()
def get_charset_name(self):
return "EUC-TW"
######################## BEGIN LICENSE BLOCK ########################
# The Original Code is mozilla.org code.
#
# The Initial Developer of the Original Code is
# Netscape Communications Corporation.
# Portions created by the Initial Developer are Copyright (C) 1998
# the Initial Developer. All Rights Reserved.
#
# Contributor(s):
# Mark Pilgrim - port to Python
#
# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
#
# This library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public
# License along with this library; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
from .mbcharsetprober import MultiByteCharSetProber
from .codingstatemachine import CodingStateMachine
from .chardistribution import GB2312DistributionAnalysis
from .mbcssm import GB2312SMModel
class GB2312Prober(MultiByteCharSetProber):
def __init__(self):
MultiByteCharSetProber.__init__(self)
self._mCodingSM = CodingStateMachine(GB2312SMModel)
self._mDistributionAnalyzer = GB2312DistributionAnalysis()
self.reset()
def get_charset_name(self):
return "GB2312"
######################## BEGIN LICENSE BLOCK ########################
# The Original Code is Mozilla Universal charset detector code.
#
# The Initial Developer of the Original Code is
# Netscape Communications Corporation.
# Portions created by the Initial Developer are Copyright (C) 2001
# the Initial Developer. All Rights Reserved.
#
# Contributor(s):
# Mark Pilgrim - port to Python
# Shy Shalom - original C code
# Proofpoint, Inc.
#
# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
#
# This library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public
# License along with this library; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
import sys
from . import constants
from .charsetprober import CharSetProber
class MultiByteCharSetProber(CharSetProber):
def __init__(self):
CharSetProber.__init__(self)
self._mDistributionAnalyzer = None
self._mCodingSM = None
self._mLastChar = [0, 0]
def reset(self):
CharSetProber.reset(self)
if self._mCodingSM:
self._mCodingSM.reset()
if self._mDistributionAnalyzer:
self._mDistributionAnalyzer.reset()
self._mLastChar = [0, 0]
def get_charset_name(self):
pass
def feed(self, aBuf):
aLen = len(aBuf)
for i in range(0, aLen):
codingState = self._mCodingSM.next_state(aBuf[i])
if codingState == constants.eError:
if constants._debug:
sys.stderr.write(self.get_charset_name()
+ ' prober hit error at byte ' + str(i)
+ '\n')
self._mState = constants.eNotMe
break
elif codingState == constants.eItsMe:
self._mState = constants.eFoundIt
break
elif codingState == constants.eStart:
charLen = self._mCodingSM.get_current_charlen()
if i == 0:
self._mLastChar[1] = aBuf[0]
self._mDistributionAnalyzer.feed(self._mLastChar, charLen)
else:
self._mDistributionAnalyzer.feed(aBuf[i - 1:i + 1],
charLen)
self._mLastChar[0] = aBuf[aLen - 1]
if self.get_state() == constants.eDetecting:
if (self._mDistributionAnalyzer.got_enough_data() and
(self.get_confidence() > constants.SHORTCUT_THRESHOLD)):
self._mState = constants.eFoundIt
return self.get_state()
def get_confidence(self):
return self._mDistributionAnalyzer.get_confidence()
######################## BEGIN LICENSE BLOCK ########################
# The Original Code is Mozilla Universal charset detector code.
#
# The Initial Developer of the Original Code is
# Netscape Communications Corporation.
# Portions created by the Initial Developer are Copyright (C) 2001
# the Initial Developer. All Rights Reserved.
#
# Contributor(s):
# Mark Pilgrim - port to Python
# Shy Shalom - original C code
# Proofpoint, Inc.
#
# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
#
# This library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public
# License along with this library; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
from .charsetgroupprober import CharSetGroupProber
from .utf8prober import UTF8Prober
from .sjisprober import SJISProber
from .eucjpprober import EUCJPProber
from .gb2312prober import GB2312Prober
from .euckrprober import EUCKRProber
from .cp949prober import CP949Prober
from .big5prober import Big5Prober
from .euctwprober import EUCTWProber
class MBCSGroupProber(CharSetGroupProber):
def __init__(self):
CharSetGroupProber.__init__(self)
self._mProbers = [
UTF8Prober(),
SJISProber(),
EUCJPProber(),
GB2312Prober(),
EUCKRProber(),
CP949Prober(),
Big5Prober(),
EUCTWProber()
]
self.reset()
**1.5.6 (2014-05-16)**
* Upgrade requests to 2.3.0 to fix an issue with proxies on Python 3.4.1
(PR #1821).
**1.5.5 (2014-05-03)**
* Fixes #1632. Uninstall issues on debianized pypy, specifically issues with
setuptools upgrades. (PR #1743)
* Update documentation to point at https://bootstrap.pypa.io/get-pip.py for
bootstrapping pip.
* Update docs to point to https://pip.pypa.io/
* Upgrade the bundled projects (distlib==0.1.8, html5lib==1.0b3, six==1.6.1,
colorama==0.3.1, setuptools==3.4.4).
**1.5.4 (2014-02-21)** **1.5.4 (2014-02-21)**
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment