Skip to content

Defaultsettings

snoop.defaultsettings #

Default settings file.

This file gets imported both on the Docker image and on the testing configuration.

Attributes#

ALLOWED_HOSTS #

List of domains to allow requests for.

Loaded from environment variable SNOOP_HOSTNAME, default is * (no restrictions).

ALWAYS_QUEUE_NOW #

Setting this to True disables the Task queueing system and executes Task functions in the foregrond. Used for testing.

base_dir #

Helper pointing to root dir of repository.

CELERY_DB_REUSE_MAX #

Instruct Celery to not reuse database connections.

CHILD_QUEUE_LIMIT #

Limit for queueing large counts of children tasks.

DATABASE_ROUTERS #

Activate our database router under snoop.data.collections.CollectionsRouter.

DATABASES #

Django databases configuration.

Gets populated from the SNOOP_COLLECTIONS constant at import time.

DEBUG #

Enable debug logging.

Loaded from environment variabe with same name.

DEFAULT_AUTO_FIELD #

Define type for automatically generated primary keys. This is needed since Django 3.2.

DISPATCH_MAX_QUEUE_SIZE #

Don't queue anything on a queue if its length is greater than this value.

DISPATCH_MIN_QUEUE_SIZE #

If the task count on the queue is less than this value (70%), and if we would queue at least another DISPATCH_QUEUE_LIMIT, then dispatch more tasks. This is used to reduce waiting between batches.

DISPATCH_QUEUE_LIMIT #

Count of pending tasks to trigger per collection when finding an empty queue.

A single worker core running zero-length tasks gets at most around 40 tasks/s, so to keep them all occupied for 5min: 12000

INSTALLED_APPS #

List of Django apps to load.

LANGUAGE_CODE #

Django locale.

MIDDLEWARE #

List of Django middleware to load.

NLP_TEXT_LENGTH_LIMIT #

Truncate text sent to NLP service after this many characters.

OCR_ENABLED #

Flag to enable/disable OCR processing.

OCR_PROCESSES_PER_DOC #

Number of parallel OCR processes used by this task with pdf2pdfocr.py

REST_FRAMEWORK #

Configuration for Django Rest Framework.

Disables authentication, allows all access. Sets JSON as the default input and output.

RETRY_LIMIT_TASKS #

Number BROKEN/ERROR tasks to retry every minute, while their fail count has not reached the limit.

See TASK_RETRY_FAIL_LIMIT.

SECRET_KEY #

Django secret key.

Loaded from environment variabe with same name.

SILENCED_SYSTEM_CHECKS #

Used to disable Django warnings.

SNOOP_CLEAR_MOUNTS_EVERY_TASK #

Run "killall" on the various mount sub-processes started by the system. Only useful when running one worker per task, otherwise tasks will interfere with each other.

SNOOP_COLLECTIONS #

Static configuration for the collections list and settings.

Provided througn environment variable at server boot time.

The DATABASES is expanded with the databases for all these collections here.

SNOOP_COLLECTIONS_ELASTICSEARCH_URL #

URL pointing to Elasticsearch server.

SNOOP_DOCUMENT_CHILD_QUERY_LIMIT #

Limit page size when listing directory children.

SNOOP_DOCUMENT_LOCATIONS_QUERY_LIMIT #

Limit page size when listing document locations.

SNOOP_FEED_PAGE_SIZE #

Pagination size for the /feed URLs.

Todo

remove this value, as the API is not used anymore.

SNOOP_NLP_URL #

URL pointing to NLP server

SNOOP_RABBITMQ_HTTP_PASSWORD #

Password for rabbitmq HTTP interface. Default 'guest'

SNOOP_RABBITMQ_HTTP_URL #

URL pointing to RabbitMQ message queue.

Of the form "1.2.3.4:1234/_path/" (no "http://" prefix). Used to query queue lengths.

Username and password configs follow.

SNOOP_RABBITMQ_HTTP_USERNAME #

Username for rabbitmq HTTP interface. Default 'guest'

SNOOP_S3FS_MOUNT_DIR #

Location ono disk where s3fs mounts are stored.

SNOOP_S3FS_MOUNT_LIMIT #

Global limit of parallel S3 mounts (buckets).

SNOOP_TASK_DISABLE_TAIL_QUEUE #

Flag to disable queueing more tasks of same/different type after a task completes.

Useful for running tests with the Celery Eager executor, to avoid infinite loops.

SNOOP_TEMP_STORAGE #

Full disk path pointing to temp storage.

SNOOP_TIKA_URL #

URL pointing to Apache Tika server.

SNOOP_TOTAL_WORKER_COUNT #

Rough total number of executors to be run on the system.

STATIC_ROOT #

Full disk path to static directory on disk, for Django.

STATIC_URL #

Url path pointing to static files, for Django.

SYNC_RETRY_LIMIT_DIRS #

If there are no pending tasks, this is how many directories will be retried by sync every minute.

SYSTEM_QUEUES #

List of "system queues" - celery that must be executed periodically.

One execution of any of these functions will work on all collections under a for loop.

TABLES_SPLIT_FILE_ROW_COUNT #

Number of rows inside each table splt. Limits the time spent by a single unarchive task to a few minutes, increasing parallelism.

This limits the number of children (row) documents for a given table to the inode performance limit of 4000 files per dir.

TASK_PREFIX #

Prefix to add to all snoop task queues.

Todo

Remove this value, as it's not used anymore.

TASK_RETRY_AFTER_MINUTES #

Errored tasks are retried at most every this number of minutes.

TASK_RETRY_FAIL_LIMIT #

Errored tasks are retried at most this number of times.

The actual value is higher, since we retry very old tasks more times.

UNARCHIVE_THREADS #

Number of threads that will be used by 7z to unarchive.

URL_PREFIX #

Configuration to set the URL prefix for all service routes. For example: "snoop/".

WORKER_PREFETCH #

Celery-rabbitmq prefetch count.

WORKER_TASK_LIMIT #

Max tasks count to be finished by 1 worker process before restarting it.

WSGI_APPLICATION #

Configure which WSGI application to use, for Django.