HomeLab迷你小主机(x86):Docker部署开源无纸化电子文档paperless-ngx

NO.1
paperless-ngx简介

github地址

https://github.com/paperless-ngx/paperless-ngx

docker images地址

https://hub.docker.com/r/paperlessngx/paperless-ngx

社区支持的无纸化增强版本:扫描、索引和归档所有纸质文档

特点

  • 使用标签、通讯员、类型等组织和索引扫描的文档。

  • 对文档执行 OCR,将可选择的文本添加到仅图像文档中,并向文档添加标签、通讯员和文档类型。

  • 支持 PDF 文档、图像、纯文本文件和 Office 文档(Word、Excel、Powerpoint 和 LibreOffice 等效项)。

  • Office 文档支持是可选的,由 Apache Tika 提供(请参阅配置)

  • 无纸化将文档直接存储在磁盘上。文件名和文件夹采用无纸化管理,格式可自由配置。

  • 单页应用程序前端。

  • 包括一个显示基本统计数据并具有文档上传功能的仪表板。

  • 按标签、通讯员、类型等进行过滤。

  • 可以保存自定义视图并将其显示在仪表板上。

  • 全文搜索可帮助找到所需内容。

  • 自动完成会建议文档中的相关单词。

  • 结果按与的搜索查询的相关性排序。

  • 突出显示可以显示文档的哪些部分与查询匹配。

  • 搜索类似文档(“更多类似内容”)

  • 电子邮件处理:无纸化添加电子邮件帐户的文档。

  • 配置多个帐户并为每个帐户配置过滤器。

  • 从邮件添加文档时,无纸化可以将这些邮件移动到新文件夹、将其标记为已读、将其标记为重要或将其删除。

  • 机器学习驱动的文档匹配。

  • Paperless-ngx 会从文档中学习,一旦以无纸化方式存储了一些文档,它就能够自动为文档分配标签、通讯员和类型。

  • 针对多核系统进行了优化:Paperless-ngx 并行使用多个文档。

  • 集成的完整性检查器可确保文档存档状况良好。

img

简而言之,就是纸质文档电子化

第一个想到的是各种票据,合同等,其次是各类术说明书,然后是杂志,书籍等

NO.2
paperless-ngx安装

官方文档

https://docs.paperless-ngx.com/setup/#docker_script

docker-compose参考

https://github.com/paperless-ngx/paperless-ngx/tree/main/docker/compose

新建docker-compose.yml配置文件

# docker-compose file for running paperless from the Docker Hub.
# This file contains everything paperless needs to run.
# Paperless supports amd64, arm and arm64 hardware.
#
# All compose files of paperless configure paperless in the following way:
#
# - Paperless is (re)started on system boot, if it was running before shutdown.
# - Docker volumes for storing data are managed by Docker.
# - Folders for importing and exporting files are created in the same directory
# as this file and mounted to the correct folders inside the container.
# - Paperless listens on port 8000.
#
# In addition to that, this docker-compose file adds the following optional
# configurations:
#
# - Instead of SQLite (default), PostgreSQL is used as the database server.
#
# To install and update paperless with this file, do the following:
#
# - Copy this file as 'docker-compose.yml' and the files 'docker-compose.env'
# and '.env' into a folder.
# - Run 'docker-compose pull'.
# - Run 'docker-compose run --rm webserver createsuperuser' to create a user.
# - Run 'docker-compose up -d'.
#
# For more extensive installation and update instructions, refer to the
# documentation.

version: "3.4"
services:
broker:
image: docker.io/library/redis:7
restart: unless-stopped
volumes:
- ./redisdata:/data

db:
image: docker.io/library/postgres:15
restart: unless-stopped
volumes:
- ./pgdata:/var/lib/postgresql/data
environment:
POSTGRES_DB: paperless
POSTGRES_USER: paperless
POSTGRES_PASSWORD: paperless

webserver:
image: ghcr.io/paperless-ngx/paperless-ngx:latest
restart: unless-stopped
depends_on:
- db
- broker
ports:
- "3033:8000"
healthcheck:
test: ["CMD", "curl", "-fs", "-S", "--max-time", "2", "http://localhost:8000"]
interval: 30s
timeout: 10s
retries: 5
volumes:
- ./data:/usr/src/paperless/data
- ./media:/usr/src/paperless/media
- ./export:/usr/src/paperless/export
- ./consume:/usr/src/paperless/consume
env_file: docker-compose.env
environment:
PAPERLESS_REDIS: redis://broker:6379

PAPERLESS_DBHOST: db

      PAPERLESS_ADMIN_USER: admin

      PAPERLESS_ADMIN_PASSWORD: admin

 



volumes:
data:
media:
pgdata:
redisdata:

新建docker-compose.env文件

直接复制对应的内容

# The UID and GID of the user used to run paperless in the container. Set this
# to your UID and GID on the host so that you have write access to the
# consumption directory.
#USERMAP_UID=1000
#USERMAP_GID=1000

# Additional languages to install for text recognition, separated by a
# whitespace. Note that this is
# different from PAPERLESS_OCR_LANGUAGE (default=eng), which defines the
# language used for OCR.
# The container installs English, German, Italian, Spanish and French by
# default.
# See https://packages.debian.org/search?keywords=tesseract-ocr-&searchon=names&suite=buster
# for available languages.
#PAPERLESS_OCR_LANGUAGES=tur ces

###############################################################################
# Paperless-specific settings #
###############################################################################

# All settings defined in the paperless.conf.example can be used here. The
# Docker setup does not use the configuration file.
# A few commonly adjusted settings are provided below.

# This is required if you will be exposing Paperless-ngx on a public domain
# (if doing so please consider security measures such as reverse proxy)
#PAPERLESS_URL=https://paperless.example.com

# Adjust this key if you plan to make paperless available publicly. It should
# be a very long sequence of random characters. You don't need to remember it.
#PAPERLESS_SECRET_KEY=change-me

# Use this variable to set a timezone for the Paperless Docker containers. If not specified, defaults to UTC.
#PAPERLESS_TIME_ZONE=America/Los_Angeles

# The default language to use for OCR. Set this to the language most of your
# documents are written in.
#PAPERLESS_OCR_LANGUAGE=eng

# Set if accessing paperless via a domain subpath e.g. https://domain.com/PATHPREFIX and using a reverse-proxy like traefik or nginx
#PAPERLESS_FORCE_SCRIPT_NAME=/PATHPREFIX
#PAPERLESS_STATIC_URL=/PATHPREFIX/static/ # trailing slash required

img

开放端口

sudo ufw allow 3033

拉取镜像并启动服务

docker-compose up

img

NO.3
paperless-ngx使用

参考文档

https://docs.paperless-ngx.com/configuration/#PAPERLESS_ADMIN_USER

docker-compose.yml配置中有设置超级用户和超级密码

PAPERLESS_ADMIN_USER: admin

PAPERLESS_ADMIN_PASSWORD: admin

访问IP:端口

进入到登录页面

用户名:admin

密码:admin

img

登录成功

img

上传一个文档,比如票据图片

点击浏览文件

img

上传成功后点击打开文档

img

或者点击左侧的文档tab

然后双击要查看的文档

img

可以看到对应的电子化文档详情

img

这个内容为ocr自动识别的内容

可以被搜索栏按关键字检索

可以只能默认识别为英文,不支持简体中文识别

img

ocr对应配置文档

https://docs.paperless-ngx.com/configuration/#ocr

如需要支持简体中文,需要在docker-compose.yml中添加参数

PAPERLESS_OCR_LANGUAGE: chi_sim

但是添加后会报错不支持

目前paperless-ngx对简体中文,各个内置的ocr都不支持

img

建议在保存的时候可以使用别的ocr来进行电子文档内容的录入,方便后续查找

NO.4
Tips

可以放一些PDF,图片等

目前还没有高频使用,后期可能录入

  1. 各种证件(身份证,社保卡,公积金卡,银行卡,会员卡等)

  2. 各种票据(超市小票,外卖小票,存款票据等各种交易票据)

  3. 各种合同(入职合同,体检报告,离职协议,保密协议,银行合同,房租合同等)

  4. 各种说明(说明书等)

初步使用的第一个缺点是ocr只支持英文

后期可以找一个自部署的ocr服务器,来结合使用,手动繁琐一点,但是更为准确

END.

觉得本文还行,不妨顺手点赞收藏,下期见。

推荐阅读

HomeLab迷你小主机(x86):Docker部署开源消息推送通知barkServer(适用于ios)

HomeLab迷你小主机(x86):Docker部署开源dashy,自托管个人导航、仪表板、可视化小部件

HomeLab迷你小主机(x86):Docker部署开源建站LMS在线教育Moodle

☕ 朋友,都看到这了,确定不关注一下么? 👇