This repository has been archived on 2024-11-02. You can view files and clone it, but cannot push or open issues or pull requests.
diagram-parser/README.md
2024-11-02 21:16:37 +00:00

2.2 KiB

DEPRECATED - DIAGRAMS ARE NOW IN PDF FORMAT.

https://git.fjla.uk/owlboard/dgp2 supports new PDF format schedule cards and offers some automated validation of codes. This project will not be maintained.

diagram-parser

This is an experimental project and is not yet used as part of the OwlBoard stack.

Language

It is so-far undecided what language will be used. Documents for parsing are likely to be a few hundred lines long so searching may become processor intensive meaning Go may be a good candidate, however Python offers an array of libraries which coule be helpful.

File formats

Diagrams are received in DOCX format, however can be easily be converted to ODT, DOC, or PDF which provides flexibility in the languages and the libraries used in the implementation.

Aims

The aim of diagram-parser is to simplify the addition of PIS codes that are not yet in the OwlBoard data source. The planned implementation is as follows:

  • diagram-parser is subscribed to an email inbox (IMAP/POP3)
  • Formatted train-crew schedule cards are sent to the inbox and loaded by diagram-parser
  • List of existing PIS codes is loaded and a list of non-existent codes is compiled (0000-9999)
  • If a code is found both in the diagram and on the list of non-existent codes, a Gitea issue is opened providing details of the code.
  • Once the program has run and extracted only the relavent details, the email is deleted and the file is closed and not stored.
  • The evantual aim is to avoid any manual searching of the files.

The current process of adding new codes involves being made aware of them face to face, or finding them myself and manually finding and adding them to the data source.

Points to Remember

  • Emails received should be verified.
    • A pre-authorised key in the subject field, any emails not matching the key should be discarded.
  • Attachment formats may vary slightly.
    • The format of the attachment should be checked and any errors handled gracefully.
  • Avoid duplicate issues
    • Issues opened should contain the missing PIS code in their title, this application should check for any open issues containing the missing code to avoid duplicated issues.

Main external dependencies (Expected)

  • imaplib
  • email