owemdjee

Data Science & Image Processing amalgam library in C/C++.

This place is a gathering spot & integration workplace for the C & C++ libraries we choose to use. Think "Façade Pattern" and you're getting warm. 😉 The heavy data lifting will be done in the referenced libraries, while this lib will provide some glue and common ground for them to work in/with.

TOC
- Reason for this repo
Intent
Libraries we're looking at for this intent

Reason for this repo

git submodules hasn't been the most, ah, "user-friendly" methods to track and manage a set of libraries that you wish to track at source level.

A few problems have been repeatedly observed over our lifetime with git:

when it so happens that the importance & interest in a submoduled library is perhaps waning and you want to migrate to another, you can of course invoke git to ditch the old sow and bring in the shiny new one, but that stuff gets quite finicky when you are pedalling back & forth through your commit tree when, e.g. bughunting or maintenance work on a release branch which isn't up to snuff with the fashion kids yet.

Yup, that's been much less of a problem since about 2018, but old scars need more than a pat on the arm to heal, if you get my drift.
folks haven't always been the happy campers they were supposed to be when they're facing a set of submodules and want to feel safe and sure in their "knowledge" that each library X is at commit Y, when the top of the module tree is itself at commit Z, for we are busy producing a production release, perhaps? That's a wee bit stressful and there have been enough "flukes" with git to make that a not-so-ironclad-as-we-would-like position.

Over time, I've created several bash shell scripts to help with that buzzin' feelin' of absolute certainty. Useful perhaps, but the cuteness of those wears off pretty darn quickly when many nodes in the submodule tree start cluttering their git repo with those.

And?

This repo is made to ensure we have a single point of reference for all the data munching stuff, at least.

We don't need to git submodule add all those data processing libs in our applications this way, as this is a single submodule to bother that project with. The scripts and other material in here will provide the means to ensure your build and test tools can quickly and easily ensure that everyone in here is at the commit spot they're supposed to be.

And when we want to add another lib about data/image processing, we do that in here, so the application-level git repo sees a very stable singular submodule all the time: this repo/lib, not the stuff that will change over time as external libs gain and loose momentum over time. (We're talking multiyear timespans here!)

Critique?

It's not the most brilliant solution to our problems, as this, of course, becomes a single point of failure that way, but experience in the past with similar "solutions" has shown that it's maybe not always fun, but at least we keep track of the management crap in one place and that was worth it, every time.

And why not do away with git submodule entirely and use packages instead? Because this stuff is important enough that other, quite painful experience has shown us that (binary & source) packages are a wonder and a hassle too: I'ld rather have my code tracked and tagged at source level all the way because that has reduced several bug situations from man-weeks to man-hours: like Gentoo, compile it all, one compiler only. Doesn't matter if the bug is in your own code or elsewhere, there are enough moments like that where one is helped enormously by the ability to step through and possibly tweak a bit of code here or there temporarily to help the debugging process that I, at least, prefer full source code.

And that's what this repo is here to provide: the source code gathered and ready for use on our machines.

Why is this repo a solution? And does it scale?

The worst bit first: it scales like rotten eggs. The problem there is two-fold: first, there's (relatively) few people who want to track progress at the bleeding edge, so tooling is consequently limited in power and availability, compared to conservative approaches (count the number of package managers lately?).

Meanwhile, I'm in a spot where I want to ride the bleeding edge, at least most of the time, and I happen to like it that way: my world is much more R&D than product maintenance, so having a means to track, relatively easy, the latest developments in subjects and materiel of interest is a boon to me. Sure, I'll moan and rant about it once in a while, but if I wanted to really get rid of the need to be flexible and adapt to changes, sometimes often, I'ld have gone with the conservative stability of package managers and LTS releases already. Which I've done for other parts of my environment, but do not intend to do for the part which is largely covered by this repo: source libraries which I intend to use or am using already in research tools I'm developing for others and myself.

For that purpose, this repo is a solution, though -- granted -- a sub-optimal one in that it doesn't scale very well. I don't think there's any automated process available to make this significantly faster and more scalable anyway: the fact that I'm riding the bleeding edge and wish to be able to backpedal at will when the latest change of direction or state of affairs of a component is off the rails (from my perspective at least), requires me to be flexible and adaptable to the full gamut of change. There are alternative approaches, also within the git world, but they haven't shown real appeal vs. old skool git submodules -- which is cranky at times and a pain in the neck when you want to ditch something but still need it in another dev branch, moan moan moan, but anyway... -- so here we are.

Side note: submodules which have been picked up for experimentation and inspection but have been deleted from this A list later on are ~~struck through~~ in the overview below: the rationale there is that we can thus still observe why we struck it off the list, plus never make the mistake of re-introducing it after a long time, forgetting that we once had a look already, without running into the struck-through entry and having to re-evaluate the reason at least, before we re-introduce an item.

Intent

TOC

Inter-process communications (IPC)

Lowest possible run-time cost, a.k.a. "run-time overhead": the aim is to have IPC which does not noticably impact UX (User Experience of the application: responsiveness / UI) on reeasonably powered machines. (Users are not expected to have the latest or fastest hardware.)

As at least large images will be transfered (PDF page renders) we need to have a binary-able protocol.

Programming Languages used: intent and purposes

We expect to use these languages in processes which require this type of IPC:

C / C++ (backend No.1)
- PDF renderer (mupdf)
- metadata & annotations extractor (mupdf et al)
- very probably also the database interface (SQLite)
- [page] image processing (leptonica, openCV, ImageMagick?, what-ever turns out to be useful and reasonable to integrate (particularly between PDF page renderer and OCR engine to help us provide a user-tunable PDF text+metadata extractor
- OCR (tesseract)
- "A.I."-assisted tooling to help process and clean PDFs: cover pages, abstract/summary extraction for meta-research, etc. (think ngrams, xdelta, SVM, tensors, author identification, document categorization, document similarity / [near-]duplicate / revision detection, tagging, ...)
- document identifier key generator a.k.a. content hasher for creating unique key for each document, which can be used as database record index, etc.
  - old: Qiqqa SHA1B
  - new: BLAKE3+Base36
C# ("business logic" / "middleware": the glue logic)
Java (SOLR / Lucene: our choice for the "full text search database" ~ backend No.2)
JavaScript (UI, mostly. Think electron, web browser, Chromely^also, WebView2^plus, that sort of thing)

Here we intend to use the regular SOLR APIs, which does not require specialized binary IPC.

We may probably choose to use a web-centric UI approach where images are compressed and cached in the backend, while being provided as <picture> or <img> tag references (URLs) in the HTML generated by the backend. However, we keep our options open ATM as furtheer testing is expected to hit a few obstacles there (smart caching required as we will be processing lots of documents in "background bulk processes" alongside the browsing and other more direct user activity) so a websocket or similar push technology may be employed: there we may benefit from dedicated IPC for large binary and text data transfers.

Scripting the System: Languages Considered for Scripting by Users

Python has been considered. Given its loud presence in the AI communities, we still may integrate it one day. However, personally I'm not a big fan of the language and don't use it unless it's prudent to do, e.g. when extending or tweaking previous works produced by others. Also, it turns out, it's not exactly easy to integrate (CPython) and I don't see a need for it beyond this one project / product: Qiqqa.

I've looked at Lua for a scripting language suitable for users (used quite a lot in the gaming industries and elsewhere); initial trials to get something going did not uncover major obstacles, but the question "how do I debug Lua scripts?" does not produce any viable project / product that goes beyond the old skool printf-style debugging method. Not a prime candidate therefor, as we expect that users will pick this up, when they like it, and grow the user scripts to unanticipated size and complexity: I've seen this happen multiple times in my career. Lua does not provide a scalable growth path from my perspective due to the lack of a decent, customizable, debugger.

Third candidate is JavaScript. While Artifex/mupdf comes with mujs, which is a simple engine it suffers from two drawbacks: it's ES5 and also does not provide a debugger mechanism beyond old skool print. Nice for nerds, but this is user-facing and thus not a viable option.

The other JavaScript engines considered are of varying size, performance and complexity. Some of them offer ways to integrate them with the [F12] Chrome browser Developer Tools debugger, which would be very nice to have available. The road traveled there, along the various JavaScript engines is this:

Facebook's Hermes, Samsung's Escargot and XS/moddable^{also here}, which led me to a webpage where various embeddable JS engines are compared size- and performance-wise.
Google's V8^{here too}, as available in NodeJS, is deemed too complex for integration: when we go there, we could spend the same amount of effort on CPython integration -- though there again is the ever-present "how to debug this visually?!" question...)
JerryScript: ES2017/2020 (good!), there's noises about Chrome Developer Tools on the Net for this one. Small, designed for embedded devices. I like that.
mujs: ES5, no visual debugger. Out.
QuickJS: ES2020, DevTools or VS Code debugging seems to be available. Also comes with an interesting runtime: txiki, which we still need to take a good look at.

UPDATE 2021/June: JerryScript, duktape, XS/moddable, escargot: these have been dropped as we picked QuickJS. After some initial hassle with that codebase, we picked a different branch to test, which was cleaner and compiled out of the box (CMake > MSVC), which is always a good omen for a codebase when you have cross-platform portability in mind.

🡻 all (index) | 🡺 next section

Name		Name	Last commit message	Last commit date
Latest commit History 938 Commits
Sorry, we had to truncate this directory to 1,000 files. 605 entries were omitted from the list.
.obsidian		.obsidian
0000-index		0000-index
1D-RGB-color-gradient @ ff2f632		1D-RGB-color-gradient @ ff2f632
2D-color-gradient-or-Procedural-texture @ 8ad8e4d		2D-color-gradient-or-Procedural-texture @ 8ad8e4d
7-Zip-zstd @ 0906a8e		7-Zip-zstd @ 0906a8e
7zip @ 5c640cd		7zip @ 5c640cd
A-MNS_TemplateMatching @ de87d73		A-MNS_TemplateMatching @ de87d73
ADE-graph-management @ a47848e		ADE-graph-management @ a47848e
AFLplusplus @ 7d77cbd		AFLplusplus @ 7d77cbd
Algo_Ds_Notes @ 45b781c		Algo_Ds_Notes @ 45b781c
Allocator @ e15fb49		Allocator @ e15fb49
ApprovalTestsCpp @ 3605b45		ApprovalTestsCpp @ 3605b45
ArborX @ 7d23e00		ArborX @ 7d23e00
Arduino-KalmanFilter @ 0beeda5		Arduino-KalmanFilter @ 0beeda5
AudioFile @ 0d70dc0		AudioFile @ 0d70dc0
Awesome-Document-Image-Rectification @ ad0cdef		Awesome-Document-Image-Rectification @ ad0cdef
Awesome-Image-Quality-Assessment @ 6ef12d8		Awesome-Image-Quality-Assessment @ 6ef12d8
BBHash @ 8c12403		BBHash @ 8c12403
BCF-cuckoo-index @ 9156482		BCF-cuckoo-index @ 9156482
BLAKE3 @ 5ed585f		BLAKE3 @ 5ed585f
BaseMatrixOps @ 9e5beec		BaseMatrixOps @ 9e5beec
BayerToRGB @ 1933309		BayerToRGB @ 1933309
BehaviorTree.CPP @ 83f9bce		BehaviorTree.CPP @ 83f9bce
Bi-Sent2Vec @ b3bf758		Bi-Sent2Vec @ b3bf758
BitFunnel @ 20c7f9b		BitFunnel @ 20c7f9b
BlingFire @ 18e9a19		BlingFire @ 18e9a19
BoxFort @ 1018a44		BoxFort @ 1018a44
CDT @ e2532e8		CDT @ e2532e8
CHM-lib @ e5cc1ca		CHM-lib @ e5cc1ca
CImg @ 4e4f6d3		CImg @ 4e4f6d3
CLBlast @ 7d38027		CLBlast @ 7d38027
CLBlast-database @ 4acad0f		CLBlast-database @ 4acad0f
CLTune @ 6a13b43		CLTune @ 6a13b43
CNTK @ e1ccc55		CNTK @ e1ccc55
CRFpp @ c603ea6		CRFpp @ c603ea6
CRFsuite-extended @ 073d650		CRFsuite-extended @ 073d650
CRoaring @ e376078		CRoaring @ e376078
CTCWordBeamSearch @ bb9cfa4		CTCWordBeamSearch @ bb9cfa4
CTPL-Thread-Pool @ 437e135		CTPL-Thread-Pool @ 437e135
CacheLib @ 9927d2b		CacheLib @ 9927d2b
Capture2Text @ e4c347c		Capture2Text @ e4c347c
Catch2 @ bfd8942		Catch2 @ bfd8942
Celero @ 205bfce		Celero @ 205bfce
ChaiScript @ f904cd5		ChaiScript @ f904cd5
Clipper2 @ 77d569f		Clipper2 @ 77d569f
Cmathtuts @ a361050		Cmathtuts @ a361050
ColorMagic @ 991b493		ColorMagic @ 991b493
ColorPaletteCodable @ 6dd9770		ColorPaletteCodable @ 6dd9770
ColorSpace @ fee690d		ColorSpace @ fee690d
Containers @ 7db1e5a		Containers @ 7db1e5a
CppNumericalSolvers @ 450e62b		CppNumericalSolvers @ 450e62b
Criterion @ e1dd6a8		Criterion @ e1dd6a8
CryptSync @ 42e3e6b		CryptSync @ 42e3e6b
Cuckoo_Filter_simple @ 584ba36		Cuckoo_Filter_simple @ 584ba36
CurvatureFilter @ 489abda		CurvatureFilter @ 489abda
CxImage @ 4460ba1		CxImage @ 4460ba1
Cysboard @ 387d15e		Cysboard @ 387d15e
DBoW2 @ 3924753		DBoW2 @ 3924753
DBow3 @ f142cce		DBow3 @ f142cce
DCF-cuckoo-index @ f943929		DCF-cuckoo-index @ f943929
DGM-CRF @ 33a2e9b		DGM-CRF @ 33a2e9b
DP_means @ 7977897		DP_means @ 7977897
DarkThumbs @ f527158		DarkThumbs @ f527158
DataFrame @ fd04754		DataFrame @ fd04754
Detours @ c532e20		Detours @ c532e20
Digital-Halftoning @ 6da1ae6		Digital-Halftoning @ 6da1ae6
DiskANN @ 1f9b79c		DiskANN @ 1f9b79c
DocLayNet @ 6694739		DocLayNet @ 6694739
EASTL @ 7fadbf0		EASTL @ 7fadbf0
ECMA262 @ caa0482		ECMA262 @ caa0482
EWAHBoolArray @ 344ac9a		EWAHBoolArray @ 344ac9a
EasyOCR @ c4f3cd7		EasyOCR @ c4f3cd7
EasyOCR-cpp @ cf08e1c		EasyOCR-cpp @ cf08e1c
EigenRand @ a3db5e9		EigenRand @ a3db5e9
EtwExplorer @ 8f2abf9		EtwExplorer @ 8f2abf9
Extensible-Storage-Engine @ 51cefa2		Extensible-Storage-Engine @ 51cefa2
FASTER @ 7f71289		FASTER @ 7f71289
FFME @ 473d351		FFME @ 473d351
FFmpeg @ f4e72eb		FFmpeg @ f4e72eb
FIt-SNE @ 47ff14f		FIt-SNE @ 47ff14f
FM-fast-match @ 3150e54		FM-fast-match @ 3150e54
FastGlobbing @ e44f743		FastGlobbing @ e44f743
FastString @ 1424059		FastString @ 1424059
FiberTaskingLib @ 98493f1		FiberTaskingLib @ 98493f1
FreeFileSync @ 75d3c4a		FreeFileSync @ 75d3c4a
GALGO-2.0 @ 6ee1cd9		GALGO-2.0 @ 6ee1cd9
GMM-HMM-kMeans @ df76ebd		GMM-HMM-kMeans @ df76ebd
GMMreg @ b18d1c9		GMMreg @ b18d1c9
GQ-gumbo-css-selectors @ a00f4ed		GQ-gumbo-css-selectors @ a00f4ed
GSL @ 3325bbd		GSL @ 3325bbd
GoldFish-CBOR @ 758e462		GoldFish-CBOR @ 758e462
GraphBLAS @ 8c381c1		GraphBLAS @ 8c381c1
HDiffPatch @ 50893b2		HDiffPatch @ 50893b2
Hackers-Delight @ 8b8407a		Hackers-Delight @ 8b8407a
HashMap @ 1efe0b4		HashMap @ 1efe0b4
IMGUR5K-Handwriting-Dataset @ 94714f6		IMGUR5K-Handwriting-Dataset @ 94714f6
ITK @ 91c71ec		ITK @ 91c71ec
IconFontCppHeaders @ c92a21f		IconFontCppHeaders @ c92a21f
IdGenerator @ 942ba6a		IdGenerator @ 942ba6a
Image-Compression-Benchmark @ 401685f		Image-Compression-Benchmark @ 401685f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

owemdjee

Reason for this repo

And?

Critique?

Why is this repo a solution? And does it scale?

Intent

Inter-process communications (IPC)

Programming Languages used: intent and purposes

Scripting the System: Languages Considered for Scripting by Users

About

Uh oh!

Releases

Packages

Languages

Uh oh!

License

Uh oh!

GerHobbelt/owemdjee

Folders and files

Latest commit

History

Repository files navigation

owemdjee

Reason for this repo

And?

Critique?

Why is this repo a solution? And does it scale?

Intent

Inter-process communications (IPC)

Programming Languages used: intent and purposes

Scripting the System: Languages Considered for Scripting by Users

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages