I know Calibre can remove DRM, but it seems that Calibre does not remove things like watermarks, references to the buyer by name, etc. Now maybe I can try to find those manually, but that is an error prone process. Plus, what if they embed a unique digital signature that ties back to me? I understand that this is a very uncommon practice, but I do not want to find myself in a bad place.
I suppose the only way to remove a digital signature of any sort is to buy two of the same e-book by different people, diff them, and remove anything that differentiates them.
Is there any tool that does this or automates the process? am I being too paranoid, and this is not a real threat?
The bad news is that uploading e-books will involve programming on your part (for your sanity at least).
The good news is that it should be far easier than other mediums.
If you are approaching from a complete safety perspective (cause you live in a fiefdom that owes tribute to the publishers guild), then you’re going to want to OCR the pages of the book and use the text to make a brand new book free from metadata. I’m pretty sure a python crash course could get you up and running in a month or 6.
If you want what’s closest to the original product, then you’ll need a python script that strips everything from the book into just a text document, then re-convert back into your own book. You’ll have to review the text document to see if any random code was included in the book like invisible text.
Both options are so simple from a programming perspective that I’ve never seen scripts to strip e-book protections. A real (the solution is left un-worked as a challenge for the reader). And from what I know, the publishers have switched to focusing on selling hard copies as their bread and butter, and striking deals with libraries for other revenue. Big money is still in mandatory university textbooks.
Source: Never actually done what you’re asking for
Thanks for your advice. I am a programmer by craft so I can definitely do that. I think the only issue may be books with any important content that is not text, i.e. graphics and images (and unfortunately, many of the books I am interested in have that). If I understood what you said correctly.
gImageReader or ocrmypdf will get you the pdf text, but after the text will need fiddling with and cleaning. Use LibreOffice, languagetool, write-good, etc to make finding the oddballs easy.
pdftk is what you want for editing pdf metadata.
Gimp is what you’ll need for editing images, Looking for watermarks, smoothing edges, lowering quality, introducing random noise, etc.
exiftool is what you’ll need for image metadata. Or take a screenshot, add a bit of noise or de-noise, and add back to the new pdf.
Scrivener or LibreOffice if you want to polish/republish, though that’s a ton of work.
Even with OCR, couldn’t your copy at least in theory be laced with strategically placed minor word changes? Say throughout the book you pick 30 spots to change a word without changing the meaning of the text, or you introduce a typo. If every copy gets a different set of those that would be a unique identifier.
I think I have heard that being done with imperceptable changes in films sent for showings in theaters.
@matcha_addict@lemy.lol In this situation, I’d advise acquiring a copy from an alternative source, then just compare the texts of the two.
In practicality though, if you’re already going the OCR route then just utility knife cut the pages from a real book and feed them into a feeder scanner. All they get to know is that some asshole cyberpunk script kiddie jacked your book while you were waiting at a bus stop.
have a look on “snowdrop” (search together with “steganography”), its basically the opposite of what you want, but worth mentioning here. watermarks could be placed into whitespace (not limited to actual spaces or linebreaks, intentionally changed usage of paragraphs, tabs or even page boundaries could possibly be detected after scanning andeven after OCR. IMHO snowdrop uses -depending on choosen operation mode- small errors like misspelled words, commata etc but also has a mode that comes along with fine grammar and without misspelled words…
how do you make sure that by diff’ing two versions you do cover "everything’ that has been deliberately placed into both documents but share literally the same informations?
lets say you bought two books at two different stores with two different watermarks. if the watermark contains the date and time of the purchase and the only difference of this were the minutes because you bought them within the same hour, the remaining watermark would point to all buyers that bought exactly this book in this hour - worldwide. but still it could be “very” precise depending on all other(!) buyers, if they exist at all within that timeframe. what if the watermark includes unix epoch? then the part which is the same in both watermarks would not be bound by hours, but by seconds, 10seconds, 100seconds etc.
and you could not know if there were other watermarks hidden that just happened to be the same for your two (three.?) purchases (same country, continent, payment method, credit card holder name, name of internet provider used during purchase, browser used etc.) it fully depends on the creator of the watermark what would be included and what not. if you happem to know all that (without any possibleexemptions) you might be on the safe side, but if not…
my general suggestion here is:
- if you want to be sure to not getting into trouble, then just don’t do it.
- if that book is too expensive compared to its content, just not buying it possibly also helps the market to fix the problem.
- save that time and instead help those who already fight for a better world.
- search already licence free books (or such as “cc” licensed) and promote those instead, help improving free resources like openstreetmap, wiki* but do not publish licence-poisoned content there, wtite it yourself, alway.
- write your own book and publish it free.
just to mention… the “safe” side sometimes seems limited but maybe is actually not, if you really look at it.
Diffing should reveal any differences, even white space. I suppose with white space it may be harder to fix, as you have to figure out the neutral state. But it is still possible.
Regarding the time stamp, I actually did think of this and you’re right. It would work especially for a small online bookstore. I believe the two books just have to be bought at very different times and ideally different other things, like people with different last name and even general location of billing address.
Regarding your other points… You make good points, so I will consider.
i have to admit, that my point ‘just don’t do it’ in reality does not garantee to prevent any trouble. it still is possible to be sued for things someone else did.
also one suggestion to think about:
if the seller just sprays some random changes over a book for every sold version, one would have differences in “every” sold version to every other sold version. by blindly changing those parts to something else you could reveal which exact two/three versions you had for diffing.
UPDATE: someone else here had the same thought a bit earlier…
my suggestion to not do it stays the same ;-)
it could be interesting to figure things out how they work, what could be done to prevent or circumvent such prevention, but actually doing it seems risky no matter what.
Just use calibre to change the format to rtf or even txt, and then back to mobi or epub, and poof.
Neither of those formats store any significant metadata, with rtf the original formatting gets preserved though.
From what I gathered Amazon is the only one that includes an identifier. Look for a string starting with “atv:kin”.
The DeDRM fork removes it: https://github.com/noDRM/DeDRM_tools/blob/bf2471e65b1f52bb5292caeba70a9aea31bf6653/DeDRM_plugin/mobidedrm.py#L254
Buy it with a stolen credit card via VPN.