I need figure it out how to minify a 13GB txt file and transform him in json.
Any thoughts?
Any thoughts?
www.geeksforgeeks.org
datatofish.com
im not sure how to minify unstructured text - if it's already json (or parsing is a simple opertion) then jq is great.I need figure it out how to minify a 13GB txt file and transform him in json.
Any thoughts?
Which?We have a tool for this
i don't think txt file can be minified without altering it's content. you probably have to remove irrelevant spaces, tabs, and newlines but find character for newline or you just trim spaces and explode to array using newline. then encode array to json but you have to check memory usage and you have to write the code.I need figure it out how to minify a 13GB txt file and transform him in json.
Any thoughts?
import json
input_path = '13gb.txt'
output_path = 'output.json'
def clean_chunk(chunk):
# Remove newlines, tabs, multiple spaces and etc
return ' '.join(chunk.replace('\n', ' ').replace('\t', ' ').split())
with open(input_path, 'r', encoding='utf-8') as infile, open(output_path, 'w', encoding='utf-8') as outfile:
outfile.write('{"data":"') # Start JSON
for line in infile:
minified = clean_chunk(line)
# Escape quotes and backslashes for valid JSON string
safe = minified.replace('\\', '\\\\').replace('"', '\\"')
outfile.write(safe)
outfile.write('"}') # End json
this code would require about 16GB RAM to run without memory related error. you have to optimize itif you only want to remove the spaces , \t , \n and this stuff
you can use this :
Python:import json input_path = '13gb.txt' output_path = 'output.json' def clean_chunk(chunk): # Remove newlines, tabs, multiple spaces and etc return ' '.join(chunk.replace('\n', ' ').replace('\t', ' ').split()) with open(input_path, 'r', encoding='utf-8') as infile, open(output_path, 'w', encoding='utf-8') as outfile: outfile.write('{"data":"') # Start JSON for line in infile: minified = clean_chunk(line) # Escape quotes and backslashes for valid JSON string safe = minified.replace('\\', '\\\\').replace('"', '\\"') outfile.write(safe) outfile.write('"}') # End json
You could split the TXT file in multiple subfiles and then use pandas using batches, i think is probably the easiest solution currently. You'll need some edge cases depending on the file.I need figure it out how to minify a 13GB txt file and transform him in json.
Any thoughts?